# Project 4: **Build a Deep Research System**
Welcome to project 4! For this project, we shift our focus from tool use and agents to *reasoning* models. You will practice state‑of‑the‑art inference‑time scaling methods such as *Chain‑of‑Thought* prompting and *Tree‑of‑Thoughts*, and briefly explore high-levels of training reasoning models using techniques like **STaR**.


Finally, you will put everything together to build a *deep research agent* that can browse the web, reason over what it finds, and give structured answers.

## Learning Objectives  
* Apply common inference‑time scaling methods: **zero‑shot / few‑shot CoT, self‑consistency, sequential decoding, tree‑of‑thoughts**  
* Gain intuition for **training** reasoning‑capable models following **STaR** approach 
* Build a minimal **deep‑research agent** that combines step‑by‑step reasoning with live web search   
* Practice extending deep-search to a multi-agent system 

## Roadmap  
1. Environment setup  
2. Inference‑time scaling  
   2.1 Few‑shot & zero‑shot CoT  
   2.2 Self‑consistency
   2.3 Sequential revisions  
   2.4 Tree‑of‑Thought
3. STaR for training models for reasoning  
4. Deep-research agent  
5. (Optional) Multi-agent deep-research

# 1‑ Environment setup

## 1.1- Conda environment

Before we start coding, you need a reproducible setup. Open a terminal in the same directory as this notebook and run:

```bash
# Create and activate the conda environment
conda env create -f environment.yaml && conda activate deep_research

# Register this environment as a Jupyter kernel
python -m ipykernel install --user --name=deep_research --display-name "deep_research"
```
Once this is done, you can select "deep_research" from the Kernel → Change Kernel menu in Jupyter or VS Code.

## 1.2 Ollama setup

In this project we use the `llama3.2:3b` and `deepseek-r1:8b` models. You can try other smaller or larger reasoning LLMs such as `qwen2.5:3b-instruct` or `phi4-mini` to compare performance. Explore available models here: https://ollama.com/library.

```bash
ollama pull llama3.2:3b
ollama pull deepseek-r1:8b
# Additional small reasoning models to compare
# ollama pull qwen2.5:3b-instruct
# ollama pull phi4-mini

```

`ollama pull` downloads the model so you can run it locally without API calls.

---  
# 2‑ Inference‑time scaling

Inference-time scaling refers to techniques that make an existing model reason better without retraining it. Instead of changing the model’s weights, we achieve reasoning capability by adjusting how we prompt, sample, or aggregate LLM's outputs.

In this section, we’ll explore several inference-time strategies that improve reasoning quality using a non-reasoning base model. You will experiment with and compare methods such as:

- Few-shot Chain-of-Thought (CoT)
- Zero-shot CoT
- Self-consistency
- Sequential revision
- Tree-of-Thoughts (ToT)

### 2.1: Few‑Shot CoT
Few-shot prompting helps a model reason by showing one or multiple examples before asking a new question. By observing the pattern of reasoning and final answers, the model learns how to structure its own reasoning process on the new input.

In this exercise, you will create a prompt that includes a few example Q&A pairs demonstrating step-by-step reasoning. Then, you will feed a new question and see the model’s output.

In [None]:
# Step 1: Write a few examples showing reasoning steps
# Step 2: Write your new question
# Step 3: Concatenate examples + new question into a single prompt
# Step 4: Call your Ollama or OpenAI client to get a response from llama3.2:3b # e.g., client.chat.completions.create(...)
# Step 5: Print the final answer

from openai import OpenAI

client = OpenAI(api_key = "ollama", base_url = "http://localhost:11434/v1")

few_shot_examples = """Q: If it is 3 PM in London (UTC+0), what time is it in New York (UTC-5)?
A: London is 5 hours ahead, so we subtract 5. The final answer is 10 AM.

Q: A tank holds 60 L of water. It leaks 3 L per hour and is filled at 5 L per hour. How much water after 4 h?
A: Net fill = 5-3 = 2 L/h. 2x4 = 8 L. The final answer is 68 L.
"""

question = "A rectangle has perimeter 40 cm and width 5 cm. What is its length?"
prompt = few_shot_examples + f"Q: {question} A:"

MODEL = "llama3.2:3b"

response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role":"user","content": prompt}],
    temperature=0.9
)
print(response.choices[0].message.content)

To solve the problem, we can use the formula for the perimeter of a rectangle:

Perimeter = 2(length + width)

Since we know the perimeter (40 cm) and the width (5 cm), we can plug in these values and solve for the length:

40 = 2(length + 5)
20 = length + 5
length = 15

The final answer is: 15 cm


### (Optional) Few-shot CoT on GPT2
GPT-2 is a pre-trained language model without instruction tuning. It continues text rather than answering questions. In this section, you'll try the exact same CoT pattern on GPT-2 and observe what happens. The goal is to test whether few-shot CoT alone can elicit structured reasoning from a non-chat LLM.

In [None]:
import os
import torch
from transformers import pipeline

# Step 1: Load GPT-2 text-generation from huggingface (https://huggingface.co/docs/transformers/en/model_doc/gpt2)
# Step 2: Write 1–2 few-shot reasoning examples (short, explicit steps + final answer in your own unique format)
# Step 3: Append a new test question after the examples to form one prompt string
# Step 4: Generate 1–3 completions with different decoding settings (e.g., greedy vs. top-k)
# Step 5: Print raw outputs; check if steps are followed and if the final answer is correct

# MPS stuff
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
assert torch.backends.mps.is_available(), "MPS not available on this system"
torch.mps.empty_cache()
device = torch.device("mps")


pipeline = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=device)

few_shot = """Q: If it is 3 PM in London (UTC+0), what time is it in New York (UTC-5)?
A: London is 5 hours ahead, so we subtract 5. The final answer is 10 AM.

Q: A tank holds 60 L of water. It leaks 3 L per hour and is filled at 5 L per hour. How much water after 4 h?
A: Net fill = 5-3 = 2 L/h. 2x4 = 8 L. The final answer is 68 L.
"""

q = "A rectangle has perimeter 40 cm and width 5 cm. What is its length?"
prompt = few_shot + f"Q: {q}\nA:"

# # Greedy
out_greedy = pipeline(
    prompt,
    max_new_tokens=128,
    do_sample=False,
    use_cache=False
)[0]["generated_text"]

out_sample = pipeline(
    prompt,
    max_new_tokens=128,
    do_sample=True,
    top_p=0.9,
    temperature=0.8,
    use_cache=False
)[0]["generated_text"]

print("Greedy decoding:\n", out_greedy)
print("\nSampled decoding:\n", out_sample)


Device set to use mps
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Greedy decoding:
 Q: If it is 3 PM in London (UTC+0), what time is it in New York (UTC-5)?
A: London is 5 hours ahead, so we subtract 5. The final answer is 10 AM.

Q: A tank holds 60 L of water. It leaks 3 L per hour and is filled at 5 L per hour. How much water after 4 h?
A: Net fill = 5-3 = 2 L/h. 2x4 = 8 L. The final answer is 68 L.
Q: A rectangle has perimeter 40 cm and width 5 cm. What is its length?
A: The rectangle is 40 cm x 5 cm.

Q: A tank holds 60 L of water. It leaks 3 L per hour and is filled at 5 L per hour. How much water after 4 h?
A: Net fill = 5-3 = 2 L/h. 2x4 = 8 L. The final answer is 68 L.

Q: A rectangle has perimeter 40 cm and width 5 cm. What is its length?

A: The rectangle is 40 cm x 5 cm.

Q: A tank holds 60 L of water. It leaks 3 L per hour and is filled at 5

Sampled decoding:
 Q: If it is 3 PM in London (UTC+0), what time is it in New York (UTC-5)?
A: London is 5 hours ahead, so we subtract 5. The final answer is 10 AM.

Q: A tank holds 60 L of water. It 

### 2.2: Zero‑Shot Chain‑of‑Thought
Zero-shot CoT encourages the model to reason without examples by adding a short cue such as “Let’s think step by step.” This simple phrase often activates the model’s latent reasoning ability even when no demonstrations are provided. It serves as a baseline to compare with few-shot and other inference-time scaling methods.

In [None]:
from openai import OpenAI

# Step 1: Write the question and a zero-shot CoT cue (e.g., "Let's think step by step.")
# Step 2: Build a single prompt string that includes brief role guidance plus the question
# Step 3: Call your Ollama or OpenAI client to get a response from llama3.2:3b  # e.g., client.chat.completions.create(...)
# Step 4: Print the chain and the final answer

client = OpenAI(api_key = "ollama", base_url = "http://localhost:11434/v1")

question = "Why do we use neural network to build LLMs?"

prompt = f"""You are a knowledgeable tutor. Answer the question. 
Question: {question}
Let's think step by step."""

MODEL = "llama3.2:3b"
response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role":"user","content": prompt}],
    temperature=0
)
print(response.choices[0].message.content)

To understand why neural networks are used to build Large Language Models (LLMs), let's break down the process step by step:

1. **Understanding the Problem**: The primary goal of building an LLM is to create a model that can generate human-like text, answer questions, or perform other natural language processing tasks.

2. **Traditional Approaches**: Before neural networks, traditional approaches to NLP involved rule-based systems and statistical models. These methods were limited in their ability to handle complex linguistic structures and nuances of human language.

3. **The Rise of Neural Networks**: In the 1980s and 1990s, researchers began exploring the use of neural networks for NLP tasks. The key innovation was the development of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which allowed models to capture sequential dependencies in language.

4. **Why Neural Networks?**: Neural networks are particularly well-suited for LLMs because they can:
   -

### 2.3 Self‑Consistency
Self-consistency enhances reasoning accuracy by sampling multiple independent reasoning paths for the same question instead of relying on a single deterministic answer. Each run may follow a slightly different logical chain, and the diversity helps correct individual mistakes. After generating several reasoning traces, you then aggregate the final answers using majority voting.

This approach is especially useful when tasks involve multi-step reasoning or arithmetic, where single-path outputs may be incorrect.

In [None]:
from openai import OpenAI
import re, collections

client = OpenAI(api_key = "ollama", base_url = "http://localhost:11434/v1")
MODEL = "llama3.2:3b"


def cot_answer(question, temperature=1.3):
    prompt = f"""Answer the following question with step-by-step reasoning and final answer after **Therefore,**.
        Question: {question}
        Let's think step by step."""
    
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role":"user","content": prompt}],
        temperature=temperature
    )

    content = r.choices[0].message.content
    match = re.search(r"[Tt]herefore,?\s*(.*)", content) # extract text after 'Therefore'.
    return content, match.group(1).strip() if match else None


def self_consistent(question, n=5):
    answers = []
    for _ in range(n):
        _, ans = cot_answer(question, temperature=0.9)
        answers.append(ans)
    counter = collections.Counter(answers)
    winner, _ = counter.most_common(1)[0]
    return winner, counter


question = "What is the square root of 144?"
winner, counter = self_consistent(question)
print("Votes:", counter)
print("Chosen answer:", winner)

Votes: Counter({'12 is a possible candidate for being the square root of 144.': 1, '**the square root of 144 is 12**.': 1, '**The square root of 144 is 12.**': 1, '**': 1, 'instead of using trial and error division, we can simply state the square root of 144 as 12.': 1})
Chosen answer: 12 is a possible candidate for being the square root of 144.


### 2.4: Sequential Revision

Sequential revision iteratively improves an answer by generating a first draft, critiquing it, and producing revised drafts that condition on prior answers. Each round should be short and focused, so improvements accumulate without drifting from the question.

In [None]:
MODEL = "llama3.2:3b"


def sequential_revision(question: str, max_steps: int = 3) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Keep your answers clear and correct."},
        {"role": "user", "content": question}
    ]
   
    draft = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        temperature=0.7,
    ).choices[0].message.content.strip()
    print(f"Draft 1: {draft}")

    # Iterative revision
    for idx in range(1, max_steps):
        messages = [
            {"role": "system", "content": "You are a helpful assistant. Improve answers by making them clearer and more accurate."},
            {"role": "user", "content": question},
            {"role": "assistant", "content": draft},
            {"role": "user", "content": "Please revise your answer. Make it clearer, more accurate, and better written. Only include the new answer."}
        ]
        draft = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            temperature=0.7,
        ).choices[0].message.content.strip()
        print(f"Draft {idx+1}: {draft}")

    return draft


output = sequential_revision("If a rectangle is twice as long as it is wide and the perimeter is 30 cm, what is the area?")

Draft 1: Let's break this down step by step:

1. Let's call the width "w" (in cm).
2. Since the rectangle is twice as long as it is wide, the length can be represented as 2w.
3. The perimeter of a rectangle is given by the formula: Perimeter = 2(length + width)
4. We are told that the perimeter is 30 cm, so we can set up an equation: 
   30 = 2(2w + w)
5. Simplify the equation:
   30 = 6w
6. Divide both sides by 6 to solve for w:
   w = 30/6
   w = 5

This means the width is 5 cm.

7. Since the length is twice the width, we can now find the length: 
   Length = 2w
   Length = 2(5)
   Length = 10 cm

8. Now that we have both dimensions (width and length), we can calculate the area:
   Area = length × width
   Area = 10 × 5
   Area = 50
Draft 2: The area of the rectangle is 50 square centimeters (cm²). To find this, we first calculated that the width was 5 cm and the length was twice the width, so 10 cm. Then, using these dimensions, we multiplied them together to get the area.
Draft 3: 

### 2.5 Tree‑of‑Thoughts
Tree-of-Thoughts reframes reasoning as a search process rather than a single forward chain.
Instead of producing one linear sequence of thoughts, the model generates multiple candidate thoughts at each step, evaluates their promise, and then expands only the best few. This allows exploration of different reasoning paths before committing to a final answer, similar to how humans brainstorm, prune, and refine ideas.


In this section, you’ll experiment with two simplified versions of ToT:
1. Word Ladder puzzle solver: a small example where each “thought” is a candidate word transition.
2. Generic ToT search (depth 2, width 2): a minimal logic to expand, evaluate, and select reasoning branches

In [None]:
###### Word Ladder Puzzle ##########

def neighbors(word, vocabulary):
    for i, c1 in enumerate(word):
        for c2 in 'abcdefghijklmnopqrstuvwxyz':
            if c1 != c2:
                candidate = word[:i] + c2 + word[i+1:]
                if candidate in vocabulary:
                    yield candidate


def tree_of_thought(start, goal, vocab, max_depth=5, beam_width=4):
    frontier = [[start]]
    for depth in range(max_depth):
        candidates = []
        for path in frontier:
            for nxt in neighbors(path[-1], vocab):
                if nxt in path:  # avoid loops
                    continue
                candidates.append(path + [nxt])
        # score: negative edit distance to goal
        scored = sorted(candidates, key=lambda p: sum(a!=b for a,b in zip(p[-1], goal)))
        frontier = scored[:beam_width]
        if any(p[-1] == goal for p in frontier):
            return [p for p in frontier if p[-1]==goal][0]
    return None


vocab = {"hit","dot","cog","log","dog","lot","lit","hot"}
print(tree_of_thought("hit", "cog", vocab))


None
['hit', 'hot', 'dot', 'dog', 'cog']


In [None]:
###### Generic ToT Search ##########

import re

MODEL = "llama3.2:3b"

def propose_thoughts(question, state, k=2):
    prompt = f"""You are exploring solutions.
            Problem: {question}
            Current partial solution: {state}

            Propose at most {k} different next thoughts."""
    
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,
        n=k
    )
    return [c.message.content.strip() for c in r.choices]


def score_state(question, state):
    prompt = f"""Problem: {question}
        Rate from 1-10 how promising this partial solution is: {state}"""
    
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    try:
        return int(re.findall(r"\d+", r.choices[0].message.content)[0])
    except Exception:
        return 5  # neutral fallback


def tree_of_thoughts(question, depth=2, width=2):
    frontier = [("", 0)]
    for _ in range(depth):
        new_frontier = []
        for state, _ in frontier:
            for thought in propose_thoughts(question, state, k=width):
                new_state = (state + "\n" + thought).strip()
                score = score_state(question, new_state)
                new_frontier.append((new_state, score))
        new_frontier.sort(key=lambda x: x[1], reverse=True)  # keep top‑k
        frontier = new_frontier[:width]
    best_state, best_score = frontier[0]
    return best_state, best_score


question = "Design a plan for a weekend science workshop for 12-year-olds."
solution, score = tree_of_thoughts(question)

print(f"Best solution (score {score}):\n{solution}")

Best solution (score 8):
Here are two potential next steps to develop the plan for the weekend science workshop:

1. **Define the Workshop's Theme and Objectives**: Determine what specific scientific topics or themes would be most engaging and accessible for 12-year-olds. This could include topics such as:
	* Environmental science (e.g., recycling, climate change)
	* Simple physics (e.g., magnets, solar system)
	* Biology (e.g., cells, genetics)

Establishing a clear theme and set of objectives will help guide the planning process and ensure that the workshop stays focused and interactive.

2. **Identify Potential Facilitators or Guest Speakers**: Consider engaging experienced science educators, scientists, or industry experts to lead hands-on activities, discussions, and experiments during the workshop. This could include:
	* Local science teachers or professors
	* Museum staff or curators
	* Scientists from relevant industries (e.g., environmental consulting, biotechnology)

Involvin

---  
# 3‑ Training Models for Reasoning

### 3.1: CoT Training
Chain-of-Thought (CoT) training conditions the model on explicit rationales during fine-tuning. Instead of teaching the model to output only the final answer, we train on (question, rationale, answer) so the model learns to internalize multi-step reasoning patterns. A practical recipe is STaR (Self-Taught Reasoner), which uses a stronger teacher model to bootstrap rationales that a smaller student can learn from.

For tasks that require multi-hop reasoning, models fine-tuned on rationales often achieve higher accuracy and are more stable at inference time than models trained on direct answers only. 

Training a full language model is beyond the scope of this notebook, but here is the high-level workflow followed by a short pseudocode:
- Collect questions: Prepare a dataset of questions and correct answers.
- Generate rationales: Use a strong LLM to produce step-by-step reasoning ending with the correct answer.
- Filter and clean: Discard incorrect or low-quality rationales.
- Prepare training data: Format triples (question, rationale, answer) for supervised fine-tuning.
- Fine-tune: Fine-tune the LLM on rationales.
- Iterate: Refine prompts, improve data quality, and retrain for stronger reasoning.

In [None]:
# Pseudocode (STaR loop)
# for round in 1 ... iters:
    # STEP 1: self-generate reasoning (teacher creates rationale + answer)
    # STEP 2: keep only correct, high-quality traces
    # STEP 3: fine-tune student on (question, rationale, answer) data

### 3.2: ORM vs PRM + RL
Training a Reward Model (RM) allows large language models to be improved through reinforcement learning (RL). Instead of fine-tuning directly on examples, we train a separate model that can score or rank model outputs, and use those scores as feedback signals to refine the policy model.

Two main reward modeling approaches are ORM (predicts a scalar reward for the final answer) and PRM (evaluates the reasoning steps instead of just the outcome)



| Approach | Typical loss | When to use |
|-----------|-------------|-------------|
|*Outcome Reward Model* | Predict scalar reward | Easy to collect training data using verifiers |
|*Process Reward Model* | Predict rewards per step | Difficult to collect training data but more accurate |
| *RLHF* | Use RM as reward in **RL** fine‑tuning | Aligns policy with human signals | Aligns model policy with human or synthetic preferences




In [None]:
# for round = 1 ... iters:
    # STEP 1:  Generate reasoning
        # sample a minibatch of questions
        # policy roll‑out (actions + log‑probs)
    # STEP 2:  Score the trajectory
        # ORM: scalar reward for the final answer / PRM: scalar reward for the thought process
    # STEP 3:  Reinforce the policy (PPO)

---  
# 4‑ A Deep Research Agent

A deep-research agent pairs a reasoning model (e.g., deepseek-r1) with external tools for web search and retrieval. We will follow the ReAct pattern: the model writes short thoughts, decides when to call tools, reads observations, and continues reasoning until it can answer or reaches a step limit.

We now combine a **search tool** with a reasoning model (e.g., `deepseek-r1`) in a multi-step setup. We follow the *ReAct* pattern (reason → tool → observation):

1. The model reasoins and decides to use tools
2. The agent searches and feed condensed snippets back as context
3. Iterate until the model answers or hits a step limit

We use `AgentType.OPENAI_FUNCTIONS`, which hides the loop inside the LangChain agent.

In [None]:
from ddgs import DDGS
from langchain.tools import Tool


def ddg_search(query: str, k: int = 5) -> str:
    """Basic DuckDuckGo web search that returns a concatenated text snippet."""
    with DDGS() as ddgs:
        results = [hit["body"] for hit in ddgs.text(query, max_results=k)]
    return "\n".join(results)

search_tool = Tool(
    name="DuckDuckGo Search",
    func=ddg_search,
    description="Search the public web. Input: a plain English query. Returns: concatenated snippets."
)


In [None]:
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatOllama

MODEL = "deepseek-r1:8b"
question = "What are the best resources to learn machine learning in 2025?"

# Step 1: Initialize the reasoning model via ChatOllama
llm = ChatOllama(model=MODEL, temperature=0.2)

# Step 2: Build the agent with tool access (DuckDuckGo Search) and function-calling interface (initialize_agent)
agent = initialize_agent(
    tools=[search_tool],
    llm=llm,
    agent=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
)

# Step 3: Ask a query and let the agent search + reason to produce an answer
result = agent.invoke({"input": question})
print(result["output"])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m<think>
Okay, user is asking about the best resources to learn machine learning for 2025. That's an interesting timeframe - they're looking ahead two years, which suggests they want future-proof knowledge rather than just current trends. 

First, I should acknowledge that predicting exact educational needs two years in advance is tricky because ML evolves rapidly. But some foundational concepts and practical skills will likely remain relevant regardless of technological shifts. The user probably wants to know what's most valuable now but with an eye toward long-term career sustainability.

Hmm... they didn't specify their background or goals, which makes this broad. Are they a complete beginner? A developer looking to pivot? Or someone with math/stats knowledge wanting hands-on skills? Since they didn't say, I should cover multiple learning paths comprehensively.

I notice they're asking about "resources" plural - not just co

# Optional (Multi-agent Deep Research)
Instead of a single multi-step agent, you can design multiple collaborating agents such as a Planner, Searcher, Summarizer, and Verifier that pass information and refine each other’s outputs. This setup improves robustness, diversity of reasoning, and division of labor.

Try building a simple setup with 2–3 agents that share goals and messages, for example Planner → Researcher → Writer.

In [None]:
def parallel_research(query, n=3):
    # Run n independent research runs in parallel and return their answers.
    # Steps: use ThreadPoolExecutor; submit n calls to your agent/search pipeline; gather results in order.
    """
    YOUR CODE HERE
    """

answers = parallel_research("What are the best resources to learn ML in 2025?")
for i,a in enumerate(answers,1):
    print(f"[Run {i}] {a[:200]}…")

## 🎉 Congratulations!

* Practised various inference‑time reasoning methods
* Gained intuition about training reasoning models
* You have built a **deep-research agent**: reasoning model like deep-seek r1 + ReAct-style agent + tool use (web search)
* Try adding more tools, and extending the deep-research to a multi-agent system: many agents researching web in parallel.


👏 **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.