# Deep Research Agent with Pydantic AI

Our agent will:

- Start with the user's initial question
- Explore the topic through multiple search queries
- Generate follow-up questions based on initial results
- Expand the search to cover related and complementary topics
- Synthesize findings into a comprehensive research report

## Setting Up the Dataset

We'll use the [DataTalks.Club podcast archive](https://www.youtube.com/playlist?list=PL3MmuxUbc_hK60wsCyvrEK2RjQsUi4Oa_) as our knowledge base.

In [2]:
from pathlib import Path

data_folder = Path('../data_cache/youtube_videos/')
data_files = sorted(data_folder.glob('*.txt'))

In [4]:
len(data_files)

94

Read and chunk the transcripts:

In [5]:
import docs
from tqdm.auto import tqdm

documents = []

for f in tqdm(data_files):
    filename = f.name
    video_id, _ = filename.split('.')
    content = f.read_text(encoding='utf-8')
    chunks = docs.sliding_window(content, size=3000, step=1500)

    for chunk in chunks:
        chunk['video_id'] = video_id
        documents.append(chunk)

  0%|          | 0/94 [00:00<?, ?it/s]

In [6]:
len(documents)

3756

In [7]:
documents[100]

{'start': 42000,
 'content': "always recommend this because\n44:05 sometimes we just cannot think about\n44:07 something there's maybe a little thing\n44:08 we missed\n44:10 yeah so so I I think it's always an\n44:14 incremental work really to you know make\n44:17 things better and so on but I think this\n44:19 you know this is the same as in life and\n44:21 in business\n44:23 yeah and um there is one quite Hot Topic\n44:28 these days these llms right so everyone\n44:30 is talking about the lamps large\n44:32 language models we actually we were in\n44:35 our podcast we were pretty late to the\n44:37 party but recently we had two podcast\n44:39 interviews that were about our lamps so\n44:42 better late than never\n44:44 and yeah I guess all the lamps are kind\n44:46 of hot because of charge BT\n44:49 um at least this is when I noticed them\n44:53 so before when it was just gpt3 it was\n44:56 like okay so what but when I saw chargpd\n44:59 it like completely changed my perception\n45:01 

## Indexing the Documents

In [8]:
from minsearch import Index

index = Index(
    text_fields=["content"],
    keyword_fields=["video_id"]
)

index.fit(documents)


<minsearch.minsearch.Index at 0x123bef230>

In [9]:
index.search("How do i make money with AI?")

[{'start': 22500,
  'content': "at kind of\n22:43 money you need for the next 12 to 18\n22:45 months for example it depends so I want\n22:48 to hire I don't know two three\n22:50 developers I want to hire\n22:52 whatever I need this amount amount of\n22:54 money for that I want to make some\n22:56 marketing is I don't know experiments I\n22:58 need some money for that so at all I\n22:59 need XYZ type of money\n23:02 um and then I go out with this I need\n23:05 this because all the investors is going\n23:06 to ask you what do you need that money\n23:08 for and if you don't have a good answer\n23:09 to that that's not a good sign\n23:12 um if you don't know what you what\n23:13 you're raising for so and and then yeah\n23:16 the the investors have their process and\n23:18 the investors normally if you have a\n23:20 venture fund and you invest typically in\n23:23 one with the money from One Fund you\n23:26 invest into 20 to 40\n23:29 um startups depends always on this on\n23:31 the around 

## Search Tool

In [10]:
from typing import Any, Dict, List, TypedDict

class SearchResult(TypedDict):
    """Represents a single search result entry."""
    start: int
    content: str
    video_id: str


def search(query: str) -> List[SearchResult]:
    """
    Search the index for documents matching the given query.

    Args:
        query (str): The search query string.

    Returns:
        List[SearchResult]: A list of search results. Each result dictionary contains:
            - start (int): The starting position or offset within the source file.
            - content (str): A text excerpt or snippet containing the match.
            - video_id (str): YouTube video ID for the snippet.
    """
    return index.search(
        query=query,
        num_results=5,
    )


## First Research Agent

In [11]:
from pydantic_ai import Agent

instructions = """
Your role is to explore the topic provided by the user as deep as possible. 
Use the search function for that, and then based on the search results, 
create more queries to explore relevant topics.
""".strip()

agent_tools = [search]

agent = Agent(
    name="search",
    instructions=instructions,
    tools=agent_tools,
    model='gpt-4o-mini'
)

Test the basic agent:

In [12]:
results = await agent.run(user_prompt='how do I make money with AI?')
print(results.output)

Making money with AI can be approached from several angles, depending on your interests, skills, and resources. Here are some avenues to consider:

### 1. **Startups and Entrepreneurship**
   - **AI Product Development**: Build products that leverage AI technologies. This might be software applications that solve specific problems (like customer service automation through chatbots) or hardware integrations.
   - **Fundraising and Investment**: Many startups look for venture capital. Demonstrating that your AI-based business has the potential to generate revenue and scale can help attract investors. Having AI tools or platforms that simplify user engagement can further entice investors.

### 2. **Freelancing and Consulting**
   - Many companies are looking for AI experts to help them implement AI solutions. You can offer consulting services to organizations needing guidance on AI technologies. This includes providing insights on which AI technologies to adopt or assisting them in AI mod

Check wich search calls the agent made:

In [13]:
messages = results.all_messages()

for message in messages:
    for part in message.parts:
        if part.part_kind == 'tool-call':
            print(part)


ToolCallPart(tool_name='search', args='{"query":"how to make money with AI"}', tool_call_id='call_EeiGhtIFNKJhGiE7zHfz0K5z')
ToolCallPart(tool_name='search', args='{"query": "business models using AI"}', tool_call_id='call_5euRix5jHk5NkIOBa7zDVGZf')
ToolCallPart(tool_name='search', args='{"query": "AI in entrepreneurship"}', tool_call_id='call_seOUHWaed17PlkywHBzDh8WK')
ToolCallPart(tool_name='search', args='{"query": "AI tools for making money"}', tool_call_id='call_fMPpNeo62F1mh6BbzVbXASsB')
ToolCallPart(tool_name='search', args='{"query": "ways to monetize AI technology"}', tool_call_id='call_oqUYkRaoSPSAfmmCyYdrkbNQ')


## Improving Reasearch Depth

Adding this to the instructions:

`Don't stop until you perform at least 5 queries.`


In [14]:
instructions = """
Your role is to explore the topic provided by the user as deep as possible. 
Use the search function for that, and then based on the search results, 
create more queries to explore relevant topics.

Don't stop until you perform at least 5 queries.
""".strip()

agent_tools = [search]

agent = Agent(
    name="search",
    instructions=instructions,
    tools=agent_tools,
    model='gpt-4o-mini'
)

To see what's happening while we're waiting for query execution, let's add an event handler:

In [15]:
from pydantic_ai.messages import FunctionToolCallEvent

async def print_function_calls(ctx, event):
    # Detect nested streams
    if hasattr(event, "__aiter__"):
        async for sub in event:
            await print_function_calls(ctx, sub)
        return

    if isinstance(event, FunctionToolCallEvent):
        print("TOOL CALL:", event.part.tool_name, event.part.args)

Test the imporoved agent:


In [16]:
question = 'how do I get into machine learning?'

results = await agent.run(
    user_prompt=question,
    event_stream_handler=print_function_calls
)

TOOL CALL: search {"query":"how to get started in machine learning"}
TOOL CALL: search {"query": "best resources for learning machine learning"}
TOOL CALL: search {"query": "common challenges in machine learning projects"}
TOOL CALL: search {"query": "skills needed for machine learning careers"}
TOOL CALL: search {"query": "successful machine learning projects tips"}
TOOL CALL: search {"query": "online courses and certifications for machine learning"}


In [17]:
print(results.output)

Getting into machine learning can seem daunting, but it involves several clear steps and resources. Here's a comprehensive guide based on current insights and perspectives from various professionals in the field:

### 1. Understanding the Basics
- **Start with a Foundation in Mathematics & Statistics**: A strong grasp of linear algebra, calculus, and statistics is essential, as these are the mathematical foundations of most machine learning algorithms.
- **Programming Skills**: Familiarity with Python is often recommended since it has a rich ecosystem of libraries (e.g., NumPy, pandas, scikit-learn, TensorFlow, and PyTorch) for machine learning.

### 2. Educational Resources
- **Online Courses**: Websites like Coursera, edX, and Udacity offer valuable courses on machine learning by recognized institutions (e.g., Andrew Ng’s course on Coursera). 
- **Books**: Find a good introductory book on machine learning (like “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by A

## Cost Analysis

In [18]:
results.usage()

RunUsage(input_tokens=28918, cache_read_tokens=4224, output_tokens=824, details={'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, requests=3, tool_calls=6)

In [19]:
from toyaikit.pricing import PricingConfig
pricing = PricingConfig()

usage = results.usage()

pricing.calculate_cost(
    model=agent.model.model_name,
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens
)

CostInfo(input_cost=0.0043377, output_cost=0.0004944, total_cost=0.0048321)

## Advanced Research Structure

In [20]:
instructions = """
You are a deep research agent exploring topics using a proprietary podcast/video database.

GOAL

Given a user question, perform a structured multi-stage exploration to deeply understand the topic and all relevant adjacent ideas.

PROCESS

Stage 1 — Initial Search

- Take the user's question and perform 1–2 broad searches.
- Summarize the main insights from the top results.
- Note related subtopics or recurring ideas.

Stage 2 — Expansion

- Generate 5 targeted follow-up search queries based on the Stage 1 insights.
    Example: If the user asks "how to make money with AI", follow-ups might be:
    - "AI startup business models"
    - "freelancing with AI tools"
    - "AI side hustles"
    - "ethical considerations in AI monetization"
    - "AI job opportunities"
- Perform each search and summarize key takeaways with references.

Stage 3 — Deep Dive

- Based on findings so far, generate 5 even deeper or contrasting queries.
    These might cover debates, frameworks, case studies, or expert insights.
- Perform these searches and extract detailed insights.


OUTPUT FORMAT

At the end, output a structured summary:

**Main Question:** <original question>  
**Overview:** A 1-paragraph synthesis of the key ideas.

**Stage 1 Findings:**  
- Bullet summaries + [reference links]

**Stage 2 Expansions:**  
- Subtopic summaries + [reference links]

**Stage 3 Deep Dives:**  
- In-depth findings or nuanced perspectives + [reference links]

**References:**  
List clickable YouTube links in this format:  
[Title](https://www.youtube.com/watch?v=<video_id>&t=<timestamp>s)

RULES

- Always use `search()` to gather evidence before summarizing.
- Derive each new query from the content of previous results.
- Only use information returned by search() as references.
- Always include at least 5 unique searches.
- Prefer quality and diversity over repetition.
""".strip()

## Structured Output with Pydantic Models

Sometimes the agents quits before completing all stages. We can force it to finish all the stages with structured output: when it needs to generate a report for each stage, it'll execute all of the stages.

We also can add extra instructions to the fields by using Field with descriptions. This will be read by the LLM.

In [21]:
from pydantic import BaseModel, Field
from typing import List

class Reference(BaseModel):
    """Citations that directly tie each claim to a verifiable source."""
    quote: str = Field(..., description="A short, verbatim quote (2–4 sentences) from the database snippet.")
    youtube_id: str = Field(..., description="Video ID")
    timestamp: str = Field(..., description="Timestamp to the exact position in the video where the quote is, 'h:mm:ss' or 'mm:ss' format.")

class Keyword(BaseModel):
    """Research results for a specific keyword"""
    search_keyword: str = Field(..., description="Exact keyword used for search.")
    summary: str = Field(..., description="Short summary of the search result.")
    references: List[Reference] = Field(..., description="Specific references to help us track the findings of the research.")
    relevance_summary: str = Field(..., description="1 sentence for each reference explainig how it supports the keyword's summary — ensure factual consistency.")
    other_ideas: str = Field(..., description="Free-form description of related or complimentary ideas to explore in next stages.")

class StageReport(BaseModel):
    """Summarizes what was found during a single exploration stage."""
    stage: int = Field(..., description="Stage number (1 for initial search, 2 for expansion, 3 for deep dive).")
    keywords: List[Keyword] = Field(..., description="Search keywords ")
    summary: str = Field(..., description="A concise synthesis of insights found in this stage, summarizing themes and discoveries from all queries executed in the stage.")

class Claim(BaseModel):
    """A factual statement supported by one specific reference."""
    description: str = Field(..., description=(
        "A short paragraph (3–4 sentences) that paraphrases the meaning of the quote in your own words. "
        "It must stay faithful to the factual content of the quote — no speculation or extrapolation."
    ))
    relevance_check: str = Field(..., description=(
        "1–2 sentences explaining *why* this quote supports the claim — a brief justification to ensure factual grounding."
    ))
    reference: Reference = Field(..., description=(
        "A direct quote that explicitly supports or demonstrates the statement made in 'description'. "
        "The claim should be a paraphrase or interpretation of this quote."
    ))

class ArticleSection(BaseModel):
    """One thematic part of the final article, containing multiple claims."""
    title: str = Field(..., description="A concise section title summarizing the theme.")
    claims: List[Claim] = Field(..., description="3–4 claims that explore different aspects of this section's theme.")

class ActionPoint(BaseModel):
    """Practical takeaways from the research."""
    point: str = Field(..., description="A concrete recommendation, insight, or action derived from the research.")
    relevance_check: str = Field(..., description="Explain how the referenced quote supports this action point — must show logical connection, not assumption.")
    reference: Reference = Field(..., description="Source supporting this action point.")

class Article(BaseModel):
    """The final synthesized output — a structured article summarizing all research stages."""
    title: str = Field(..., description="Compelling headline summarizing the topic and main insight (7-10 words).")
    introduction: str = Field(..., description="A short overview (3-4 paragraphs) explaining what the research explored and why it matters.")
    sections: List[ArticleSection] = Field(..., description="5-8 well-structured sections presenting grouped claims by topic.")
    action_points: List[ActionPoint] = Field(..., description="Optional 3-5 key insights or recommendations derived from the findings.")
    conclusion: str = Field(..., description="Final synthesis paragraph summarizing the broader takeaways and closing thoughts.")

class ResearchReport(BaseModel):
    """The complete record of exploration across all stages, culminating in the final article."""
    stages: List[StageReport] = Field(..., description="Exploration stage reports (Stage 1–3) detailing the search process.")
    article: Article = Field(..., description="The final article.")


Note the ordering of fields in each class.

For example, in the Keyword class, we ask for summary and references before other ideas. This sequence helps the model "think" more systematically. It first produces one, and then it's easier for the model to generate other ideas.

Another technique I used here is called "grounding". We do it with the relevance_check fields.

The model is forced to explicitly state the relevance, so this makes sure that the references actually make sense.



## Final Agent

In [22]:
instructions = """
You are a deep research agent exploring topics using a proprietary podcast/video database.

Given a user question, perform a structured, multi-stage exploration to understand
the topic deeply and comprehensively through the database.

## DATA SOURCE

- You can only use the results from the `search()` function.
- Each search result includes `video_id` and snippet text.
- All references must link to YouTube URLs derived from the database and contain a quote
- Do not create, infer, or guess podcast names, titles, or timestamps.

## PROCESS

Stage 1 — Initial Search

1. Use the user's question as the first query with `search()`.
2. Summarize the most relevant insights from the results.
3. Identify key ideas, recurring themes, or related questions.

Stage 2 — Expansion

1. Generate 5-7 follow-up queries that explore related subtopics or complementary ideas.
2. For each query, call `search()` again.
3. Summarize the main insights from each result.

Stage 3 — Deep Dive

1. From the Stage 2 findings, generate 5-7 deeper or contrasting exploration queries.
2. For each, call `search()` again and summarize findings.
3. At the end of Stage 3, write an article that describes everything you discovered.

## Exploration rules

You are not allowed to stop until you perform at least 11 queries:

- 1 initial query for stage 1
- 5-7 follow up queries for stage 2
- 5-7 deeper exploration queries for stage 3

## References

When generating a claim or action point:

- Read the reference quote carefully.
- Write the claim as a faithful paraphrase or inference strictly supported by the quote.
- After each claim, provide a 1–2 sentence "relevance_check" explaining why the quote supports it.
- Do not generalize or introduce new facts not mentioned in the quote.

## Article

- The resulting article should contain an introduction, 5-8 sections and a conclusion.
- Each section should present 3-4 claims (backed by references) grouped by topics
- Each claim should be a paragraph with 3-4 sentences.
"""

agent = Agent(
    name="search",
    instructions=instructions,
    tools=agent_tools,
    model='gpt-4o-mini',
    output_type=ResearchReport
)


## Displaying the Results

Helper functions to format references and display results:

In [23]:
def to_link(reference) -> str:
    """
    Converts the timestamp to a YouTube URL with a proper time offset.
    Supports both 'h:mm:ss' and 'mm:ss' formats.
    """
    if not reference.timestamp:
        return f"https://www.youtube.com/watch?v={reference.youtube_id}"

    ts = reference.timestamp.strip()
    if not ts:
        return f"https://www.youtube.com/watch?v={reference.youtube_id}"

    parts = ts.split(":")

    try:
        parts = [int(p) for p in parts]
    except ValueError:
        return f"https://www.youtube.com/watch?v={reference.youtube_id}"

    if len(parts) == 3: # h:mm:ss
        hours, minutes, seconds = parts
    elif len(parts) == 2: # mm:ss
        hours, minutes, seconds = 0, parts[0], parts[1]
    elif len(parts) == 1:
        hours, minutes, seconds = 0, 0, parts[0]

    total_seconds = hours * 3600 + minutes * 60 + seconds
    return f"https://www.youtube.com/watch?v={reference.youtube_id}&t={total_seconds}s"

def diplay_reference(reference: Reference): 
    return f"[{reference.quote}]({to_link(reference)})" 


In [24]:
question

'how do I get into machine learning?'

## Execute it...

In [25]:
results = await agent.run(
    user_prompt=question,
    event_stream_handler=print_function_calls
)

TOOL CALL: search {"query":"how to get into machine learning"}
TOOL CALL: search {"query": "machine learning beginner resources"}
TOOL CALL: search {"query": "skills needed for machine learning"}
TOOL CALL: search {"query": "common pitfalls in machine learning projects"}
TOOL CALL: search {"query": "importance of programming in machine learning"}
TOOL CALL: search {"query": "machine learning training and education"}
TOOL CALL: search {"query": "how to build machine learning projects"}
TOOL CALL: search {"query": "career paths in machine learning"}
TOOL CALL: search {"query": "best programming languages for machine learning"}
TOOL CALL: search {"query": "essential tools for machine learning projects"}
TOOL CALL: search {"query": "networking in machine learning community"}
TOOL CALL: search {"query": "building machine learning portfolio"}


Display results:

In [26]:
report = results.output

# Display stage-by-stage findings
for stage in report.stages:
    print('Stage:', stage.stage)
    for kw in stage.keywords:
        print('  keyword:', kw.search_keyword)
        print('  summary:', kw.summary)
        print('  references:', [diplay_reference(r) for r in kw.references])
    print(stage.summary)

# Display the final article
article = report.article
print('#', article.title)
print('## Introduction')
print(article.introduction)

for section in article.sections:
    print('##', section.title)
    for claim in section.claims:
        print(claim.description, '(', diplay_reference(claim.reference), ')')

print('## Action Points')
for action_point in article.action_points:
    print('*', action_point.point, diplay_reference(action_point.reference))

print('## Conclusion')
print(article.conclusion)


Stage: 1
  keyword: how to get into machine learning
  summary: Interpersonal communication and understanding stakeholder needs are crucial for successful machine learning projects. Executing such projects often requires complex support from various business aspects and is significantly more demanding than typical software engineering projects.
  references: ['[a huge part of your role in machine learning is to be able to communicate back value to BU yourself...](https://www.youtube.com/watch?v=su2M058m3Lw&t=1179s)']
Understanding the complexities of machine learning projects is essential for anyone looking to enter this field. Key skills such as communication, stakeholder engagement, and adaptability to various project needs are crucial.
Stage: 2
  keyword: machine learning beginner resources
  summary: For those new to machine learning, gaining foundational knowledge is crucial. Online platforms offer courses tailored to beginners, which include coding in Python, data handling, and b

Check the cost:

In [27]:
usage = results.usage()

pricing.calculate_cost(
    model=agent.model.model_name,
    input_tokens=usage.input_tokens,
    output_tokens=usage.output_tokens
)

CostInfo(input_cost=0.0127539, output_cost=0.002706, total_cost=0.0154599)

## Improvements

We can improve it by splitting one single agent that's doing everyting into separate research and synthesis agents. One agent will focus purely on exploration and keyword discovery, then pass results to a synthesis agent for fact checking and article writing.