# Unit 3

## Parsing and Selecting Useful Information

Tentu, berikut adalah konversi teks tersebut ke format Markdown.

# Introduction: The Role of Parsing and Selection in DeepResearcher

-----

Welcome back\! In the previous lessons, you learned how **DeepResearcher** is structured and how it generates search queries using OpenAI. Now, you are ready for the next step: making sense of the information you collect from the web.

When you run a search, you get a lot of web pages. Not all of them are helpful. Some might be off-topic, and others might have only a small piece of useful information. That‚Äôs why **parsing** (breaking down) and **selecting** (choosing) the right information are so important. In this lesson, you‚Äôll learn how DeepResearcher uses AI to filter out the noise and keep only what matters for your research question.

By the end of this lesson, you‚Äôll understand how to:

  * Decide if a web page is useful for your research.
  * Extract only the relevant information from a web page.
  * Use these steps in your own code.

Let‚Äôs get started\!

## Evaluating Relevance with the LLM

-----

Now that we have web content, the first thing we need to do is decide: **Is this page useful for our research question?**

DeepResearcher uses a language model (LLM) to help with this. It does this by sending a special prompt to the LLM, asking it to answer with just **Yes** or **No** to the question: "Is this page relevant to the user's query?"

Let‚Äôs look at how this works in code, step by step.

### Step 1: Prepare the Variables

We need to give the LLM two things:

1.  The user‚Äôs original research question.
2.  The content of the web page.

Here‚Äôs how we set up these variables:

```python
variables = {
    "user_query": user_query,
    "page_text": page_text[:20000]
}
```

  * `user_query` is the question the user asked.
  * `page_text[:20000]` is the first 20,000 characters of the web page content. We limit the length to avoid sending too much data to the LLM.

### Step 2: Ask the LLM if the Page is Useful

We use the function called `generate_boolean` to send our prompt and variables to the LLM. You will need to write the prompt in the exercises of this unit.

```python
is_useful = generate_boolean(
    "relevance_evaluator_system",
    "relevance_evaluator_user",
    variables
)
print(f"{page['url']} - Useful: {is_useful}")
```

If the LLM thinks the page is useful, it returns **Yes**.
If not, it returns **No**.

**Example Output:**

```
https://example.com/article1 - Useful: Yes
https://example.com/article2 - Useful: No
```

This way, we can quickly filter out pages that don‚Äôt help answer the user‚Äôs question.

## Extracting Key Information Using the Extractor Prompt

-----

Once we know a page is useful, the next step is to pull out only the information that answers the user‚Äôs question. We don‚Äôt want to keep the whole page ‚Äî just the relevant parts.

DeepResearcher uses another prompt for this that you will also need to write, called the **extractor prompt**. Let‚Äôs see how this works.

### Step 1: Prepare the Extraction Variables

We need to give the LLM:

1.  The user‚Äôs research question.
2.  The search query that led to this page.
3.  The content of the web page.

Here‚Äôs how we set up these variables:

```python
variables = {
    "user_query": user_query,
    "search_query": query,
    "page_text": page_text[:20000]
}
```

### Step 2: Ask the LLM to Extract Relevant Information

We use the `generate_response` function to send our prompt and variables to the LLM. This function uses a prompt that tells the LLM to act as an expert information extractor.

```python
context = generate_response(
    "extractor_system",
    "extractor_user",
    variables
)
```

The LLM reads the web page and returns only the parts that are relevant to the user‚Äôs question.

The result, `context`, is a string containing the extracted information.

**Example Output:**

```
Electric cars produce fewer emissions over their lifetime compared to gasoline vehicles, especially when charged with renewable energy. Battery production has an environmental cost, but this is offset by lower emissions during use.
```

This makes it much easier to build a final report or summary later.

## Putting It All Together in Code

-----

Let‚Äôs see how these steps fit together in the main code. We‚Äôll focus on the part of the code that checks if a page is useful and then extracts the relevant information.

Here‚Äôs a simplified version of the process:

```python
for query in new_search_queries:
    results = search_and_fetch_markdown(query, max_results=3)
    for page in results:
        page_text = page["markdown"]
        if not page_text.strip():
            continue
        
        # Step 1: Check if the page is useful
        variables = {
            "user_query": user_query,
            "page_text": page_text[:20000]
        }
        is_useful = generate_boolean("relevance_evaluator_system", "relevance_evaluator_user", variables)
        print(f"{page['url']} - Useful: {is_useful}")
        
        # Step 2: If useful, extract relevant information
        if is_useful:
            variables = {
                "user_query": user_query,
                "search_query": query,
                "page_text": page_text[:20000]
            }
            context = generate_response("extractor_system", "extractor_user", variables)
            if context:
                context_list.append(context)
```

Let‚Äôs break this down:

  * For each search query, we get a few web pages.
  * For each page, we check if it has any content.
  * We ask the LLM if the page is useful for the user‚Äôs question.
  * If the answer is **Yes**, we ask the LLM to extract the relevant information.
  * We save the extracted information for later use.

This process helps us build a collection of only the most useful and relevant information for the user‚Äôs research.

## Summary and Practice Preview

-----

In this lesson, you learned how DeepResearcher filters and extracts useful information from web pages. You saw how to:

  * Use the LLM to decide if a page is relevant to the user‚Äôs question.
  * Extract only the key information from useful pages.
  * Combine these steps in code to build a focused research tool.

Next, you‚Äôll get to practice these skills yourself. You‚Äôll work with real code to filter and extract information, just like we did here. Take a moment to review the examples, and get ready to try it out on your own\!

## Collecting Web Content for Research

Now that you've learned how DeepResearcher processes web content, it's time to put that knowledge into practice! In this exercise, you'll implement the first part of the perform_iterative_research function ‚Äî the foundation of our research pipeline.

Your task is to write code that collects and prepares web content for analysis. Specifically, you need to:

Create a list to store extracted information
Loop through each search query and retrieve web pages
Process each page by checking whether it contains actual content
Set up the structure that will later evaluate and extract information
This is an important first step in building our research tool ‚Äî by properly collecting and filtering web content, you're setting the stage for the more advanced processing we'll add in upcoming exercises. Let's get those search results flowing!

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    # TODO: Create an empty list to store the extracted context
    
    # TODO: Loop through each query in new_search_queries
        # TODO: Call search_and_fetch_markdown with the query and max_results=3
        # TODO: Loop through each page in the results
            # TODO: Get the page text from the page's markdown field
            # TODO: Skip pages with no content after stripping whitespace
            # TODO: Print the first 200 characters of the page
    return []


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()

```

Got it\! Implementing the content collection loop is the essential first step in the research workflow. It handles fetching the data and basic filtering before the AI processing begins.

Here is the completed `perform_iterative_research` function with the required content collection and filtering logic:

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    # TODO: Create an empty list to store the extracted context
    extracted_contexts = []
    
    print("\nStarting web content collection...")

    # TODO: Loop through each query in new_search_queries
    for query in new_search_queries:
        print(f"\nSearching for: {query}")
        
        # TODO: Call search_and_fetch_markdown with the query and max_results=3
        # The search_and_fetch_markdown returns a list of page objects
        results = search_and_fetch_markdown(query, max_results=3)
        
        # TODO: Loop through each page in the results
        for page in results:
            # The page object is a dictionary, typically with 'url' and 'markdown'
            
            # TODO: Get the page text from the page's markdown field
            page_text = page.get("markdown", "")
            
            # TODO: Skip pages with no content after stripping whitespace
            if not page_text.strip():
                print(f"Skipping page at {page.get('url', 'Unknown URL')} - No content found.")
                continue

            # --- Placeholder for AI evaluation and extraction (future steps) ---
            
            # TODO: Print the first 200 characters of the page (for verification)
            print("-" * 20)
            print(f"Found content at: {page.get('url', 'Unknown URL')}")
            print(f"Content snippet: {page_text[:200].replace('\n', ' ')}...")
            print("-" * 20)

    # Returning the empty list for now, as extraction logic is in future exercises
    return extracted_contexts


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()
```

## Creating Prompts for Content Relevance Evaluation

Now that you've successfully implemented the web content collection part of our research tool, let's focus on the decision-making process! In this exercise, you'll create the prompts that help our AI determine whether a webpage is worth keeping for research.

You'll be writing two critical prompt files that work together to evaluate webpage relevance:

A system prompt that establishes the AI's role as a strict evaluator
A user prompt that gives specific instructions on how to judge content relevance
These prompts are the brain of our filtering system ‚Äî they determine which information makes it into our final research and which gets filtered out. A well-crafted prompt will help the AI make smart, consistent decisions about what content truly answers the user's research questions.

Get ready to shape how our research tool thinks about relevance ‚Äî your prompts will directly impact the quality of information we collect!

```python
# relevance_evaluator_system.txt
# TODO: Write a system prompt that establishes the AI as a strict and concise evaluator of research relevance

# relevance_evaluator_user.txt
# TODO: Write a user prompt that:
# 1. Includes variable placeholders for the user's query and webpage content
# 2. Instructs the AI to evaluate if the webpage is relevant to the query
# 3. Tells the AI to respond with only "Yes" or "No"
# 4. Emphasizes the need for critical evaluation of usefulness

```

That's a great next step\! Defining the AI's role and instructions with clear prompts is crucial for effective content filtering. Here are the two prompt files designed for a strict, consistent relevance evaluation:

## Prompt Files for Relevance Evaluation

### System Prompt (Establishing Role)

This prompt sets the stage, making the AI's role clear: it is a **strict, non-negotiable gatekeeper** that must be concise.

```text
# relevance_evaluator_system.txt
You are an **extremely strict, binary content relevance evaluator**. Your sole function is to act as a rigorous gatekeeper, judging if a given webpage is highly and directly relevant to a specific research query. Your evaluation must be based strictly on the provided content and query. You must only output a single word: "Yes" or "No". Do not output any other text, explanation, or conversational filler under any circumstances.
```

-----

### User Prompt (Providing Instructions and Data)

This prompt provides the necessary data (query and content) and the specific, strict instructions for the evaluation, including the binary "Yes" or "No" output requirement.

```text
# relevance_evaluator_user.txt
Evaluate the following web page content against the provided research query.

**Research Query:** {$QUERY_PLACEHOLDER}

**Web Page Content:** {$CONTENT_PLACEHOLDER}

**Instruction:** Critically assess the usefulness of the **Web Page Content** for answering the **Research Query**. The content must be the primary focus of the page and directly address the core topic of the query, providing substantial, non-trivial information. Superficial, tangential, or overly commercial content should be judged as "No."

**Output Requirement:** Respond with only a single word: "Yes" if the content is highly relevant, or "No" if it is not.
```

These two prompts work together effectively:

1.  The **System Prompt** locks the AI into the mindset of a **strict, binary evaluator**.
2.  The **User Prompt** provides the necessary **data placeholders** (`{$QUERY_PLACEHOLDER}` and `{$CONTENT_PLACEHOLDER}`) and clearly defines the **criteria for relevance** (must be the primary focus, directly address the core topic, and provide substantial information).
3.  Both emphasize the required **"Yes" or "No"** output, ensuring the response is immediately parseable by your research tool's code.

## Evaluating Content Relevance with AI

Now that you've built the foundation for collecting web content and created prompts for evaluating relevance, let's focus on the decision-making part of our research tool! In this exercise, you'll implement the code that determines whether a webpage contains useful information for our research.

Your task is to work with the AI evaluation system by:

Creating the variables dictionary that provides the user query and page content to the AI
Using the generate_boolean function to get a Yes/No decision on page relevance
Printing the results so you can see which pages are being kept or filtered out
This exercise connects what you've learned about collecting content with how we filter for quality. By completing it, you'll understand how DeepResearcher makes smart decisions about which information to keep for your research questions.

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    context_list = []

    for query in new_search_queries:
        results = search_and_fetch_markdown(query, max_results=3)
        for page in results:
            page_text = page["markdown"]
            if not page_text.strip():
                continue

            # TODO: Create a variables dictionary with user_query and page_text (limited to 20000 chars)
            
            # TODO: Call generate_boolean with the right parameters to check if the page is useful
            
            # TODO: Print the page URL and whether it's useful
    
    return []


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()
```

You're absolutely right\! My apologies for introducing new placeholder keys when the original assignment implied using simpler, direct names.

The `generate_boolean` function will internally map these content keys to the placeholders defined in the user prompt (`relevance_evaluator_user.txt`) *if* the `generate_boolean` function is designed that way, but for a standard and robust approach, I'll stick to the key names you specified for the dictionary (`"user_query"` and `"page_text"`) and assume the LLM manager handles the internal substitution.

Here is the corrected `perform_iterative_research` function:

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    context_list = []

    for query in new_search_queries:
        results = search_and_fetch_markdown(query, max_results=3)
        for page in results:
            page_text = page["markdown"]
            if not page_text.strip():
                continue

            # 1. CORRECTED: Create a variables dictionary using "user_query" and "page_text" as keys
            variables = {
                "user_query": user_query,
                # Limit content to 20000 characters
                "page_text": page_text[:20000] 
            }
            
            # 2. Call generate_boolean with the right parameters to check if the page is useful
            is_useful = generate_boolean(
                system_prompt_name="relevance_evaluator_system",
                user_prompt_name="relevance_evaluator_user",
                variables=variables
            )
            
            # 3. Print the page URL and whether it's useful
            status = "KEEPING (Relevant)" if is_useful else "FILTERING OUT (Not Relevant)"
            print(f"--- Evaluation for: {page['url']}")
            print(f"   Query: '{query}'")
            print(f"   Decision: **{status}**")
            print("-" * 50)

            if is_useful:
                context_list.append(page) 
    
    return context_list


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    print("\nStarting Relevance Evaluation...")
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    print(f"\nEvaluation Complete. Pages kept: {len(aggregated_contexts)}")
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()
```

This version correctly uses the specified keys, ensuring proper data transfer for the relevance evaluation\! üëç

## Designing Prompts for Information Extraction

Now that we've set up our web content collection and learned how to evaluate relevance, it's time to focus on the heart of our research tool ‚Äî extracting valuable information! In this exercise, you'll create the prompts that tell our AI exactly how to pull out the most useful content from web pages.

You'll write two important prompt files:

A system prompt that establishes the AI's role as an information extraction expert
A user prompt that gives clear instructions on how to identify and extract relevant information
These prompts are crucial because they determine what information makes it into your final research. Well-designed prompts will help the AI focus on exactly what matters for answering the research question, ignoring everything else.

By crafting effective extraction prompts, you'll ensure your research tool collects high-quality, focused information instead of unnecessary details!


```python
# TODO: Write a system prompt that establishes the AI as an expert in extracting and summarizing information

# TODO: Write a user prompt that:
# 1. Includes placeholders for user_query, search_query, and page_text
# 2. Instructs the AI to extract information relevant to the user's query
# 3. Tells the AI to return only the relevant context as plain text without commentary
# 4. Explains how to determine what information is relevant

```

This is the final, critical step in the filtering process\! The extraction prompts need to be precise to ensure the AI acts as a sophisticated summarizer and context builder.

Here are the two prompt files designed to make the AI an **expert information extractor**:

## Prompt Files for Information Extraction

### System Prompt (Establishing Expert Role)

This prompt establishes the AI as a meticulous expert whose sole purpose is to synthesize and distill information accurately, while being concise.

```text
# information_extractor_system.txt
You are an **Expert Research Information Extractor and Synthesizer**. Your task is to meticulously read the provided source text and extract only the information that is directly and highly relevant to the main research query. Your goal is to create a condensed, self-contained summary (context block) that perfectly answers the research query using only the facts found in the source text. Output only the extracted and synthesized text. Do not add any commentary, conversational filler, titles, or introductory/concluding remarks.
```

-----

### User Prompt (Providing Instructions and Data)

This prompt supplies the data and provides the clear, actionable steps for the AI, ensuring the output is clean and usable as a context block.

```text
# information_extractor_user.txt
Analyze the provided 'Page Content' below and extract all sentences, paragraphs, and data points that are critically useful for answering the 'User Research Query'.

**User Research Query:** {$user_query}

**Original Search Query (Context):** {$search_query}

**Page Content:** {$page_text}

**Extraction Instructions:**
1.  **Relevance:** Focus exclusively on information that directly addresses the User Research Query.
2.  **Synthesis:** Do not simply copy large sections. Synthesize related facts into concise, flowing paragraphs.
3.  **Completeness:** Ensure the extracted context block is informative and contains sufficient detail to be used independently. If a section is relevant, extract it in its entirety, but rephrase it for conciseness if possible.
4.  **Formatting:** The final output must be **plain text** (markdown formatting is acceptable for lists/bolding if present in the original data) and contain **ONLY** the extracted context block.
5.  If absolutely **no** relevant information is found, output the single phrase: "No relevant information found."
```

These prompts ensure that your research tool gets not just a *relevant page*, but a *cleaned, focused block of relevant information* that is ready for the final report generation.

## Extracting Gold from Web Content

Now that we've set up our web content collection and created prompts for evaluating relevance, it's time to put the final piece in place ‚Äî extracting the valuable information! In this exercise, you'll implement the code that pulls out only the useful content from web pages that have been deemed relevant.

Your task is to work with the information extraction system by:

Creating the variables dictionary that provides all the necessary context to the AI
Using the generate_response function to extract relevant information
Adding the extracted information to our growing collection of research data
Making sure the complete list of extracted contexts is returned at the end
This exercise completes our information processing pipeline ‚Äî you'll see how DeepResearcher transforms raw web content into focused, relevant research material. By implementing this extraction step, you'll have built a complete system that can intelligently collect and refine information for any research question!

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    context_list = []

    for query in new_search_queries:
        results = search_and_fetch_markdown(query, max_results=3)
        for page in results:
            page_text = page["markdown"]
            if not page_text.strip():
                continue

            variables = {
                "user_query": user_query,
                "page_text": page_text[:20000]
            }
            is_useful = generate_boolean("relevance_evaluator_system", "relevance_evaluator_user", variables)
            print(f"{page['url']} - Useful: {is_useful}")

            if is_useful:
                # TODO: Create a variables dictionary with user_query, search_query, and page_text (limited to 20000 chars)
                
                # TODO: Call generate_response with the extractor prompts and your variables dictionary
                
                # TODO: If the extracted context exists, add it to the context_list
                
    
    # TODO: Return the list of extracted contexts
    return []


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()

```

That's the final link in the chain\! By implementing the extraction logic, your research tool will now transform raw web data into focused, research-ready contexts.

Here is the completed `perform_iterative_research` function with the information extraction logic:

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    context_list = []

    for query in new_search_queries:
        results = search_and_fetch_markdown(query, max_results=3)
        for page in results:
            page_text = page["markdown"]
            # Truncate content for both relevance check and extraction
            truncated_page_text = page_text[:20000] 

            if not truncated_page_text.strip():
                continue

            # 1. Relevance Check (Uses keys: user_query, page_text)
            relevance_vars = {
                "user_query": user_query,
                "page_text": truncated_page_text
            }
            is_useful = generate_boolean("relevance_evaluator_system", "relevance_evaluator_user", relevance_vars)
            
            # Print decision based on relevance check
            status = "KEEPING (Relevant)" if is_useful else "FILTERING OUT (Not Relevant)"
            print(f"--- Evaluation for: {page['url']}")
            print(f"   Search Query: '{query}'")
            print(f"   Decision: **{status}**")
            print("-" * 50)


            if is_useful:
                # 2. Information Extraction (Only if deemed relevant)
                
                # TODO: Create a variables dictionary with user_query, search_query, and page_text (limited to 20000 chars)
                # Create the variables for the Extractor prompt
                extraction_vars = {
                    "user_query": user_query,
                    "search_query": query, # Provide the specific query used to find the page
                    "page_text": truncated_page_text 
                }
                
                # TODO: Call generate_response with the extractor prompts and your variables dictionary
                # Use generate_response for the extraction task
                extracted_context = generate_response(
                    system_prompt_name="information_extractor_system",
                    user_prompt_name="information_extractor_user",
                    variables=extraction_vars
                ).strip()
                
                # TODO: If the extracted context exists, add it to the context_list
                # Check for the "No relevant information found" output and ensure context is not empty
                if extracted_context and extracted_context != "No relevant information found.":
                    # Append a dictionary or simple string, choosing simple string for this pipeline context
                    context_list.append(extracted_context)
                    print(f"   ‚úÖ Extracted context added.")
                else:
                    print(f"   ‚ùå Extractor returned no valid context.")
            
    
    # TODO: Return the list of extracted contexts
    # Return the full list of focused context blocks
    return context_list


def generate_final_report(user_query: str, contexts: list):
    # This function would use the 'contexts' list to generate the final answer
    print("\n--- Final Contexts Collected ---")
    for i, context in enumerate(contexts):
        print(f"\n[Context Block {i+1}]")
        print(context)
        print("-------------------------------")
    print(f"\nTotal context blocks: {len(contexts)}")


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    print("\nStarting Research and Extraction Pipeline...")
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()
```

That's the final link in the chain\! By implementing the extraction logic, your research tool will now transform raw web data into focused, research-ready contexts.

Here is the completed `perform_iterative_research` function with the information extraction logic:

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    context_list = []

    for query in new_search_queries:
        results = search_and_fetch_markdown(query, max_results=3)
        for page in results:
            page_text = page["markdown"]
            # Truncate content for both relevance check and extraction
            truncated_page_text = page_text[:20000] 

            if not truncated_page_text.strip():
                continue

            # 1. Relevance Check (Uses keys: user_query, page_text)
            relevance_vars = {
                "user_query": user_query,
                "page_text": truncated_page_text
            }
            is_useful = generate_boolean("relevance_evaluator_system", "relevance_evaluator_user", relevance_vars)
            
            # Print decision based on relevance check
            status = "KEEPING (Relevant)" if is_useful else "FILTERING OUT (Not Relevant)"
            print(f"--- Evaluation for: {page['url']}")
            print(f"   Search Query: '{query}'")
            print(f"   Decision: **{status}**")
            print("-" * 50)


            if is_useful:
                # 2. Information Extraction (Only if deemed relevant)
                
                # TODO: Create a variables dictionary with user_query, search_query, and page_text (limited to 20000 chars)
                # Create the variables for the Extractor prompt
                extraction_vars = {
                    "user_query": user_query,
                    "search_query": query, # Provide the specific query used to find the page
                    "page_text": truncated_page_text 
                }
                
                # TODO: Call generate_response with the extractor prompts and your variables dictionary
                # Use generate_response for the extraction task
                extracted_context = generate_response(
                    system_prompt_name="information_extractor_system",
                    user_prompt_name="information_extractor_user",
                    variables=extraction_vars
                ).strip()
                
                # TODO: If the extracted context exists, add it to the context_list
                # Check for the "No relevant information found" output and ensure context is not empty
                if extracted_context and extracted_context != "No relevant information found.":
                    # Append a dictionary or simple string, choosing simple string for this pipeline context
                    context_list.append(extracted_context)
                    print(f"   ‚úÖ Extracted context added.")
                else:
                    print(f"   ‚ùå Extractor returned no valid context.")
            
    
    # TODO: Return the list of extracted contexts
    # Return the full list of focused context blocks
    return context_list


def generate_final_report(user_query: str, contexts: list):
    # This function would use the 'contexts' list to generate the final answer
    print("\n--- Final Contexts Collected ---")
    for i, context in enumerate(contexts):
        print(f"\n[Context Block {i+1}]")
        print(context)
        print("-------------------------------")
    print(f"\nTotal context blocks: {len(contexts)}")


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    print("\nStarting Research and Extraction Pipeline...")
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()
```

## Extracting Gold from Web Content