# Unit 3

## Parsing and Selecting Useful Information

Tentu, berikut adalah konversi teks tersebut ke format Markdown.

# Introduction: The Role of Parsing and Selection in DeepResearcher

-----

Welcome back\! In the previous lessons, you learned how **DeepResearcher** is structured and how it generates search queries using OpenAI. Now, you are ready for the next step: making sense of the information you collect from the web.

When you run a search, you get a lot of web pages. Not all of them are helpful. Some might be off-topic, and others might have only a small piece of useful information. That’s why **parsing** (breaking down) and **selecting** (choosing) the right information are so important. In this lesson, you’ll learn how DeepResearcher uses AI to filter out the noise and keep only what matters for your research question.

By the end of this lesson, you’ll understand how to:

  * Decide if a web page is useful for your research.
  * Extract only the relevant information from a web page.
  * Use these steps in your own code.

Let’s get started\!

## Evaluating Relevance with the LLM

-----

Now that we have web content, the first thing we need to do is decide: **Is this page useful for our research question?**

DeepResearcher uses a language model (LLM) to help with this. It does this by sending a special prompt to the LLM, asking it to answer with just **Yes** or **No** to the question: "Is this page relevant to the user's query?"

Let’s look at how this works in code, step by step.

### Step 1: Prepare the Variables

We need to give the LLM two things:

1.  The user’s original research question.
2.  The content of the web page.

Here’s how we set up these variables:

```python
variables = {
    "user_query": user_query,
    "page_text": page_text[:20000]
}
```

  * `user_query` is the question the user asked.
  * `page_text[:20000]` is the first 20,000 characters of the web page content. We limit the length to avoid sending too much data to the LLM.

### Step 2: Ask the LLM if the Page is Useful

We use the function called `generate_boolean` to send our prompt and variables to the LLM. You will need to write the prompt in the exercises of this unit.

```python
is_useful = generate_boolean(
    "relevance_evaluator_system",
    "relevance_evaluator_user",
    variables
)
print(f"{page['url']} - Useful: {is_useful}")
```

If the LLM thinks the page is useful, it returns **Yes**.
If not, it returns **No**.

**Example Output:**

```
https://example.com/article1 - Useful: Yes
https://example.com/article2 - Useful: No
```

This way, we can quickly filter out pages that don’t help answer the user’s question.

## Extracting Key Information Using the Extractor Prompt

-----

Once we know a page is useful, the next step is to pull out only the information that answers the user’s question. We don’t want to keep the whole page — just the relevant parts.

DeepResearcher uses another prompt for this that you will also need to write, called the **extractor prompt**. Let’s see how this works.

### Step 1: Prepare the Extraction Variables

We need to give the LLM:

1.  The user’s research question.
2.  The search query that led to this page.
3.  The content of the web page.

Here’s how we set up these variables:

```python
variables = {
    "user_query": user_query,
    "search_query": query,
    "page_text": page_text[:20000]
}
```

### Step 2: Ask the LLM to Extract Relevant Information

We use the `generate_response` function to send our prompt and variables to the LLM. This function uses a prompt that tells the LLM to act as an expert information extractor.

```python
context = generate_response(
    "extractor_system",
    "extractor_user",
    variables
)
```

The LLM reads the web page and returns only the parts that are relevant to the user’s question.

The result, `context`, is a string containing the extracted information.

**Example Output:**

```
Electric cars produce fewer emissions over their lifetime compared to gasoline vehicles, especially when charged with renewable energy. Battery production has an environmental cost, but this is offset by lower emissions during use.
```

This makes it much easier to build a final report or summary later.

## Putting It All Together in Code

-----

Let’s see how these steps fit together in the main code. We’ll focus on the part of the code that checks if a page is useful and then extracts the relevant information.

Here’s a simplified version of the process:

```python
for query in new_search_queries:
    results = search_and_fetch_markdown(query, max_results=3)
    for page in results:
        page_text = page["markdown"]
        if not page_text.strip():
            continue
        
        # Step 1: Check if the page is useful
        variables = {
            "user_query": user_query,
            "page_text": page_text[:20000]
        }
        is_useful = generate_boolean("relevance_evaluator_system", "relevance_evaluator_user", variables)
        print(f"{page['url']} - Useful: {is_useful}")
        
        # Step 2: If useful, extract relevant information
        if is_useful:
            variables = {
                "user_query": user_query,
                "search_query": query,
                "page_text": page_text[:20000]
            }
            context = generate_response("extractor_system", "extractor_user", variables)
            if context:
                context_list.append(context)
```

Let’s break this down:

  * For each search query, we get a few web pages.
  * For each page, we check if it has any content.
  * We ask the LLM if the page is useful for the user’s question.
  * If the answer is **Yes**, we ask the LLM to extract the relevant information.
  * We save the extracted information for later use.

This process helps us build a collection of only the most useful and relevant information for the user’s research.

## Summary and Practice Preview

-----

In this lesson, you learned how DeepResearcher filters and extracts useful information from web pages. You saw how to:

  * Use the LLM to decide if a page is relevant to the user’s question.
  * Extract only the key information from useful pages.
  * Combine these steps in code to build a focused research tool.

Next, you’ll get to practice these skills yourself. You’ll work with real code to filter and extract information, just like we did here. Take a moment to review the examples, and get ready to try it out on your own\!

## Collecting Web Content for Research

Now that you've learned how DeepResearcher processes web content, it's time to put that knowledge into practice! In this exercise, you'll implement the first part of the perform_iterative_research function — the foundation of our research pipeline.

Your task is to write code that collects and prepares web content for analysis. Specifically, you need to:

Create a list to store extracted information
Loop through each search query and retrieve web pages
Process each page by checking whether it contains actual content
Set up the structure that will later evaluate and extract information
This is an important first step in building our research tool — by properly collecting and filtering web content, you're setting the stage for the more advanced processing we'll add in upcoming exercises. Let's get those search results flowing!

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    # TODO: Create an empty list to store the extracted context
    
    # TODO: Loop through each query in new_search_queries
        # TODO: Call search_and_fetch_markdown with the query and max_results=3
        # TODO: Loop through each page in the results
            # TODO: Get the page text from the page's markdown field
            # TODO: Skip pages with no content after stripping whitespace
            # TODO: Print the first 200 characters of the page
    return []


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()

```

Got it\! Implementing the content collection loop is the essential first step in the research workflow. It handles fetching the data and basic filtering before the AI processing begins.

Here is the completed `perform_iterative_research` function with the required content collection and filtering logic:

```python
from deepresearcher.llm.llm_manager import generate_response, generate_boolean
from deepresearcher.web.web_searcher import search_and_fetch_markdown, clear_visited_pages


def generate_initial_search_queries(user_query: str):
    variables = {"user_query": user_query}
    search_queries_str = generate_response("search_generator_system", "search_generator_user", variables)
    try:
        queries = eval(search_queries_str)
        if not isinstance(queries, list):
            raise ValueError("Not a list")
        return queries
    except Exception:
        print("Invalid response for search queries:", search_queries_str)
        return []


def perform_iterative_research(user_query: str, new_search_queries: list, all_search_queries: list, iteration_limit: int):
    # TODO: Create an empty list to store the extracted context
    extracted_contexts = []
    
    print("\nStarting web content collection...")

    # TODO: Loop through each query in new_search_queries
    for query in new_search_queries:
        print(f"\nSearching for: {query}")
        
        # TODO: Call search_and_fetch_markdown with the query and max_results=3
        # The search_and_fetch_markdown returns a list of page objects
        results = search_and_fetch_markdown(query, max_results=3)
        
        # TODO: Loop through each page in the results
        for page in results:
            # The page object is a dictionary, typically with 'url' and 'markdown'
            
            # TODO: Get the page text from the page's markdown field
            page_text = page.get("markdown", "")
            
            # TODO: Skip pages with no content after stripping whitespace
            if not page_text.strip():
                print(f"Skipping page at {page.get('url', 'Unknown URL')} - No content found.")
                continue

            # --- Placeholder for AI evaluation and extraction (future steps) ---
            
            # TODO: Print the first 200 characters of the page (for verification)
            print("-" * 20)
            print(f"Found content at: {page.get('url', 'Unknown URL')}")
            print(f"Content snippet: {page_text[:200].replace('\n', ' ')}...")
            print("-" * 20)

    # Returning the empty list for now, as extraction logic is in future exercises
    return extracted_contexts


def generate_final_report(user_query: str, contexts: list):
    pass


def research_main():
    user_query = input("Enter your research query/topic: ").strip()
    iteration_limit = input("Max number of iterations (default 10): ").strip()
    iteration_limit = int(iteration_limit) if iteration_limit.isdigit() else 10

    clear_visited_pages()
    new_search_queries = generate_initial_search_queries(user_query)
    if not new_search_queries:
        return

    all_search_queries = new_search_queries.copy()
    aggregated_contexts = perform_iterative_research(user_query, new_search_queries, all_search_queries, iteration_limit)
    generate_final_report(user_query, aggregated_contexts)


if __name__ == "__main__":
    research_main()
```

## Creating Prompts for Content Relevance Evaluation

## Evaluating Content Relevance with AI

## Designing Prompts for Information Extraction

## Extracting Gold from Web Content

## Extracting Gold from Web Content