# Lesson 16: Initial Data Ingestion and Tooling

In this lesson, we focus on building the first set of essential MCP tools for data gathering in our research agent. We'll implement tools that read article guideline files, extract web URLs programmatically, and scrape their content in parallel. This lesson demonstrates how file-based approaches can save tokens for the orchestrating agent, which only needs to process simple success or failure messages rather than large content blocks.

Learning Objectives:
- Learn how to build MCP tools that extract URLs and references from text files
- Understand the benefits of file-based tool outputs for token efficiency
- Implement robust web scraping tools using external services
- Handle error cases gracefully thanks to appropriate policies in the MCP prompt instructions

## 1. Setup

First, we define some standard Magic Python commands to autoreload Python packages whenever they change:

In [1]:
%load_ext autoreload
%autoreload 2

### Set Up Python Environment

To set up your Python virtual environment using `uv` and load it into the Notebook, follow the step-by-step instructions from the `Course Admin` lesson from the beginning of the course.

**TL/DR:** Be sure the correct kernel pointing to your `uv` virtual environment is selected.

### Configure Gemini API

To run this lesson, you'll need several API keys configured:

1. **Gemini API Key**, `GOOGLE_API_KEY` variable: Get your key from [Google AI Studio](https://aistudio.google.com/app/apikey).
2. **Firecrawl API Key**, `FIRECRAWL_API_KEY` variable: Get your key from [Firecrawl](https://firecrawl.dev/). They have a free tier that allows you to scrape 500 pages, which is enough for testing the agent for free.
3. **GitHub token (optional)**, `GITHUB_TOKEN` variable: If you want to process private GitHub repositories, you'll need a GitHub token with access to them. In case you want to test this functionality, you can get a token from [here](https://github.com/settings/personal-access-tokens). However, this is not required for the lesson, as we can easily use public repositories for explaining the functionalities.

In [2]:
from utils import env

env.load(required_env_vars=["GOOGLE_API_KEY", "FIRECRAWL_API_KEY", "GITHUB_TOKEN"])

Environment variables loaded from `/Users/fabio/Desktop/course-ai-agents/.env`
Environment variables loaded successfully.


### Import Key Packages

In [3]:
import nest_asyncio
nest_asyncio.apply() # Allow nested async usage in notebooks

## 2. Understanding the Research Agent Workflow

As we saw in the previous lesson, the research agent follows a systematic workflow for data ingestion. The MCP prompt defines a clear two-phase approach regarding the data ingestion:

- **Step 1**: Extract URLs and file references from the article guidelines.
- **Step 2**: Process all the extracted resources in parallel (local files, web URLs, GitHub repos, YouTube videos).

Here's a snapshot of the MCP prompt that defines the first two steps of the workflow:
```markdown
1. Setup:

    1.1. Explain to the user the numbered steps of the workflow. Be concise. Keep them numbered so that the user
    can easily refer to them later.
    
    1.2. Ask the user for the research directory, if not provided. Ask the user if any modification is needed for the
    workflow (e.g. running from a specific step, or adding user feedback to specific steps).

    1.3 Extract the URLs from the ARTICLE_GUIDELINE_FILE with the "extract_guidelines_urls" tool. This tool reads the
    ARTICLE_GUIDELINE_FILE and extracts three groups of references from the guidelines:
    • "github_urls" - all GitHub links;
    • "youtube_videos_urls" - all YouTube video links;
    • "other_urls" - all remaining HTTP/HTTPS links;
    • "local_files" - relative paths to local files mentioned in the guidelines (e.g. "code.py", "src/main.py").
    Only extensions allowed are: ".py", ".ipynb", and ".md".
    The extracted data is saved to the GUIDELINES_FILENAMES_FILE within the NOVA_FOLDER directory.

2. Process the extracted resources in parallel:

    You can run the following sub-steps (2.1 to 2.4) in parallel. In a single turn, you can call all the
    necessary tools for these steps.

    2.1 Local files - run the "process_local_files" tool to read every file path listed under "local_files" in the
    GUIDELINES_FILENAMES_FILE and copy its content into the LOCAL_FILES_FROM_RESEARCH_FOLDER subfolder within
    NOVA_FOLDER, giving each copy an appropriate filename (path separators are replaced with underscores).

    2.2 Other URL links - run the "scrape_and_clean_other_urls" tool to read the `other_urls` list from the
    GUIDELINES_FILENAMES_FILE and scrape/clean them. The tool writes the cleaned markdown files inside the
    URLS_FROM_GUIDELINES_FOLDER subfolder within NOVA_FOLDER.

    2.3 GitHub URLs - run the "process_github_urls" tool to process the `github_urls` list from the
    GUIDELINES_FILENAMES_FILE with gitingest and save a Markdown summary for each URL inside the
    URLS_FROM_GUIDELINES_CODE_FOLDER subfolder within NOVA_FOLDER.

    2.4 YouTube URLs - run the "transcribe_youtube_urls" tool to process the `youtube_videos_urls` list from the
    GUIDELINES_FILENAMES_FILE, transcribe each video, and save the transcript as a Markdown file inside the
    URLS_FROM_GUIDELINES_YOUTUBE_FOLDER subfolder within NOVA_FOLDER.
        Note: Please be aware that video transcription can be a time-consuming process. For reference,
        transcribing a 39-minute video can take approximately 4.5 minutes.
```

Let's examine the MCP tools involved in these first two steps of the workflow. As we saw in the previous lesson, the MCP tool endpoints are defined in the `src/routers/tools.py` file.

Source: _research_agent_part_2/mcp_server/src/routers/tools.py_

```python
def register_mcp_tools(mcp: FastMCP) -> None:
    """Register all MCP tools with the server instance."""
    
    # Step 1: Extract URLs and file references from guidelines
    @mcp.tool()
    async def extract_guidelines_urls(research_directory: str) -> Dict[str, Any]:
        """
        Extract URLs and local file references from article guidelines.
        
        Reads the ARTICLE_GUIDELINE_FILE file in the research directory and extracts:
        - GitHub URLs
        - Other HTTP/HTTPS URLs  
        - Local file references (files mentioned in quotes with extensions)
        
        Results are saved to GUIDELINES_FILENAMES_FILE in the research directory.
        """
        result = extract_guidelines_urls_tool(research_directory)
        return result

    # Step 2.1: Process local files
    @mcp.tool()
    async def process_local_files(research_directory: str) -> Dict[str, Any]:
        """Process local files referenced in the article guidelines."""
        result = process_local_files_tool(research_directory)
        return result
        
    # Step 2.2: Scrape web URLs
    @mcp.tool() 
    async def scrape_and_clean_other_urls(research_directory: str, concurrency_limit: int = 4) -> Dict[str, Any]:
        """Scrape and clean other URLs from GUIDELINES_FILENAMES_FILE."""
        result = await scrape_and_clean_other_urls_tool(research_directory, concurrency_limit)
        return result

    # Step 2.3: Process GitHub repositories
    @mcp.tool()
    async def process_github_urls(research_directory: str) -> Dict[str, Any]:
        """
        Process GitHub URLs from GUIDELINES_FILENAMES_FILE using gitingest.
        
        Reads the GUIDELINES_FILENAMES_FILE file and processes each URL listed
        under 'github_urls' using gitingest to extract repository summaries, file trees,
        and content. The results are saved as markdown files in the
        URLS_FROM_GUIDELINES_CODE_FOLDER subfolder.
        """
        result = await process_github_urls_tool(research_directory)
        return result
        
    # Step 2.4: Transcribe YouTube videos
    @mcp.tool()
    async def transcribe_youtube_urls(research_directory: str) -> Dict[str, Any]:
        """
        Transcribe YouTube video URLs from GUIDELINES_FILENAMES_FILE using Gemini 2.5 Pro.
        
        Reads the GUIDELINES_FILENAMES_FILE file and processes each URL listed
        under 'youtube_videos_urls'. Each video is transcribed, and the results are
        saved as markdown files in the URLS_FROM_GUIDELINES_YOUTUBE_FOLDER subfolder.
        """
        result = await transcribe_youtube_videos_tool(research_directory)
        return result
```

Notice how this tool returns a concise summary rather than the full extracted content. We'll see the exact outputs in the next sections. This design choice has several advantages:

1. **Token Efficiency**: The agent receives only essential information (counts, status, file path) rather than large content blocks.
2. **Context Management**: Keeps the agent's context window manageable for complex workflows.
3. **Selective Reading**: The agent can choose to read the output file only if needed for decision-making. However, the ability to read files must be implemented as a tool (or another MCP server) for the MCP client. To do this, it would be possible to add a separate MCP server to the MCP client, or to use an MCP client that has already this capability (e.g. Cursor).
4. **Error Handling**: Clear status messages help the agent understand what succeeded or failed, and how to proceed.

Let's now see how each of these MCP tools is implemented.

## 3. Extracting URLs from Guidelines

The first tool in our data ingestion pipeline reads an article guideline file and programmatically extracts all URLs and file references it contains.

Here's its implementation:

Source: _research_agent_part_2/mcp_server/src/tools/extract_guidelines_urls_tool.py_

```python
def extract_guidelines_urls_tool(research_folder: str) -> Dict[str, Any]:
    """
    Extract URLs and local file references from the article guidelines in the research folder.
    
    Reads the ARTICLE_GUIDELINE_FILE file and extracts:
    - GitHub URLs
    - YouTube video URLs  
    - Other HTTP/HTTPS URLs
    - Local file references
    
    Results are saved to GUIDELINES_FILENAMES_FILE in the research folder.
    """
    ...

    # Convert to Path object
    research_path = Path(research_folder)
    nova_path = research_path / NOVA_FOLDER
    guidelines_path = research_path / ARTICLE_GUIDELINE_FILE
    ...
    
    # Read guidelines content
    guidelines_content = read_file_safe(guidelines_path)
    
    # Extract URLs
    urls = extract_urls(guidelines_content)
    github_source_urls = [u for u in all_urls if "github.com" in u]
    youtube_source_urls = [u for u in all_urls if "youtube.com" in u]
    web_source_urls = [u for u in all_urls if "github.com" not in u and "youtube.com" not in u]

    # Extract local file paths
    local_paths = extract_local_paths(guidelines_content)
    
    # Prepare the extracted data structure
    extracted_data = {
        "github_urls": urls["github_urls"],
        "youtube_videos_urls": urls["youtube_videos_urls"], 
        "other_urls": urls["other_urls"],
        "local_file_paths": local_paths,
    }
    
    # Save to JSON file
    output_path = nova_path / GUIDELINES_FILENAMES_FILE
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(extracted_data, f, indent=2, ensure_ascii=False)
    
    return {
        "status": "success",
        "github_sources_count": len(urls["github_urls"]),
        "youtube_sources_count": len(urls["youtube_videos_urls"]),
        "web_sources_count": len(urls["other_urls"]),
        "local_files_count": len(local_paths),
        "output_path": str(output_path),
        "message": f"Successfully extracted URLs from article guidelines in '{research_folder}'. "
                  f"Found {len(urls['github_urls'])} GitHub URLs, {len(urls['youtube_videos_urls'])} YouTube videos URLs, "
                  f"{len(urls['other_urls'])} other URLs, and {len(local_paths)} local file references. "
                  f"Results saved to: {output_path}"
    }
```

The code:
1. Identifies the location of the article guidelines file,
2. Uses the `extract_urls` function to extract the URLs from the guidelines content,
3. Extracts local file paths with the `extract_local_paths` function,
4. Saves the extracted data to a JSON file, and
5. Returns a summary of the results.

Let's now see how the URLs are extracted from the guidelines content.

### 3.1 URLs Extraction

The `extract_urls` function from the guideline extractions handler finds all HTTP/HTTPS URLs:

Source: _research_agent_part_2/mcp_server/src/app/guideline_extractions_handler.py_

```python
def extract_urls(text: str) -> list[str]:
    """Extract all HTTP/HTTPS URLs from the given text."""
    url_pattern = re.compile(r"https?://[^\s)>\"',]+")
    return url_pattern.findall(text)
```

This regular expression pattern:
- `https?://` - Matches both HTTP and HTTPS protocols
- `[^\s)>\"',]+` - Matches any characters except whitespace, closing parentheses, greater-than signs, quotes, or commas
- This ensures URLs are extracted cleanly from markdown links, plain text, and various formatting contexts

After extraction, the URLs are categorized by domain to enable specialized processing for each type of content source.

### 3.2 Local File Path Extraction

The `extract_local_paths` function is used to extract local file paths from the guidelines content, and it is defined in the `app/guideline_extractions_handler.py` file.

We won't show its code here as it's not interesting for teaching how AI agents work. You can check how it works in the code if you're curious. You only need to know the following:
- It only looks for specific file extensions (`.py`, `.ipynb`, `.md`)
- It excludes anything that looks like a URL

### 3.3 Running the Tool

Let's test this tool programmatically to get an idea of its output:

In [None]:
from research_agent_part_2.mcp_server.src.tools import extract_guidelines_urls_tool

# Update this path to your actual sample research folder
research_folder = "/path/to/research_folder"
result = extract_guidelines_urls_tool(research_folder=research_folder)
print(result)

{'status': 'success', 'github_sources_count': 1, 'youtube_sources_count': 1, 'web_sources_count': 2, 'local_files_count': 0, 'output_path': '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder/.nova/guidelines_filenames.json', 'message': "Successfully extracted URLs from article guidelines in '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder'. Found 1 GitHub URLs, 1 YouTube videos URLs, 2 other URLs, and 0 local file references. Results saved to: /Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder/.nova/guidelines_filenames.json"}


The output will show a structured summary like:

```json
{
  "status": "success",
  "github_sources_count": 1,
  "youtube_sources_count": 2, 
  "web_sources_count": 6,
  "local_files_count": 0,
  "output_path": "/path/to/research_folder/.nova/guidelines_filenames.json",
  "message": "Successfully extracted URLs from article guidelines in '/path/to/research_folder'. Found 1 GitHub URLs, 2 YouTube videos URLs, 6 other URLs, and 0 local file references. Results saved to: /path/to/research_folder/.nova/guidelines_filenames.json"
}
```

With this summary, the agent can understand if everything worked fine or not, and how to proceed in case of errors (e.g. by asking the user for help).

## 4. Processing Local Files

The `process_local_files_tool` tool handles local file references found in the guidelines. It copies referenced files to an organized folder structure and formats them for LLM consumption.

Source: _research_agent_part_2/mcp_server/src/tools/process_local_files_tool.py_

```python
def process_local_files_tool(research_directory: str) -> Dict[str, Any]:
    """
    Process local files referenced in the article guidelines.

    Reads the guidelines JSON file and copies each referenced local file
    to the local files subfolder. Path separators in filenames are
    replaced with underscores to avoid creating nested folders.

    Args:
        research_directory: Path to the research directory containing the guidelines JSON file

    Returns:
        Dict with status, processing results, and file paths
    """
    ...

    # Convert to Path object
    research_path = Path(research_directory)
    nova_path = research_path / NOVA_FOLDER

    # Look for GUIDELINES_FILENAMES_FILE
    metadata_path = nova_path / GUIDELINES_FILENAMES_FILE

    ...

    # Load JSON metadata
    data = json.loads(metadata_path.read_text(encoding="utf-8"))
    local_files = data.get("local_files", [])

    if not local_files:
        return {
            "status": "success",
            "message": f"No local files to process in research folder '{research_directory}'.",
            "files_processed": 0,
            "files_total": 0,
            "warnings": [],
            "errors": [],
        }

    # Create destination folder if it doesn't exist
    dest_folder = nova_path / LOCAL_FILES_FROM_RESEARCH_FOLDER
    dest_folder.mkdir(parents=True, exist_ok=True)

    processed = 0
    warnings = []
    errors = []
    processed_files = []

    # Initialize notebook converter for .ipynb files
    notebook_converter = NotebookToMarkdownConverter(include_outputs=True, include_metadata=False)

    for rel_path in local_files:
        # Local files are relative to the research folder
        src_path = research_path / rel_path
        ...

        # Sanitize destination filename (replace path separators with underscores)
        dest_name = rel_path.replace("/", "_").replace("\\", "_")

        try:
            # Handle .ipynb files specially by converting to markdown
            if src_path.suffix.lower() == ".ipynb":
                # Convert .ipynb to .md extension for destination
                dest_name = dest_name.rsplit(".ipynb", 1)[0] + ".md"
                dest_path = dest_folder / dest_name

                # Convert notebook to markdown string
                markdown_content = notebook_converter.convert_notebook_to_string(src_path)

                # Write markdown content to destination
                dest_path.write_text(markdown_content, encoding="utf-8")
            else:
                # For other file types, copy as before
                dest_path = dest_folder / dest_name
                shutil.copy2(src_path, dest_path)

            processed += 1
            processed_files.append(dest_name)
        except Exception as e:
            errors.append(f"Failed to process {rel_path}: {str(e)}")

    # Build result message using the dedicated function
    result_message = build_result_message(research_directory, processed, local_files, dest_folder, warnings, errors)

    return {
        "status": "success" if processed > 0 else "warning",
        "files_processed": processed,
        "files_total": len(local_files),
        "processed_files": processed_files,
        "warnings": warnings,
        "errors": errors,
        "output_directory": str(dest_folder.resolve()),
        "message": result_message,
    }
```

This local file processing tool looks for the local files extracted by the `extract_guidelines_urls_tool` tool and copies them to an organized folder structure. It distinguishes between different file types (where it copy its content as is) and notebooks (where it converts the content to markdown).

The `NotebookToMarkdownConverter` class can be found in the `app/notebook_handler.py` file. We won't show its code here as it's not interesting for teaching how AI agents work. You can check how it works in the code if you're curious. You only need to know that it keeps both markdown cells and code cells, and it also keeps the outputs of the executed cells truncated to a maximum amount of characters.

## 5. Web Scraping with Firecrawl and LLM Cleaning

This is the most complex tool in our data ingestion pipeline. It scrapes web URLs and cleans the content using both external services and LLM processing. Here's its implementation:

Source: _research_agent_part_2/mcp_server/src/tools/scrape_and_clean_other_urls_tool.py_

```python
async def scrape_and_clean_other_urls_tool(research_directory: str, concurrency_limit: int = 4) -> Dict[str, Any]:
    """
    Scrape and clean other URLs from guidelines file in the research folder.
    
    Reads the guidelines file and scrapes/cleans each URL listed
    under 'other_urls'. The cleaned markdown content is saved to the
    URLS_FROM_GUIDELINES_FOLDER subfolder with appropriate filenames.
    """    
    # Convert to Path object
    research_path = Path(research_directory)
    nova_path = research_path / NOVA_FOLDER
    
    # Look for GUIDELINES_FILENAMES_FILE file
    guidelines_file_path = nova_path / GUIDELINES_FILENAMES_FILE
    
    # Read the guidelines filenames file
    guidelines_data = json.loads(read_file_safe(guidelines_file_path))
    urls_to_scrape = guidelines_data.get("other_urls", [])
    
    if not urls_to_scrape:
        return {
            "status": "success",
            "urls_processed": [],
            "urls_failed": [],
            "total_urls": 0,
            "successful_urls_count": 0,
            "failed_urls_count": 0,
            "output_directory": str(nova_path / URLS_FROM_GUIDELINES_FOLDER),
            "message": "No other URLs found to scrape in the guidelines filenames file."
        }
    
    # Read article guidelines for context
    guidelines_path = research_path / ARTICLE_GUIDELINE_FILE
    guidelines_content = read_file_safe(guidelines_path)
    
    # Scrape URLs concurrently
    completed_results = await scrape_urls_concurrently(
        urls_to_scrape, 
        concurrency_limit, 
        guidelines_content
    )
    
    # Write results to files
    output_dir = nova_path / URLS_FROM_GUIDELINES_FOLDER
    saved_files, successful_scrapes = write_scraped_results_to_files(completed_results, output_dir)
    
    # Calculate statistics
    failed_urls = [res["url"] for res in completed_results if not res.get("success", False)]
    successful_urls = [res["url"] for res in completed_results if res.get("success", False)]
    
    return {
        "status": "success",
        "urls_processed": successful_urls,
        "urls_failed": failed_urls,
        "total_urls": len(urls_to_scrape),
        "successful_urls_count": successful_scrapes,
        "failed_urls_count": len(failed_urls),
        "output_directory": str(output_dir),
        "message": f"Successfully processed {successful_scrapes}/{len(urls_to_scrape)} URLs. "
                  f"Results saved to: {output_dir}"
    }
```

Here's how it works:
1. It looks for the URLs to scrape in the guidelines filenames file.
2. It uses the `scrape_urls_concurrently` function to scrape the URLs concurrently using Firecrawl and clean the content using an LLM.
3. It saves the cleaned content to the `URLS_FROM_GUIDELINES_FOLDER` folder.
4. It returns a summary of the results.

Let's see in more detail how the `scrape_urls_concurrently` function works.

### 5.1 The Two-Stage Cleaning Process

The scraping process uses a two-stage approach:

1. **Firecrawl for Initial Scraping**: Firecrawl is a specialized service that handles the complexity of modern web scraping, including:
   - JavaScript rendering
   - Dynamic content loading  
   - Anti-bot protection
   - Clean markdown extraction

2. **LLM for Content Refinement**: After Firecrawl extracts the raw content, an LLM (Gemini 2.5 Flash) further cleans and structures the content by:
   - Removing irrelevant sections (ads, navigation, footers)
   - Focusing on content relevant to the article guidelines
   - Maintaining proper markdown formatting
   - Preserving important links and references

This Firecrawl scraping function handles the complexity of modern web scraping:

```python
async def scrape_url(url: str, firecrawl_app: AsyncFirecrawl) -> dict:
    """
    Scrape a URL using Firecrawl with retries and return a dict with url, title, markdown.

    Uses maxAge=1 week for 500% faster scraping by leveraging cached data when available.
    This optimization significantly improves performance for documentation, articles, and
    relatively static content while maintaining freshness within acceptable limits.
    """
    max_retries = 3
    base_delay = 5  # seconds
    timeout_seconds = 120000  # 2 minutes timeout per request

    for attempt in range(max_retries):
        try:
            # Add timeout to individual Firecrawl request
            # Use maxAge=1 week for 500% faster scraping with cached data
            res = await firecrawl_app.scrape(
                url, formats=["markdown"], maxAge=MAX_AGE_ONE_WEEK, timeout=timeout_seconds
            )
            title = res.metadata.title if res and res.metadata and res.metadata.title else "N/A"
            markdown_content = res.markdown if res and res.markdown else ""
            return {"url": url, "title": title, "markdown": markdown_content, "success": True}
        except asyncio.TimeoutError:
            # Manage retries
            ...
        except Exception as e:
            # Manage retries
            ...
    
    return {
        "url": url,
        "title": "Scraping Failed",
        "markdown": f"⚠️ Error scraping {url} after {max_retries} attempts.",
        "success": False,
    }
```

The core of it is the `firecrawl_app.scrape` function that scrapes the URL and returns the markdown content.

The LLM cleaning process is handled by the `clean_markdown` function:

Source: _research_agent_part_2/mcp_server/src/app/scraping_handler.py_

```python
async def clean_markdown(
    markdown_content: str, article_guidelines: str, url_for_log: str, chat_model: BaseChatModel
) -> str:
    """Clean markdown content via LLM and convert image syntax to URLs."""
    if not markdown_content.strip():
        return markdown_content

    prompt_text = PROMPT_CLEAN_MARKDOWN.format(article_guidelines=article_guidelines, markdown_content=markdown_content)
    timeout_seconds = 180  # 3 minutes timeout for LLM call

    try:
        # Add timeout to LLM API call
        response = await asyncio.wait_for(chat_model.ainvoke(prompt_text), timeout=timeout_seconds)
        cleaned_content = response.content if hasattr(response, "content") else str(response)

        if isinstance(cleaned_content, list):
            cleaned_content = "".join(str(part) for part in cleaned_content)

        # Post-process: convert markdown images to just URLs
        cleaned_content = convert_markdown_images_to_urls(cleaned_content)

        return cleaned_content
    except asyncio.TimeoutError:
        logger.error(f"LLM API call timed out after {timeout_seconds}s for {url_for_log}. Using original content.")
        return markdown_content
    except Exception as e:
        logger.error(f"Error cleaning markdown for {url_for_log}: {e}. Using original content.", exc_info=True)
        return markdown_content
```

The `clean_markdown` function simply uses Gemini 2.5 Flash to clean the markdown content using an appropriate prompt.
Here's the prompt used (`PROMPT_CLEAN_MARKDOWN`) to clean the markdown content:

```markdown
Your task is to clean markdown content scraped from a webpage by *only removing* all irrelevant sections such as
headers, footers, navigation bars, advertisements, sidebars, self-promotion, call-to-actions, etc.
Focus on keeping only the core textual content (and code content if there are code sections) that is pertinent to
the article guidelines provided below.
Return *only* the cleaned markdown.
Do not summarize or rewrite the original content. This task is only about *removing* irrelevant content.
Good content should be kept as is, do not touch it.

Here are the article guidelines:
<article_guidelines>
{article_guidelines}
</article_guidelines>

Here is the markdown content to clean:
<markdown_content>
{markdown_content}
</markdown_content>
```

The cleaning process significantly reduces token count while preserving the most relevant information for research purposes.

### 5.2 Why Use External Scraping Services?

Web scraping is notoriously complex due to:

- **Dynamic Content**: Modern websites heavily use JavaScript
- **Anti-Bot Measures**: CAPTCHAs, rate limiting, IP blocking
- **Varied Formats**: Inconsistent HTML structures across sites
- **Performance Issues**: Slow loading, timeouts, redirects

Rather than building a robust scraper from scratch (which would require significant effort and still fall short), using a specialized service like Firecrawl allows us to:

- Focus on our core research logic
- Get reliable results across diverse websites  
- Benefit from ongoing improvements to the scraping infrastructure
- Handle edge cases that would be time-consuming to solve ourselves

Let's now test the scraping tool to see what is its output:

In [7]:
from research_agent_part_2.mcp_server.src.tools import scrape_and_clean_other_urls_tool

# Test the scraping tool
result = await scrape_and_clean_other_urls_tool(research_directory=research_folder, concurrency_limit=2)
print(result)

  from .autonotebook import tqdm as notebook_tqdm


{'status': 'success', 'urls_processed': 2, 'urls_total': 2, 'files_saved': 2, 'output_directory': '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder/.nova/urls_from_guidelines', 'saved_files': ['function-calling-with-the-gemini-api-google-ai-for-developer.md', 'openai-platform.md'], 'message': "Scraped and cleaned 2/2 other URLs from guidelines_filenames.json in '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder'.\nSaved 2 files to urls_from_guidelines folder: function-calling-with-the-gemini-api-google-ai-for-developer.md, openai-platform.md"}


The output lists the number of URLs processed, the number of URLs that failed, and the output directory where the cleaned content is saved. You can now open the output directory (in the `.nova/urls_from_guidelines` folder) to see the cleaned content.

Now, let's see how GitHub URLs are processed.

## 6. Processing GitHub URLs

For GitHub repositories, we use a different approach optimized for code analysis. The `process_github_urls_tool` function (from `mcp_server/src/tools/process_github_urls_tool.py`) leverages the `gitingest` library to extract comprehensive information from GitHub repositories, making code and documentation available for research purposes.

We won't show the code here as it's not interesting for teaching how AI agents work. You can check how it works in the code if you're curious. You only need to know that it uses the `gitingest` library to extract the information from the GitHub repositories.

Let's test the GitHub processing tool here. The GitHub URL from the sample article guideline refer to a prompting guide for GPT-5 and is available in a [public repository](https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/gpt-5_prompting_guide.ipynb), so you don't need to provide a GitHub token for it.

In [5]:
from research_agent_part_2.mcp_server.src.tools import process_github_urls_tool

# Test GitHub URL processing
result = await process_github_urls_tool(research_directory=research_folder)
print(result)

{'status': 'success', 'urls_processed': 1, 'urls_total': 1, 'files_saved': 1, 'output_directory': '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder/.nova/urls_from_guidelines_code', 'message': "Processed 1/1 GitHub URLs from guidelines_filenames.json in '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder'. Saved markdown summaries to urls_from_guidelines_code folder."}


From its result, you can see that the tool has extracted the information from the GitHub repository and saved it in the `.nova/github_urls_from_guidelines_code` folder. You can now open the output directory to see the extracted information.

## 7. YouTube Video Transcription

The `transcribe_youtube_videos_tool` (from the `mcp_server/src/tools/transcribe_youtube_videos_tool.py` file) leverages Gemini's multimodal capabilities to process video content directly and generate structured transcripts for research purposes.

The core of it is the `transcribe_youtube` function, which is the one that actually transcribes the YouTube video. Here's its implementation:

Source: _research_agent_part_2/mcp_server/src/app/youtube_handler.py_

```python
async def transcribe_youtube(
    url: str,
    output_path: Path,
    timestamp: int = 30,
) -> None:
    """
    Transcribes a public YouTube video using a Gemini model and saves the
    result to a file.

    Args:
        url: The public URL of the YouTube video.
        output_path: The path to save the transcription markdown file.
        timestamp: The interval in seconds for inserting timestamps in the
                   transcription.
    """
    # Create client internally using settings and track with Opik if configured
    base_client = genai.Client(api_key=settings.google_api_key.get_secret_value())
    client = track_genai_client(base_client)
    model_name = settings.youtube_transcription_model

    prompt = PROMPT_YOUTUBE_TRANSCRIPTION.format(timestamp=timestamp)

    parts: list[types.Part] = [
        types.Part(
            file_data=types.FileData(file_uri=url)  # YouTube URL - no download needed
        ),
        types.Part(text=prompt),
    ]

    ...
    response: types.GenerateContentResponse = await client.aio.models.generate_content(
        model=model_name,
        contents=types.Content(parts=parts),
    )
    ...

    output_path.write_text(response.text, encoding="utf-8")
```

The Gemini API can [transcribe YouTube videos](https://ai.google.dev/gemini-api/docs/video-understanding#transcribe-video) by adding the video URL as a `types.Part` to the request, as shown above. Visit the provided link to learn more about the Gemini API's video understanding capabilities.

You can run the following code to test the YouTube transcription tool.

In [7]:
from research_agent_part_2.mcp_server.src.tools import transcribe_youtube_videos_tool

# Test YouTube transcription (note: this can be time-consuming)
result = await transcribe_youtube_videos_tool(research_directory=research_folder)
print(result)

{'status': 'success', 'videos_processed': 1, 'videos_total': 1, 'output_directory': '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder/.nova/urls_from_guidelines_youtube_videos', 'message': "Processed 1 YouTube URLs from guidelines_filenames.json in '/Users/fabio/Desktop/course-ai-agents/lessons/research_agent_part_2/data/sample_research_folder'. Saved transcriptions to urls_from_guidelines_youtube_videos folder."}


The output lists the number of videos processed, the number of videos that failed, and the output directory where the transcription is saved. You can now open the output directory to see the transcription.

**Note**: Video transcription is time-intensive. A 39-minute video typically takes about 4.5 minutes to process. The tool processes videos concurrently but with controlled concurrency to respect API limits and avoid overwhelming the service.

## 8. Running the Full Agent with MCP Prompt

We're now ready to run the MCP client and see how these tools work together! Run the following code cell to start the client.

Once the client is running, you can:

1. **Start the workflow**: Type `/prompt/full_research_instructions_prompt` to load the complete research workflow. It will load the MCP prompt with all the instructions and feed it to the LLM, which will in turn write a message to the user asking for the research directory path and whether the workflow should be run with modifications.
2. **Answer the agent**: Give the path to your sample research folder, and tell the agent to run only the first two steps of the workflow, and to stop after that.
3. **Watch the agent work**: Observe how it runs the tools in sequence
4. **Examine outputs**: Check the `.nova` folder for generated files

Try these commands in sequence:
- `/prompt/full_research_instructions_prompt`
- `The research folder is /path/to/research_folder. Run only the first two steps of the workflow and stop after that, and ask me how to proceed.` Replace the `/path/to/research_folder` with the path to your sample research folder.
- `/quit` after the agent has finished running the tools and asked you how to proceed.

In [8]:
from research_agent_part_2.mcp_client.src.client import main as client_main
import sys

async def run_client():
    _argv_backup = sys.argv[:]
    sys.argv = ["client"]
    try:
        await client_main()
    finally:
        sys.argv = _argv_backup

# Start client with in-memory server 
await run_client()

🛠️  Available tools: 11
📚 Available resources: 2
💬 Available prompts: 1

Available Commands: /tools, /resources, /prompts, /prompt/<name>, /resource/<uri>, /model-thinking-switch, /quit


[1m[95m🤔 LLM's Thoughts:[0m
[35m**Understanding the Research Task**

Okay, so the user wants me to run a research workflow. My first step is clear: I need to concisely explain the entire process to them, breaking it down into manageable chunks. Numbered steps will be the most straightforward approach here. Once I've laid out the plan, I'll need to gather some crucial information: the location where the research will be saved (`research_directory`) and whether they want to customize the workflow at all. This is important – I want to make sure I'm aligned with their expectations before diving in. Then, and only then, can I actually begin running the research workflow.[0m

[37m💬 LLM Response: [0mHello! I will help you with the research workflow. Here are the steps:

1.  **Setup:** Extract URLs and

Now, read the above output and notice the following:
- Since there isn't any local file to extract, the agent skipped the `process_local_files_tool` tool.
- Read the agent thoughts to understand the reasoning behind the choices it made. They usually refer to the previous tool outputs.
- Read the final message from the agent.

## 9. Exploring Generated Files

After running the tools, examine the organized file structure in your research directory:

```
research_directory/
├── article_guideline.md                     # Input guidelines
├── .nova/                                   # Hidden folder with all data
│   ├── guidelines_filenames.json           # Extracted URLs and files
│   ├── local_files_from_research/          # Copied local files  
│   ├── urls_from_guidelines/               # Scraped web content
│   ├── urls_from_guidelines_code/          # GitHub repo summaries
│   └── urls_from_guidelines_youtube/       # Video transcripts
```

Each folder contains processed content ready for the next stages of the research workflow. The file-based approach ensures that:

- **Content is persistent** across agent sessions
- **Large content blocks** don't overwhelm the agent's context
- **Selective access** allows the agent to read only relevant files
- **Human inspection** is possible for debugging and verification

In a production setting, these files can be replaced with a database to enable more efficient querying and retrieval.