# Build an Agentic Web Crawler with Tavily & OpenAI
---

This notebook demonstrates how to build an agentic web crawler that intelligently searches and crawls for information based on user objectives.

## Overview

This notebook demonstrates how to build an intelligent web crawler that autonomously navigates the internet to fulfill specific information needs. Our approach combines Tavily's powerful 🔍 `/Search` and 🕸️ `/Crawl` capabilities with OpenAI's LLMs to create a system that can:

1. **Understand user objectives**
2. **Discover relevant sources** - Leverage Tavily `/Search` to identify promising websites
3. **Make strategic decisions** - Use LLMs to evaluate and select the most valuable sites to explore
4. **Extract comprehensive content** - Deploy Tavily's `/Crawl` API to gather in-depth information
5. **Synthesize findings** - Consolidate and analyze the collected data

---

In [1]:
import getpass
import os
import requests
import json
from typing import List, Dict, Any
from openai import OpenAI

# Check for environment variables or prompt for API keys
if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY:\n")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY:\n")

TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [2]:
from tavily import TavilyClient

# Initialize Tavily client
tavily_client = TavilyClient(TAVILY_API_KEY)
# Analyze search results with OpenAI
openai_client = OpenAI(api_key=OPENAI_API_KEY)

## Step 1: Define User Objective

First, we'll define the objective that the agent will use to guide its search and crawling strategy.

---

In [3]:
# Define user objective
user_objective = "I want to learn about the Tavily API"

In [4]:
from optimize_parameters.optimize import OptimizeParameters
from optimize_parameters.schemas import (
    TavilySearchParameters,
)

# Using OpenAI models
optimizer = OptimizeParameters(model="gpt-4o-mini", provider="openai")
params = optimizer.optimize_parameters(user_objective)

In [5]:
params

TavilySearchParameters(query='Tavily API', include_domains=None, exclude_domains=None, include_images=False, include_image_descriptions=False, topic='general', time_range=None)

In [6]:
def search_tavily(parameters: TavilySearchParameters) -> Dict[str, Any]:
    """
    Execute a search with the Tavily API using the provided parameters.

    Args:
        parameters: The parameters to use for the search

    Returns:
        Dict[str, Any]: The raw search results from the Tavily API
    """
    # Convert parameters to a dictionary and directly pass to Tavily API
    # This simplifies the conversion process
    params_dict = parameters.model_dump(exclude_none=True)
    params_dict["search_depth"] = "advanced"

    # Execute the search
    return tavily_client.search(**params_dict)

## Step 2: Perform Initial Search with Tavily

We'll use Tavily's search API to find the most relevant websites related to the user's objective. This will serve as our starting point for determining which sites to crawl.

---

In [7]:
# Perform the initial search
search_results = search_tavily(params)

print(f"Found {len(search_results.get('results', []))} potential sources\n")

Found 5 potential sources



Let's get a quick view of the webpage via Tavily's Title, URL, and Content Snippet

In [8]:
for result in search_results.get("results", []):
    print(f"Title: {result.get('title')}")
    print(f"URL: {result.get('url')}")
    print(f"Content Snippet: {result.get('content')}")
    print("\n")

Title: Tavily
URL: https://tavily.com/
Content Snippet: Tavily Search API is a specialized search engine designed for Large Language Models (LLMs) and AI agents. It provides real-time, accurate, and unbiased information, enabling AI applications to retrieve and process data efficiently. Tavily is built with AI developers in mind, simplifying the process of integrating dynamic web information into AI-driven solutions.

How is Tavily Search API different from other APIs? [...] Unlike Bing, Google and SerpAPI, Tavily Search API reviews multiple sources to find the most relevant content from each source, delivering concise, ready-to-use information optimized for LLM context. This focus on RAG and LLMs ensures your AI applications access only the highest-quality data. Tavily Search API is also more affordable and flexible.

What are the key advantages of using Tavily Search API? [...] ### Tavily is trusted by AI leaders around the world

![quotes](./images/icons/quotes.svg)

We've been usin

## Step 3: Analyze Search Results with LLM

Now we'll use OpenAI's `o1` reasoning model to analyze the search results and decide which sites are the most relevant to crawl. This is where the system becomes "agentic" - making decisions about what to explore next.

We use the titles, urls, and content snippets provided by the Tavily `/Search` endpoint as context to the model.

---

In [9]:
def analyze_search_results(objective: str, results: List[Dict]) -> List[Dict]:
    """Use LLM to analyze search results and select the most relevant sites to crawl."""
    results_text = "\n".join(
        [
            f"{i+1}. {r['title']}\n   URL: {r['url']}\n   Content Snippet: {r['content']}"
            for i, r in enumerate(results)
        ]
    )

    prompt = f"""
    User Objective: {objective}
    
    Search Results:
    {results_text}
    
    Identify up to 2 URLs that contain the most relevant information for the user's specific objective.
    When selecting URLs, you may trim URL paths to more general directories if those would provide broader coverage of relevant information. For example, instead of 'amazon.com/careers/engineering/ai/', choose 'amazon.com/careers/engineering/' if the user wants information about all engineering positions rather than just AI roles.
    The URLs will later be used as input to crawl the website, so earlier you can trim the URL paths to more general directories if those would provide broader coverage of relevant information.

    Format your response as a JSON object with a 'selected_sites' array containing object with 'url'.
    """

    response = openai_client.chat.completions.create(
        model="o1-2024-12-17",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content).get("selected_sites", [])


selected_sites = analyze_search_results(
    user_objective, search_results.get("results", [])
)
print(f"Selected {len(selected_sites)} base urls for crawling.")

Selected 2 base urls for crawling.


In [10]:
selected_sites

[{'url': 'https://tavily.com/'},
 {'url': 'https://python.langchain.com/v0.1/docs/integrations/retrievers/tavily/'}]

## Step 4: Crawl Selected Websites

Now we'll crawl each of the selected websites using Tavily's crawling API, applying the custom crawling strategies determined by our agent.

---

In [16]:
# Crawling function
def crawl_website(url: str) -> Dict:
    """Crawl a website using Tavily's API."""
    response = requests.post(
        "https://api.tavily.com/crawl",
        headers={"Authorization": f"Bearer {TAVILY_API_KEY}"},
        json={
            "url": url,
            "limit": 50,
            "max_depth": 2,
            "extract_depth": "basic",
            "max_breadth": 20,
            "select_domains": [],
            # "select_paths": ["/examples/*"],
        },
    )

    if response.status_code != 200:
        print(f"Error crawling {url}: {response.status_code} {response.text}")
        return {"url": url, "status": "error", "results": []}

    return response.json()

In [17]:
# Perform crawling (this may take a minute or two)
crawl_results = [crawl_website(site["url"]) for site in selected_sites]

In [19]:
def extract_data_fields(crawl_results):
    data_fields = []
    for result in crawl_results:
        # Check if this is a successful crawl with data
        if result.get("success") and result.get("results"):
            data_fields.extend(result["results"])
    return data_fields

In [20]:
crawl_data = extract_data_fields(crawl_results)

In [21]:
crawl_data

[]

## Step 5: Process and Extract Key Information

Now we'll process the crawled content to extract the most relevant information for the user's objective. We'll organize this information by source and relevance.

---


In [16]:
# After you have your crawl_result


def generate_batch_reports(crawl_result, batch_size=5):
    """
    Process crawl results in batches and generate reports using an LLM.

    Args:
        crawl_result: The results from your web crawl
        batch_size: Number of results to analyze in each LLM call

    Returns:
        A list of batch reports and a combined final report
    """
    # Split crawl results into batches of 5
    batches = []
    for i in range(0, len(crawl_result), batch_size):
        batch = crawl_result[i : i + batch_size]
        batches.append(batch)

    print(f"Split crawl results into {len(batches)} batches")

    # Process each batch with the LLM
    batch_reports = []
    for i, batch in enumerate(batches):
        print(f"Processing batch {i+1}/{len(batches)}...")

        # Create a prompt for the LLM to analyze this batch
        prompt = f"""
        Analyze the following web crawl results and provide a detailed summary of the key information:
        
        {batch}
                
        Format your response as a well-structured report section.
        """

        # Call the LLM with the prompt
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )
        batch_reports.append(response)

    # Combine the batch reports into a final comprehensive report
    prompt = f"""
    User Objective: {user_objective}
    Create a report based on the findings:
    {batch_reports}
    """

    final_report = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )

    return final_report.choices[0].message.content

In [None]:
# Use the function on your crawl results
report_results = generate_batch_reports(crawl_data)

In [18]:
# You can also save the report to a file
with open("report.md", "w") as f:
    f.write(report_results)

View the saved report file: [report.md](report.md)

In [None]:
print("FINAL REPORT:")
print(report_results)