## **Search, Extract and Crawl the Web**

### **Introduction**

In this tutorial, you'll gain hands-on experience with the core capabilities of the Tavily API - searching the web with semantic understanding, extracting content from live web pages, and crawling entire websites.

These skills are essential for anyone building AI agents or applications that need up-to-date, relevant information from the internet. By learning how to programmatically access and process real-time web data, you'll able to bridge the gap between static language models and the dynamic world they operate in, making your agents smarter, more accurate and context-aware.

We'll cover:
- How to perform web searches and retrieve the most relevant results
- How to extract clean, usable content from any URL
- How to crawl websites to gather comprehensive information
- How to fine-tune your queries with advanced parameters

### **Getting Started**

1. **Sign up** for Tavily at `app.tavily.com` to get API key.
2. **Copy API key** and paste to .env file

In [1]:
# Install dependencies
!pip install -q -q -q python-dotenv tavily-python

### **Setting up Tavily Client**

In [2]:
import os
from dotenv import load_dotenv
from tavily import TavilyClient

load_dotenv()

tavily_api_key = os.getenv("TAVILY_API_KEY")

if tavily_api_key:
    tavily_client = TavilyClient(api_key=tavily_api_key)
    print("Tavily client initialized successfully.")
else: 
    print("Please set the TAVILY_API_KEY environment variable.")

Tavily client initialized successfully.


### **Search**

Let's run a basic web search query to retrieve up-to-date information about NYC.

In [None]:
search_results = tavily_client.search(
    query="What happend in NYC today?", 
    max_results=3
)

In [9]:
for result in search_results["results"]:
    print(result)

print(result.keys())

{'url': 'https://abc7ny.com/news/', 'title': 'Breaking News | Eyewitness News Feed - ABC7 New York', 'content': 'The massive fire has completely destroyed at least six businesses along Maple Avenue, between Winans and Conklin avenues, according to officials. Eyewitness', 'score': 0.50769037, 'raw_content': None}
{'url': 'https://www.cbsnews.com/newyork/local-news/new-york/', 'title': 'New York News', 'content': "High tide brings coastal flooding to parts of Long Island. A nor'easter that's bearing down on the NYC area continues to cause flooding concerns today for Long", 'score': 0.46497503, 'raw_content': None}
{'url': 'https://nypost.com/metro/', 'title': 'Breaking NYC News & Local Headlines | New York Post', 'content': 'Grieving parents sue NYC day care after 1-year-old drowned as caretaker was cooking.', 'score': 0.3831801, 'raw_content': None}
dict_keys(['url', 'title', 'content', 'score', 'raw_content'])


Let's run another search query with specific question

In [10]:
search_results = tavily_client.search(
    query="Anthropic model release?",
    max_results=3,
    time_range="month",
    include_domains=["techcrunch.com"],
    topic="news"
)

In [11]:
for result in search_results["results"]:
    print(result["title"])
    print(result["url"])
    print(result["content"])
    print("\n")

‘Selling coffee beans to Starbucks’ – how the AI boom could leave AI’s biggest companies behind - TechCrunch
https://techcrunch.com/2025/09/14/selling-coffee-beans-to-starbucks-how-the-ai-boom-could-leave-ais-biggest-companies-behind/
It might seem like a silly question, but it’s come up a lot in my conversations with AI startups, which are increasingly comfortable with businesses that used to be dismissed as “GPT wrappers,” or companies that build interfaces on top of existing AI models like ChatGPT. Throughout the contemporary boom, the success of AI has been inextricable from the success of the companies building foundation models — specifically, OpenAI, Anthropic, and Google. For years, foundation model development was the only AI business there was — and the fast pace of progress made their lead seem insurmountable. The assumption was that, however AI models ended up making money, the lion’s share of the benefit would flow back to the foundation model companies, who had done the w

### **Extract**

Next, we'll use the Tavily extract endpoint to retrive the complete content (i.e, `raw_content`) of each page using the URLs from our previous search results. Instead of just using the short content snippets from the search, this allows us to access the full text of each page. For efficiency, the extract endpoint can process up to 20 URLs at once in a single call.

In [12]:
extract_results = tavily_client.extract(
    urls=[result["url"] for result in search_results["results"]]
)

In [None]:
# Print the results
for result in extract_results["results"]:
    print(result["url"])
    print(result["raw_content"])
    print("\n")

Rather than using the extract endpoint to return raw page content, we can combine the search and extract endpoints into a API call by using the search endpoint with the `include_raw_content=True` parameter.

In [14]:
search_results = tavily_client.search(
    query="Anthropic model release?",
    max_results=1,
    include_raw_content=True
)

In [None]:

# Print the results
for result in search_results["results"]:
    print(result["url"])
    print(result["content"])
    print(result["score"])
    print(result["raw_content"])
    print("\n")

### Crawl

Now let's use Tavily to crawl a webpage and extract all its links. Web crawling is the process of automatically navigating through websites by following hyperlinks to discover numerous web pages and URLs (think of it like falling down a Wikipedia rabbit hole - Clinking from page to page, diving deeper into interconnected topics). For autonomous web agents, this capability is essential for accessing deep web data which might be difficult to retrieve via search.

Let's begin by crawling the Tavily website to gather all nested pages.

In [16]:
crawl_results = tavily_client.crawl(url="tavily.com")

In [None]:
for result in crawl_results["results"]:
    print(result["url"])

If you're interested in just the links (without the full page content), use the Map endpoint. It's a faster and more cost-effective way to retrieve all the links from a site

In [22]:
map_results = tavily_client.map(url="tavily.com")

In [None]:
map_results

The `instructions` parameter of crawl/map endpoint is a powerful feature that lets you guide the web crawl using natural language instructions.

In [20]:
guided_map_results = tavily_client.map(
    url="tavily.com",
    instructions="find only the developer docs"
)

In [21]:
guided_map_results

{'base_url': 'tavily.com',
 'results': ['https://docs.tavily.com/',
  'https://docs.tavily.com/api-reference',
  'https://docs.tavily.com/api-reference/endpoint/crawl',
  'https://docs.tavily.com/api-reference/endpoint/extract',
  'https://docs.tavily.com/api-reference/endpoint/search',
  'https://docs.tavily.com/documentation/api-reference/endpoint/crawl',
  'https://docs.tavily.com/documentation/api-reference/endpoint/extract',
  'https://docs.tavily.com/documentation/api-reference/endpoint/map',
  'https://docs.tavily.com/documentation/api-reference/endpoint/search',
  'https://docs.tavily.com/documentation/api-reference/endpoint/usage',
  'https://docs.tavily.com/documentation/api-reference/introduction',
  'https://docs.tavily.com/sdk',
  'https://docs.tavily.com/sdk/javascript/quick-start',
  'https://docs.tavily.com/sdk/javascript/reference',
  'https://docs.tavily.com/sdk/python/quick-start',
  'https://docs.tavily.com/sdk/python/reference'],
 'response_time': 4.11,
 'request_i

### **Conclusion & Next Steps**

In this tutorial, you learned how to:
- Perform real-time web searches using the Tavily API
- Extract content from web pages
- Crawl and map websites to gather links and information
- Guide crawls with the natural language instructions for targeted data extraction

These foundational skills enable your agents to access and utilize up-to-date web information, making them more powerful and context-aware. 