## **Search, Extract and Crawl the Web**

### **Introduction**

In this tutorial, you'll gain hands-on experience with the core capabilities of the Tavily API - searching the web with semantic understanding, extracting content from live web pages, and crawling entire websites.

These skills are essential for anyone building AI agents or applications that need up-to-date, relevant information from the internet. By learning how to programmatically access and process real-time web data, you'll able to bridge the gap between static language models and the dynamic world they operate in, making your agents smarter, more accurate and context-aware.

We'll cover:
- How to perform web searches and retrieve the most relevant results
- How to extract clean, usable content from any URL
- How to crawl websites to gather comprehensive information
- How to fine-tune your queries with advanced parameters

### **Getting Started**

1. **Sign up** for Tavily at `app.tavily.com` to get API key.
2. **Copy API key** and paste to .env file

In [1]:
# Install dependencies
!uv add python-dotenv tavily-python --quiet

### **Setting up Tavily Client**

In [2]:
import os
from dotenv import load_dotenv
from tavily import TavilyClient

load_dotenv()

tavily_api_key = os.getenv("TAVILY_API_KEY")

if tavily_api_key:
    tavily_client = TavilyClient(api_key=tavily_api_key)
    print("Tavily client initialized successfully.")
else: 
    print("Please set the TAVILY_API_KEY environment variable.")

Tavily client initialized successfully.


### **Search**

Let's run a basic web search query to retrieve up-to-date information about NYC.

In [3]:
search_results = tavily_client.search(
    query="What happend in NYC today?", 
    max_results=3
)

In [4]:
for result in search_results["results"]:
    print(result)

print(result.keys())

{'url': 'https://www.cbsnews.com/newyork/local-news/new-york/', 'title': 'New York - CBS News', 'content': "High tide brings coastal flooding to parts of Long Island. A nor'easter that's bearing down on the NYC area continues to cause flooding concerns today for Long", 'score': 0.46672463, 'raw_content': None}
{'url': 'https://www.nyc.gov/main/events', 'title': 'New York City events - NYC.gov', 'content': 'Columbus Day. Federal, State, & City Holiday - NYC public schools are closed. Post offices are closed. City offices and other government buildings are c.', 'score': 0.4457955, 'raw_content': None}
{'url': 'https://abc7ny.com/news/', 'title': 'Breaking News | Eyewitness News Feed - ABC7 New York', 'content': 'Police discovered a 27-year-old man shot multiple times throughout the body at 79 Alexander Ave. in the Mott Haven section. Eyewitness News at', 'score': 0.34820226, 'raw_content': None}
dict_keys(['url', 'title', 'content', 'score', 'raw_content'])


Let's run another search query with specific question

In [5]:
search_results = tavily_client.search(
    query="Anthropic model release?",
    max_results=3,
    time_range="month",
    include_domains=["techcrunch.com"],
    topic="news"
)

In [6]:
for result in search_results["results"]:
    print(result["title"])
    print(result["url"])
    print(result["content"])
    print("\n")

Salesforce launches ‘Missonforce,’ a national security-focused business unit - TechCrunch
https://techcrunch.com/2025/09/16/salesforce-launches-missonforce-a-national-security-focused-business-unit/
It will be focused on incorporating AI into defense workflows in three main areas: personnel, logistics, and decision making, according to a company press release. This news is the latest in a wave of tech companies building and offering services specifically for the U.S. government. OpenAI launched a version of its ChatGPT designed for U.S. government agencies in January. In August, the company announced it struck a deal with the government to give federal agencies access to its enterprise ChatGPT tier for just $1 a year. AI, AI agents, Anthropic, artificial intelligence, defense, Enterprise, Google, Government & Policy, national security, OpenAI, Salesforce, United States Becca is a senior writer at TechCrunch that covers venture capital trends and startups.


Former UK Prime Minister Ris

### **Extract**

Next, we'll use the Tavily extract endpoint to retrive the complete content (i.e, `raw_content`) of each page using the URLs from our previous search results. Instead of just using the short content snippets from the search, this allows us to access the full text of each page. For efficiency, the extract endpoint can process up to 20 URLs at once in a single call.

In [7]:
extract_results = tavily_client.extract(
    urls=[result["url"] for result in search_results["results"]]
)

In [8]:
# Print the results
for result in extract_results["results"]:
    print(result["url"])
    print(result["raw_content"])
    print("\n")

https://techcrunch.com/2025/09/16/salesforce-launches-missonforce-a-national-security-focused-business-unit/
Published Time: 2025-09-16T14:00:00+00:00

Salesforce launches 'Missonforce,' a national security-focused business unit | TechCrunch

[Skip to content](https://techcrunch.com/2025/09/16/salesforce-launches-missonforce-a-national-security-focused-business-unit/#wp--skip-link--target)

[![Image 1](https://techcrunch.com/wp-content/uploads/2024/09/tc-lockup.svg)TechCrunch Desktop Logo](https://techcrunch.com/)[![Image 2](https://techcrunch.com/wp-content/uploads/2024/09/tc-logo-mobile.svg)TechCrunch Mobile Logo](https://techcrunch.com/)

*   [Latest](https://techcrunch.com/latest/)
*   [Startups](https://techcrunch.com/category/startups/)
*   [Venture](https://techcrunch.com/category/venture/)
*   [Apple](https://techcrunch.com/tag/apple/)
*   [Security](https://techcrunch.com/category/security/)
*   [AI](https://techcrunch.com/category/artificial-intelligence/)
*   [Apps](https://

Rather than using the extract endpoint to return raw page content, we can combine the search and extract endpoints into a API call by using the search endpoint with the `include_raw_content=True` parameter.

In [9]:
search_results = tavily_client.search(
    query="Anthropic model release?",
    max_results=1,
    include_raw_content=True
)

In [10]:

# Print the results
for result in search_results["results"]:
    print(result["url"])
    print(result["content"])
    print(result["score"])
    print(result["raw_content"])
    print("\n")

https://www.cnbc.com/2025/10/15/anthropic-claude-haiku-4-5-ai.html
[Skip Navigation](https://www.cnbc.com/2025/10/15/anthropic-claude-haiku-4-5-ai.html#MainContent) [SIGN IN](https://www.cnbc.com/2025/10/15/anthropic-claude-haiku-4-5-ai.html#) [Create free account](https://www.cnbc.com/2025/10/15/anthropic-claude-haiku-4-5-ai.html#) [Anthropic](https://www.cnbc.com/2025/06/10/anthropic-cnbc-disruptor-50.html) on Wednesday announced Claude Haiku 4.5, a small [artificial intelligence](https://www.cnbc.com/ai-artificial-intelligence/) model that's available as a lower-cost offering for all of the company's users. Claude Haiku 4.5 is better at using computers than [Claude Sonnet 4](https://www.cnbc.com/2025/05/22/claude-4-opus-sonnet-anthropic.html), for instance, which is a midsized model the company launched in May. It performs similarly to Claude Sonnet 4 and OpenAI's most recent model, [GPT-5](https://www.cnbc.com/2025/08/07/openai-launches-gpt-5-model-for-all-chatgpt-users.html), at c

### Crawl

Now let's use Tavily to crawl a webpage and extract all its links. Web crawling is the process of automatically navigating through websites by following hyperlinks to discover numerous web pages and URLs (think of it like falling down a Wikipedia rabbit hole - Clinking from page to page, diving deeper into interconnected topics). For autonomous web agents, this capability is essential for accessing deep web data which might be difficult to retrieve via search.

Let's begin by crawling the Tavily website to gather all nested pages.

In [11]:
crawl_results = tavily_client.crawl(url="tavily.com")

In [12]:
for result in crawl_results["results"]:
    print(result["url"])

https://www.tavily.com/
https://www.tavily.com/enterprise
https://www.tavily.com/contact
https://www.tavily.com/privacy
https://www.tavily.com/use-cases
https://www.tavily.com/careers
https://www.tavily.com/terms
https://www.tavily.com/#benchmarks
https://www.tavily.com/#features
https://www.tavily.com/#pricing
https://help.tavily.com/
https://blog.tavily.com/
https://community.tavily.com/
https://docs.tavily.com/
https://status.tavily.com/
https://app.tavily.com/playground
https://x.com/tavilyai
https://github.com/tavily-ai
https://app.tavily.com/home
https://blog.tavily.com/jetbrains-tavily
https://mailto:support@tavily.com/


If you're interested in just the links (without the full page content), use the Map endpoint. It's a faster and more cost-effective way to retrieve all the links from a site

In [13]:
map_results = tavily_client.map(url="tavily.com")

In [14]:
map_results

{'base_url': 'tavily.com',
 'results': ['https://www.tavily.com/',
  'https://www.tavily.com/enterprise',
  'https://www.tavily.com/careers',
  'https://www.tavily.com/contact',
  'https://www.tavily.com/use-cases',
  'https://www.tavily.com/privacy',
  'https://www.tavily.com/terms',
  'https://www.tavily.com/#features',
  'https://www.tavily.com/#benchmarks',
  'https://www.tavily.com/#pricing',
  'https://mailto:support@tavily.com/',
  'https://blog.tavily.com/',
  'https://community.tavily.com/',
  'https://docs.tavily.com/',
  'https://help.tavily.com/',
  'https://status.tavily.com/',
  'https://app.tavily.com/home',
  'https://x.com/tavilyai',
  'https://app.tavily.com/playground',
  'https://blog.tavily.com/jetbrains-tavily',
  'https://github.com/tavily-ai'],
 'response_time': 0.12,
 'request_id': 'a6affe52-c13e-4ccf-b47b-a41551f55841'}

The `instructions` parameter of crawl/map endpoint is a powerful feature that lets you guide the web crawl using natural language instructions.

In [15]:
guided_map_results = tavily_client.map(
    url="tavily.com",
    instructions="find only the developer docs"
)

In [16]:
guided_map_results

{'base_url': 'tavily.com',
 'results': ['https://docs.tavily.com/',
  'https://docs.tavily.com/api-reference',
  'https://docs.tavily.com/api-reference/endpoint/crawl',
  'https://docs.tavily.com/api-reference/endpoint/extract',
  'https://docs.tavily.com/api-reference/endpoint/search',
  'https://docs.tavily.com/documentation/about',
  'https://docs.tavily.com/documentation/api-reference/endpoint/crawl',
  'https://docs.tavily.com/documentation/api-reference/endpoint/extract',
  'https://docs.tavily.com/documentation/api-reference/endpoint/map',
  'https://docs.tavily.com/documentation/api-reference/endpoint/search',
  'https://docs.tavily.com/documentation/api-reference/endpoint/usage',
  'https://docs.tavily.com/documentation/api-reference/introduction',
  'https://docs.tavily.com/documentation/best-practices/best-practices-crawl',
  'https://docs.tavily.com/documentation/best-practices/best-practices-extract',
  'https://docs.tavily.com/documentation/best-practices/best-practices-s

### **Conclusion & Next Steps**

In this tutorial, you learned how to:
- Perform real-time web searches using the Tavily API
- Extract content from web pages
- Crawl and map websites to gather links and information
- Guide crawls with the natural language instructions for targeted data extraction

These foundational skills enable your agents to access and utilize up-to-date web information, making them more powerful and context-aware. 