# Build a RAG System with Tavily Crawl

In this tutorial, you'll learn how to turn any website into a searchable knowledge base. We'll use Tavily's crawl API to extract information from websites, convert the content into a searchable vector index with OpenAI embeddings and in-memory Chroma vector store, and create a RAG question-answering system. This tutorial is self-contained and requires no additional setup.

## Getting Started

Follow these steps to set up:

1. **Sign up** for Tavily at [app.tavily.com](https://app.tavily.com/home/) to get your API key.


2. **Sign up** for OpenAI to get your API key. Feel free to substitute any other LLM provider.
   

2. **Copy your API keys** from your Tavily and OpenAI account dashboard.

3. **Paste your API keys** into the cell below and execute the cell.

In [None]:
# To export your API keys into a .env file, run the following cell (replace with your actual keys):
!echo "TAVILY_API_KEY=<your-tavily-api-key>" >> .env
!echo "OPENAI_API_KEY=<your-openai-api-key>" >> .env

Install dependencies in the cell below.

In [None]:
%pip install --upgrade tavily-python langchain-openai langchain-chroma langchain-community --quiet

### Setting Up Your Tavily API Client

The code below will instantiate the Tavily client with your API key.

In [None]:
import os
import getpass
from dotenv import load_dotenv
from tavily import TavilyClient

# Load environment variables from .env file
load_dotenv()

# Prompt the user to securely input the API key if not already set in the environment
if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY:\n")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

# Initialize the Tavily API client using the loaded or provided API key
tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

## Step 1: Define the Target Website

Now let's utilize Tavily to crawl a webpage and retrieve all of its links. Web crawling involves automatically traversing websites by following hyperlinks to uncover various web pages and URLs. Tavily's crawl feature is AI-native, offering rapid responses via parallelized, graph-based processing.

For this example, we're using `www.tavily.com`.


In [4]:
# Crawl the Tavily website
base_url = "www.tavily.com"

When crawling web pages, we can specify the output format as either "text" (clean text) or "markdown". For this tutorial, we'll use "text" format since it's better suited for creating embeddings later.

In [5]:
# Crawl the base url with clean text format as output
crawl_result = tavily_client.crawl(
    url=base_url,
    max_depth=3,
    limit=100,
    format="text",
)

Now lets examine all the nested URLs.

In [None]:
for page in crawl_result["results"]:
    print(page["url"])

Let's run a second crawl with natural language `instructions` to specifically target developer documentation pages. This demonstrates how we can focus the crawler on specific types of content.

In [None]:
# Crawl the base url with clean text format as output
focused_crawl_result = tavily_client.crawl(
    url=base_url,
    max_depth=3,
    limit=100,
    format="text",
    instructions="find all developer documentation pages",
)

Now, the results will only include developer docs from the Tavily webpage.

In [None]:
for page in focused_crawl_result["results"]:
    print(page["url"])

## Step 2: Preview the Raw Content

Let's examine a sample of the raw content from one of the crawled pages to understand the webpage data we're working with:


In [None]:
data = crawl_result["results"]

# Just view one sample page from the data array
if data:
    sample_page = data[0]  # Get the first page
    raw_content = sample_page["raw_content"]
    print(f"URL: {sample_page['url']}\n")
    print(f"Raw Content: \n{raw_content}...")
    print("-" * 50)

## Step 3: Process Content into Documents

We'll convert the crawled content into LangChain Document objects, which will allow us to:

1. Maintain important metadata (source URL, page name)
2. Prepare the text for chunking
3. Make the content ready for vectorization


In [10]:
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter


def process_crawl_result(crawl_result):
    # Convert the crawled content into LangChain Document objects
    documents = []
    # Loop through the crawl results and create a Document object for each page
    for result in crawl_result["results"]:
        raw_content = result.get("raw_content")
        if not raw_content:  # Skip if None, empty string, or false
            continue

        doc = Document(
            page_content=raw_content,
            metadata={
                "url": result.get("url", ""),
            },
        )
        documents.append(doc)
    return documents

Let's run this on the generic crawl results and the developer-specific crawl results.

In [11]:
documents = process_crawl_result(crawl_result)
documents_focussed = process_crawl_result(focussed_crawl_result)


## Step 4: Split Documents into Chunks

We'll split the documents into smaller, more manageable chunks using the `RecursiveCharacterTextSplitter` and preview the result.

In [None]:
# If you still want to split each page into smaller chunks:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Larger chunk size for page-level content
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)
# Split the documents while preserving metadata
all_chunks = text_splitter.split_documents(documents)
focussed_chunks = text_splitter.split_documents(documents_focussed)

# Now you have each page as a separate document with proper metadata
print(f"Created {len(documents)} page-level documents")
print(f"Split into {len(all_chunks)} total chunks")

# Example of accessing the documents
for i, doc in enumerate(documents[:2]):  # Print first 3 for example
    print(f"\nDocument {i+1}:")
    print(f"Page: {doc.metadata.get('page_name')}")
    print(f"Source: {doc.metadata.get('source')}")
    print(f"Content length: {len(doc.page_content)} characters")

## Step 5: Create Vector Embeddings

Now we'll create vector embeddings for our document chunks using OpenAI's embedding model and store them in a Chroma vector database. This allows us to perform semantic search on our document collection.

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma

# Create embeddings for the documents
embeddings = OpenAIEmbeddings(api_key=os.getenv("OPENAI_API_KEY"))

# Create a vector store from the loaded documents
vector_store = Chroma.from_documents(documents, embeddings)
vector_store_focussed = Chroma.from_documents(documents_focussed, embeddings)

## Step 6: Build the Question-Answering System

Finally, we'll create a retrieval-based question-answering system using gpt-4.1-mini. We use the "stuff" chain type, which combines all relevant retrieved documents into a single context for the model.

In [14]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Initialize the language model
llm = ChatOpenAI(model="gpt-4.1-mini", streaming=True)

# Create a QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
)
qa_chain_focussed = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store_focussed.as_retriever(),
)

## Step 7: Test the System

Let's test our RAG system by asking a question about Tavily's documentation.

First, lets ask a generic question about Tavily.

In [15]:
# Example question
query = "What is Tavily's production rate limit?"
answer = qa_chain.invoke(query)

In [None]:
from IPython.display import display
from IPython.display import Markdown
display(Markdown(answer["result"]))

For the developer-specific index, lets ask a detailed question.

In [17]:
# Example question
query = "tell me about the chunks_per_source parameter"
answer = qa_chain_focussed.invoke(query)

In [None]:
display(Markdown(answer["result"]))

## Conclusion

We've successfully built a complete RAG system that can:

1. Crawl web content from a specific domain
2. Process and structure the content
3. Create vector embeddings for semantic search
4. Answer questions based on the crawled information

This approach can be extended to create knowledge bases from any website, documentation, or content repository, making it a powerful tool for building domain-specific assistants and search systems. 

For a more advanced implementation of this concept:
- Try out our [hosted demo application](https://crawl-to-rag.tavily.com/)
- View the complete [source code](https://github.com/tavily-ai/crawl2rag)


For more information, read the crawl [API reference](https://docs.tavily.com/documentation/api-reference/endpoint/crawl) and [best practices guide](https://docs.tavily.com/documentation/best-practices/best-practices-crawl).