# Data Collection and PDF Export with Tavily Crawl

In this tutorial, you'll learn how to use Tavily's crawl API to collect data from websites and export the results as organized PDF files. This is perfect for creating offline archives, research documentation, or building a local knowledge base from web content.

By the end of this lesson, you'll know how to:
- Crawl websites to collect structured data
- Process and clean the crawled content
- Export individual pages as well-formatted PDF files
- Organize your exported data in a structured file system

## Getting Started

Follow these steps to set up:

1. **Sign up** for Tavily at [app.tavily.com](https://app.tavily.com/home/) to get your API key.
   

2. **Copy your API keys** from your Tavily and OpenAI account dashboard.

3. **Paste your API keys** into the cell below and execute the cell.

In [None]:
# To export your API keys into a .env file, run the following cell (replace with your actual keys):
!echo "TAVILY_API_KEY=<your-tavily-api-key>" >> .env

Install dependencies in the cell below.

In [None]:
%pip install -U tavily-python reportlab beautifulsoup4 --quiet

### Setting Up Your Tavily API Client

The code below will instantiate the Tavily client with your API key.

In [1]:
import os
import getpass
from dotenv import load_dotenv
from tavily import TavilyClient

# Load environment variables from .env file
load_dotenv()

# Prompt the user to securely input the API key if not already set in the environment
if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY:\n")

# Initialize the Tavily API client using the loaded or provided API key
tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

### Define the Target Website and Crawl Parameters

Now let's configure our crawl to collect data from a website. We'll use Tavily's crawl API to systematically traverse the website and collect content that we can later export as PDFs.

For this example, we're targeting `www.tavily.com` to collect blog posts, but you can modify the URL and instructions for any website you want to archive.

In [None]:
# Define the target website and crawl parameters
base_url = "www.tavily.com"

# Crawl the website with specific parameters for data collection
crawl_result = tavily_client.crawl(
    url=base_url,
    limit=30,
    max_depth=2,
    max_breadth=30,
    instructions="Blog posts and articles",
    format="text",  # Get clean text format for better PDF conversion
)

In [None]:
# Preview the crawl results
print(f"Successfully crawled {len(crawl_result['results'])} pages")
print("\nDiscovered URLs:")
for i, page in enumerate(crawl_result["results"][:5]):  # Show first 5 URLs
    print(f"{i+1}. {page['url']}")

if len(crawl_result["results"]) > 5:
    print(f"... and {len(crawl_result['results']) - 5} more pages")

Successfully crawled 15 pages

Discovered URLs:
1. https://blog.tavily.com/
2. https://blog.tavily.com/#/portal/
3. https://blog.tavily.com/#main
4. https://blog.tavily.com/ai-enablers-the-building-blocks-of-next-gen-enterprise-solutions
5. https://blog.tavily.com/building-and-breaking-weblangchain
... and 10 more pages


### Preview the Raw Content

Let's examine a sample of the raw content from one of the crawled pages to understand the data we're working with before converting it to PDF:

In [None]:
# Preview content from the first crawled page
if crawl_result["results"]:
    sample_page = crawl_result["results"][0]
    raw_content = sample_page.get("raw_content", "")

    print(f"URL: {sample_page['url']}")
    print(f"Content length: {len(raw_content)} characters")
    print("\nFirst 500 characters of content:")
    print("-" * 50)
    print(raw_content[:500] + "..." if len(raw_content) > 500 else raw_content)
    print("-" * 50)

### Set Up PDF Export Functions

Now we'll create functions to process the crawled content and export it as well-formatted PDF files. We'll use ReportLab to create professional-looking PDFs with proper formatting, headers, and metadata.

In [None]:
import os
from datetime import datetime
from reportlab.lib.pagesizes import letter, A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib import colors
from urllib.parse import urlparse
import re


def clean_text_for_pdf(text):
    """Clean and prepare text content for PDF conversion"""
    if not text:
        return ""

    # Remove excessive whitespace and newlines
    text = re.sub(r"\n\s*\n", "\n\n", text)
    text = re.sub(r"\s+", " ", text)

    # Escape special characters for ReportLab
    text = text.replace("&", "&amp;")
    text = text.replace("<", "&lt;")
    text = text.replace(">", "&gt;")

    return text.strip()


def generate_filename_from_url(url, max_length=50):
    """Generate a safe filename from a URL"""
    # Parse the URL to get the path
    parsed = urlparse(url)
    path = parsed.path.strip("/")

    if not path:
        path = parsed.netloc

    # Clean the path for filename use
    filename = re.sub(r"[^\w\-_.]", "_", path)
    filename = re.sub(r"_+", "_", filename)
    filename = filename.strip("_")

    # Truncate if too long
    if len(filename) > max_length:
        filename = filename[:max_length]

    # Ensure it's not empty
    if not filename:
        filename = "page"

    return filename


def create_pdf_from_content(content_data, output_dir="exported_pdfs"):
    """
    Create a PDF file from crawled content data

    Args:
        content_data: Dict containing url, raw_content, and other metadata
        output_dir: Directory to save the PDF files

    Returns:
        String: Path to the created PDF file
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Generate filename
    url = content_data.get("url", "unknown")
    base_filename = generate_filename_from_url(url)
    pdf_filename = f"{base_filename}.pdf"
    pdf_path = os.path.join(output_dir, pdf_filename)

    # Handle duplicate filenames
    counter = 1
    while os.path.exists(pdf_path):
        pdf_filename = f"{base_filename}_{counter}.pdf"
        pdf_path = os.path.join(output_dir, pdf_filename)
        counter += 1

    # Create PDF document
    doc = SimpleDocTemplate(
        pdf_path,
        pagesize=A4,
        topMargin=1 * inch,
        bottomMargin=1 * inch,
        leftMargin=0.75 * inch,
        rightMargin=0.75 * inch,
    )

    # Get styles
    styles = getSampleStyleSheet()
    title_style = ParagraphStyle(
        "CustomTitle",
        parent=styles["Heading1"],
        fontSize=16,
        spaceAfter=30,
        textColor=colors.darkblue,
    )

    url_style = ParagraphStyle(
        "URLStyle",
        parent=styles["Normal"],
        fontSize=10,
        textColor=colors.blue,
        spaceAfter=20,
    )

    content_style = ParagraphStyle(
        "ContentStyle", parent=styles["Normal"], fontSize=11, leading=14, spaceAfter=12
    )

    # Build PDF content
    story = []

    # Add title
    title = f"Crawled Content from {urlparse(url).netloc}"
    story.append(Paragraph(title, title_style))

    # Add URL
    story.append(Paragraph(f"Source: {url}", url_style))

    # Add crawl date
    crawl_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    story.append(Paragraph(f"Crawled on: {crawl_date}", url_style))

    # Add content
    raw_content = content_data.get("raw_content", "No content available")
    cleaned_content = clean_text_for_pdf(raw_content)

    # Split content into paragraphs
    paragraphs = cleaned_content.split("\n\n")
    for para in paragraphs:
        if para.strip():
            story.append(Paragraph(para.strip(), content_style))
            story.append(Spacer(1, 6))

    # Build PDF
    doc.build(story)

    return pdf_path

### Bulk Export All Pages to PDF

Now let's export all crawled pages as organized PDF files. We'll create a structured directory and generate summary information:


In [None]:
def export_all_pages_to_pdf(crawl_results, base_output_dir="tavily_export"):
    """
    Export all crawled pages to organized PDF files

    Args:
        crawl_results: The results from tavily_client.crawl()
        base_output_dir: Base directory for organizing exports

    Returns:
        dict: Summary of the export process
    """
    # Create timestamped directory
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    export_dir = f"{base_output_dir}_{timestamp}"
    pdf_dir = os.path.join(export_dir, "pdfs")

    # Create directories
    os.makedirs(pdf_dir, exist_ok=True)

    # Export summary
    summary = {
        "export_directory": export_dir,
        "total_pages": len(crawl_results["results"]),
        "successful_exports": 0,
        "failed_exports": 0,
        "exported_files": [],
        "errors": [],
    }

    print(f"📁 Creating export directory: {export_dir}")
    print(f"🔄 Exporting {summary['total_pages']} pages to PDF...")

    # Process each page
    for i, page_data in enumerate(crawl_results["results"], 1):
        try:
            url = page_data.get("url", f"page_{i}")
            print(f"  [{i}/{summary['total_pages']}] Processing: {url}")

            # Create PDF
            pdf_path = create_pdf_from_content(page_data, pdf_dir)

            summary["successful_exports"] += 1
            summary["exported_files"].append(
                {
                    "url": url,
                    "pdf_path": pdf_path,
                    "file_size": os.path.getsize(pdf_path),
                }
            )

        except Exception as e:
            print(f"    ❌ Error processing {url}: {str(e)}")
            summary["failed_exports"] += 1
            summary["errors"].append({"url": url, "error": str(e)})

    return summary


# Run the bulk export
export_summary = export_all_pages_to_pdf(crawl_result)

Congratulations! You've successfully built a complete data collection and PDF export system using Tavily's crawl API. 