<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/api-examples/2-data-ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara Data Ingestion

In this notebook we demonstrate how to ingest data into the two corpora we created:
1. **File Upload API**: Upload AI research papers as PDFs from ArXiv
2. **Core Indexing API**: Index Vectara documentation as pre-chunked text

This creates a comprehensive dataset combining academic papers and practical documentation.

## About Vectara

[Vectara](https://vectara.com/) is the Agent Operating System for trusted enterprise AI: a unified Agentic RAG platform with built-in multi-modal retrieval, orchestration, and always-on governance. Deploy it on-prem (air-gapped), in your VPC, or as SaaS.

Vectara provides a complete API-first platform for building production RAG and agentic applications:

- **Simple Integration**: RESTful APIs and SDKs for Python, TypeScript, and Java make integration straightforward
- **Flexible Deployment**: Choose SaaS, VPC, or on-premises deployment based on your security and compliance requirements
- **Multi-Modal Support**: Index and search across text, tables, and images from various document formats
- **Advanced Retrieval**: Hybrid search combining semantic and keyword matching with multiple reranking options
- **Grounded Generation**: LLM responses with citations and factual consistency scores to reduce hallucinations
- **Enterprise-Ready**: Built-in access controls, audit logging, and compliance certifications (SOC2, HIPAA)

## Setup

This notebook assumes you've completed Notebook 1 (corpus creation) and have the corpus keys available.

In [1]:
import os
import requests
import json
from time import sleep

# Get credentials and corpus keys from environment
api_key = os.environ['VECTARA_API_KEY']
research_corpus_key = 'tutorial-ai-research-papers'
docs_corpus_key = 'tutorial-vectara-docs'

# Base API URL
BASE_URL = "https://api.vectara.io/v2"

## Part 1: Upload AI Research Papers (PDFs)

We'll upload several key papers about RAG, embeddings, and retrieval from ArXiv. Vectara will automatically:
- Extract text from PDFs
- Chunk the content intelligently
- Create Boomerang embeddings

In [2]:
# Key research papers about RAG, LLMs, and retrieval
research_papers = [
    {
        "url": "https://arxiv.org/pdf/2005.11401.pdf",
        "filename": "gpt3-language-models.pdf",
        "metadata": {
            "source": "arxiv",
            "year": 2020,
            "topic": "LLMs",
            "title": "Language Models are Few-Shot Learners",
            "authors": "Brown et al."
        }
    },
    {
        "url": "https://arxiv.org/pdf/2005.11401v4.pdf",
        "filename": "rag-retrieval-augmented-generation.pdf",
        "metadata": {
            "source": "arxiv",
            "year": 2020,
            "topic": "RAG",
            "title": "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks",
            "authors": "Lewis et al."
        }
    },
    {
        "url": "https://arxiv.org/pdf/1706.03762.pdf",
        "filename": "attention-is-all-you-need.pdf",
        "metadata": {
            "source": "arxiv",
            "year": 2017,
            "topic": "embeddings",
            "title": "Attention Is All You Need",
            "authors": "Vaswani et al."
        }
    },
    {
        "url": "https://arxiv.org/pdf/2104.08821.pdf",
        "filename": "beir-retrieval-benchmark.pdf",
        "metadata": {
            "source": "arxiv",
            "year": 2021,
            "topic": "retrieval",
            "title": "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
            "authors": "Thakur et al."
        }
    },
    {
        "url": "https://arxiv.org/pdf/2201.12086.pdf",
        "filename": "dense-passage-retrieval.pdf",
        "metadata": {
            "source": "arxiv",
            "year": 2022,
            "topic": "retrieval",
            "title": "Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling",
            "authors": "Hofstätter et al."
        }
    },
    {
        "url": "https://aclanthology.org/2025.naacl-short.38.pdf",
        "filename": "hallucination-detection-naacl.pdf",
        "metadata": {
            "source": "acl",
            "year": 2025,
            "topic": "RAG",
            "title": "Hallucination Detection in RAG Systems",
            "authors": "NAACL 2025"
        }
    },
    {
        "url": "https://arxiv.org/pdf/2210.03629",
        "filename": "retrieval-evaluation-metrics.pdf",
        "metadata": {
            "source": "arxiv",
            "year": 2022,
            "topic": "retrieval",
            "title": "Retrieval Evaluation Metrics and Methods",
            "authors": "ArXiv 2022"
        }
    }
]

In [3]:
# Upload each paper
upload_url = f"{BASE_URL}/corpora/{research_corpus_key}/upload_file"
upload_results = []

for paper in research_papers:
    try:
        print(f"Downloading {paper['filename']}...")
        # Download PDF
        response = requests.get(paper['url'], timeout=30)
        if response.status_code != 200:
            print(f"  ✗ Failed to download: {response.status_code}")
            upload_results.append({'filename': paper['filename'], 'success': False, 'error': 'Download failed'})
            continue
            
        pdf_content = response.content
        
        print(f"  Uploading to Vectara...")
        files = {
            'file': (paper['filename'], pdf_content, 'application/pdf'),
            'metadata': (None, json.dumps(paper['metadata']), 'application/json'),
            'table_extraction_config': (None, json.dumps({'extract_tables': True}), 'application/json')
        }

        headers = {
            'Accept': 'application/json',
            'x-api-key': api_key
        }
        
        upload_response = requests.post(upload_url, headers=headers, files=files, timeout=60)
        
        if upload_response.status_code in [200, 201]:
            print(f"  ✓ Successfully uploaded {paper['filename']}")
            upload_results.append({'filename': paper['filename'], 'success': True})
        else:
            print(f"  ✗ Upload failed: {upload_response.status_code} - {upload_response.text}")
            upload_results.append({'filename': paper['filename'], 'success': False, 'error': upload_response.text})
        
        # Small delay between uploads
        sleep(1)
        
    except Exception as e:
        print(f"  ✗ Error: {str(e)}")
        upload_results.append({'filename': paper['filename'], 'success': False, 'error': str(e)})

# Summary
successful = sum(1 for r in upload_results if r['success'])
print(f"\n=== Upload Summary ===")
print(f"Total: {len(upload_results)}, Successful: {successful}, Failed: {len(upload_results) - successful}")

Downloading gpt3-language-models.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded gpt3-language-models.pdf
Downloading rag-retrieval-augmented-generation.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded rag-retrieval-augmented-generation.pdf
Downloading attention-is-all-you-need.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded attention-is-all-you-need.pdf
Downloading beir-retrieval-benchmark.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded beir-retrieval-benchmark.pdf
Downloading dense-passage-retrieval.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded dense-passage-retrieval.pdf
Downloading hallucination-detection-naacl.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded hallucination-detection-naacl.pdf
Downloading retrieval-evaluation-metrics.pdf...
  Uploading to Vectara...
  ✓ Successfully uploaded retrieval-evaluation-metrics.pdf

=== Upload Summary ===
Total: 7, Successful: 7, Failed: 0


## Part 2: Crawl and Index Vectara Documentation

Now we'll use Scrapy to automatically crawl the entire docs.vectara.com site, extract content from all documentation pages, and index them using the Core Indexing API. This demonstrates how to build a RAG system from a complete website.

In [4]:
# Install required libraries for web crawling
try:
    from bs4 import BeautifulSoup
except ImportError:
    print("Installing beautifulsoup4...")
    import subprocess
    subprocess.check_call(['pip', 'install', 'beautifulsoup4'])
    from bs4 import BeautifulSoup

try:
    from urllib.parse import urljoin, urlparse
except ImportError:
    pass  # Built-in in Python 3

In [5]:
# Global list to store scraped documents
scraped_docs = []

class VectaraDocsCrawler:
    """
    Simple web crawler for docs.vectara.com that works in Jupyter notebooks
    """
    def __init__(self, start_url, max_pages=100):
        self.start_url = start_url
        self.max_pages = max_pages
        self.visited_urls = set()
        self.to_visit = [start_url]
        self.scraped_docs = []
        
    def is_valid_url(self, url):
        """Check if URL should be crawled"""
        parsed = urlparse(url)
        
        # Must be from docs.vectara.com
        if parsed.netloc != 'docs.vectara.com':
            return False
        
        # Must be under /docs/
        if not parsed.path.startswith('/docs/'):
            return False
        
        # Skip search pages and anchors
        if '/search' in parsed.path or '#' in url:
            return False
        
        return True
    
    def extract_content(self, url, html):
        """Extract content from a documentation page"""
        try:
            soup = BeautifulSoup(html, 'html.parser')
            
            # Extract title
            title = None
            for selector in ['h1', 'title', '.page-title']:
                if selector.startswith('.'):
                    title_elem = soup.select_one(selector)
                else:
                    title_elem = soup.find(selector)
                if title_elem:
                    title = title_elem.get_text(strip=True)
                    break
            
            if not title:
                title = url.split('/')[-1].replace('-', ' ').title()
            
            # Remove "| Vectara" or similar suffixes from title
            title = title.split('|')[0].strip()
            
            # Find main content
            content_elem = (
                soup.find('article') or 
                soup.find('main') or 
                soup.find('div', class_='content') or
                soup.find('div', {'role': 'main'})
            )
            
            if not content_elem:
                content_elem = soup.body
            
            # Remove navigation, headers, footers
            for unwanted in content_elem.find_all(['nav', 'header', 'footer', 'aside']):
                unwanted.decompose()
            
            # Extract chunks with section context
            chunks = []
            current_section = "introduction"
            
            for elem in content_elem.find_all(['h1', 'h2', 'h3', 'h4', 'p', 'li', 'pre']):
                if elem.name in ['h1', 'h2', 'h3', 'h4']:
                    # New section
                    current_section = elem.get_text(strip=True).lower().replace(' ', '_')
                elif elem.name == 'p':
                    text = elem.get_text(strip=True)
                    if text and len(text) > 20:
                        chunks.append({
                            'text': text,
                            'section': current_section
                        })
                elif elem.name == 'li':
                    text = elem.get_text(strip=True)
                    if text and len(text) > 10:
                        chunks.append({
                            'text': text,
                            'section': current_section
                        })
                elif elem.name == 'pre':
                    # Code block
                    text = elem.get_text(strip=True)
                    if text and len(text) > 10:
                        chunks.append({
                            'text': f"Code example: {text[:500]}",  # Limit code length
                            'section': current_section
                        })
            
            # Skip pages with no meaningful content
            if len(chunks) < 3:
                return None
            
            # Determine doc_type and topic from URL
            url_path = url.replace('https://docs.vectara.com/', '')
            doc_type = 'guide'
            topic = 'general'
            
            if '/api-reference/' in url_path or '/rest-api/' in url_path:
                doc_type = 'api_reference'
            
            if 'query' in url_path or 'search' in url_path:
                topic = 'query'
            elif 'index' in url_path or 'upload' in url_path:
                topic = 'indexing'
            elif 'embed' in url_path or 'boomerang' in url_path:
                topic = 'embeddings'
            elif 'agent' in url_path:
                topic = 'agents'
            elif 'rerank' in url_path:
                topic = 'reranking'
            elif 'corpus' in url_path or 'corpora' in url_path:
                topic = 'corpus_management'
            elif 'grounded' in url_path or 'generation' in url_path:
                topic = 'grounded_generation'
            
            # Store the scraped document
            doc_data = {
                'url': url,
                'title': title,
                'chunks': chunks[:50],  # Limit chunks per document
                'doc_type': doc_type,
                'topic': topic
            }
            
            return doc_data
            
        except Exception as e:
            print(f"  ✗ Error parsing {url}: {str(e)}")
            return None
    
    def extract_links(self, url, html):
        """Extract all links from a page"""
        soup = BeautifulSoup(html, 'html.parser')
        links = []
        
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            # Convert relative URLs to absolute
            absolute_url = urljoin(url, href)
            
            # Remove fragments
            absolute_url = absolute_url.split('#')[0]
            
            if self.is_valid_url(absolute_url) and absolute_url not in self.visited_urls:
                links.append(absolute_url)
        
        return links
    
    def crawl(self):
        """Crawl the documentation site"""
        print("Starting documentation crawler...")
        print(f"Max pages: {self.max_pages}\n")
        
        while self.to_visit and len(self.visited_urls) < self.max_pages:
            url = self.to_visit.pop(0)
            
            if url in self.visited_urls:
                continue
            
            try:
                print(f"[{len(self.visited_urls) + 1}/{self.max_pages}] Crawling: {url}")
                
                # Fetch the page
                response = requests.get(url, timeout=10)
                if response.status_code != 200:
                    print(f"  ✗ Failed to fetch: {response.status_code}")
                    self.visited_urls.add(url)
                    continue
                
                html = response.text
                self.visited_urls.add(url)
                
                # Extract content
                doc_data = self.extract_content(url, html)
                if doc_data:
                    self.scraped_docs.append(doc_data)
                    print(f"  ✓ Extracted: {doc_data['title']} ({len(doc_data['chunks'])} chunks)")
                else:
                    print(f"  ⊘ Skipped (insufficient content)")
                
                # Extract and queue new links
                new_links = self.extract_links(url, html)
                for link in new_links:
                    if link not in self.visited_urls and link not in self.to_visit:
                        self.to_visit.append(link)
                
                # Be respectful - small delay between requests
                sleep(0.5)
                
            except Exception as e:
                print(f"  ✗ Error: {str(e)}")
                self.visited_urls.add(url)
        
        print(f"\n=== Crawling Complete ===")
        print(f"Pages visited: {len(self.visited_urls)}")
        print(f"Documents scraped: {len(self.scraped_docs)}")
        
        return self.scraped_docs

print("Crawler class defined")

Crawler class defined


In [6]:
# Run the documentation crawler
print("Starting documentation crawler...\n")
print("This will crawl docs.vectara.com/docs/ and extract all documentation pages.")
print("Depending on max_pages setting, this may take several minutes.\n")

# Create and run the crawler
# Note: Set max_pages to control how many pages to crawl (default: 100)
# For a full crawl, you can increase this number, but it will take longer
crawler = VectaraDocsCrawler(start_url='https://docs.vectara.com/docs/', max_pages=50)
scraped_docs = crawler.crawl()

print(f"\nSample pages:")
for doc in scraped_docs[:5]:
    print(f"  - {doc['title']} ({doc['topic']})")

Starting documentation crawler...

This will crawl docs.vectara.com/docs/ and extract all documentation pages.
Depending on max_pages setting, this may take several minutes.

Starting documentation crawler...
Max pages: 50

[1/50] Crawling: https://docs.vectara.com/docs/
  ✓ Extracted: The Vectara Platform (47 chunks)
[2/50] Crawling: https://docs.vectara.com/docs/rest-api
  ⊘ Skipped (insufficient content)
[3/50] Crawling: https://docs.vectara.com/docs/sdk/vectara-python-sdk
  ✓ Extracted: Vectara Python SDK (26 chunks)
[4/50] Crawling: https://docs.vectara.com/docs/release-notes
  ✓ Extracted: Vectara Release Notes (50 chunks)
[5/50] Crawling: https://docs.vectara.com/docs/changelog
  ✓ Extracted: Vectara Documentation Changelog (50 chunks)
[6/50] Crawling: https://docs.vectara.com/docs/learn/data-privacy/privacy-overview
  ✓ Extracted: Privacy Overview (7 chunks)
[7/50] Crawling: https://docs.vectara.com/docs/learn/authentication/authentication-authorization-vectara
  ✓ Extracted: A

## Index Scraped Documentation

Now let's index the scraped documentation into Vectara using the Core Indexing API.

In [7]:
# Index scraped documentation
index_url = f"{BASE_URL}/corpora/{docs_corpus_key}/documents"
index_results = []

headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
    'x-api-key': api_key
}

print("Indexing scraped documentation...\n")

for doc_data in scraped_docs:
    try:
        # Create a unique document ID from the URL
        doc_id = doc_data['url'].replace('https://', '').replace('/', '-').replace('.', '-')
        
        # Prepare document for Core Indexing API
        doc = {
            'id': doc_id,
            'type': 'core',
            'document_parts': [
                {
                    'text': chunk['text'],
                    'metadata': {'section': chunk['section']}
                } for chunk in doc_data['chunks']
            ],
            'metadata': {
                'source': 'vectara_docs',
                'title': doc_data['title'],
                'doc_type': doc_data['doc_type'],
                'topic': doc_data['topic'],
                'url': doc_data['url']
            }
        }
        
        print(f"Indexing: {doc_data['title']} ({len(doc_data['chunks'])} chunks)")
        response = requests.post(index_url, headers=headers, json=doc, timeout=30)
        
        if response.status_code in [200, 201]:
            print(f"  ✓ Successfully indexed {doc_id}")
            index_results.append({'id': doc_id, 'title': doc_data['title'], 'success': True})
        else:
            print(f"  ✗ Indexing failed: {response.status_code} - {response.text[:200]}")
            index_results.append({'id': doc_id, 'title': doc_data['title'], 'success': False, 'error': response.text})
        
        sleep(0.5)
        
    except Exception as e:
        print(f"  ✗ Error: {str(e)}")
        index_results.append({'id': doc_id, 'title': doc_data.get('title', 'Unknown'), 'success': False, 'error': str(e)})

# Summary
successful = sum(1 for r in index_results if r['success'])
print(f"\n=== Indexing Summary ===")
print(f"Total: {len(index_results)}, Successful: {successful}, Failed: {len(index_results) - successful}")

if successful > 0:
    print(f"\nSuccessfully indexed documentation:")
    for result in index_results:
        if result['success']:
            print(f"  ✓ {result['title']}")

Indexing scraped documentation...

Indexing: The Vectara Platform (47 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-
Indexing: Vectara Python SDK (26 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-sdk-vectara-python-sdk
Indexing: Vectara Release Notes (50 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-release-notes
Indexing: Vectara Documentation Changelog (50 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-changelog
Indexing: Privacy Overview (7 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-learn-data-privacy-privacy-overview
Indexing: Authentication and Authorization in Vectara (26 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-learn-authentication-authentication-authorization-vectara
Indexing: Getting Started (13 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-getting-started
Indexing: Private Deployment (11 chunks)
  ✓ Successfully indexed docs-vectara-com-docs-deployments
Indexing: Data Management (11 chunks)
  ✓ Successfully

## Verify Indexed Content

Let's verify that the expected number of documents were successfully indexed in both corpora by listing all documents in each corpus.

In [8]:
headers = {
    'Accept': 'application/json',
    'x-api-key': api_key
}

# Helper function to list all documents with pagination
def list_all_documents(corpus_key, corpus_name):
    """List all documents in a corpus with pagination"""
    all_documents = []
    page_key = None
    
    while True:
        # Build request with pagination
        params = {'limit': 100}
        if page_key:
            params['page_key'] = page_key
        
        response = requests.get(
            f"{BASE_URL}/corpora/{corpus_key}/documents",
            headers=headers,
            params=params
        )
        
        if response.status_code != 200:
            print(f"  ✗ Error listing documents: {response.status_code}")
            print(f"    {response.text}")
            break
        
        data = response.json()
        documents = data.get('documents', [])
        all_documents.extend(documents)
        
        # Check if there are more pages
        page_key = data.get('metadata', {}).get('page_key')
        if not page_key:
            break
    
    return all_documents

# Verify AI Research Papers corpus
print("=== AI Research Papers Corpus ===")
print(f"Expected: {len(research_papers)} documents\n")

research_docs = list_all_documents(research_corpus_key, "AI Research Papers")
print(f"Actual: {len(research_docs)} documents indexed\n")

if len(research_docs) > 0:
    print("Indexed papers:")
    for doc in research_docs:
        doc_id = doc.get('id', 'N/A')
        metadata = doc.get('metadata', {})
        title = metadata.get('title', 'Unknown')
        print(f"  • {title}")
else:
    print("  ⚠ No documents found")

# Check if count matches expected
if len(research_docs) == len(research_papers):
    print(f"\n✓ All {len(research_papers)} papers successfully indexed")
elif len(research_docs) < len(research_papers):
    print(f"\n⚠ Warning: Expected {len(research_papers)} papers but found {len(research_docs)}")
else:
    print(f"\n⚠ Warning: Found more documents ({len(research_docs)}) than expected ({len(research_papers)})")

# Verify Vectara Documentation corpus
print("\n\n=== Vectara Documentation Corpus ===")
print(f"Expected: {len(scraped_docs)} documents\n")

docs_docs = list_all_documents(docs_corpus_key, "Vectara Documentation")
print(f"Actual: {len(docs_docs)} documents indexed\n")

if len(docs_docs) > 0:
    print(f"Sample indexed documentation (showing first 10):")
    for doc in docs_docs[:10]:
        doc_id = doc.get('id', 'N/A')
        metadata = doc.get('metadata', {})
        title = metadata.get('title', 'Unknown')
        topic = metadata.get('topic', 'N/A')
        doc_type = metadata.get('doc_type', 'N/A')
        print(f"  • {title} (topic: {topic}, type: {doc_type})")
else:
    print("  ⚠ No documents found")

# Check if count matches expected
if len(docs_docs) == len(scraped_docs):
    print(f"\n✓ All {len(scraped_docs)} documentation pages successfully indexed")
elif len(docs_docs) < len(scraped_docs):
    print(f"\n⚠ Warning: Expected {len(scraped_docs)} pages but found {len(docs_docs)}")
else:
    print(f"\n⚠ Warning: Found more documents ({len(docs_docs)}) than expected ({len(scraped_docs)})")

# Overall summary
print("\n\n=== Overall Indexing Summary ===")
print(f"Research Papers: {len(research_docs)}/{len(research_papers)} indexed")
print(f"Documentation: {len(docs_docs)}/{len(scraped_docs)} indexed")
print(f"Total: {len(research_docs) + len(docs_docs)} documents across both corpora")

=== AI Research Papers Corpus ===
Expected: 7 documents

Actual: 7 documents indexed

Indexed papers:
  • Language Models are Few-Shot Learners
  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  • Attention Is All You Need
  • BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
  • Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
  • Hallucination Detection in RAG Systems
  • Retrieval Evaluation Metrics and Methods

✓ All 7 papers successfully indexed


=== Vectara Documentation Corpus ===
Expected: 41 documents

Actual: 41 documents indexed

Sample indexed documentation (showing first 10):
  • The Vectara Platform (topic: general, type: guide)
  • Vectara Python SDK (topic: general, type: guide)
  • Vectara Release Notes (topic: general, type: guide)
  • Vectara Documentation Changelog (topic: general, type: guide)
  • Privacy Overview (topic: general, type: guide)
  • Authentication and Auth