<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/api-examples/1-corpus-creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara Corpus Creation

In this notebook we demonstrate how to create corpora using Vectara's REST API. We'll create two corpora:
- **AI Research Papers**: For academic papers from ArXiv about RAG, LLMs, and retrieval
- **Vectara Documentation**: For Vectara product documentation and guides

Both corpora will use the Boomerang embedding model and include filterable attributes for metadata-based search.

## About Vectara

[Vectara](https://vectara.com/) is the Agent Operating System for trusted enterprise AI: a unified Agentic RAG platform with built-in multi-modal retrieval, orchestration, and always-on governance. Deploy it on-prem (air-gapped), in your VPC, or as SaaS.

Vectara provides a complete API-first platform for building production RAG and agentic applications:

- **Simple Integration**: RESTful APIs and SDKs for Python, TypeScript, and Java make integration straightforward
- **Flexible Deployment**: Choose SaaS, VPC, or on-premises deployment based on your security and compliance requirements
- **Multi-Modal Support**: Index and search across text, tables, and images from various document formats
- **Advanced Retrieval**: Hybrid search combining semantic and keyword matching with multiple reranking options
- **Grounded Generation**: LLM responses with citations and factual consistency scores to reduce hallucinations
- **Enterprise-Ready**: Built-in access controls, audit logging, and compliance certifications (SOC2, HIPAA)

## Getting Started

To get started with Vectara, [sign up](https://console.vectara.com/signup?utm_source=vectara&utm_medium=signup&utm_term=DevRel&utm_content=example-notebooks&utm_campaign=vectara-signup-DevRel-example-notebooks) (if you haven't already) and create a personal API key from the console.

This notebook assumes you have a `VECTARA_API_KEY` environment variable set with your personal API key.

In [1]:
import os
import requests
import json

# Get API key from environment
api_key = os.environ['VECTARA_API_KEY']

# Base URL for Vectara API v2
BASE_URL = "https://api.vectara.io/v2"

# Common headers for all requests
headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
    'x-api-key': api_key
}

## Corpus 1: AI Research Papers

First, we'll create a corpus for academic papers about RAG, LLMs, and retrieval techniques. We'll configure:
- **Boomerang embedding model**: Vectara's state-of-the-art embeddings
- **Filterable attributes**: `source`, `year`, `topic` for metadata-based filtering

In [2]:
# Create the AI Research Papers corpus
research_corpus_config = {
    "key": "tutorial-ai-research-papers",
    "name": "AI Research Papers",
    "description": "Academic papers from ArXiv on RAG, LLMs, embeddings, and retrieval techniques",
    "encoder_name": "boomerang-2023-q3",
    "filter_attributes": [
        {
            "name": "source",
            "level": "document",
            "description": "Source of the document (e.g., arxiv)",
            "type": "text"
        },
        {
            "name": "year",
            "level": "document",
            "description": "Publication year",
            "type": "integer"
        },
        {
            "name": "topic",
            "level": "document",
            "description": "Main topic of the paper (e.g., RAG, embeddings, retrieval)",
            "type": "text"
        }
    ]
}

# Create the corpus
response = requests.post(
    f"{BASE_URL}/corpora",
    headers=headers,
    json=research_corpus_config
)

if response.status_code == 201:
    research_corpus = response.json()
    research_corpus_key = research_corpus['key']
    print(f"✓ Created AI Research Papers corpus")
    print(f"  Corpus Key: {research_corpus_key}")
    print(f"  Encoder: {research_corpus.get('encoder_name', 'N/A')}")
    print(f"  Filter Attributes: {len(research_corpus.get('filter_attributes', []))}")
else:
    print(f"Error creating corpus: {response.status_code}")
    print(response.text)

✓ Created AI Research Papers corpus
  Corpus Key: tutorial-ai-research-papers
  Encoder: boomerang-2023-q3
  Filter Attributes: 3


## Corpus 2: Vectara Documentation

Next, we'll create a corpus for Vectara's product documentation. This will include:
- **Filterable attributes**: `source`, `doc_type`, `topic` for categorizing different types of documentation

In [3]:
# Create the Vectara Documentation corpus
docs_corpus_config = {
    "key": "tutorial-vectara-docs",
    "name": "Vectara Documentation",
    "description": "Vectara product documentation, API references, guides, and tutorials",
    "encoder_name": "boomerang-2023-q3",
    "filter_attributes": [
        {
            "name": "source",
            "level": "document",
            "description": "Source of the document (e.g., vectara_docs)",
            "indexed": True,
            "type": "text"
        },
        {
            "name": "doc_type",
            "level": "document",
            "description": "Type of documentation (e.g., api_reference, guide, tutorial)",
            "indexed": True,
            "type": "text"
        },
        {
            "name": "topic",
            "level": "document",
            "description": "Main topic covered (e.g., query, indexing, agents)",
            "indexed": True,
            "type": "text"
        }
    ]
}

# Create the corpus
response = requests.post(
    f"{BASE_URL}/corpora",
    headers=headers,
    json=docs_corpus_config
)

if response.status_code == 201:
    docs_corpus = response.json()
    docs_corpus_key = docs_corpus['key']
    print(f"✓ Created Vectara Documentation corpus")
    print(f"  Corpus Key: {docs_corpus_key}")
    print(f"  Encoder: {docs_corpus.get('encoder_name', 'N/A')}")
    print(f"  Filter Attributes: {len(docs_corpus.get('filter_attributes', []))}")
else:
    print(f"Error creating corpus: {response.status_code}")
    print(response.text)

✓ Created Vectara Documentation corpus
  Corpus Key: tutorial-vectara-docs
  Encoder: boomerang-2023-q3
  Filter Attributes: 3


## Verify Corpus Creation

Let's verify both corpora were created successfully and view their details.

In [4]:
# List all corpora with pagination
print("\n=== Your Corpora ===")

all_corpora = []
page_key = None

# Fetch all pages
while True:
    # Build request with pagination
    params = {'limit': 100}
    if page_key:
        params['page_key'] = page_key
    
    response = requests.get(f"{BASE_URL}/corpora", headers=headers, params=params)
    
    if response.status_code != 200:
        print(f"Error listing corpora: {response.status_code}")
        print(response.text)
        break
    
    data = response.json()
    corpora = data.get('corpora', [])
    all_corpora.extend(corpora)
    
    # Check if there are more pages
    page_key = data.get('metadata', {}).get('page_key')
    if not page_key:
        break

print(f"Total corpora found: {len(all_corpora)}\n")

# Display our tutorial corpora
tutorial_corpus_keys = ['tutorial-ai-research-papers', 'tutorial-vectara-docs']
found_tutorial_corpora = [c for c in all_corpora if c['key'] in tutorial_corpus_keys]

if found_tutorial_corpora:
    for corpus in found_tutorial_corpora:
        print(f"\n{corpus['name']}")
        print(f"  Key: {corpus['key']}")
        print(f"  Description: {corpus.get('description', 'N/A')}")
        print(f"  Encoder: {corpus.get('encoder_name', 'N/A')}")
        print(f"  Filter Attributes: {[attr['name'] for attr in corpus.get('filter_attributes', [])]}")
else:
    print("Tutorial corpora not found in the list.")


=== Your Corpora ===
Total corpora found: 52


AI Research Papers
  Key: tutorial-ai-research-papers
  Description: Academic papers from ArXiv on RAG, LLMs, embeddings, and retrieval techniques
  Encoder: boomerang-2023-q3
  Filter Attributes: ['source', 'topic', 'year']

Vectara Documentation
  Key: tutorial-vectara-docs
  Description: Vectara product documentation, API references, guides, and tutorials
  Encoder: boomerang-2023-q3
  Filter Attributes: ['doc_type', 'source', 'topic']
