<a href="https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/api-examples/3-query-api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectara Query API Examples

In this notebook we demonstrate how to use Vectara's Query API using direct REST API calls. We'll cover:
- Single corpus queries with hybrid search and reranking
- Multiple corpora queries
- Metadata filtering
- Streaming responses
- Conversational chat

## About Vectara

[Vectara](https://vectara.com/) is the Agent Operating System for trusted enterprise AI: a unified Agentic RAG platform with built-in multi-modal retrieval, orchestration, and always-on governance. Deploy it on-prem (air-gapped), in your VPC, or as SaaS. Vectara agents deliver grounded answers and safe actions with source citations, step-level audit trails, fine-grained access controls, and real-time policy and factual-consistency enforcement, so teams ship faster with lower risk, and with trusted, production-grade AI agents at scale.

Vectara provides a complete API-first platform for building production RAG and agentic applications:

- **Simple Integration**: RESTful APIs and SDKs (Python, JavaScript) for quick integration into any stack
- **Flexible Deployment**: Choose SaaS, VPC, or on-premises deployment based on your requirements
- **Multi-Modal Support**: Index and search across text, tables, and images from PDFs, documents, and structured data
- **Advanced Retrieval**: Hybrid search combining semantic and keyword matching with state-of-the-art reranking
- **Grounded Generation**: LLM responses with citations and factual consistency scores to reduce hallucinations
- **Enterprise-Ready**: Built-in access controls, audit logging, and compliance (SOC2, HIPAA) from day one

## Getting Started

This notebook assumes you've completed Notebooks 1 and 2:
- Notebook 1: Created two corpora (ai-research-papers and vectara-docs) with Boomerang embeddings
- Notebook 2: Ingested AI research papers and Vectara documentation


In [1]:
import os
import requests
import json

# Set up authentication
api_key = os.environ['VECTARA_API_KEY']

# Get corpus keys from environment (set these from Notebook 1 output)
research_corpus_key = 'tutorial-ai-research-papers'
docs_corpus_key = 'tutorial-vectara-docs'

# Base URL for Vectara API v2
BASE_URL = "https://api.vectara.io/v2"

# Common headers for all requests
headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
    'x-api-key': api_key
}

print(f"Research Corpus: {research_corpus_key}")
print(f"Docs Corpus: {docs_corpus_key}")

Research Corpus: tutorial-ai-research-papers
Docs Corpus: tutorial-vectara-docs


## Example 1: Basic Query with Hybrid Search and Reranking

This example demonstrates a single corpus query using:
- Hybrid search (lexical_interpolation=0.005 for best semantic search)
- Chain reranker combining multilingual reranker with MMR (diversity_bias=0.05) for improved relevance and diversity
- Two-stage retrieval: fetch 30 results, rerank to top 10
- Generation with vectara-summary-ext-24-05-med-omni preset
- Factual Consistency Score to detect potential hallucinations

In [9]:
# Construct the query request - querying research papers corpus
query_request = {
    "query": "What is retrieval augmented generation?",
    "search": {
        "corpora": [
            {
                "corpus_key": research_corpus_key,
                "lexical_interpolation": 0.005
            }
        ],
        "limit": 100,
        "context_configuration": {
            "sentences_before": 2,
            "sentences_after": 2
        },
        "reranker": {
            "type": "chain",
            "rerankers": [
                {
                    "type": "customer_reranker",
                    "reranker_id": "rnk_272725719", 
                    "limit": 30,
                },
                {
                    "type": "mmr",
                    "diversity_bias": 0.05
                }
            ],
        }
    },
    "generation": {
        "generation_preset_name": "vectara-summary-ext-24-05-med-omni",
        "max_used_search_results": 10,
        "response_language": "eng",
        "enable_factual_consistency_score": True
    }
}

# Make the query request
url = f"{BASE_URL}/query"
response = requests.post(url, headers=headers, json=query_request)

if response.status_code == 200:
    result = response.json()
    print("\n=== Generated Summary ===")
    print(result['summary'])
    print(f"\n=== Factual Consistency Score: {result.get('factual_consistency_score', 'N/A')} ===")
else:
    print(f"Error: {response.status_code}")
    print(response.text)


=== Generated Summary ===
Retrieval-augmented generation (RAG) is a method that combines pre-trained parametric memory models, such as sequence-to-sequence (seq2seq) transformers, with non-parametric memory, which is typically a dense vector index of external data sources like Wikipedia. This approach uses a pre-trained neural retriever to access the non-parametric memory, allowing the model to retrieve relevant information to enhance language generation tasks. RAG models can condition on the same retrieved passages for the entire generated sequence or use different passages for each token, providing flexibility in generating responses for knowledge-intensive tasks [1], [4].

=== Factual Consistency Score: 0.96875 ===


### Examining Search Results and Citations

The response includes the retrieved documents that were used to generate the summary, along with citation information.

In [11]:
if response.status_code == 200:
    result = response.json()
    
    # Display first 5 search results
    print("\n=== Top Search Results ===")
    for i, search_result in enumerate(result.get('search_results', [])[:5], 1):
        print(f"\n--- Result {i} ---")
        print(f"Text: {search_result['text'][:200]}...")
        print(f"Score: {search_result.get('score', 'N/A')}")
        print(f"Document ID: {search_result.get('document_id', 'N/A')}")
        if 'document_metadata' in search_result:
            print(f"Metadata: {search_result['document_metadata']}")


=== Top Search Results ===

--- Result 1 ---
Text: Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-para...
Score: 0.9941574335098267
Document ID: rag-retrieval-augmented-generation.pdf
Metadata: {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'CreationDate': 'D:20210413004838Z', 'Keywords': '', 'Producer': 'pdfTeX-1.40.21', 'Author': '', 'Title': '', 'Creator': 'LaTeX with hyperref', 'ModDate': 'D:20210413004838Z', 'Trapped': '/False', 'Subject': '', 'source': 'arxiv', 'year': 2020, 'topic': 'RAG', 'title': 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks', 'authors': 'Lewis et al.'}

--- Result 2 ---
Text: arXiv preprint arXiv:2203.05115, 2022. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
    Heinrich Küttler, Mi

## Example 2: Querying Multiple Corpora

Vectara allows you to query across multiple corpora simultaneously. This is useful when you have data organized across different collections.

In [16]:
# Query both corpora simultaneously
# This combines results from research papers AND documentation
multi_corpus_request = {
    "query": "How do Agents work with Vectara?",
    "search": {
        "corpora": [
            {
                "corpus_key": research_corpus_key,
                "lexical_interpolation": 0.005
            },
            {
                "corpus_key": docs_corpus_key,
                "lexical_interpolation": 0.005
            }
        ],
        "limit": 100,
        "context_configuration": {
            "sentences_before": 2,
            "sentences_after": 2
        },
        "reranker": {
            "type": "chain",
            "rerankers": [
                {
                    "type": "customer_reranker",
                    "reranker_id": "rnk_272725719",
                    "limit": 30
                },
                {
                    "type": "mmr",
                    "diversity_bias": 0.05
                }
            ],
        }
    },
    "generation": {
        "generation_preset_name": "vectara-summary-ext-24-05-med-omni",
        "max_used_search_results": 10,
        "response_language": "eng",
        "enable_factual_consistency_score": True
    }
}

response = requests.post(f"{BASE_URL}/query", headers=headers, json=multi_corpus_request)

if response.status_code == 200:
    result = response.json()
    print("\n=== Generated Summary (Multiple Corpora) ===")
    print(result['summary'])
    print(f"\n=== Factual Consistency Score: {result.get('factual_consistency_score', 'N/A')} ===")
    
    # Show which corpus each result came from
    print("\n=== Result Sources ===")
    for i, search_result in enumerate(result.get('search_results', [])[:5], 1):
        doc_meta = search_result.get('document_metadata', {})
        source = doc_meta.get('source', 'unknown')
        title = doc_meta.get('title', 'N/A')
        print(f"{i}. Source: {source}, Title: {title}")
else:
    print(f"Error: {response.status_code}")
    print(response.text)


=== Generated Summary (Multiple Corpora) ===
Vectara Agents work by enabling enterprises to build sophisticated, enterprise-grade intelligent applications that go beyond basic question answering. These agents interpret user input, reason through context, leverage external tools, and maintain continuity across multi-turn interactions. Unlike traditional Retrieval Augmented Generation (RAG) systems that simply retrieve documents and pass them to a language model, Vectara agents provide orchestrated workflows capable of taking action, retrieving information, invoking APIs, or maintaining user sessions. This comprehensive framework allows for the creation of AI-powered applications that can autonomously reason through problems, orchestrate multiple tools, maintain conversation context, and integrate with enterprise systems through standardized protocols [1], [6].

=== Factual Consistency Score: 0.9609375 ===

=== Result Sources ===
1. Source: vectara_docs, Title: Agents
2. Source: vectara

## Example 3: Metadata Filtering

You can filter search results using metadata filters. This allows you to narrow down results based on document or chunk-level metadata.

In [19]:
# Example with metadata filtering
# Filter to only get research papers from 2020 or later
filtered_request = {
    "query": "What are the key innovations in retrieval augmented generation?",
    "search": {
        "corpora": [
            {
                "corpus_key": research_corpus_key,
                "lexical_interpolation": 0.005,
                # Filter for recent RAG papers
                "metadata_filter": "doc.year >= 2023"
            }
        ],
        "limit": 100,
        "context_configuration": {
            "sentences_before": 2,
            "sentences_after": 2
        },
        "reranker": {
            "type": "chain",
            "rerankers": [
                {
                    "type": "customer_reranker",
                    "reranker_id": "rnk_272725719",
                    "limit": 30
                },
                {
                    "type": "mmr",
                    "diversity_bias": 0.05
                }
            ],
        }
    },
    "generation": {
        "generation_preset_name": "vectara-summary-ext-24-05-med-omni",
        "max_used_search_results": 10,
        "response_language": "eng",
        "enable_factual_consistency_score": True
    }
}

response = requests.post(f"{BASE_URL}/query", headers=headers, json=filtered_request)

if response.status_code == 200:
    result = response.json()
    print("\n=== Generated Summary (With Metadata Filter) ===")
    print(result['summary'])
    print(f"\n=== Number of results: {len(result.get('search_results', []))} ===")
    
    # Show filtered results
    print("\n=== Filtered Papers ===")
    for search_result in result.get('search_results', [])[:3]:
        doc_meta = search_result.get('document_metadata', {})
        print(f"- {doc_meta.get('title', 'N/A')} ({doc_meta.get('year', 'N/A')}) - Topic: {doc_meta.get('topic', 'N/A')}")
else:
    print(f"Error: {response.status_code}")
    print(response.text)


=== Generated Summary (With Metadata Filter) ===
The key innovations in Retrieval-Augmented Generation (RAG) include the integration of retrieval mechanisms with generation models to enhance the accuracy and reliability of generated content. This approach allows large language models (LLMs) to generate answers or summaries by leveraging external knowledge sources, thereby reducing the likelihood of hallucinations, which are unsupported or incorrect information in the generated text. Additionally, the development of benchmarks like FaithBench and RAGTruth provides a framework for evaluating and improving the trustworthiness of RAG systems by focusing on hallucination detection and mitigation strategies [2], [5], [10].

=== Number of results: 30 ===

=== Filtered Papers ===
- Hallucination Detection in RAG Systems (2025) - Topic: RAG
- Hallucination Detection in RAG Systems (2025) - Topic: RAG
- Hallucination Detection in RAG Systems (2025) - Topic: RAG


## Example 4: Streaming Responses

For better user experience, you can stream the generated response in real-time using Server-Sent Events (SSE).

In [29]:
# Streaming query request - query the documentation corpus
streaming_request = {
    "query": "How do I use chunking with Vectara",
    "stream_response": True,
    "search": {
        "corpora": [
            {
                "corpus_key": docs_corpus_key,
                "lexical_interpolation": 0.005
            }
        ],
        "limit": 100,
        "context_configuration": {
            "sentences_before": 2,
            "sentences_after": 2
        },
        "reranker": {
            "type": "chain",
            "rerankers": [
                {
                    "type": "customer_reranker",
                    "reranker_id": "rnk_272725719",
                    "limit": 30
                },
                {
                    "type": "mmr",
                    "diversity_bias": 0.05
                }
            ],
        }
    },
    "generation": {
        "generation_preset_name": "vectara-summary-ext-24-05-med-omni",
        "max_used_search_results": 15,
        "response_language": "eng",
        "enable_factual_consistency_score": True
    }
}

# Make streaming request
streaming_headers = headers.copy()
streaming_headers['Accept'] = 'text/event-stream'

response = requests.post(
    f"{BASE_URL}/query",
    headers=streaming_headers,
    json=streaming_request,
    stream=True
)

print("\n=== Streaming Response ===")
if response.status_code == 200:
    for line in response.iter_lines():
        if line:
            line_str = line.decode('utf-8')
            if line_str.startswith('data:'):
                try:
                    data = json.loads(line_str[5:])  # Remove 'data: ' prefix
                    # Handle different event types
                    if data.get('type') == 'generation_chunk':
                        # Print generation text as it arrives
                        print(data.get('generation_chunk', ''), end='', flush=True)
                    elif data.get('type') == 'factual_consistency_score':
                        print(f"\n\n=== FCS: {data.get('factual_consistency_score')} ===")
                    elif data.get('type') == 'search_results':
                        # Search results arrive before generation starts
                        pass
                except json.JSONDecodeError:
                    pass
    print("\n")
else:
    print(f"Error: {response.status_code}")
    print(response.text)


=== Streaming Response ===
To use chunking with Vectara, you can choose between sentence-based and character-based chunking strategies. By default, Vectara uses sentence-based chunking, where each chunk contains one complete sentence. This method can lead to higher retrieval latency due to the increased number of chunks. Alternatively, you can opt for character-based chunking to create larger chunks by setting the type to `max_chars_chunking_strategy` and defining the `max_chars_per_chunk` value. This allows you to create chunks containing 3-7 sentences (512 to 1024 characters), balancing retrieval speed and contextual integrity [1], [2], [3].

=== FCS: 0.78125 ===


