# Evaluate AI Search Quality with Tavily & Quotient: A Cookbook

<a target="_blank" href="https://colab.research.google.com/github/quotient-ai/quotient-alpha/blob/main/cookbooks/search/tavily/tavily_quotient_detections.ipynb">
 <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This cookbook walks through how we can monitor AI search results from [Tavily](https://www.tavily.com/) for hallucinations or retrieval issues using [Quotient AI](https://www.quotientai.co/).

We’ll cover:
- Performing AI-powered search using Tavily
- Logging search results in Quotient
- Automatically detecting hallucinations and irrelevant results with Quotient
- Understanding common failure cases and how to fix them

In [8]:
# !pip install -qU quotientai tavily-python tqdm json

## Step 0: Grab your API keys

We’ll use API keys from:
 - [Tavily](https://www.tavily.com/) — get your API key from the [Tavily app](https://app.tavily.com)
 - [Quotient AI](https://www.quotientai.co) — get your API key from the [Quotient AI app](https://app.quotientai.co)
 
Both Tavily and Quotient offer generous free tiers to get started; you can check out their pricing  [here](https://www.tavily.com/#pricing) and [here](https://www.quotientai.co/pricing).


In [9]:
import os
# Set API keys:
os.environ['TAVILY_API_KEY'] ="tavily_api_key_here"
os.environ['QUOTIENT_API_KEY'] ="quotient_api_key_here"

## Step 1: Connect to Tavily Search and Quotient monitoring

We’ll use Tavily’s API to retrieve content from the web and AI-generated answers for each query.

In [10]:
from tavily import TavilyClient

tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

Quotient is an intelligent observability platform designed for retrieval-augmented and search-augmented AI systems.

Quotient performs automated detections on two key fronts each time you send it a log:

- **Hallucination:** Identifies statements in the model output that are unsupported by the retrieved documents or that contradict them. This flagging is done at the sentence level and returns a boolean indicator if any part of the answer contains a hallucination.

- **Context Relevance:** Evaluates each retrieved document to determine whether it meaningfully contributed to grounding the answer. Quotient returns relevance labels for all documents, helping gauge retrieval and search quality.
  
These capabilities are enabled automatically when `hallucination_detection=True` is set during logger initialization.

Below, we'll set up the Quotient logger, send each AI-search result for automatic evaluation, and retrieve structured logs and detections:

In [11]:
from quotientai import QuotientAI

# Initialize the Quotient SDK

quotient = QuotientAI()

quotient_logger = quotient.logger.init(
    app_name = "search-eval", # Name your application or project
    environment = "test", # Set the environment (e.g., "dev", "prod", "staging")
    sample_rate = 1, # Set the sample rate for logging (0-1)
    hallucination_detection = True, # Enable hallucination detection
    hallucination_detection_sample_rate = 1, # Set the sample rate for  detections (0-1)
)

## Step 2: Get a set of example queries

We’ll evaluate on a set of realistic user queries queries from the open-source [Tavily Web Eval Generator](https://github.com/Eyalbenba/tavily-web-eval-generator), covering a diverse set of topics. From each sample, we will use the `question` attribute to run a fresh search and compare the generated answer against retrieved documents.

In [12]:
import json

# Load queries from file
with open("search_queries.jsonl") as f:
    queries = [json.loads(line)["question"] for line in f]

Alternatively, you can connect Quotient to a live development or production environment and run detections automatically as data comes in — no manual setup required beyond the few-lines-of-code initial integration.

## Step 3: Query Tavily for each example query and log your results in Quotient

Lets run fresh searches for a subset of examples.

In [13]:
from tqdm import tqdm

tavily_results = []
log_ids = []

num_results = 10

for query in queries[:num_results]:

    response = tavily_client.search(
        query = query, 
        include_answer = 'advanced',
        search_depth = 'advanced',
        include_raw_content = True
    )

    print(f"\n🧠 {query}\n➡️ {response['answer']}")

    log_id = quotient_logger.log(user_query=query, 
                        model_output=response['answer'], 
                        # Send the *raw content* of retrieved documents for grounding checks
                        documents=[str(doc) for doc in response['results']],
                        )
    
    print(f"📝 Logged to Quotient with log_id: {log_id}")
    
    log_ids.append(log_id)


🧠 What is the top emerging technology in 2025 according to the article '25 New Technology Trends for 2025'?
➡️ According to the article "25 New Technology Trends for 2025," Generative AI stands as the top emerging technology for 2025. This artificial intelligence technology, which can create new content including text, images, code, and other media, leads the comprehensive list of 25 transformative technologies. Generative AI is positioned at the forefront of the AI and Machine Learning category, reflecting its significant impact across industries and its potential to revolutionize how businesses operate and how people interact with technology. The technology's prominence is underscored by its placement as the first item in multiple technology trend analyses, highlighting its critical role in shaping the technological landscape of 2025.
📝 Logged to Quotient with log_id: 778ad299-ed49-4c60-baac-04d8168c31f0

🧠 What is the name of the 105-qubit quantum processor unveiled by Alphabet?
➡️

### How It Works

When `.log()` is called:

1. **Data ingestion:** The query, model output, and all retrieved document contents are logged to Quotient.

2. **Async detection pipeline:** Quotient runs:
- **Hallucination detection**, labeling the output as hallucinated or not.
- **Document relevance scoring**, marking which retrieved documents helped ground the output 

3. **Result retrieval:** You can poll or fetch detections linked to your `log_id`.

4. **Monitor and troubleshoot in the Quotient app:** Access the [Quotient dashboard](app.quotientai.co) to:
- Monitor you AI system over time
- Review flagged hallucinated sentences.
- See which documents were irrelevant.
- Compare across tags or environments for deeper insights.

For full implementation details, visit the Quotient [docs](https://docs.quotientai.co/).


# Step 4: Review detections in Quotient

You can now view your logs and detections in the [Quotient dashboard](app.quotientai.co), where you can also filter them by tags and environments to identify common failure patterns.

![Quotient AI Dashboard](Quotient_Dashboard.png "Quotient AI Dashboard")


## What You’ve Built

A lightweight search and monitoring pipeline that:
- Runs live AI search queries
- Automatically checks if answers are grounded in retrieved evidence
- Flags hallucinations and irrelevant retrievals

You can scale this to monitor production traffic, benchmark retrieval and search performance, or compare different models side by side.

## How to interpret the results
- Well-grounded systems typically show **< 5% hallucination rate**. If yours is higher, it’s often a signal that either your data ingestion, retrieval pipeline, or prompting needs improvement.

- High-performing systems typically show **> 75% document relevance**. Lower scores may signal ambiguous user queries, incorrect retrieval, or noisy source data.


# (Optional) Grab the detection restults from Quotient

Quotient's detections are now available to fetch via the Quotient SDK using the `log_id` you received earlier:

In [14]:
hallucination_detections = []
doc_relevancy_detections = []

for id in tqdm(log_ids):

    try:
        detection = quotient_logger.poll_for_detection(log_id = id)
        # Add the hallucination detection to the results_df
        hallucination_detections.append(detection.has_hallucination)
        # Add the document relevancy detection to the results_df
        docs = detection.log_documents
        doc_relevancy_detections.append(sum(1 for doc in docs if doc.get('is_relevant') is True) / len(docs) if docs else None)

    except:
        continue

print(f"Number of results: {len(log_ids)}")
print(f"Percentage of hallucinations: {sum(hallucination_detections)/len(hallucination_detections)*100:.2f}%")
print(f"Average percentage of relevant document: {sum(doc_relevancy_detections)/len(doc_relevancy_detections)*100:.2f}%")

100%|██████████| 10/10 [01:37<00:00,  9.71s/it]

Number of results: 10
Percentage of hallucinations: 25.00%
Average percentage of relevant document: 90.00%



