# Generate Custom Benchmark

This notebook walks through how to generate a custom benchmark based on your data.

We will be using Anthropic's claude-3-5-sonnet for generating queries and OpenAI's text-embedding-3-large for embedding, but these models can easily be switched out:
- Various embedding functions are provided in `embedding_functions.py`
- LLM prompts are provided in `llm_functions.py`

## 1. Setup

### 1.1 Install & Import

Install the necessary packages.

In [None]:
!pip install -r requirements.txt

Import modules.

In [19]:
%load_ext autoreload
%autoreload 2

import chromadb
import pandas as pd
import numpy as np
import datasets
import json
import datetime
from openai import OpenAI as OpenAIClient
from anthropic import Anthropic as AnthropicClient
from functions.llm import *
from functions.embed import *
from functions.chroma import *
from functions.evaluate import *
from functions.visualize import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1.2 Load API Keys

To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database. If you want to use local Chroma, skip this step and simply input `COLLECTION_NAME`, `OPENAI_API_KEY`, and `CLAUDE_API_KEY`.

In [2]:
# Chroma Cloud
CHROMA_TENANT = "YOUR CHROMA TENANT ID"
X_CHROMA_TOKEN = "YOUR CHROMA API KEY"
DATABASE_NAME = "YOUR CHROMA DATABASE NAME"

# Chroma Collection
COLLECTION_NAME = "YOUR COLLECTION NAME"

# Embedding Model
OPENAI_API_KEY = "YOUR OPENAI API KEY"

# LLM
ANTHROPIC_API_KEY = "YOUR ANTHROPIC API KEY"

### 1.3 Set Clients

Initialize the clients.

In [None]:
chroma_client = chromadb.HttpClient(
  ssl=True,
  host='api.trychroma.com',
  tenant=CHROMA_TENANT,
  database=DATABASE_NAME,
  headers={
    'x-chroma-token': X_CHROMA_TOKEN
  }
)

# If you want to use the local Chroma instead, uncomment the following line:
# chroma_client = chromadb.Client()

openai_client = OpenAIClient(api_key=OPENAI_API_KEY)
anthropic_client = AnthropicClient(api_key=ANTHROPIC_API_KEY)

## 2. Create Chroma Collection

If you already have a Chroma Collection for your data, skip to **3. Filter Documents for Quality**.

### 2.1 Load in Your Data

We use pre-chunked [Chroma Docs](https://docs.trychroma.com/docs/overview/introduction) as an example, but replace this with your own data.

NOTE: should we add a chunking function for people to process their data?

In [None]:
with open('data/chroma_docs.json', 'r') as f:
    corpus = json.load(f)

In [None]:
corpus_ids = list(corpus.keys())
corpus_documents = [corpus[key] for key in corpus_ids]

### 2.2 Embed Data & Add to Chroma Collection

Embed your documents using an embedding model of your choice. We use Openai's text-embedding-3-large here, but have other functions available in `embed.py`. You may also define your own embedding function.

We use batching and multi-threading for efficiency.

In [None]:
corpus_embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=corpus_documents,
    model="text-embedding-3-large",
)

corpus_collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)

collection_add_in_batches(
    collection=corpus_collection,
    ids=corpus_ids,
    texts=corpus_documents,
    embeddings=corpus_embeddings,
)

## 3. Filter Documents for Quality

We begin by filtering our documents prior to query generation, this step ensures that we avoid generating queries from irrelevant or incomplete documents.

### 3.1 Set Criteria

We use the following criteria:
- `relevance` checks whether the document is relevant to the specified context
- `completeness` checks for overall quality of the document

You can modify the criteria as you see fit.

Fill in `context` according to your use case.

In [None]:
context = "FILL IN WITH CONTEXT RELEVANT TO YOUR USE CASE"

In [None]:
relevance = f"The document is relevant to the following context: {context}"
completeness = "The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for."

criteria = [relevance, completeness]
criteria_labels = ["relevance", "completeness"]

### 3.2 Get Documents

Get your Chroma collection and filter documents according to criteria.

In [None]:
corpus_collection = chroma_client.get_collection(
    name=COLLECTION_NAME
)

corpus = get_collection_items(
    collection=corpus_collection
)

corpus_ids = [key for key in corpus.keys()]
corpus_documents = [corpus[key]['document'] for key in corpus_ids]

### 3.3 Filter Documents

We create a batch request for our LLM calls (this is cheaper and typically faster).

In [None]:
filtered_documents_batch_id = create_document_filter_batch(
    client=anthropic_client,
    documents=corpus_documents,
    ids=corpus_ids,
    criteria=criteria,
    criteria_labels=criteria_labels
)

You can check the status of your batch through the [Anthropic Console](https://console.anthropic.com/workspaces/default/batches).

Retrieve the batch once it is finished.

In [None]:
filtered_documents_batch = retrieve_document_filter_batch_df(
    client=anthropic_client,
    batch_id=filtered_documents_batch_id
)

passed_document_ids = get_filtered_ids(
    filtered_documents_batch_df=filtered_documents_batch
)

passed_documents = [corpus[id]['document'] for id in passed_document_ids]

failed_document_ids = [id for id in corpus_ids if id not in passed_document_ids]

### 3.4 View Results

In [None]:
print(f"Number of documents passed: {len(passed_document_ids)}")
print(f"Number of documents failed: {len(failed_document_ids)}")
print("-"*80)
print("Example of passed document:")
print(corpus[passed_document_ids[0]]['document'])
print("-"*80)
print("Example of failed document:")
print(corpus[failed_document_ids[0]]['document'])
print("-"*80)

## 4. Generate Golden Dataset

Using our filtered documents, we can genereate a golden dataset of queries.

### 4.1 Create Custom Prompt

We will use the `context` (from the prior section) and `example_queries` for query generation.

Fill in `example_queries` with examples of what users may ask. These examples help indicate what kind of topics users typically focus on, as well as the style of query that should be generated.

In [None]:
example_queries = "FILL IN WITH EXAMPLE QUERIES"

### 4.2 Generate Queries

Send a batch request for generation.

In [None]:
golden_dataset_batch_id = create_golden_dataset_batch(
    client=anthropic_client,
    model="claude-3-5-sonnet-20241022",
    documents=passed_documents,
    ids=passed_document_ids,
    context=context,
    example_queries=example_queries
)

Retrieve batch.

In [None]:
golden_dataset = retrieve_batch(
    client=anthropic_client,
    batch_id=golden_dataset_batch_id
)

golden_dataset.head()

## 5. Evaluate

Now that we have our golden dataset, we will can run our evaluation.

### 5.1 Prepare Inputs

In [None]:
queries = golden_dataset['query'].tolist()
ids = golden_dataset['id'].tolist()

Embed generated queries.

In [None]:
query_embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=golden_dataset["query"],
    model="text-embedding-3-large"
)

query_embeddings_lookup = {
    id: {
        "text": query,
        "embedding": embedding
    }
    for id, query, embedding in zip(golden_dataset["id"], golden_dataset["query"], query_embeddings)
}

Create our qrels (query relevance labels) dataframe. In this case, each query and its corresponding document share the same id.

In [None]:
qrels = pd.DataFrame(
    {
        "query-id": ids,
        "corpus-id": ids,
        "score": 1
    }
)

### 5.2 Run Benchmark

In [None]:
results = run_benchmark(
    query_embeddings_lookup=query_embeddings_lookup,
    collection=corpus_collection,
    qrels=qrels
)

Save results.

This is helpful for comparison (e.g. comparing different embedding models or chunking strategies).

In [None]:
results_to_save = {
    "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    "model": "text-embedding-3-large",
    "results": results
}

In [None]:
with open('results/results_v1.json', 'w') as f:
    json.dump(results_to_save, f)