# Query Generation
Generative benchmarking allows for a more tailored and representative approach to evaluation. First we filter documents using an aligned LLM judge and given context from the user to identify documents that are most relevant to the specified use case and contains sufficient information to generate queries from. Next, we generate queries using given context and example queries to steer generation

## 1. Setup

### 1.1 Install & Import
Install the necessary packages and import modules.

In [None]:
%pip install -r requirements.txt

In [None]:
%load_ext autoreload
%autoreload 2

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
import chromadb
import pandas as pd
import numpy as np
import json
import os
from pathlib import Path
from datetime import datetime
from openai import OpenAI as OpenAIClient
from functions.llm import *
from functions.embed import *
from functions.chroma import *
from functions.evaluate import *
from functions.visualize import *

### 1.2 Load Data
Load our curated data of cafe reviews.

In [None]:
with open('data/cafes.json', 'r') as f:
    df_cafes = [json.loads(line) for line in f]
df_cafes = pd.DataFrame(df_cafes)
df_cafes.head()

In [None]:
ids = df_cafes['id'].tolist()
documents = df_cafes['text'].tolist()
metadatas = df_cafes[['name', 'address']].to_dict(orient='records')

### 1.3 Set Clients
Initialize clients for OpenAI and Chroma.

In [None]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
openai_client = OpenAIClient(api_key=OPENAI_API_KEY)

chroma_client = chromadb.Client()

## 2. Create Chroma Collection

### 2.1 Embed Documents
Generate embeddings for reviews using OpenAI's text-embedding-3-large model.

In [None]:
embeddings = openai_embed_in_batches(
    openai_client=openai_client,
    texts=documents,
    model="text-embedding-3-large",
)

### 2.2 Create Collection & Add Documents
Store reviews and metadata in Chroma's vector database (in batches).

In [None]:
COLLECTION_NAME = "cafes-openai-large"

In [None]:
collection = chroma_client.get_or_create_collection(
    name=COLLECTION_NAME,
    metadata={"hnsw:space": "cosine"}
)

collection_add_in_batches(
    collection=collection,
    ids=ids,
    texts=documents,
    metadatas=metadatas,
    embeddings=embeddings,
)

In [None]:
collection = chroma_client.get_collection(name=COLLECTION_NAME)
corpus = get_collection_items(collection=collection)
corpus_ids = [key for key in corpus.keys()]
corpus_documents = [corpus[key]['document'] for key in corpus_ids]
corpus_metadatas = [corpus[key]['metadata'] for key in corpus_ids]

## 3. Filter Documents for Quality

### 3.1 Set Criteria
Define criteria for a review that is useful to generate a query for.

In [None]:
relevance = "The review is relevant if it provides meaningful information about the cafe, including aspects like atmosphere, food, drinks, service quality, accessibility, decor, or overall experience."
specificity = "The review is specific if it provides detailed and precise information about the cafe, such as particular menu items, exact features, or clear descriptions of the environment, rather than vague or general statements."
positivity = "The review reflects a generally positive sentiment about the cafe, indicating a favorable experience."
criteria = [relevance, specificity, positivity]
criteria_labels = ["relevance", "specificity", "positivity"]

### 3.2 Filter Documents
Filter reviews prior to query generation to ensure that we avoid generating queries from irrelevant documents.

In [None]:
filtered_document_ids = filter_documents(
    client=openai_client,
    model="gpt-4o-mini",
    documents=corpus_documents,
    ids=corpus_ids,
    criteria=criteria,
    criteria_labels=criteria_labels
)

In [None]:
passed_documents = [corpus[id]['document'] for id in filtered_document_ids]

failed_document_ids = [id for id in corpus_ids if id not in filtered_document_ids]

In [None]:
print(f"Number of documents passed: {len(filtered_document_ids)}")
print(f"Number of documents failed: {len(failed_document_ids)}")
print("-"*80)
print("Example of passed document:")
print(corpus[filtered_document_ids[0]]['document'])
print("-"*80)
print("Example of failed document:")
print(corpus[failed_document_ids[0]]['document'])
print("-"*80)

## 4. Generate Golden Dataset

### 4.1 Create Custom Prompt & Generate Queries

Provide context and example queries to generate a golden dataset of queries using OpenAI's gpt-4o.

In [None]:
context = "This is a search assistant for Corner, a review platform where users discover local cafes."
example_queries = """
    quiet cafe for studying
    romantic first date spot
    best iced matcha latte
    fresh fruit pastries
    natural light with plants
    vegan-friendly with oat milk
    trendy cafe with artsy vibes
    perfect espresso shot
    brunch with large portions
    """

golden_dataset = create_golden_dataset(
    client=openai_client,
    model="gpt-4o",
    documents=passed_documents,
    ids=filtered_document_ids,
    context=context,
    example_queries=example_queries
)

golden_dataset.head()

In [None]:
golden_dataset.to_json('queries/golden_dataset.json', orient='records', lines=True)