## Retrieval-Augmented Generation (RAG) Evaluation for Recipe Assistant

This notebook provides a comprehensive evaluation framework for the RAG system described in `rag-flow.ipynb`.  
It covers ground-truth generation, retrieval metrics (Hit Rate, MRR), parameter optimization, and LLM-as-judge answer quality for both Minsearch and Elasticsearch using the best-performing combined retrieval and reranking approaches.

---

### 1. Setup and Imports
Import dependencies, load OpenAI API key and connect to OpenAI API. 

In [None]:
import pandas as pd
import numpy as np
import json
import random
import re
from tqdm.auto import tqdm
import dotenv
import minsearch
from elasticsearch import Elasticsearch
from openai import OpenAI

In [None]:
dotenv.load_dotenv("../.env")
client = OpenAI()

### 2. Data Loading and Preprocessing
Load the recipe dataset from CSV file and prepare it for indexing.

In [None]:
df = pd.read_csv('../data/recipes_clean.csv')

# Add an ID column if it doesn't exist
if 'id' not in df.columns:
    df['id'] = range(len(df))
    
# Create documents for indexing
documents = df.to_dict(orient='records')
documents[0]

### 3. Ground Truth Generation
We generate ground-truth user questions for each recipe using the LLM.  
This is used to evaluate retrieval quality.

In [None]:
prompt_template = """
You emulate a user of our recipe assistant application.
Formulate 5 questions this user might ask based on a provided recipe.
Make the questions specific to ingredients, cooking methods, 
cooking duration (prep/cook time), or dietary information in this recipe.
Do NOT mention the recipe name in the question.
The record should contain the answer to the questions, 
and the questions should be complete and not too short.
Use as few words as possible from the record.

The record:

Recipe: {recipe_name}
Cuisine: {cuisine_type}
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}

Provide the output in parsable JSON without using code blocks:

{{"questions": ["question1", "question2", ..., "question5"]}}
""".strip()

def generate_questions(doc):
    prompt = prompt_template.format(**doc)
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

results = {}
for i, doc in enumerate(tqdm(documents[:100])):
    doc_id = doc.get('id', i)
    if doc_id in results:
        continue
    try:
        questions_raw = generate_questions(doc)
        questions = json.loads(questions_raw)
        results[doc_id] = questions['questions']
    except (json.JSONDecodeError, KeyError):
        continue

final_results = []
for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

df_results = pd.DataFrame(final_results, columns=['id', 'question'])
df_results.to_csv('../data/ground-truth-retrieval.csv', index=False)

### 4. Load Ground Truth Questions

Load the generated ground-truth questions for evaluation.

In [None]:
df_gt = pd.read_csv('../data/ground-truth-retrieval.csv')
ground_truth = df_gt.to_dict(orient='records')

### 5. Utility Functions

#### 5.1 Create Time Filter
Users will be queried for available time. 

In [None]:
def filter_by_max_time(results, max_time=None):
    if max_time is None:
        return results
    filtered = []
    for doc in results:
        try:
            total_time = int(doc.get('prep_time_minutes', 0)) + int(doc.get('cook_time_minutes', 0))
        except Exception:
            total_time = 99999
        if total_time <= max_time:
            filtered.append(doc)
    return filtered

#### 5.2 Print Unused Ingredients

Utility function that, given the query and the results, prints which query ingredients were not covered by the selected recipes.

In [None]:
# tokenize_ingredients is used in below Cover Ingredients Searches
def tokenize_ingredients(ingredient_str):
    # Split only on commas, strip whitespace, and lowercase
    return set([ing.strip().lower() for ing in ingredient_str.split(',') if ing.strip()])

In [None]:
def print_unused_ingredients(ingredients, results):
    query_ings = tokenize_ingredients(ingredients)
    used = set()
    for doc in results:
        recipe_ings = tokenize_ingredients(doc.get('all_ingredients', ''))
        used |= recipe_ings
    unused = query_ings - used
    print("Unused ingredients:", ", ".join(unused) if unused else "All ingredients used!")

#### 5.3. Deduplicate Search results
Elasticsearch can sometimes return duplicate or near-duplicate documents in the RAG pipeline.

In [None]:
def deduplicate_results(results):
    seen = set()
    deduped = []
    for doc in results:
        # Use a unique field, e.g. 'id' or a tuple of fields
        key = doc.get('id') or (doc.get('recipe_name'), doc.get('prep_time_minutes'), doc.get('cook_time_minutes'))
        if key not in seen:
            seen.add(key)
            deduped.append(doc)
    return deduped

### 5. Minsearch Setup
Set up Minsearch retrieval backend as in the main RAG notebook.

#### 5.1. Minsearch Index

In [None]:
# Setup minsearch index for recipes
index = minsearch.Index(
    text_fields=['recipe_name', 'main_ingredients', 'all_ingredients', \
                 'instructions', 'cuisine_type', 'dietary_restrictions'],
    keyword_fields=['meal_type', 'difficulty_level']
)

# Fit/train the index on the recipe documents
index.fit(documents)
index.documents = index.docs

#### 5.2. Elasticsearch Index

Run docker in terminal:

`docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

Or:

`docker run -d --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx1g" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

And check if Elasticsearch is up:

`curl http://localhost:9200`

In [None]:
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

# Create Elasticsearch Client (make sure Elasticsearch is running locally)
es_client = Elasticsearch('http://localhost:9200')

In [None]:
# Define Elasticsearch index settings and mappings for recipes
index_settings = {
    "settings": {
        "number_of_shards": 1, # A unit of storage and search. More shards can improve parallelism for large datasets
        "number_of_replicas": 0 # A shard copy for fault tolerance and increased search throughput (not recommended for production)
    },
    "mappings": {
        "properties": {
            "recipe_name": {"type": "text"},
            "main_ingredients": {"type": "text"},
            "all_ingredients": {"type": "text"},
            "instructions": {"type": "text"},
            "cuisine_type": {"type": "text"},
            "dietary_restrictions": {"type": "text"},
            "meal_type": {"type": "keyword"},
            "difficulty_level": {"type": "keyword"},
            "prep_time_minutes": {"type": "integer"},
            "cook_time_minutes": {"type": "integer"},
            "all_ingredients_vector": {
                "type": "dense_vector",
                "dims": 1536  # text-embedding-3-small returns 1536-dimensional vectors
}
        }
    }
}

index_name = "recipes"

# Create the index (ignore error if it already exists)
try:
    es_client.indices.create(index=index_name, body=index_settings)
except Exception as e:
    print("Index may already exist:", e)

In [None]:
def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[text]
    )
    return np.array(response.data[0].embedding)

emb = get_embedding("test")
print(emb.shape)  # Should print (1536,)

(1536,)


In [None]:
# Index documents into Elasticsearch, including the embedding vector
for doc in tqdm(documents):
    doc['all_ingredients_vector'] = get_embedding(doc['all_ingredients']).tolist()
    es_client.index(index=index_name, document=doc)

  0%|          | 0/477 [00:00<?, ?it/s]

### 7. Best Retrieval Strategies

#### 7.1. Best Minsearch Retrieval and Rerank Combination

#### 7.2. Best Elasticsearch Retrieval and Rerank Combination

### 8. LLM-based Enhancements

#### 8.1. Rerank with LLM

In [None]:
def rerank_with_llm(query, candidates):
    context = "\n\n".join([f"Recipe: {doc['recipe_name']}\nIngredients: {doc['main_ingredients']}" for doc in candidates])
    prompt = f"""
Given the following user query and candidate recipes, rank the recipes from most to least relevant.

Query: {query}

Candidates:
{context}

Return a JSON list of recipe names in ranked order.
""".strip()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    import re
    content = response.choices[0].message.content
    json_match = re.search(r'\[.*\]', content, re.DOTALL)
    if json_match:
        ranked_names = json.loads(json_match.group())
    else:
        return candidates
    ranked_docs = [doc for name in ranked_names for doc in candidates if doc['recipe_name'] == name]
    return ranked_docs

#### 8.2. Query Rewriting

In [None]:
def rewrite_query(query):
    prompt = f"Rewrite this user query for a recipe search system to be more specific and clear: '{query}'"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

### 9. Prompt Strategies

In [None]:
def get_prompt_strategy_1():
    return """
You are a chef assistant. Based on the available recipes, recommend dishes that use the requested ingredients.
Provide the recipe name, brief description, and cooking instructions and time.

CONTEXT:
{context}

QUESTION: {question}

Answer:
""".strip()

def get_prompt_strategy_2():
    return """
You are an expert chef specializing in ingredient substitutions. When users provide ingredients,
recommend recipes and suggest alternatives for missing ingredients. Always explain possible substitutions
and how they might affect the dish.

CONTEXT:
{context}

QUESTION: {question}

Provide recommendations with substitution suggestions:
""".strip()

def get_prompt_strategy_3():
    return """
You are a nutritionist and chef. Recommend recipes based on ingredients provided, considering
nutritional value and dietary restrictions. Highlight health benefits and suggest modifications
for different dietary needs (vegetarian, gluten-free, etc.).

CONTEXT:
{context}

QUESTION: {question}

Provide nutritionally-aware recommendations:
""".strip()


### 10. Retrieval Evaluation: Hit Rate and MRR

In [None]:
def hit_rate(relevance_total):
    cnt = 0
    for line in relevance_total:
        if True in line:
            cnt += 1
    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank]:
                total_score += 1 / (rank + 1)
                break
    return total_score / len(relevance_total)

def evaluate_retrieval(ground_truth, search_function):
    relevance_total = []
    for q in tqdm(ground_truth):
        doc_id = str(q['id'])
        results = search_function(q)
        relevance = [str(d.get('id')) == doc_id for d in results]
        relevance_total.append(relevance)
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

print("Evaluating Minsearch retrieval...")
metrics_minsearch = evaluate_retrieval(ground_truth, lambda q: minsearch_search(q['question'], num_results=10))
print("Minsearch:", metrics_minsearch)

print("Evaluating Elasticsearch retrieval...")
metrics_es = evaluate_retrieval(ground_truth, lambda q: elasticsearch_search(q['question'], num_results=10))
print("Elasticsearch:", metrics_es)


### 11. Parameter Optimization for Minsearch

In [None]:
param_ranges = {
    'recipe_name': (0.0, 1.0),
    'main_ingredients': (0.0, 4.0),
    'all_ingredients': (0.0, 5.0),
    'instructions': (0.0, 3.0),
    'cuisine_type': (0.0, 1.0),
    'dietary_restrictions': (0.0, 2.0)
}

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')
    for _ in range(n_iterations):
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            current_params[param] = random.uniform(min_val, max_val)
        current_score = objective_function(current_params)
        if current_score > best_score:
            best_score = current_score
            best_params = current_params
    return best_params, best_score

gt_val = df_gt.sample(n=50, random_state=42).to_dict(orient='records')

def objective(boost_params):
    def search_function(q):
        return minsearch_search(q['question'], boost=boost_params, num_results=10)
    results = evaluate_retrieval(gt_val, search_function)
    return results['mrr']

best_boost, best_score = simple_optimize(param_ranges, objective, n_iterations=20)
print("Best Minsearch boost params:", best_boost)
print("Best validation MRR:", best_score)


### 12. RAG Pipeline and LLM Answer Quality

In [None]:
def build_prompt(query, search_results):
    entry_template = """
Recipe: {recipe_name}
Cuisine: {cuisine_type}
Meal Type: {meal_type}
Difficulty: {difficulty_level}
Prep Time: {prep_time_minutes} minutes
Cook Time: {cook_time_minutes} minutes
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}
""".strip()
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt_template = """
You are an expert chef and culinary assistant. Answer the question based on the content from 
our recipe database. Use only the facts from the context when answering the question.

CONTEXT:
{context}

QUESTION: {question}

Provide recipe recommendations with brief explanations of why they match the requested ingredients.
If exact ingredients aren't available, suggest the closest matches and mention any substitutions needed.
""".strip()
    return prompt_template.format(context=context, question=query)

def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

def rag_minsearch(question):
    search_results = minsearch_search(question, boost=best_boost, num_results=5)
    prompt = build_prompt(question, search_results)
    answer = llm(prompt)
    return answer

def rag_elasticsearch(question):
    search_results = elasticsearch_search(question, num_results=5)
    prompt = build_prompt(question, search_results)
    answer = llm(prompt)
    return answer

### 13. LLM-as-Judge Evaluation (RAG answer quality)

In [None]:
prompt2_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

sample = df_gt.sample(n=50, random_state=1).to_dict(orient='records')
evaluations_minsearch = []
evaluations_es = []

print("Evaluating RAG (Minsearch) with LLM-as-judge...")

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_minsearch(question)
    prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    except Exception:
        evaluation = {"Relevance": "ERROR", "Explanation": evaluation}
    evaluations_minsearch.append({
        "id": record['id'],
        "question": question,
        "answer": answer_llm,
        "relevance": evaluation.get("Relevance"),
        "explanation": evaluation.get("Explanation")
    })

print("Evaluating RAG (Elasticsearch) with LLM-as-judge...")

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_elasticsearch(question)
    prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    except Exception:
        evaluation = {"Relevance": "ERROR", "Explanation": evaluation}
    evaluations_es.append({
        "id": record['id'],
        "question": question,
        "answer": answer_llm,
        "relevance": evaluation.get("Relevance"),
        "explanation": evaluation.get("Explanation")
    })

df_eval_minsearch = pd.DataFrame(evaluations_minsearch)
df_eval_es = pd.DataFrame(evaluations_es)
df_eval_minsearch.to_csv('../data/rag-eval-minsearch.csv', index=False)
df_eval_es.to_csv('../data/rag-eval-elasticsearch.csv', index=False)

print("Minsearch RAG relevance proportions:")
print(df_eval_minsearch['relevance'].value_counts(normalize=True))
print("Elasticsearch RAG relevance proportions:")
print(df_eval_es['relevance'].value_counts(normalize=True))

### 14. Summary

In [None]:
print("\n=== RETRIEVAL METRICS ===")
print("Minsearch:", metrics_minsearch)
print("Elasticsearch:", metrics_es)
print("\n=== RAG LLM-as-Judge (proportion RELEVANT) ===")
print("Minsearch:", (df_eval_minsearch['relevance'] == 'RELEVANT').mean())
print("Elasticsearch:", (df_eval_es['relevance'] == 'RELEVANT').mean())
print("\nAll evaluation results saved to CSV in ../data/")

---

**How to interpret the LLM-as-Judge RAG evaluation output:**

The table shows the proportions of answers classified as "RELEVANT" or "PARTLY_RELEVANT" by the LLM-as-judge for both Minsearch and Elasticsearch RAG pipelines.  
- **RELEVANT:** Generated answers judged fully relevant to the user's question.
- **PARTLY_RELEVANT:** Answers judged partially relevant.

This relevance-based evaluation provides a more realistic measure of user experience than strict retrieval metrics