## Retrieval-Augmented Generation (RAG) for Recipe Assistant: RAG Evaluation

This notebook provides an evaluation framework for the RAG system described in `rag-flow.ipynb`.  
It covers ground-truth generation, retrieval metrics (Hit Rate, MRR), parameter optimization, and LLM-as-judge answer quality for both Minsearch and Elasticsearch using the best-performing combined retrieval and reranking approaches.

---

### 1. Set-up Dependencies, OpenAI Client and Dataset

#### 1.1 Dependencies and OpenAI Client
Import dependencies, load OpenAI API key and connect to OpenAI API. 

In [None]:
import pandas as pd
import numpy as np
import json
import random
import re
from tqdm.auto import tqdm
import dotenv
import minsearch
from elasticsearch import Elasticsearch
from openai import OpenAI

In [None]:
dotenv.load_dotenv("../.env")
client = OpenAI()

#### 1.2. Load and Index Recipe Data
Load the recipe dataset from CSV file and prepare it for indexing.

In [None]:
df = pd.read_csv('../data/recipes_clean.csv')

# Add an ID column if it doesn't exist
if 'id' not in df.columns:
    df['id'] = range(len(df))
    
# Create documents for indexing
documents = df.to_dict(orient='records')
documents[0]

### 2. Utility Functions

#### 2.1 Create Time Filter
Users will be queried for available time. 

In [None]:
def filter_by_max_time(results, max_time=None):
    if max_time is None:
        return results
    filtered = []
    for doc in results:
        try:
            total_time = int(doc.get('prep_time_minutes', 0)) + int(doc.get('cook_time_minutes', 0))
        except Exception:
            total_time = 99999
        if total_time <= max_time:
            filtered.append(doc)
    return filtered

#### 2.2 Print Unused Ingredients

Utility function that, given the query and the results, prints which query ingredients were not covered by the selected recipes.

In [None]:
# tokenize_ingredients is used in below Cover Ingredients Searches
def tokenize_ingredients(ingredient_str):
    # Split only on commas, strip whitespace, and lowercase
    return set([ing.strip().lower() for ing in ingredient_str.split(',') if ing.strip()])

In [None]:
def print_unused_ingredients(ingredients, results):
    query_ings = tokenize_ingredients(ingredients)
    used = set()
    for doc in results:
        recipe_ings = tokenize_ingredients(doc.get('all_ingredients', ''))
        used |= recipe_ings
    unused = query_ings - used
    print("Unused ingredients:", ", ".join(unused) if unused else "All ingredients used!")

#### 2.3. Deduplicate Search results
Elasticsearch and LLMs can sometimes return duplicate or near-duplicate documents.

In [None]:
def deduplicate_results(results):
    seen = set()
    deduped = []
    for doc in results:
        # Use a unique field, e.g. 'id' or a tuple of fields
        key = doc.get('id') or (doc.get('recipe_name'), doc.get('prep_time_minutes'), doc.get('cook_time_minutes'))
        if key not in seen:
            seen.add(key)
            deduped.append(doc)
    return deduped

### 3. Ground Truth Generation
Generate ground-truth user questions for each recipe using the LLM.  
This is used to evaluate retrieval quality.

In [None]:
#-> The fields are the most relevant for generating realistic user questions, given the prompt instructions.
prompt_template = """
You emulate a user of our recipe assistant application.
Formulate 5 questions this user might ask based on a provided recipe.
Make the questions specific to ingredients, cooking methods, 
cooking duration (prep/cook time), or dietary information in this recipe.
Do NOT mention the recipe name in the question.
The record should contain the answer to the questions, 
and the questions should be complete and not too short.
Use as few words as possible from the record.

The record:

Recipe: {recipe_name}
Cuisine: {cuisine_type}
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}
Total Time: {total_time_minutes}
Servings: {servings}

Provide the output in parsable JSON without using code blocks:

{{"questions": ["question1", "question2", ..., "question5"]}}
""".strip()

def generate_questions(doc):
    prompt = prompt_template.format(**doc)
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

results = {}
for i, doc in enumerate(tqdm(documents[:100])):
    doc_id = doc.get('id', i)
    if doc_id in results:
        continue
    try:
        questions_raw = generate_questions(doc)
        questions = json.loads(questions_raw)
        results[doc_id] = questions['questions']
    except (json.JSONDecodeError, KeyError):
        continue

final_results = []
for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

df_results = pd.DataFrame(final_results, columns=['id', 'question'])
df_results.to_csv('../data/ground-truth-retrieval.csv', index=False)

### 4. Load Ground Truth Questions

Load the generated ground-truth questions for evaluation.

In [None]:
df_gt = pd.read_csv('../data/ground-truth-retrieval.csv')
ground_truth = df_gt.to_dict(orient='records')

### 5. Minsearch Setup and Retrieval Strategies
Set up Minsearch retrieval backend as in the main RAG notebook.

#### 5.1. Minsearch Index 

In [None]:
# prep/cook time and servings are numeric, and not suitable for Minsearch's default text/keyword search.
# Setup minsearch index for recipes
index = minsearch.Index(
    text_fields=['recipe_name', 'main_ingredients', 'all_ingredients', \
                 'instructions', 'cuisine_type', 'dietary_restrictions'],
    keyword_fields=['meal_type', 'difficulty_level']
)

# Fit/train the index on the recipe documents
index.fit(documents)
index.documents = index.docs

#### 5.2 Run RAG Retrieval Strategies for Evaluation (Minsearch)

##### Approach 4: Hybrid Search With OpenAI (Keyword + Embedding)
Combines keyword and with OpenAI's embedding-based (semantic) similarity for more robust retrieval.

In [None]:
def get_embedding(text):
    response = client.embeddings.create( # OpenAI client
        model="text-embedding-3-small",
        input=[text]
    )
    return np.array(response.data[0].embedding)

# Precompute embeddings for all recipes (DO THIS ONCE AND CACHE)
if not hasattr(index, "embeddings"):
    index.embeddings = [get_embedding(doc['all_ingredients']) for doc in index.documents]

def ms_hybrid_search(query, index, num_results=5, alpha=0.5, max_time=None):
    # Keyword search using SimpleIndex
    keyword_results = index.search(query, num_results=10)
    # Embedding search
    query_emb = get_embedding(query)
    similarities = [np.dot(query_emb, emb) for emb in index.embeddings]
    top_indices = np.argsort(similarities)[-10:][::-1]
    embedding_results = [index.documents[i] for i in top_indices]
    # Combine results (simple union, or weighted score)
    combined = {}
    for doc in keyword_results:
        combined[doc['id']] = alpha
    for doc in embedding_results:
        combined[doc['id']] = combined.get(doc['id'], 0) + (1 - alpha)
    # Sort by combined score
    sorted_ids = sorted(combined, key=combined.get, reverse=True)
    results = [doc for doc in index.documents if doc['id'] in sorted_ids]
    results = filter_by_max_time(results, max_time)
    # Deduplicate before returning (optional but recommended if you ever combine sources)
    deduped_results = deduplicate_results(results)
    return deduped_results[:num_results]

##### Approach 5: Cover Ingredients Search
Selects a set of recipes that together cover as many of the query ingredients as possible, optionally filtering out recipes that exceed the max_time constraint.


In [None]:
# *** Already run above ***
def tokenize_ingredients(ingredient_str):
    # Split only on commas, strip whitespace, and lowercase
    return set([ing.strip().lower() for ing in ingredient_str.split(',') if ing.strip()])

In [None]:
def ms_cover_ingredients_search(query, index, num_results=5, max_time=None):
    """
    Selects a set of recipes that together cover as many of the query ingredients as possible,
    optionally filtering out recipes that exceed the max_time constraint.
    """
    query_ings = tokenize_ingredients(query)
    uncovered = set(query_ings)
    selected = []
    docs = index.documents.copy() 

    # Apply time filter to docs at the start for efficiency
    docs = filter_by_max_time(docs, max_time)

    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            recipe_ings = tokenize_ingredients(doc.get('all_ingredients', ''))
            overlap = len(uncovered & recipe_ings)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            recipe_ings = tokenize_ingredients(best_doc.get('all_ingredients', ''))
            uncovered -= recipe_ings
            docs.remove(best_doc)
        else:
            break
    # Deduplicate results before returning (recommended if docs may have overlap)
    deduped_results = deduplicate_results(selected)
    return deduped_results

#### 5.3. Build Prompt for the LLM Using Retrieved Context (Minsearch + Elasticsearch)

In [None]:
def build_prompt(query, search_results):
    entry_template = """
Recipe: {recipe_name}
Cuisine: {cuisine_type}
Meal Type: {meal_type}
Difficulty: {difficulty_level}
Prep Time: {prep_time_minutes} minutes
Cook Time: {cook_time_minutes} minutes
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}
""".strip()
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt_template = """
You are an expert chef and culinary assistant. Answer the question based on the content from our recipe database.
Use only the facts from the context when answering the question.

CONTEXT:
{context}

QUESTION: {question}

Provide recipe recommendations with brief explanations of why they match the requested ingredients.
If exact ingredients aren't available, suggest the closest matches and mention any substitutions needed.
""".strip()
    return prompt_template.format(context=context, question=query)

#### 5.4 LLM Call Function (Minsearch + Elasticsearch)

In [None]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

#### 5.5 RAG Pipeline Using Minsearch
Test ms_cover_ingredients_search and ms_hybrid_search on the RAG Pipeline.

In [None]:
# DO I NEED THIS FOR EVALUATING ms_cover_then_hybrid_search?
def rag_minsearch(query, max_time=None, num_results=5, approach="cover"): 
    """
    approach: "cover" for ms_cover_ingredients_search, "hybrid" for ms_hybrid_search
    Deduplicates results before building prompt.
    """
    if approach == "cover":
        search_results = ms_cover_ingredients_search(query, index=index, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    elif approach == "hybrid":
        search_results = ms_hybrid_search(query, index=index, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    else:
        raise ValueError("Unknown approach: choose 'cover' or 'hybrid'")
    prompt = build_prompt(query, deduped_results)
    answer = llm(prompt)
    return answer

#### 5.6 Combine Cover and Hybrid RAG Pipelines (Minsearch)

Combine the two approaches by first running Cover Ingredients Search to select a diverse set of recipes that cover as many ingredients as possible, and then passing those results through the Hybrid Search to rerank or filter them by semantic relevance.

In [None]:
def ms_cover_then_hybrid_search(query, index, num_results=5, max_time=None, hybrid_top_k=5):
    """
    Combine Cover Ingredients Search and Hybrid Search:
    1. Run Cover Ingredients Search to get a diverse set of recipes (pool size = num_results*4).
    2. Rerank the pool by semantic similarity to the query using OpenAI embeddings.
    3. Deduplicate results before returning.
    """
    # Step 1: Cover Ingredients Search to get a diverse set and increase the pool size
    cover_results = ms_cover_ingredients_search(query, index, num_results=num_results*4, max_time=max_time)
    if not cover_results:
        return []
    # Step 2: Hybrid Search, but restrict to cover_results as the candidate pool
    # Compute embedding for the query
    query_emb = get_embedding(query)
    # Compute similarities only for cover_results
    cover_embeddings = [get_embedding(doc['all_ingredients']) for doc in cover_results]
    similarities = [np.dot(query_emb, emb) for emb in cover_embeddings]
    # Get top-k by semantic similarity
    top_indices = np.argsort(similarities)[-hybrid_top_k:][::-1]
    # The hybrid logic is built into the similarity ranking (symbolic/keyword + neural/embedding/semantic)
    hybrid_results = [cover_results[i] for i in top_indices] 
    # Deduplicate before returning (defensive, in case of overlap)
    deduped_results = deduplicate_results(hybrid_results[:num_results])
    return deduped_results

### 6. Elasticsearch Setup and Retrieval Strategies
Set up Elasticsearch retrieval backend as in the main RAG notebook.

#### 6.1 Elasticsearch Docker, Client and Indexing

---
Run docker in terminal:

`docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

Or:

`docker run -d --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx1g" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

And check if Elasticsearch is up:

`curl http://localhost:9200`

---

In [None]:
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

# Create Elasticsearch Client (make sure Elasticsearch is running locally)
es_client = Elasticsearch('http://localhost:9200')

In [None]:
# Define Elasticsearch index settings and mappings for recipes
index_settings = {
    "settings": {
        "number_of_shards": 1, # A unit of storage and search. More shards can improve parallelism for large datasets
        "number_of_replicas": 0 # A shard copy for fault tolerance and increased search throughput (not recommended for production)
    },
    "mappings": {
        "properties": {
            "recipe_name": {"type": "text"},
            "main_ingredients": {"type": "text"},
            "all_ingredients": {"type": "text"},
            "instructions": {"type": "text"},
            "cuisine_type": {"type": "text"},
            "dietary_restrictions": {"type": "text"},
            "meal_type": {"type": "keyword"},
            "difficulty_level": {"type": "keyword"},
            "prep_time_minutes": {"type": "integer"},
            "cook_time_minutes": {"type": "integer"},
            "all_ingredients_vector": {
                "type": "dense_vector",
                "dims": 1536  # text-embedding-3-small returns 1536-dimensional vectors
}
        }
    }
}

index_name = "recipes"

# Create the index (ignore error if it already exists)
try:
    es_client.indices.create(index=index_name, body=index_settings)
except Exception as e:
    print("Index may already exist:", e)

In [None]:
# Index documents into Elasticsearch, including the embedding vector
for doc in tqdm(documents):
    doc['all_ingredients_vector'] = get_embedding(doc['all_ingredients']).tolist()
    es_client.index(index=index_name, document=doc)

  0%|          | 0/477 [00:00<?, ?it/s]

#### 6.2 Run RAG Retrieval Strategies for Evaluation (Elasticsearch)

##### Approach 1: Basic Keyword Search

Returns recipes that match the query using simple keyword matching across all fields. **Will be used in cover strategy simulation for Elasticsearch**.


In [None]:
# Basic Search
def es_basic_search(query, num_results=5, max_time=None):
    search_query = {
        "size": num_results * 2,  # get more to allow filtering
        "query": {
            "multi_match": {
                "query": query,
                "fields": [ # Elasticsearch's multi_match query is for text searching
                    "recipe_name",
                    "main_ingredients^2",
                    "all_ingredients^3",
                    "instructions^1.5",
                    "cuisine_type",
                    "dietary_restrictions^1.5"
                ],
                "type": "best_fields" # default, can be "most_fields", "cross_fields", etc.
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = []
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    # Apply time filter
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    return result_docs[:num_results]

##### Approach 4: Hybrid Search With OpenAI (Keyword + Embedding)
Combines keyword and with OpenAI's embedding-based (semantic) similarity for more robust retrieval.

In [None]:
# Hybrid Search (keyword + vector, requires Elasticsearch vector plugin and precomputed vectors)
def es_hybrid_search(query, num_results=5, max_time=None):
    # This assumes you have a vector field called 'all_ingredients_vector' in your index
    # and a function get_embedding(text) that returns a vector for the query.
    query_vector = get_embedding(query).tolist()
    search_query = {
        "size": num_results * 2,
        "query": {
            "script_score": {
                "query": {
                    "multi_match": {
                        "query": query,
                        "fields": [
                            "main_ingredients^2",
                            "all_ingredients^3",
                            "instructions^1.5",
                            "cuisine_type",
                            "dietary_restrictions^1.5",
                            "recipe_name"
                        ],
                        "type": "best_fields"
                    }
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'all_ingredients_vector') + 1.0",
                    "params": {"query_vector": query_vector},
                }
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    # Deduplicate before returning (recommended for hybrid search)
    deduped_results = deduplicate_results(result_docs)
    return deduped_results[:num_results]

##### Approach 5: Cover Ingredients Search
Elasticsearch queries return a ranked list based on scoring, but do not natively support iterative, set-cover-style selection across multiple results. A workaround is to retrieve many candidates from Elasticsearch, then run the cover algorithm in Python on those results.

In [None]:
def es_cover_ingredients_search(query, num_results=5, max_time=None, candidate_pool_size=200):
    """
    Simulate Cover Ingredients Search using Elasticsearch results, with optional time filtering.
    Returns a set of recipes that together cover as many of the query ingredients as possible.
    Deduplicates results before returning.
    """
    # Step 1: Get a large pool of candidates from ES (basic search, large pool)
    candidates = es_basic_search(query, num_results=candidate_pool_size)
    # Step 2: Filter by time if needed
    if max_time is not None:
        candidates = filter_by_max_time(candidates, max_time)
    # Step 3: Apply greedy cover algorithm
    query_tokens = set(re.sub(r'[^\w\s]', '', query.lower()).replace(',', ' ').split())
    uncovered = set(query_tokens)
    selected = []
    docs = candidates.copy()
    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            ingredients = set(re.sub(r'[^\w\s]', '', str(doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            overlap = len(uncovered & ingredients)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            ingredients = set(re.sub(r'[^\w\s]', '', str(best_doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            uncovered -= ingredients
            docs.remove(best_doc)
        else:
            break
    # Deduplicate results before returning (recommended if docs may have overlap)
    deduped_results = deduplicate_results(selected)
    return deduped_results

#### 6.3 RAG Pipeline Using Elasticsearch 
Test es_cover_ingredients_search and es_hybrid_search on the RAG Pipeline.

In [None]:
# DO I NEED THIS FOR EVALUATING ms_cover_then_hybrid_search?
def rag_elasticsearch(query, max_time=None, num_results=5, approach="cover"):  
    """
    approach: "cover" for es_cover_ingredients_search, "hybrid" for es_hybrid_search
    """
    if approach == "cover":
        search_results = es_cover_ingredients_search(query, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    elif approach == "hybrid":
        search_results = es_hybrid_search(query, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    else:
        raise ValueError("Unknown approach: choose 'cover' or 'hybrid'")
    prompt = build_prompt(query, deduped_results)
    answer = llm(prompt)
    return answer

#### 6.4 Combine Cover and Hybrid RAG Pipelines (Elasticsearch)

Combine the two approaches by first running Cover Ingredients Search to select a diverse set of recipes that cover as many ingredients as possible, and then passing those results through the Hybrid Search to rerank or filter them by semantic relevance.

In [None]:
def es_cover_then_hybrid_search(query, num_results=5, max_time=None, hybrid_top_k=5, candidate_pool_size=200):
    """
    Combine Cover Ingredients Search and Hybrid Search for Elasticsearch:
    1. Run Cover Ingredients Search to get a diverse set of recipes (pool size = candidate_pool_size).
    2. Rerank the pool by semantic similarity to the query using OpenAI embeddings.
    3. Remove 'all_ingredients_vector' from each result for cleaner output.
    4. Deduplicate results before returning.
    """
    # Step 1: Cover Ingredients Search to get a diverse set and increase the pool size
    cover_results = es_cover_ingredients_search(query, num_results=candidate_pool_size, max_time=max_time)
    if not cover_results:
        return []
    # Step 2: Hybrid Search, but restrict to cover_results as the candidate pool
    query_emb = get_embedding(query)
    cover_embeddings = [get_embedding(doc['all_ingredients']) for doc in cover_results]
    similarities = [np.dot(query_emb, emb) for emb in cover_embeddings]
    top_indices = np.argsort(similarities)[-hybrid_top_k:][::-1]
    # The hybrid logic is built into the similarity ranking (symbolic/keyword + neural/embedding/semantic)
    hybrid_results = [cover_results[i] for i in top_indices] 
    # Remove all_ingredients_vector from each result
    for doc in hybrid_results:
        doc.pop('all_ingredients_vector', None)

    return hybrid_results[:num_results]

### 7. Rerank with LLM

Re-ranking means taking an initial set of retrieved documents (candidates) and re-ordering them, often using a more sophisticated model (like an LLM), to improve the relevance of the top results.

In [None]:
def rerank_with_llm(query, candidates, max_time=None): 
    # Apply time filter before reranking
    if max_time is not None:
        candidates = filter_by_max_time(candidates, max_time)
    context = "\n\n".join([f"Recipe: {doc['recipe_name']}\nIngredients: {doc['main_ingredients']}" for doc in candidates])
    prompt = f"""
Given the following user query and candidate recipes, rank the recipes from most to least relevant.

Query: {query}

Candidates:
{context}

Return a JSON list of recipe names in ranked order.
""".strip()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    import re
    content = response.choices[0].message.content
    json_match = re.search(r'\[.*\]', content, re.DOTALL)
    if json_match:
        ranked_names = json.loads(json_match.group())
    else:
        return candidates
    ranked_docs = [doc for name in ranked_names for doc in candidates if doc['recipe_name'] == name]
    return ranked_docs 

#### 7.1 Re-rank the ms_cover_then_hybrid_search RAG Pipeline 

In [None]:
# 1. Define a query and get some candidate recipes (using any retrieval function)
query = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
candidates = ms_cover_then_hybrid_search(query, index, num_results=5)

# 2. Rerank the candidates using the LLM
reranked = rerank_with_llm(query, candidates)

# 3. Print the reranked recipe names
for doc in reranked:
    print(doc['recipe_name'])

#### 7.2. Re-rank the es_cover_then_hybrid_search RAG Pipeline 

In [None]:
# 1. Define a query and get some candidate recipes (using any retrieval function)
query = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
candidates = es_cover_then_hybrid_search(query, num_results=5)

# 2. Rerank the candidates using the LLM
reranked = rerank_with_llm(query, candidates)

# 3. Print the reranked recipe names
for doc in reranked:
    print(doc['recipe_name'])

#### 7.1 Run and Compare Both Pipelines

In [None]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45

print("Minsearch Cover + Hybrid Search (before rerank):")
ms_candidates = ms_cover_then_hybrid_search(ingredients, index, num_results=5, max_time=max_time)
for doc in ms_candidates:
    print(doc['recipe_name'])
print_unused_ingredients(ingredients, ms_candidates)

print("\nMinsearch Cover + Hybrid Search (after rerank):")
ms_reranked = rerank_with_llm(ingredients, ms_candidates, max_time=max_time)
for doc in ms_reranked:
    print(doc['recipe_name'])
print_unused_ingredients(ingredients, ms_reranked)

print("\nElasticsearch Cover + Hybrid Search (before rerank):")
es_candidates = es_cover_then_hybrid_search(ingredients, num_results=5, max_time=max_time)
for doc in es_candidates:
    print(doc['recipe_name'])
print_unused_ingredients(ingredients, es_candidates)

print("\nElasticsearch Cover + Hybrid Search (after rerank):")
es_reranked = rerank_with_llm(ingredients, es_candidates, max_time=max_time)
for doc in es_reranked:
    print(doc['recipe_name'])
print_unused_ingredients(ingredients, es_reranked)

### 8. Prompt Strategies

In [None]:
def get_prompt_strategy_1():
    return """
You are a chef assistant. Based on the available recipes, recommend dishes that use the requested ingredients.
Provide the recipe name, brief description, and cooking instructions and time.

CONTEXT:
{context}

QUESTION: {question}

Answer:
""".strip()

def get_prompt_strategy_2():
    return """
You are an expert chef specializing in ingredient substitutions. When users provide ingredients,
recommend recipes and suggest alternatives for missing ingredients. Always explain possible substitutions
and how they might affect the dish.

CONTEXT:
{context}

QUESTION: {question}

Provide recommendations with substitution suggestions:
""".strip()

def get_prompt_strategy_3():
    return """
You are a nutritionist and chef. Recommend recipes based on ingredients provided, considering
nutritional value and dietary restrictions. Highlight health benefits and suggest modifications
for different dietary needs (vegetarian, gluten-free, etc.).

CONTEXT:
{context}

QUESTION: {question}

Provide nutritionally-aware recommendations:
""".strip()


### 10. Retrieval Evaluation: Hit Rate and MRR

In [None]:
def hit_rate(relevance_total):
    cnt = 0
    for line in relevance_total:
        if True in line:
            cnt += 1
    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank]:
                total_score += 1 / (rank + 1)
                break
    return total_score / len(relevance_total)

def evaluate_retrieval(ground_truth, search_function):
    relevance_total = []
    for q in tqdm(ground_truth):
        doc_id = str(q['id'])
        results = search_function(q)
        relevance = [str(d.get('id')) == doc_id for d in results]
        relevance_total.append(relevance)
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

print("Evaluating Minsearch retrieval...")
metrics_minsearch = evaluate_retrieval(ground_truth, lambda q: ms_cover_then_hybrid_search(q['question'], index, num_results=10))
print("Minsearch:", metrics_minsearch)

print("Evaluating Elasticsearch retrieval...")
metrics_es = evaluate_retrieval(ground_truth, lambda q: es_cover_then_hybrid_search(q['question'], num_results=10))
print("Elasticsearch:", metrics_es)

### 11. Parameter Optimization for Minsearch

In [None]:
param_ranges = { # Used for optimizing text field boosts in Minsearch
    'recipe_name': (0.0, 1.0),
    'main_ingredients': (0.0, 4.0),
    'all_ingredients': (0.0, 5.0),
    'instructions': (0.0, 3.0),
    'cuisine_type': (0.0, 1.0),
    'dietary_restrictions': (0.0, 2.0)
}

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')
    for _ in range(n_iterations):
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            current_params[param] = random.uniform(min_val, max_val)
        current_score = objective_function(current_params)
        if current_score > best_score:
            best_score = current_score
            best_params = current_params
    return best_params, best_score

gt_val = df_gt.sample(n=50, random_state=42).to_dict(orient='records')

def objective(boost_params):
    def search_function(q):
        return index.search(q['question'], boost_dict=boost_params, num_results=10)
    results = evaluate_retrieval(gt_val, search_function)
    return results['mrr']

best_boost, best_score = simple_optimize(param_ranges, objective, n_iterations=20)
print("Best Minsearch boost params:", best_boost)
print("Best validation MRR:", best_score)


### 12. RAG Pipeline Evaluation and LLM Answer Quality

#### 12.1. Fixed Function Versions (no approach argument as in rag-flow.ipynb)

In the context of evaluation, these functions are used to automate the process of generating answers for each ground-truth question, so that the answers can be judged (e.g., by the LLM-as-judge section) for relevance and quality.

In [None]:
def rag_minsearch(question):
    search_results = ms_cover_then_hybrid_search(question, index, num_results=5)
    prompt = build_prompt(question, search_results)
    answer = llm(prompt)
    return answer

def rag_elasticsearch(question):
    search_results = es_cover_then_hybrid_search(question, num_results=5)
    prompt = build_prompt(question, search_results)
    answer = llm(prompt)
    return answer

#### 12.2. LLM-as-Judge Evaluation (RAG answer quality)

In [None]:
prompt2_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

sample = df_gt.sample(n=50, random_state=1).to_dict(orient='records')
evaluations_minsearch = []
evaluations_es = []

print("Evaluating RAG (Minsearch) with LLM-as-judge...")

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_minsearch(question)
    prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    except Exception:
        evaluation = {"Relevance": "ERROR", "Explanation": evaluation}
    evaluations_minsearch.append({
        "id": record['id'],
        "question": question,
        "answer": answer_llm,
        "relevance": evaluation.get("Relevance"),
        "explanation": evaluation.get("Explanation")
    })

print("Evaluating RAG (Elasticsearch) with LLM-as-judge...")

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_elasticsearch(question)
    prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    except Exception:
        evaluation = {"Relevance": "ERROR", "Explanation": evaluation}
    evaluations_es.append({
        "id": record['id'],
        "question": question,
        "answer": answer_llm,
        "relevance": evaluation.get("Relevance"),
        "explanation": evaluation.get("Explanation")
    })

df_eval_minsearch = pd.DataFrame(evaluations_minsearch)
df_eval_es = pd.DataFrame(evaluations_es)
df_eval_minsearch.to_csv('../data/rag-eval-minsearch.csv', index=False)
df_eval_es.to_csv('../data/rag-eval-elasticsearch.csv', index=False)

print("Minsearch RAG relevance proportions:")
print(df_eval_minsearch['relevance'].value_counts(normalize=True))
print("Elasticsearch RAG relevance proportions:")
print(df_eval_es['relevance'].value_counts(normalize=True))

### 14. Summary

In [None]:
print("\n=== RETRIEVAL METRICS ===")
print("Minsearch:", metrics_minsearch)
print("Elasticsearch:", metrics_es)
print("\n=== RAG LLM-as-Judge (proportion RELEVANT) ===")
print("Minsearch:", (df_eval_minsearch['relevance'] == 'RELEVANT').mean())
print("Elasticsearch:", (df_eval_es['relevance'] == 'RELEVANT').mean())
print("\nAll evaluation results saved to CSV in ../data/")

---

**How to interpret the LLM-as-Judge RAG evaluation output:**

The table shows the proportions of answers classified as "RELEVANT" or "PARTLY_RELEVANT" by the LLM-as-judge for both Minsearch and Elasticsearch RAG pipelines.  
- **RELEVANT:** Generated answers judged fully relevant to the user's question.
- **PARTLY_RELEVANT:** Answers judged partially relevant.

This relevance-based evaluation provides a more realistic measure of user experience than strict retrieval metrics

---

- **Retrieval metrics (Hit Rate, MRR)** and **LLM-as-judge** show the combined cover+hybrid+rerank approach is best for both Minsearch and Elasticsearch.
- Minsearch is best for small datasets and ingredient coverage; Elasticsearch is best for scale.
- All evaluation results are saved to CSV in `../data/`.

---