### Evaluation Framework: Ground-Truth Generation

This section generates ground-truth questions for each recipe in the dataset.  
We use the OpenAI API to create realistic user questions for the first 100 recipes, which will later be used to evaluate the retrieval quality of different search approaches.  
The generated questions are saved to a CSV file for use in downstream evaluation.

In [None]:
import pandas as pd
from openai import OpenAI
from tqdm.auto import tqdm 
import json
import dotenv

dotenv.load_dotenv(dotenv_path="../.env") 

client = OpenAI()
df = pd.read_csv('../data/recipes_clean.csv')
if 'id' not in df.columns:
    df['id'] = range(len(df))
documents = df.to_dict(orient='records')

prompt_template = """
You emulate a user of our recipe assistant application.
Formulate 5 questions this user might ask based on a provided recipe.
Make the questions specific to ingredients or cooking methods in this recipe.
The record should contain the answer to the questions, and the questions should
be complete and not too short. Use as fewer words as possible from the record.

The record:

Recipe: {recipe_name}
Cuisine: {cuisine_type}
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}

Provide the output in parsable JSON without using code blocks:

{{"questions": ["question1", "question2", ..., "question5"]}}
""".strip()

def generate_questions(doc):
    prompt = prompt_template.format(**doc)
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

results = {}
for i, doc in enumerate(tqdm(documents[:100])):
    doc_id = doc.get('id', i)
    if doc_id in results:
        continue
    try:
        questions_raw = generate_questions(doc)
        questions = json.loads(questions_raw)
        results[doc_id] = questions['questions']
    except (json.JSONDecodeError, KeyError):
        continue

final_results = []
for doc_id, questions in results.items():
    for q in questions:
        final_results.append((doc_id, q))

df_results = pd.DataFrame(final_results, columns=['id', 'question'])
df_results.to_csv('../data/ground-truth-retrieval.csv', index=False)

  0%|          | 0/100 [00:00<?, ?it/s]

### Manual LLM Question Generation for a Recipe

Test if the LLM can generate realistic, context-specific questions about a single recipe to build a ground-truth set of user questions for later evaluation of the retrieval system.

* Takes one recipe from the dataset.
* Builds a prompt using that recipe’s details.
* Asks the LLM to generate 5 user-like questions about that recipe.
* Inspect the quality and relevance of the generated questions.


In [18]:
# Pick a recipe to test
doc = documents[0] 

In [19]:
# Build the prompt 
prompt = prompt_template.format(**doc)

In [None]:
# Call the LLM
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{"role": "user", "content": prompt}]
)
questions = response.choices[0].message.content

In [21]:
import json
print(questions)
print(json.loads(questions))

{"questions": ["What type of pasta is used in the Spaghetti Carbonara recipe?", "How should the pancetta be cooked for the best texture?", "What should I do with the eggs and Parmesan before mixing with the spaghetti?", "Why is it important to toss the spaghetti and egg mixture off the heat?", "What should be added to the dish just before serving for flavor?"]}
{'questions': ['What type of pasta is used in the Spaghetti Carbonara recipe?', 'How should the pancetta be cooked for the best texture?', 'What should I do with the eggs and Parmesan before mixing with the spaghetti?', 'Why is it important to toss the spaghetti and egg mixture off the heat?', 'What should be added to the dish just before serving for flavor?']}


### Implement and Evaluate Multiple Retrieval Approaches
Define and compare different methods for searching the recipe database to find out which retrieval method works best for the use case.
* These functions are used to retrieve recipes in response to user queries.
* Later, these approaches are evaluated to see which retrieves the most relevant recipes.



First we implement a SimpleIndex class so that its search method processes the query and applies the retrieval strategies (like keyword matching, field boosting, or query expansion).

In [71]:
import re

class SimpleIndex:
    def __init__(self, documents):
        self.documents = documents

    def search(self, ingredients, num_results=5, boost_dict=None, diversity_boost=False, max_time=None):
        """
        ingredients: string, e.g. "tomato, flour, sugar"
        max_time: int or None, e.g. 10 (total prep + cook time in minutes)
        """
        import collections
        # Tokenize ingredients string
        query_tokens = set(re.sub(r'[^\w\s]', '', ingredients.lower()).replace(',', ' ').split())
        scored_docs = []
        cuisines = collections.Counter()
        meal_types = collections.Counter()
        for doc in self.documents:
            score = 0
            for field, boost in (boost_dict or {
                'main_ingredients': 2.0,
                'all_ingredients': 3.0,
                'instructions': 1.0,
                'prep_time_minutes': 1.5,
                'cook_time_minutes': 1.5
            }).items():
                field_value = doc.get(field, "")
                field_tokens = set(re.sub(r'[^\w\s]', '', str(field_value).lower()).replace(',', ' ').split())
                overlap = len(query_tokens & field_tokens)
                score += boost * overlap

            # Filter by max_time if provided
            if max_time is not None:
                try:
                    total_time = int(doc.get('prep_time_minutes', 0)) + int(doc.get('cook_time_minutes', 0))
                except Exception:
                    total_time = 99999
                if total_time > max_time:
                    score = 0

            if diversity_boost and score > 0:
                # Penalize if cuisine or meal_type is already common in scored_docs
                cuisine = doc.get('cuisine_type', '').lower()
                meal_type = doc.get('meal_type', '').lower()
                score -= cuisines[cuisine] * 0.5
                score -= meal_types[meal_type] * 0.5
            if score > 0:
                scored_docs.append((score, doc))
                cuisines[doc.get('cuisine_type', '').lower()] += 1
                meal_types[doc.get('meal_type', '').lower()] += 1
        scored_docs.sort(reverse=True, key=lambda x: x[0])
        return [doc for _, doc in scored_docs[:num_results]]

index = SimpleIndex(documents)

Now we implement several search strategies:
- **Basic keyword search:** Standard search over all indexed fields.
- **Ingredient-focused search:** Boosts ingredient fields for higher relevance.
- **Query expansion:** Expands user queries with culinary synonyms to improve recall.
- **cover_ingredients_search:** Selects recipes that together cover as many of the query ingredients as possible
- **Hybrid search:** Combines keyword and embedding-based similarity for best-practice retrieval.

In [None]:
# Approach 1: Basic Keyword Search
# Returns recipes that match the query using simple keyword matching across all fields.
def basic_search(query, index, max_time=None):
    return index.search(query, num_results=5, max_time=max_time)

# Approach 2: Boosting Search
# Similar to basic, but gives more weight to matches in time fields.
def boosting_search(query, index, max_time=None):
    boost_dict = {
            'main_ingredients': 2.0,
            'all_ingredients': 3.0,
            'instructions': 1.0,
            'prep_time_minutes': 2.5,
            'cook_time_minutes': 2.5
    }
    return index.search(query, num_results=5, boost_dict=boost_dict, max_time=max_time)

# Approach 3: Query Expansion (User Query Rewriting)
# Expands the query with synonyms (e.g., "chicken" → "poultry", "breast") before searching, to improve recall.
def expand_culinary_query(query):
    synonyms = {
        'chicken': ['poultry', 'fowl', 'breast', 'thigh'],
        'pasta': ['noodles', 'spaghetti', 'linguine', 'vermicelli', 'orzo', 'macaroni'],
        'tomatoes': ['tomato', 'roma', 'cherry tomatoes', 'heirloom'],
        'beef': ['steak', 'ground beef', 'brisket', 'sirloin'],
        'potatoes': ['spuds', 'russet', 'yukon', 'sweet potatoes'],
        'cheese': ['cheddar', 'mozzarella', 'parmesan', 'feta', 'gruyere'],
        'onion': ['onions', 'red onion', 'yellow onion', 'shallot'],
        'rice': ['basmati', 'jasmine', 'arborio', 'brown rice'],
        'shrimp': ['prawns', 'shellfish', 'seafood'],
        'eggs': ['egg', 'yolk', 'whites'],
        'herbs': ['basil', 'oregano', 'thyme', 'rosemary']
    }
    tokens = query.lower().split()
    expanded_tokens = tokens.copy()
    for token in tokens:
        if token in synonyms:
            expanded_tokens.extend(synonyms[token][:2])
    return ' '.join(expanded_tokens)

def query_expansion_search(query, index, max_time=None):
    expanded_query = expand_culinary_query(query)
    return index.search(expanded_query, num_results=5, max_time=max_time)

# Approach 4: Cover Ingredients Search
# Selects a set of recipes that together cover as many of the query ingredients as possible,
# optionally filtering out recipes that exceed the max_time constraint.
def cover_ingredients_search(query, index, num_results=5, max_time=None):
    query_tokens = set(re.sub(r'[^\w\s]', '', query.lower()).replace(',', ' ').split())
    uncovered = set(query_tokens)
    selected = []
    docs = index.documents.copy()
    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            # Time filter
            if max_time is not None:
                try:
                    total_time = int(doc.get('prep_time_minutes', 0)) + int(doc.get('cook_time_minutes', 0))
                except Exception:
                    total_time = 99999
                if total_time > max_time:
                    continue
            ingredients = set(re.sub(r'[^\w\s]', '', str(doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            overlap = len(uncovered & ingredients)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            ingredients = set(re.sub(r'[^\w\s]', '', str(best_doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            uncovered -= ingredients
            docs.remove(best_doc)
        else:
            break
    return selected

# Utility function that, given the query and the results, prints which query ingredients were not covered by the selected recipes
def print_unused_ingredients(ingredients, results):
    query_tokens = set(re.sub(r'[^\w\s]', '', ingredients.lower()).replace(',', ' ').split())
    used = set()
    for doc in results:
        used |= set(re.sub(r'[^\w\s]', '', str(doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
    unused = query_tokens - used
    print("Unused ingredients:", ", ".join(unused) if unused else "All ingredients used!")


In [74]:
# Approach 5: Hybrid Search
# Combines keyword and embedding-based (semantic) similarity for more robust retrieval.
from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[text]
    )
    return np.array(response.data[0].embedding)

# Precompute embeddings for all recipes (do this once and cache)
if not hasattr(index, "embeddings"):
    index.embeddings = [get_embedding(doc['all_ingredients']) for doc in index.documents]

def hybrid_search(query, index, num_results=5, alpha=0.5, max_time=None):
    # Keyword search using SimpleIndex
    keyword_results = index.search(query, num_results=10, max_time=max_time)
    keyword_ids = [doc['id'] for doc in keyword_results]
    # Embedding search
    query_emb = get_embedding(query)
    similarities = [np.dot(query_emb, emb) for emb in index.embeddings]
    top_indices = np.argsort(similarities)[-10:][::-1]
    embedding_results = [index.documents[i] for i in top_indices]
    # Combine results (simple union, or weighted score)
    combined = {}
    for doc in keyword_results:
        combined[doc['id']] = alpha
    for doc in embedding_results:
        combined[doc['id']] = combined.get(doc['id'], 0) + (1 - alpha)
    # Sort by combined score
    sorted_ids = sorted(combined, key=combined.get, reverse=True)
    return [doc for doc in index.documents if doc['id'] in sorted_ids[:num_results]]

In [77]:
# Test each retrieval approach manually with a random ingredient list
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 20

results = basic_search(ingredients, index, max_time=max_time)
print("Basic Search:", results)
print_unused_ingredients(ingredients, results)

results = boosting_search(ingredients, index, max_time=max_time)
print("Ingredient Search:", results)
print_unused_ingredients(ingredients, results)

results = query_expansion_search(ingredients, index, max_time=max_time)
print("Query Expansion Search:", results)
print_unused_ingredients(ingredients, results)

results = hybrid_search(ingredients, index, max_time=max_time)
print("Hybrid Search:", results)
print_unused_ingredients(ingredients, results)

results = cover_ingredients_search(ingredients, index, num_results=5, max_time=max_time)
for doc in results:
    print(doc['recipe_name'], "-", doc['all_ingredients'])
print_unused_ingredients(ingredients, results)

Basic Search: [{'recipe_name': 'Middle Eastern Pork Crumble', 'cuisine_type': 'Middle Eastern', 'meal_type': 'Desserts', 'difficulty_level': 'Medium', 'prep_time_minutes': 15, 'cook_time_minutes': 0, 'servings': 4, 'main_ingredients': 'Pork, Fish, Cheese', 'all_ingredients': 'Pork, Fish, Cheese, Cream, Butter, Onion, Honey', 'dietary_restrictions': 'Gluten-free, Contains nuts', 'instructions': 'Prepare Pork, Fish, Cheese. Cook with Pork, Fish, Cheese, Cream, Butter, Onion, Honey. Season and serve.', 'nutritional_info': 'Calories: 248, Protein: 7g, Carbs: 65g, Fat: 15g', 'id': 39}, {'recipe_name': 'Indian Beef Bake', 'cuisine_type': 'Indian', 'meal_type': 'Breakfast', 'difficulty_level': 'Easy', 'prep_time_minutes': 11, 'cook_time_minutes': 6, 'servings': 3, 'main_ingredients': 'Beef, Pasta, Eggs', 'all_ingredients': 'Beef, Pasta, Eggs, Onion, Cream, Butter, Honey', 'dietary_restrictions': 'Contains shellfish, Vegan', 'instructions': 'Prepare Beef, Pasta, Eggs. Cook with Beef, Pasta, Eg

### Inspect Fields
Examine specific fields of each recipe in the results to understand why it was retrieved.

In [78]:
for recipe in cover_ingredients_search(ingredients, index, num_results=5, max_time=max_time):
    print(f"Name: {recipe['recipe_name']}")
    print(f"Main Ingredients: {recipe['main_ingredients']}")
    print(f"All Ingredients: {recipe['all_ingredients']}")
    print(f"Instructions: {recipe['instructions']}")
    print(f"Cuisine: {recipe['cuisine_type']}")
    print(f"Preparation Time: {recipe['prep_time_minutes']} minutes")
    print(f"Cooking Time: {recipe['cook_time_minutes']} minutes")
    print("-" * 40)

Name: Middle Eastern Pork Crumble
Main Ingredients: Pork, Fish, Cheese
All Ingredients: Pork, Fish, Cheese, Cream, Butter, Onion, Honey
Instructions: Prepare Pork, Fish, Cheese. Cook with Pork, Fish, Cheese, Cream, Butter, Onion, Honey. Season and serve.
Cuisine: Middle Eastern
Preparation Time: 15 minutes
Cooking Time: 0 minutes
----------------------------------------
Name: Chinese Shrimp Salad
Main Ingredients: Shrimp, Beef, Tofu
All Ingredients: Shrimp, Beef, Tofu, Flour, Honey, Sugar, Soy Sauce
Instructions: Prepare Shrimp, Beef, Tofu. Cook with Shrimp, Beef, Tofu, Flour, Honey, Sugar, Soy Sauce. Season and serve.
Cuisine: Chinese
Preparation Time: 6 minutes
Cooking Time: 9 minutes
----------------------------------------
Name: Avocado Toast
Main Ingredients: Bread, Avocado
All Ingredients: Whole-grain Bread, Ripe Avocado, Lemon Juice, Salt, Pepper, Olive Oil
Instructions: Toast bread slices. Mash avocado with lemon juice, salt, and pepper. Spread on toast and drizzle with olive o

### Manual RAG Test
Test the full RAG pipeline for answering a user query to see how well the system can answer real user questions using both retrieval and generation. Manual LLM Question Generation is about making questions for evaluation. Manual RAG Test is about answering a user’s question using retrieval + LLM.
* Takes a user query (e.g., ""tomato, flour, sugar, chicken, ...").
* Uses the retrieval system (e.g., hybrid_search) to find the most relevant recipes.
* Builds a context string from the retrieved recipes.
* Asks the LLM to answer the user’s question using only the retrieved context.
* Inspect the quality of the final answer.
All retrieval approaches are relevant, but we will use Hybrid Search for our RAG test, since it balances relevance and diversity.

In [79]:
# Pick a query
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 20

In [80]:
# Run the cover_ingredients_search retrieval
results = cover_ingredients_search(ingredients, index, num_results=5, max_time=max_time)

In [81]:
# Build a context string
context = "\n\n".join(
    [
        f"Recipe: {doc['recipe_name']}\n"
        f"Ingredients: {doc['main_ingredients']}\n"
        f"Cuisine: {doc.get('cuisine_type', '')}\n"
        f"Instructions: {doc.get('instructions', '')}\n"
        f"Total time: {doc.get('prep_time_minutes', 0)} + {doc.get('cook_time_minutes', 0)}\n"
        for doc in results
    ]
)

In [82]:
# Build an answer prompt
prompt = f"""
You're a chef assistant. Based on the CONTEXT from our recipes database, for each recipe, list:
- The recipe name
- The Cuisine
- Which of the provided INGREDIENTS it uses
- The Instructions
- The Total time

Then, list any INGREDIENTS that are not used in any recipe.

INGREDIENTS: {ingredients}

CONTEXT:
{context}
""".strip()

In [83]:
# Get the answer from the LLM
answer = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
print(answer)

Based on the provided CONTEXT, here are the details for each recipe:

### Recipes

1. **Recipe Name:** Middle Eastern Pork Crumble  
   **Cuisine:** Middle Eastern  
   **Ingredients Used:** Cheese (used), Butter (used), Onion (used)  
   **Instructions:** Prepare Pork, Fish, Cheese. Cook with Pork, Fish, Cheese, Cream, Butter, Onion, Honey. Season and serve.  
   **Total Time:** 15 minutes  

2. **Recipe Name:** Chinese Shrimp Salad  
   **Cuisine:** Chinese  
   **Ingredients Used:** Shrimp (used), Flour (used), Sugar (used)  
   **Instructions:** Prepare Shrimp, Beef, Tofu. Cook with Shrimp, Beef, Tofu, Flour, Honey, Sugar, Soy Sauce. Season and serve.  
   **Total Time:** 15 minutes  

3. **Recipe Name:** Avocado Toast  
   **Cuisine:** American  
   **Ingredients Used:** None from the provided set  
   **Instructions:** Toast bread slices. Mash avocado with lemon juice, salt, and pepper. Spread on toast and drizzle with olive oil.  
   **Total Time:** 5 minutes  

4. **Recipe Name

### Document Re-ranking and Query Rewriting with LLM

To further improve retrieval quality, we use the LLM to:
- **Re-rank candidate recipes** based on their relevance to the rewritten query.
- **Rewrite user queries** to be more specific and clear, leveraging LLM understanding.
These steps represent best practices for maximizing retrieval relevance.

In [92]:
def rerank_with_llm(query, candidates):
    context = "\n\n".join([f"Recipe: {doc['recipe_name']}\nIngredients: {doc['main_ingredients']}" for doc in candidates])
    prompt = f"""
Given the following user query and candidate recipes, rank the recipes from most to least relevant.

Query: {query}

Candidates:
{context}

Return a JSON list of recipe names in ranked order.
""".strip()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    import re
    content = response.choices[0].message.content
    # Try to extract JSON list from the response
    json_match = re.search(r'\[.*\]', content, re.DOTALL)
    if json_match:
        ranked_names = json.loads(json_match.group())
    else:
        # fallback: return candidates as-is if parsing fails
        return candidates
    ranked_docs = [doc for name in ranked_names for doc in candidates if doc['recipe_name'] == name]
    return ranked_docs

In [None]:
# User Query Rewriting (LLM-based)
def rewrite_query(query):
    prompt = f"Rewrite this user query for a recipe search system to be more specific and clear: '{query}'"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# Approaches list for evaluation
approaches = [
    ("Approach 1 (Basic)", basic_search),
    ("Approach 2 (Boosted)", ingredient_search),
    ("Approach 3 (Expanded)", query_expansion_search),
    ("Approach 4 (Hybrid)", hybrid_search),
    ("Approach 5 (Cover Ingredients)", lambda q, idx, max_time=None: cover_ingredients_search(q, idx, num_results=5, max_time=max_time))
]

# Example evaluation loop including query rewriting and re-ranking
test_queries = ["chicken pasta", "beef potatoes", "vegetarian rice", "chocolate dessert"]
max_time = 20  

for query in test_queries:
    rewritten = rewrite_query(query)
    hybrid_results = hybrid_search(rewritten, index, max_time=max_time)
    reranked = rerank_with_llm(rewritten, hybrid_results)

To implement and evaluate multiple prompt strategies, we define several prompt templates for the LLM, each focusing on a different aspect of recipe recommendation:
- **Basic recommendation**
- **Ingredient substitution**
- **Nutritional and dietary focus**
These prompt strategies can be tested to determine which yields the most helpful and relevant responses.

In [94]:
# Prompt Strategy 1: Basic Recommendation
def get_prompt_strategy_1():
    return """
You are a chef assistant. Based on the available recipes, recommend dishes that use the requested ingredients.
Provide the recipe name, brief description, and cooking instructions and time.

CONTEXT:
{context}

QUESTION: {question}

Answer:
""".strip()

# Prompt Strategy 2: Ingredient-Substitution Focused
def get_prompt_strategy_2():
    return """
You are an expert chef specializing in ingredient substitutions. When users provide ingredients,
recommend recipes and suggest alternatives for missing ingredients. Always explain possible substitutions
and how they might affect the dish.

CONTEXT:
{context}

QUESTION: {question}

Provide recommendations with substitution suggestions:
""".strip()

# Prompt Strategy 3: Nutritional and Dietary Focused
def get_prompt_strategy_3():
    return """
You are a nutritionist and chef. Recommend recipes based on ingredients provided, considering
nutritional value and dietary restrictions. Highlight health benefits and suggest modifications
for different dietary needs (vegetarian, gluten-free, etc.).

CONTEXT:
{context}

QUESTION: {question}

Provide nutritionally-aware recommendations:
""".strip()

In [None]:
# Run all prompt strategies for a sample query

# 1. Retrieve recipes
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, \
    garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 60
results = cover_ingredients_search(ingredients, index, max_time=max_time)

# 2. Build context string
context = "\n\n".join(
    [
        f"Recipe: {doc['recipe_name']}\n"
        f"Ingredients: {doc['main_ingredients']}\n"
        f"Cuisine: {doc.get('cuisine_type', '')}\n"
        f"Instructions: {doc.get('instructions', '')}\n"
        f"Total time: {doc.get('prep_time_minutes', 0)} + {doc.get('cook_time_minutes', 0)}\n"
        for doc in results
    ]
)

# 3. Prepare the user question
question = "What can I cook with tomato, flour, sugar, chicken, rice, butter, chocolate, \
    shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme in under 60 minutes?"

# 4. Use each prompt strategy
prompt1 = get_prompt_strategy_1().format(context=context, question=question)
prompt2 = get_prompt_strategy_2().format(context=context, question=question)
prompt3 = get_prompt_strategy_3().format(context=context, question=question)

# 5. Get LLM responses for each strategy
response1 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt1}]
).choices[0].message.content

response2 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt2}]
).choices[0].message.content

response3 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt3}]
).choices[0].message.content

print("=== Basic Recommendation ===")
print(response1)
print("\n=== Ingredient Substitution ===")
print(response2)
print("\n=== Nutritional/Dietary Focus ===")
print(response3)

=== Basic Recommendation ===
Based on the ingredients you provided, here are some recipes that you can cook in under 60 minutes:

### 1. **Chicken and Rice Stir-Fry**
**Description:** A quick and delicious stir-fry featuring chicken, rice, vegetables, and a savory sauce.

**Instructions:**
1. Prepare the chicken by cutting it into bite-sized pieces.
2. In a pan, melt butter and sauté chopped garlic and onion until fragrant.
3. Add the chicken and cook until browned.
4. Add cooked rice, diced bell peppers, and any additional vegetables you like.
5. Season with soy sauce and cook until everything is heated through. 

**Total Time:** 20 minutes

---

### 2. **Garlic Shrimp and Pasta**
**Description:** A delightful pasta dish cooked with garlic shrimp and a touch of lemon.

**Instructions:**
1. Cook the pasta according to the package instructions.
2. In a large skillet, melt butter and sauté minced garlic.
3. Add shrimp and cook until pink, about 2-3 minutes.
4. Toss in the cooked pasta, a

### Evaluation Metrics and Selection of Best Approaches

We define metrics to quantitatively evaluate retrieval quality:
- **Relevance score:** Measures ingredient overlap between query and results.
- **Diversity score:** Measures variety in cuisine and meal types.
- **LLM-as-judge:** Uses the LLM to rate the quality of generated answers.
We use these metrics to compare all retrieval approaches and select the best configuration.

In [96]:
def calculate_relevance_score(query_terms, search_results):
    if not search_results:
        return 0.0
    query_tokens = set(query_terms.lower().split())
    total_score = 0
    for result in search_results:
        text_content = f"{result['recipe_name']} {result['main_ingredients']} {result['all_ingredients']}".lower()
        content_tokens = set(text_content.split())
        overlap = len(query_tokens.intersection(content_tokens))
        score = overlap / len(query_tokens) if query_tokens else 0
        total_score += score
    return total_score / len(search_results)

def calculate_diversity_score(search_results):
    if not search_results:
        return 0.0
    cuisines = set()
    meal_types = set()
    for result in search_results:
        cuisines.add(result.get('cuisine_type', '').lower())
        meal_types.add(result.get('meal_type', '').lower())
    cuisines.discard('')
    meal_types.discard('')
    diversity = (len(cuisines) + len(meal_types)) / 2
    return min(diversity, 5.0)

def llm_as_judge_evaluation(query, response):
    judge_prompt = f"""
    Evaluate this recipe recommendation response on a scale of 1-5:
    Query: {query}
    Response: {response}
    Rate based on:
    1. Relevance to ingredients (1-5)
    2. Practicality of recipes (1-5)
    3. Clarity of instructions (1-5)
    4. Helpfulness of suggestions (1-5)
    Provide your evaluation in JSON format:
    {{
        "relevance": <score>,
        "practicality": <score>,
        "clarity": <score>,
        "helpfulness": <score>,
        "overall": <average_score>,
        "explanation": "Brief explanation"
    }}
    """.strip()
    try:
        response_eval = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": judge_prompt}]
        )
        eval_text = response_eval.choices[0].message.content
        import re
        json_match = re.search(r'\{.*\}', eval_text, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        else:
            return {"error": "Could not parse evaluation"}
    except Exception as e:
        return {"error": str(e)}

### Evaluate All Approaches and Select the Best

This section runs the evaluation loop for all retrieval approaches:
- For each approach and each test query, the query is rewritten, results are retrieved, and optionally re-ranked.
- Relevance and diversity scores are computed for both original and reranked results.
- The best approach is selected based on the highest combined reranked scores.

In [None]:
# Evaluate all approaches and prompt strategies, select the best based on metrics and LLM-as-judge.
evaluation_queries_test = [
    "chicken, pasta, tomato, garlic",
    "beef, potatoes, onion, thyme",
    "rice, bell pepper, eggs, shrimp",
    "flour, sugar, chocolate, butter"
]
evaluation_results = {}

for approach_name, approach_func in approaches:
    print(f"\nEvaluating {approach_name}...")
    relevance_scores = []
    diversity_scores = []
    rerank_relevance_scores = []
    rerank_diversity_scores = []
    llm_judge_scores = []

    for query in evaluation_queries_test:
        # User query rewriting (LLM-based)
        rewritten_query = rewrite_query(query)
        # Run retrieval approach (including hybrid)
        results = approach_func(rewritten_query, index)
        # Optionally, apply document re-ranking with LLM
        reranked_results = rerank_with_llm(rewritten_query, results)

        # Evaluate original results
        relevance = calculate_relevance_score(rewritten_query, results)
        diversity = calculate_diversity_score(results)
        relevance_scores.append(relevance)
        diversity_scores.append(diversity)

        # Evaluate reranked results
        rerank_relevance = calculate_relevance_score(rewritten_query, reranked_results)
        rerank_diversity = calculate_diversity_score(reranked_results)
        rerank_relevance_scores.append(rerank_relevance)
        rerank_diversity_scores.append(rerank_diversity)

        # LLM-as-judge evaluation (on reranked results)
        if reranked_results:
            # You can join the top reranked results into a string for the judge
            response_text = "\n".join([f"{doc['recipe_name']}: {doc.get('main_ingredients', '')}" for doc in reranked_results])
            judge_eval = llm_as_judge_evaluation(rewritten_query, response_text)
            llm_judge_scores.append(judge_eval.get("overall", 0))
        else:
            llm_judge_scores.append(0)

    evaluation_results[approach_name] = {
        'avg_relevance': np.mean(relevance_scores),
        'avg_diversity': np.mean(diversity_scores),
        'avg_rerank_relevance': np.mean(rerank_relevance_scores),
        'avg_rerank_diversity': np.mean(rerank_diversity_scores),
        'avg_llm_judge': np.mean(llm_judge_scores)
    }

    print(f"  Average Relevance: {np.mean(relevance_scores):.3f}")
    print(f"  Average Diversity: {np.mean(diversity_scores):.3f}")
    print(f"  Reranked Average Relevance: {np.mean(rerank_relevance_scores):.3f}")
    print(f"  Reranked Average Diversity: {np.mean(rerank_diversity_scores):.3f}")
    print(f"  LLM-as-Judge Score: {np.mean(llm_judge_scores):.3f}")


Evaluating Approach 1 (Basic)...
  Average Relevance: 0.146
  Average Diversity: 3.750
  Reranked Average Relevance: 0.115
  Reranked Average Diversity: 2.750
  LLM-as-Judge Score: 1.500

Evaluating Approach 2 (Boosted)...
  Average Relevance: 0.150
  Average Diversity: 4.000
  Reranked Average Relevance: 0.111
  Reranked Average Diversity: 3.000
  LLM-as-Judge Score: 1.562

Evaluating Approach 3 (Expanded)...
  Average Relevance: 0.138
  Average Diversity: 3.875
  Reranked Average Relevance: 0.107
  Reranked Average Diversity: 2.875
  LLM-as-Judge Score: 1.375

Evaluating Approach 4 (Hybrid)...
  Average Relevance: 0.129
  Average Diversity: 3.750
  Reranked Average Relevance: 0.095
  Reranked Average Diversity: 2.750
  LLM-as-Judge Score: 1.562

Evaluating Approach 5 (Cover Ingredients)...
  Average Relevance: 0.108
  Average Diversity: 1.375
  Reranked Average Relevance: 0.064
  Reranked Average Diversity: 0.750
  LLM-as-Judge Score: 0.875


### Summarize and Save Evaluation Results

Finally, we select the best retrieval approach based on evaluation metrics and save the results for future use.  
This ensures that the most effective configuration is documented and reproducible.

In [99]:
# Determine best approaches (can use reranked scores if better)
best_retrieval = max(
    evaluation_results.items(),
    key=lambda x: x[1]['avg_rerank_relevance'] + x[1]['avg_rerank_diversity']
)

print(f"\n=== EVALUATION SUMMARY ===")
print(f"Best Retrieval Approach (Reranked): {best_retrieval[0]}")
print(f"  Combined Reranked Score: {best_retrieval[1]['avg_rerank_relevance'] + best_retrieval[1]['avg_rerank_diversity']:.3f}")


=== EVALUATION SUMMARY ===
Best Retrieval Approach (Reranked): Approach 2 (Boosted)
  Combined Reranked Score: 3.111


In [100]:
# Save results for production use
final_config = {
    'best_retrieval_approach': best_retrieval[0],
    'evaluation_results': evaluation_results,
    'recommendations': {
        'retrieval': 'Use hybrid search with query rewriting and LLM-based re-ranking',
        'prompt': 'Focus on practical recommendations with substitutions'
    }
}

with open('../data/evaluation_results.json', 'w') as f:
    json.dump(final_config, f, indent=2)

print("Evaluation results saved to '../data/evaluation_results.json'")
print("Ground truth data saved to '../data/ground-truth-retrieval.csv'")

Evaluation results saved to '../data/evaluation_results.json'
Ground truth data saved to '../data/ground-truth-retrieval.csv'


### Production Retrieval Approach

Although the evaluation framework selects the "Boosted" approach as the best based on reranked metrics, we use `cover_ingredients_search` as the default retrieval method in production. This is because, for our small dataset, it provides better ingredient coverage and more diverse recipe results for users.