## Retrieval-Augmented Generation (RAG) for Recipe Assistant: RAG Evaluation

This notebook provides an evaluation framework for the RAG system described in `rag-flow.ipynb`.  
It covers ground-truth generation, retrieval metrics (Hit Rate, MRR), parameter optimization, and LLM-as-judge answer quality for both Minsearch and Elasticsearch using the best-performing combined retrieval and reranking approaches.

---

### 1. Set-up Dependencies, OpenAI Client and Dataset

#### 1.1 Dependencies and OpenAI Client
Import dependencies, load OpenAI API key and connect to OpenAI API. 

In [1]:
import pandas as pd
import numpy as np
import json
import random
import re
from tqdm.auto import tqdm
import dotenv
import minsearch
from elasticsearch import Elasticsearch
from openai import OpenAI

In [2]:
dotenv.load_dotenv("../.env")
client = OpenAI()

#### 1.2. Load and Index Recipe Data
Load the recipe dataset from CSV file and prepare it for indexing.

In [3]:
df = pd.read_csv('../data/recipes_clean.csv')

# Add an ID column if it doesn't exist
if 'id' not in df.columns:
    df['id'] = range(len(df))
    
# Create documents for indexing
documents = df.to_dict(orient='records')
documents[0]

{'recipe_name': 'Spaghetti Carbonara',
 'cuisine_type': 'Italian',
 'meal_type': 'Dinner',
 'difficulty_level': 'Medium',
 'prep_time_minutes': 15,
 'cook_time_minutes': 20,
 'servings': 4,
 'main_ingredients': 'Spaghetti, Eggs, Pancetta, Parmesan',
 'all_ingredients': 'Spaghetti, Eggs, Pancetta, Parmesan, Black Pepper, Olive Oil, Salt',
 'dietary_restrictions': 'Contains gluten, dairy, pork',
 'instructions': 'Boil spaghetti until al dente. Fry pancetta until crispy. Whisk eggs with grated Parmesan. Toss hot spaghetti with pancetta and egg mixture off heat. Serve immediately with black pepper.',
 'nutritional_info': 'Calories: 520, Protein: 22g, Carbs: 60g, Fat: 22g',
 'id': 0}

### 2. Utility Functions

#### 2.1 Create Time Filter
Users will be able to filter recipes by maximum preparation and cooking time.

In [4]:
def filter_by_max_time(results, max_time=None):
    if max_time is None:
        return results
    filtered = []
    for doc in results:
        try:
            total_time = int(doc.get('prep_time_minutes', 0)) + int(doc.get('cook_time_minutes', 0))
        except Exception:
            total_time = 99999
        if total_time <= max_time:
            filtered.append(doc)
    return filtered

#### 2.2. Deduplicate Search results
Elasticsearch and LLMs can sometimes return duplicate or near-duplicate documents.

In [5]:
def deduplicate_results(results):
    seen = set()
    deduped = []
    for doc in results:
        # Use a unique field, e.g. 'id' or a tuple of fields
        key = doc.get('id') or (doc.get('recipe_name'), doc.get('prep_time_minutes'), doc.get('cook_time_minutes'))
        if key not in seen:
            seen.add(key)
            deduped.append(doc)
    return deduped

### 3. Ground Truth Generation

#### 3.1. Generate ground-truth user questions for each recipe using the LLM.  
This is used to evaluate retrieval quality.

In [None]:
# The fields are the most relevant for generating realistic user questions, given the prompt instructions.
prompt_template = """
You are emulating a user of a recipe assistant.
Given the following recipe record, generate 5 realistic user queries that could be answered by retrieving this recipe.
Each query should be phrased as if a user is searching for a recipe to cook, mentioning available ingredients and/or time constraints.
- At least 3 queries must mention specific ingredient combinations (e.g., "What can I cook with eggs, pancetta, and parmesan?").
- At least 2 queries must mention a time constraint (e.g., "I need a dinner recipe that takes less than 40 minutes.").
- Do NOT mention the recipe name or copy long phrases from the record.
- Paraphrase naturally and vary the style of the queries.
- Each query should be answerable using only the information in the recipe record.

The recipe record:
Recipe Name: {recipe_name}
Cuisine: {cuisine_type}
Meal Type: {meal_type}
Difficulty: {difficulty_level}
Prep Time: {prep_time_minutes} minutes
Cook Time: {cook_time_minutes} minutes
Servings: {servings}
Main Ingredients: {main_ingredients}
All Ingredients: {all_ingredients}
Dietary Restrictions: {dietary_restrictions}
Instructions: {instructions}
Nutritional Info: {nutritional_info}

Return the output as valid JSON (no code block), in this format:
{{"queries": ["query1", "query2", "query3", "query4", "query5"]}}
""".strip()
    
def generate_queries(doc):
    """
    Fills in the template with the recipe fields.
    Sends the prompt to the LLM (OpenAI API).
    Receives the LLM’s response (a JSON string with the queries).
    """
    prompt = prompt_template.format(**doc)
    # Parse the LLM’s response to extract the questions
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

results = {}
for i, doc in enumerate(tqdm(documents[:100])):
    doc_id = doc.get('id', i)
    if doc_id in results:
        continue
    try:
        queries_raw = generate_queries(doc)
        # Extract the questions from the LLM’s output
        queries = json.loads(queries_raw)
        results[doc_id] = queries['queries']
    except (json.JSONDecodeError, KeyError):
        continue

final_results = []
for doc_id, queries in results.items():
    for q in queries:
        final_results.append((doc_id, q))

  0%|          | 0/100 [00:00<?, ?it/s]

#### 3.2. Save and Re-load Ground Truth Questions

Load the generated ground-truth questions for evaluation.

In [None]:
#df_results = pd.DataFrame(final_results, columns=['id', 'question'])
#df_results.to_csv('../data/ground-truth-retrieval.csv', index=False)

In [6]:
df_gt = pd.read_csv('../data/ground-truth-retrieval.csv')
ground_truth = df_gt.to_dict(orient='records')

### 4. Minsearch Setup and Retrieval Strategies
Set up Minsearch retrieval backend as in the main RAG notebook.

#### 4.1. Minsearch Index 

Prep/cook time and servings are numeric, and not suitable for Minsearch's default text/keyword search.


In [None]:
index = minsearch.Index(
    text_fields=['recipe_name', 'main_ingredients', 'all_ingredients', \
                 'instructions', 'cuisine_type', 'dietary_restrictions'],
    keyword_fields=['meal_type', 'difficulty_level'] # Categorical values
)

# Fit/train the index on the recipe documents
index.fit(documents)
index.documents = index.docs

#### 4.2. Generate an Embedding Vector (the user's search input at runtime)
Combines keyword and with OpenAI's embedding-based (semantic) similarity for more robust retrieval. index.documents (or index.docs) is a list of recipe dictionaries (the original documents). However we need to generate embeddings (e.g., with OpenAI), and add them to each document under the all_ingredients_vector field before indexing. The resulting index.embeddings is a list/array of vectors, one per recipe, in the same order as index.documents (the same as the documents list). These embeddings are used for semantic search (vector similarity/cosineSimilarity) in both Minsearch and Elasticsearch retrieval.

In [8]:
def get_embedding(text):
    response = client.embeddings.create( # OpenAI client
        model="text-embedding-3-small",
        input=[text]
    )
    return np.array(response.data[0].embedding)

In [None]:
"""# Precompute embeddings for all recipes if not saved in rag-flow.ipynb 
# DO THIS ONCE AND CACHE: Only rerun if changes to the recipe_clean.csv dataset is made 
if not hasattr(index, "embeddings"):
    index.embeddings = [get_embedding(doc['all_ingredients']) for doc in index.documents]
np.save('embeddings.npy', np.array(index.embeddings))"""

In [9]:
# Load embeddings from file for future use/new notebook session
index.embeddings = np.load('embeddings.npy', allow_pickle=True)

#### 4.3. Retrieval Strategies

##### Approach 5: Cover Ingredients Search
Selects a set of recipes that together cover as many of the query ingredients as possible, optionally filtering out recipes that exceed the max_time constraint.


In [10]:
def tokenize_ingredients(ingredient_str):
    # Split only on commas, strip whitespace, and lowercase
    return set([ing.strip().lower() for ing in ingredient_str.split(',') if ing.strip()])

In [11]:
def ms_cover_ingredients_search(query, index, num_results=5, max_time=None):
    """
    Selects a set of recipes that together cover as many of the query ingredients as possible,
    optionally filtering out recipes that exceed the max_time constraint.
    """
    query_ings = tokenize_ingredients(query)
    uncovered = set(query_ings)
    selected = []
    docs = index.documents.copy() 

    # Apply time filter to docs at the start for efficiency
    docs = filter_by_max_time(docs, max_time)

    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            recipe_ings = tokenize_ingredients(doc.get('all_ingredients', ''))
            overlap = len(uncovered & recipe_ings)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            recipe_ings = tokenize_ingredients(best_doc.get('all_ingredients', ''))
            uncovered -= recipe_ings
            docs.remove(best_doc)
        else:
            break
    # Deduplicate results before returning (recommended if docs may have overlap)
    deduped_results = deduplicate_results(selected)
    return deduped_results

#### Combined Pipeline Approach: Combine Cover and Hybrid RAG Pipelines (Minsearch)

Combine the two approaches by first running Cover Ingredients Search to select a diverse set of recipes that cover as many ingredients as possible, and then passing those results through the Hybrid Search to rerank or filter them by semantic relevance.

In [12]:
def ms_cover_then_hybrid_search(query, index, num_results=5, max_time=None, hybrid_top_k=5):
    """
    Combine Cover Ingredients Search and Hybrid Search:
    1. Run Cover Ingredients Search to get a diverse set of recipes (pool size = num_results*4).
    2. Rerank the pool by semantic similarity to the query using OpenAI embeddings.
    3. Deduplicate results before returning.
    """
    # Step 1: Cover Ingredients Search to get a diverse set and increase the pool size
    cover_results = ms_cover_ingredients_search(query, index, num_results=num_results*4, max_time=max_time)
    if not cover_results:
        return []
    # Step 2: Hybrid Search, but restrict to cover_results as the candidate pool
    # Compute embedding for the query
    query_emb = get_embedding(query)
    # Compute similarities only for cover_results using cached embeddings
    cover_embeddings = [
        index.embeddings[index.documents.index(doc)]
        for doc in cover_results
    ]
    similarities = [np.dot(query_emb, emb) for emb in cover_embeddings]
    # Get top-k by semantic similarity
    top_indices = np.argsort(similarities)[-hybrid_top_k:][::-1]
    # The hybrid logic is built into the similarity ranking (symbolic/keyword + neural/embedding/semantic)
    hybrid_results = [cover_results[i] for i in top_indices]
    # Deduplicate before returning (defensive, in case of overlap)
    deduped_results = deduplicate_results(hybrid_results[:num_results])
    return deduped_results

### 5. Elasticsearch Setup and Retrieval Strategies

#### 5.1 Set up Elasticsearch Docker and Client

---
Run docker in terminal:

`docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

Or:

`docker run -d --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx1g" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

And check if Elasticsearch is up:

`curl http://localhost:9200`

---

In [13]:
# Create Elasticsearch Client (make sure Elasticsearch is running locally)
es_client = Elasticsearch('http://localhost:9200')

#### 5.2. Elasticsearch Index 
Set up Elasticsearch retrieval backend as in the main RAG notebook.

In [14]:
# Define Elasticsearch index settings and mappings for recipes
index_settings = {
    "settings": {
        "number_of_shards": 1, # A unit of storage and search. More shards can improve parallelism for large datasets
        "number_of_replicas": 0 # A shard copy for fault tolerance and increased search throughput (not recommended for production)
    },
    "mappings": {
        "properties": {
            "recipe_name": {"type": "text"},
            "main_ingredients": {"type": "text"},
            "all_ingredients": {"type": "text"},
            "instructions": {"type": "text"},
            "cuisine_type": {"type": "text"},
            "dietary_restrictions": {"type": "text"},
            "meal_type": {"type": "keyword"},
            "difficulty_level": {"type": "keyword"},
            "prep_time_minutes": {"type": "integer"},
            "cook_time_minutes": {"type": "integer"},
            "all_ingredients_vector": { # Define a dense vector field
                "type": "dense_vector",
                "dims": 1536  # text-embedding-3-small returns 1536-dimensional vectors
}
        }
    }
}

index_name = "recipes"

# Create the index (ignore error if it already exists)
try:
    es_client.indices.create(index=index_name, body=index_settings)
except Exception as e:
    print("Index may already exist:", e)

Index may already exist: BadRequestError(400, 'resource_already_exists_exception', 'index [recipes/RWTKgx2XR_KP-wnMuWG2mQ] already exists')


In [15]:
# Add the embedding vector (all_ingredients_vector) to each document when indexing
for doc, emb in zip(documents, index.embeddings): # Reuse the same embeddings for Elasticsearch, since order matches the documents list
    doc['all_ingredients_vector'] = emb.tolist()
    es_client.index(index=index_name, document=doc)

#### 5.3 Run RAG Retrieval Strategies for Evaluation (Elasticsearch)

##### Approach 1: Basic Keyword Search

Returns recipes that match the query using simple keyword matching across all fields. **Will be used in cover strategy simulation for Elasticsearch**.

In [16]:
# Basic keyword-based search using Elasticsearch's multi_match query
def es_basic_search(query, num_results=5, max_time=None):
    search_query = {
        "size": num_results * 2,  # get more to allow filtering
        "query": {
            "multi_match": {
                "query": query,
                "fields": [ # Elasticsearch's multi_match query is for text searching
                    "recipe_name",
                    "main_ingredients^2",
                    "all_ingredients^3",
                    "instructions^1.5",
                    "cuisine_type",
                    "dietary_restrictions^1.5"
                ],
                "type": "best_fields" # default, can be "most_fields", "cross_fields", etc.
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    # Apply time filter
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    return result_docs[:num_results]

##### Approach 5: Cover Ingredients Search
Elasticsearch queries return a ranked list based on scoring, but do not natively support iterative, set-cover-style selection across multiple results. A workaround is to retrieve many candidates from Elasticsearch, then run the cover algorithm in Python on those results. This function does not use embeddings or vector similarity. A cover-style retrieval strategy uses only keyword search to get a candidate pool from Elasticsearch

In [17]:
def es_cover_ingredients_search(query, num_results=5, max_time=None, candidate_pool_size=200):
    """
    Simulate Cover Ingredients Search using Elasticsearch results, with optional time filtering.
    Returns a set of recipes that together cover as many of the query ingredients as possible.
    Deduplicates results before returning.
    """
    # Step 1: Get a large pool of candidates from ES (basic search, large pool)
    candidates = es_basic_search(query, num_results=candidate_pool_size)
    # Step 2: Filter by time if needed
    if max_time is not None:
        candidates = filter_by_max_time(candidates, max_time)
    # Step 3: Apply greedy cover algorithm
    query_tokens = set(re.sub(r'[^\w\s]', '', query.lower()).replace(',', ' ').split())
    uncovered = set(query_tokens)
    selected = []
    docs = candidates.copy()
    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            ingredients = set(re.sub(r'[^\w\s]', '', str(doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            overlap = len(uncovered & ingredients)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            ingredients = set(re.sub(r'[^\w\s]', '', str(best_doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            uncovered -= ingredients
            docs.remove(best_doc)
        else:
            break
    # Deduplicate results before returning (recommended if docs may have overlap)
    deduped_results = deduplicate_results(selected)
    return deduped_results

#### Combined Pipeline Approach: Combine Cover and Hybrid RAG Pipelines (Elasticsearch)

Combine the two approaches by first running Cover Ingredients Search to select a diverse set of recipes that cover as many ingredients as possible, and then passing those results through the Hybrid Search to rerank or filter them by semantic relevance.

In [18]:
def es_cover_then_hybrid_search(query, num_results=5, max_time=None, hybrid_top_k=5, candidate_pool_size=200):
    """
    Combine Cover Ingredients Search and Hybrid Search for Elasticsearch:
    1. Run Cover Ingredients Search to get a diverse set of recipes (pool size = candidate_pool_size).
    2. Rerank the pool by semantic similarity to the query using OpenAI embeddings.
    3. Remove 'all_ingredients_vector' from each result for cleaner output.
    4. Deduplicate results before returning.
    """
    # Step 1: Cover Ingredients Search to get a diverse set and increase the pool size
    cover_results = es_cover_ingredients_search(query, num_results=candidate_pool_size, max_time=max_time)
    if not cover_results:
        return []
    # Step 2: Hybrid Search, but restrict to cover_results as the candidate pool
    query_emb = get_embedding(query)
    # Use precomputed embeddings from 'all_ingredients_vector'
    cover_embeddings = [
        np.array(doc['all_ingredients_vector']) for doc in cover_results
    ]
    similarities = [np.dot(query_emb, emb) for emb in cover_embeddings]
    # Get top-k by semantic similarity
    top_indices = np.argsort(similarities)[-hybrid_top_k:][::-1]
    # The hybrid logic is built into the similarity ranking (symbolic/keyword + neural/embedding/semantic)
    hybrid_results = [cover_results[i] for i in top_indices] 
    # Remove all_ingredients_vector from each result
    for doc in hybrid_results:
        doc.pop('all_ingredients_vector', None)

    return hybrid_results[:num_results]

### 6. Rerank with LLM

Re-ranking means taking an initial set of retrieved documents (candidates) and re-ordering them, often using a more sophisticated model (like an LLM), to improve the relevance of the top results.

#### 6.1. Build Prompt for the LLM Using Retrieved Context (Minsearch + Elasticsearch)
Use OpenAI's language model to generate answers based on the context retrieved from Minsearch and Elasticsearch. 

In [19]:
def build_prompt(query, search_results):
    entry_template = """
Recipe: {recipe_name}
Cuisine: {cuisine_type}
Meal Type: {meal_type}
Difficulty: {difficulty_level}
Prep Time: {prep_time_minutes} minutes
Cook Time: {cook_time_minutes} minutes
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}
""".strip()
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt_template = """
You are an expert chef and culinary assistant. Answer the question based on the content from our recipe database.
Use only the facts from the context when answering the question.

CONTEXT:
{context}

QUESTION: {question}

Provide recipe recommendations with brief explanations of why they match the requested ingredients.
If exact ingredients aren't available, suggest the closest matches and mention any substitutions needed.
""".strip()
    return prompt_template.format(context=context, question=query)

#### 6.2. LLM Call Function (Minsearch + Elasticsearch)

In [20]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

#### 6.3. Re-ranking Function

In [21]:
def rerank_with_llm(query, candidates, max_time=None): 
    # Apply time filter before reranking
    if max_time is not None:
        candidates = filter_by_max_time(candidates, max_time)
    context = "\n\n".join([f"Recipe: {doc['recipe_name']}\nIngredients: {doc['main_ingredients']}" for doc in candidates])
    prompt = f"""
Given the following user query and candidate recipes, rank the recipes from most to least relevant.

Query: {query}

Candidates:
{context}

Return a JSON list of recipe names in ranked order.
""".strip()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    import re
    content = response.choices[0].message.content
    json_match = re.search(r'\[.*\]', content, re.DOTALL)
    if json_match:
        ranked_names = json.loads(json_match.group())
    else:
        return candidates
    ranked_docs = [doc for name in ranked_names for doc in candidates if doc['recipe_name'] == name]
    return ranked_docs 

#### 6.4. Re-rank the ms_cover_then_hybrid_search RAG Pipeline 

In [None]:
def ms_best_rag_with_rerank(query):
    candidates = ms_cover_then_hybrid_search(query, index, num_results=5)
    reranked = rerank_with_llm(query, candidates)
    prompt = build_prompt(query, reranked)
    answer = llm(prompt)
    return answer

#### 6.5. Re-rank the es_cover_then_hybrid_search RAG Pipeline 

In [None]:
def es_best_rag_with_rerank(query):
    candidates = es_cover_then_hybrid_search(query, num_results=5)
    reranked = rerank_with_llm(query, candidates)
    prompt = build_prompt(query, reranked)
    answer = llm(prompt)
    return answer

#### 6.6. Run and Compare Both Pipelines

In [None]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"

print("Minsearch Best RAG with LLM Rerank:")
answer_ms = ms_best_rag_with_rerank(ingredients)
print(answer_ms)

print("\nElasticsearch Best RAG with LLM Rerank:")
answer_es = es_best_rag_with_rerank(ingredients)
print(answer_es)

Minsearch Best RAG with LLM Rerank:
Based on the requested ingredients, here are the closest recipe recommendations along with explanations:

1. **Mediterranean Rice Skillet**
   - **Matching Ingredients:** Rice, Eggs, Cheese, Garlic, Onion
   - **Explanation:** This recipe primarily utilizes rice, which is one of your requested ingredients. It also includes eggs and cheese, which match your list. Garlic and onion add flavor, aligning with the garlic and onion you requested. However, it does not include tomatoes or bell peppers. You could add chopped bell peppers as a substitution to enhance the dish.

2. **Chinese Pasta Soup**
   - **Matching Ingredients:** Pasta, Shrimp, Garlic, Milk (for butter)
   - **Explanation:** This recipe uses pasta and shrimp, matching your list. While it does not include chicken, eggs, or rice, you could consider substituting pasta with rice if desired. The soup's creamy texture is achieved with milk and butter, which can add richness to your meal.

3. **Be

#### 6.7. Conclusion of Combined RAG Pipeline with Re-ranking for Minsearch and Elasticsearch

1. Both pipelines surface similar top matches.

    * Both recommend Mediterranean Rice Skillet and Chinese Pasta Soup as strong matches for the ingredient list.
    * French Onion Soup appears in both lists, though not always as a top match.

2. The explanations are ingredient-aware and suggest substitutions.

    * Both systems explain which ingredients from the list match the recipes.
    * They offer suggestions for substitutions or adaptations (e.g., using chicken instead of beef, or substituting fish).

3. Elasticsearch results are slightly more ingredient-dense and explicit.

    * The Elasticsearch output lists more matching ingredients for each recipe and sometimes provides more detailed substitution advice.
    * It also suggests how to use leftover ingredients (e.g., lemon, chocolate) in side dishes or desserts.

4. Both systems handle partial matches and adaptation.

    * Recipes that don’t fully match the list are still suggested, with clear notes on what’s missing and how to adapt.
    * Both pipelines are robust to imperfect matches and provide practical suggestions.

5. The LLM reranker produces user-friendly, actionable recommendations.

    * The final output is readable, helpful, and tailored to the user’s pantry.
    * The explanations help users understand why each recipe was chosen and how to adapt it.

### 8. Prompt Strategies

#### 8.1. Compare Different Prompt Strategies for Minsearch and Elasticsearch RAG pipelines

In [24]:
def get_prompt_strategy_1():
    return """
You are a chef assistant. Based on the available recipes, recommend dishes that use the requested ingredients.
Provide the recipe name, brief description, and cooking instructions and time.

CONTEXT:
{context}

QUESTION: {question}

Answer:
""".strip()

def get_prompt_strategy_2():
    return """
You are an expert chef specializing in ingredient substitutions. When users provide ingredients,
recommend recipes and suggest alternatives for missing ingredients. Always explain possible substitutions
and how they might affect the dish.

CONTEXT:
{context}

QUESTION: {question}

Provide recommendations with substitution suggestions:
""".strip()

def get_prompt_strategy_3():
    return """
You are a nutritionist and chef. Recommend recipes based on ingredients provided, considering
nutritional value and dietary restrictions. Highlight health benefits and suggest modifications
for different dietary needs (vegetarian, gluten-free, etc.).

CONTEXT:
{context}

QUESTION: {question}

Provide nutritionally-aware recommendations:
""".strip()


#### 8.2. Prompt-strategy version (flexible prompt template)

Prompt-strategies enable experimentation with different prompt styles, tones, or instructions. You can test how changing the prompt wording or focus (e.g., substitutions, nutrition, chef persona) affects the quality and usefulness of answers. This is valuable for prompt engineering, user experience research, and optimizing the assistant for different use cases or audiences.
See fixed version below.

In [None]:
def rag_minsearch(question, prompt_strategy=None):
    search_results = ms_cover_then_hybrid_search(question, index, num_results=5)
    # Build context string as in build_prompt
    entry_template = (
        "Recipe: {recipe_name}\n"
        "Cuisine: {cuisine_type}\n"
        "Meal Type: {meal_type}\n"
        "Difficulty: {difficulty_level}\n"
        "Prep Time: {prep_time_minutes} minutes\n"
        "Cook Time: {cook_time_minutes} minutes\n"
        "Main Ingredients: {main_ingredients}\n"
        "Instructions: {instructions}\n"
        "Dietary Info: {dietary_restrictions}"
    )
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt = prompt_strategy().format(context=context, question=question)
    answer = llm(prompt)
    return answer

def rag_elasticsearch(question, prompt_strategy=None):
    search_results = es_cover_then_hybrid_search(question, num_results=5)
    entry_template = (
        "Recipe: {recipe_name}\n"
        "Cuisine: {cuisine_type}\n"
        "Meal Type: {meal_type}\n"
        "Difficulty: {difficulty_level}\n"
        "Prep Time: {prep_time_minutes} minutes\n"
        "Cook Time: {cook_time_minutes} minutes\n"
        "Main Ingredients: {main_ingredients}\n"
        "Instructions: {instructions}\n"
        "Dietary Info: {dietary_restrictions}"
    )
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt = prompt_strategy().format(context=context, question=question)
    answer = llm(prompt)
    return answer

In [30]:
# Test get_prompt_strategy_1, get_prompt_strategy_2 and get_prompt_strategy_3
question = "What can I cook with chicken, rice, and bell pepper?"
print(rag_minsearch(question, prompt_strategy=get_prompt_strategy_1))
print(rag_elasticsearch(question, prompt_strategy=get_prompt_strategy_1))
print(rag_minsearch(question, prompt_strategy=get_prompt_strategy_2))
print(rag_elasticsearch(question, prompt_strategy=get_prompt_strategy_2))
print(rag_minsearch(question, prompt_strategy=get_prompt_strategy_3))
print(rag_elasticsearch(question, prompt_strategy=get_prompt_strategy_3))

Here are some delicious dishes you can prepare using chicken, rice, and bell pepper:

### 1. **Chicken and Bell Pepper Stir-Fry**
**Description:** A quick and vibrant stir-fry featuring tender chicken pieces, crunchy bell peppers, and flavorful spices.  
**Cooking Instructions:**  
1. Slice chicken breast and bell peppers into thin strips.  
2. In a large skillet or wok, heat a tablespoon of oil over medium-high heat.  
3. Add the chicken strips and cook until browned and cooked through (about 5-7 minutes).  
4. Add sliced bell peppers and stir-fry for another 3-4 minutes until they are tender-crisp.  
5. Season with soy sauce, garlic, and ginger, and cook for an additional minute.  
6. Serve over steamed rice.  
**Total Time:** 20 minutes  

---

### 2. **One-Pan Chicken Rice with Bell Peppers**
**Description:** A hearty one-pan meal that combines chicken, bell peppers, and rice, all cooked together for an easy and tasty dinner.  
**Cooking Instructions:**  
1. In a large skillet, hea

#### 8.3. Conclusion of Prompt Strategies for Minsearch and Elasticsearch

All prompt strategies and both retrieval pipelines successfully generate relevant, practical recipes using chicken, rice, and bell pepper as main ingredients.
The answers include a variety of dishes (e.g., casseroles, stir-fries, rice bowls, pilafs, stuffed peppers), showing the system’s ability to surface diverse options.

The RAG pipeline, with prompt strategies, produces high-quality, context-aware, and adaptable recipe recommendations, supporting different user intents (basic cooking, substitutions, nutrition). This validates the usefulness of prompt engineering and the retrieval system for real-world recipe assistant scenarios.

### 9. Retrieval Evaluation Metrics 

#### 9.1. Hit Rate and MRR

Evaluate the final retrieval quality of the best-performing pipelines. i.e. how well the search step surfaces the ground-truth document in its top-N results. The LLM reranker (ms_best_rag_with_rerank, es_best_rag_with_rerank) produces a final answer, not a ranked list of retrieved documents, so it can't be directly used for retrieval metrics.

In [26]:
def hit_rate(relevance_total):
    cnt = 0
    for line in relevance_total:
        if True in line:
            cnt += 1
    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank]:
                total_score += 1 / (rank + 1)
                break
    return total_score / len(relevance_total)

def evaluate_retrieval(ground_truth, search_function):
    relevance_total = []
    for q in tqdm(ground_truth):
        doc_id = str(q['id'])
        results = search_function(q)
        relevance = [str(d.get('id')) == doc_id for d in results]
        relevance_total.append(relevance)
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [27]:
print("Evaluating Minsearch retrieval...")
metrics_ms = evaluate_retrieval(ground_truth, lambda q: ms_cover_then_hybrid_search(q['question'], index, num_results=10))
print("Minsearch:", metrics_ms)

print("Evaluating Elasticsearch retrieval...")
metrics_es = evaluate_retrieval(ground_truth, lambda q: es_cover_then_hybrid_search(q['question'], num_results=10))
print("Elasticsearch:", metrics_es)

Evaluating Minsearch retrieval...


  0%|          | 0/500 [00:00<?, ?it/s]

Minsearch: {'hit_rate': 0.05, 'mrr': 0.05}
Evaluating Elasticsearch retrieval...


  0%|          | 0/500 [00:00<?, ?it/s]

Elasticsearch: {'hit_rate': 0.428, 'mrr': 0.424}


#### 9.2. Conclusion of Hit Rate and MRR

* Elasticsearch dramatically outperforms Minsearch on both hit rate and MRR (Mean Reciprocal Rank) for the ground-truth retrieval evaluation.

    *  Hit Rate: The proportion of queries where the correct recipe appears in the top 10 results is ~36.5% for Elasticsearch, but only ~2.4% for Minsearch.
    * MRR: The average reciprocal rank of the correct recipe is also much higher for Elasticsearch (~0.37 vs. ~0.02).
* Minsearch is not effective for this retrieval task with the current setup and queries. It almost never surfaces the correct recipe in the top results.

* Elasticsearch is much more suitable for ingredient/time-based recipe retrieval, likely due to its more advanced text search and ranking capabilities. For a production system 37% is moderate, not excellent, but may be reasonable, especially with a small dataset.

Since both metrics are returning the exact same value, we can conclude that whenever the correct document is present, it is always at the same rank (almost always rank 1), or it is missing entirely. 

### 10. Parameter Optimization for Minsearch

#### 10.1. Parameter optimization of Minsearch's text field boosts (for keyword/text search)

In [28]:
param_ranges = { # Used for optimizing text field boosts in Minsearch
    'recipe_name': (0.0, 1.0),
    'main_ingredients': (0.0, 4.0),
    'all_ingredients': (0.0, 5.0),
    'instructions': (0.0, 3.0),
    'cuisine_type': (0.0, 1.0),
    'dietary_restrictions': (0.0, 2.0)
}

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')
    for _ in range(n_iterations):
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            current_params[param] = random.uniform(min_val, max_val)
        current_score = objective_function(current_params)
        if current_score > best_score:
            best_score = current_score
            best_params = current_params
    return best_params, best_score

gt_val = df_gt.sample(n=50, random_state=42).to_dict(orient='records')

def objective(boost_params):
    def search_function(q):
        return index.search(q['question'], boost_dict=boost_params, num_results=10)
    results = evaluate_retrieval(gt_val, search_function)
    return results['mrr']

In [29]:
best_boost, best_score = simple_optimize(param_ranges, objective, n_iterations=20)
print("Best Minsearch boost params:", best_boost)
print("Best validation MRR:", best_score)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

Best Minsearch boost params: {'recipe_name': 0.05695903438842731, 'main_ingredients': 2.918459989752664, 'all_ingredients': 1.1310668468873153, 'instructions': 0.37781160720585716, 'cuisine_type': 0.10329587286913522, 'dietary_restrictions': 0.18388683959828445}
Best validation MRR: 0.46774603174603163


#### 10.2. Conclusions of Parameter Optimization for Minsearch

* Field importance: The optimizer found that giving a high boost to main_ingredients (3.53) and moderate boosts to all_ingredients (1.58) and instructions (0.76) improves retrieval performance. recipe_name, cuisine_type, and dietary_restrictions are less important but still contribute.
* Retrieval quality: The best Mean Reciprocal Rank (MRR) achieved on the validation set is 0.28. This means, on average, the correct recipe appears near the top 3-4 results (since 1/0.28 ≈ 3.6).
* Optimization effect: Tuning the field boosts can significantly improve Minsearch's retrieval effectiveness compared to default or arbitrary weights.

**Conclusion of 9.2. Hit Rate and MRR for Minsearch:**
The evaluation of 9.2. uses the default or initial Minsearch configuration (default field boosts, no optimization). Minsearch's basic keyword search is not well-tuned for the complex, ingredient/time-based queries in the ground-truth set, so it rarely surfaced the correct recipe. Still, Elasticsearch outperforms Minsearch.  

### 11. RAG Pipeline Evaluation and LLM Answer Quality

#### 11.1. Fixed version (single prompt template)

The fixed version Ensures consistency and fairness in evaluation. Every answer is generated using the same prompt style, so results are directly comparable across all questions and retrieval methods. This is important for benchmarking, metrics (like MRR/Hit Rate), and automated LLM-as-judge evaluation.
The fixed versions are for automated, consistent evaluation.

In [30]:
def rag_minsearch(question):
    return ms_best_rag_with_rerank(question)

def rag_elasticsearch(question):
    return es_best_rag_with_rerank(question)

#### 12.2. LLM-as-Judge Evaluation (RAG answer quality)

In [31]:
prompt_template_2 = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

sample = df_gt.sample(n=50, random_state=1).to_dict(orient='records')
evaluations_ms = []
evaluations_es = []

print("Evaluating RAG (Minsearch) with LLM-as-judge...")

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_minsearch(question)
    prompt = prompt_template_2.format(question=question, answer_llm=answer_llm)
    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    except Exception:
        evaluation = {"Relevance": "ERROR", "Explanation": evaluation}
    evaluations_ms.append({
        "id": record['id'],
        "question": question,
        "answer": answer_llm,
        "relevance": evaluation.get("Relevance"),
        "explanation": evaluation.get("Explanation")
    })

print("Evaluating RAG (Elasticsearch) with LLM-as-judge...")

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_elasticsearch(question)
    prompt = prompt_template_2.format(question=question, answer_llm=answer_llm)
    evaluation = llm(prompt)
    try:
        evaluation = json.loads(evaluation)
    except Exception:
        evaluation = {"Relevance": "ERROR", "Explanation": evaluation}
    evaluations_es.append({
        "id": record['id'],
        "question": question,
        "answer": answer_llm,
        "relevance": evaluation.get("Relevance"),
        "explanation": evaluation.get("Explanation")
    })

# Persist evaluation results for further analysis and reproducibility (record the relevance judgments)
df_eval_ms = pd.DataFrame(evaluations_ms)
df_eval_es = pd.DataFrame(evaluations_es)
df_eval_ms.to_csv('../data/rag-eval-minsearch.csv', index=False)
df_eval_es.to_csv('../data/rag-eval-elasticsearch.csv', index=False)

Evaluating RAG (Minsearch) with LLM-as-judge...


  0%|          | 0/50 [00:00<?, ?it/s]

Evaluating RAG (Elasticsearch) with LLM-as-judge...


  0%|          | 0/50 [00:00<?, ?it/s]

In [32]:
print("Minsearch RAG relevance proportions:")
print(df_eval_ms['relevance'].value_counts(normalize=True))
print("Elasticsearch RAG relevance proportions:")
print(df_eval_es['relevance'].value_counts(normalize=True))

Minsearch RAG relevance proportions:
relevance
RELEVANT           0.9
PARTLY_RELEVANT    0.1
Name: proportion, dtype: float64
Elasticsearch RAG relevance proportions:
relevance
RELEVANT           0.56
PARTLY_RELEVANT    0.42
NON_RELEVANT       0.02
Name: proportion, dtype: float64


#### 12.3. Conclusion of LLM-as-Judge Evaluation

* Minsearch RAG pipeline produces highly relevant answers:

    * 94% of answers are judged RELEVANT by the LLM-as-judge.
    * Only 4% are PARTLY_RELEVANT and 2% NON_RELEVANT.
    * This indicates Minsearch is very effective at generating fully relevant answers for the evaluated questions.

* Elasticsearch RAG pipeline is also strong, but less so than Minsearch in this evaluation:

    * 72% of answers are RELEVANT.
    * 22% are PARTLY_RELEVANT and 6% NON_RELEVANT.
    * This suggests Elasticsearch sometimes produces answers that are only partially relevant or not relevant.

* Both systems perform well, but Minsearch achieves a higher proportion of fully relevant answers in this test.
* Minsearch may be better for maximizing answer relevance on this dataset and evaluation set, while Elasticsearch is still strong but with more partially or non-relevant outputs.

### 14. Summary

In [33]:
print("\n=== RETRIEVAL METRICS ===")
print("Minsearch:", metrics_ms)
print("Elasticsearch:", metrics_es)
print("Best Minsearch boost params:", best_boost)
print("Best validation MRR:", best_score)
print("\n=== RAG LLM-as-Judge (proportion RELEVANT) ===")
print("Minsearch:", (df_eval_ms['relevance'] == 'RELEVANT').mean())
print("Elasticsearch:", (df_eval_es['relevance'] == 'RELEVANT').mean())
print("\nAll evaluation results saved to CSV in ../data/")


=== RETRIEVAL METRICS ===
Minsearch: {'hit_rate': 0.05, 'mrr': 0.05}
Elasticsearch: {'hit_rate': 0.428, 'mrr': 0.424}
Best Minsearch boost params: {'recipe_name': 0.05695903438842731, 'main_ingredients': 2.918459989752664, 'all_ingredients': 1.1310668468873153, 'instructions': 0.37781160720585716, 'cuisine_type': 0.10329587286913522, 'dietary_restrictions': 0.18388683959828445}
Best validation MRR: 0.46774603174603163

=== RAG LLM-as-Judge (proportion RELEVANT) ===
Minsearch: 0.9
Elasticsearch: 0.56

All evaluation results saved to CSV in ../data/


Stop Elasticsearch running in Docker:

`docker stop elasticsearch`

Remove the container (optional, if you want to delete it):

`docker rm elasticsearch`
