## Retrieval-Augmented Generation (RAG) for Recipe Assistant: Minsearch and Elasticsearch

This notebook demonstrates an end-to-end RAG workflow for recipe recommendations using both Minsearch (for prototyping and local development) and Elasticsearch (for scalable, production-grade search).  
It includes advanced retrieval strategies, LLM-based query rewriting and re-ranking, prompt strategies, and evaluation metrics.
These strategies will be evaluated in `rag-evaluation.ipynb`

---

### 1. Set-up Dependencies, OpenAI Client and Dataset

#### 1.1. Dependencies and OpenAI Client
Install (run once) and import dependencies, load OpenAI API key and connect to OpenAI API.

In [9]:
# %pip install minsearch elasticsearch tqdm openai python-dotenv 

In [None]:
import pandas as pd
import numpy as np
import json
import re
from tqdm.auto import tqdm
import dotenv
import minsearch
from elasticsearch import Elasticsearch
from openai import OpenAI

In [2]:
dotenv.load_dotenv(dotenv_path="../.env")
client = OpenAI()

#### 1.2. Load and Index Recipe Data

Load the recipe dataset from CSV file and prepare it for indexing.

In [3]:
df = pd.read_csv('../data/recipes_clean.csv')

# Add an ID column if it doesn't exist
if 'id' not in df.columns:
    df['id'] = range(len(df))
    
# Create documents for indexing
documents = df.to_dict(orient='records')
documents[0]

{'recipe_name': 'Spaghetti Carbonara',
 'cuisine_type': 'Italian',
 'meal_type': 'Dinner',
 'difficulty_level': 'Medium',
 'prep_time_minutes': 15,
 'cook_time_minutes': 20,
 'servings': 4,
 'main_ingredients': 'Spaghetti, Eggs, Pancetta, Parmesan',
 'all_ingredients': 'Spaghetti, Eggs, Pancetta, Parmesan, Black Pepper, Olive Oil, Salt',
 'dietary_restrictions': 'Contains gluten, dairy, pork',
 'instructions': 'Boil spaghetti until al dente. Fry pancetta until crispy. Whisk eggs with grated Parmesan. Toss hot spaghetti with pancetta and egg mixture off heat. Serve immediately with black pepper.',
 'nutritional_info': 'Calories: 520, Protein: 22g, Carbs: 60g, Fat: 22g',
 'id': 0}

### 2. Utility Functions

#### 2.1. Create Time Filter
Users will be queried for available time. 

In [4]:
def filter_by_max_time(results, max_time=None):
    if max_time is None:
        return results
    filtered = []
    for doc in results:
        try:
            total_time = int(doc.get('prep_time_minutes', 0)) + int(doc.get('cook_time_minutes', 0))
        except Exception:
            total_time = 99999
        if total_time <= max_time:
            filtered.append(doc)
    return filtered

#### 2.2. Print Unused Ingredients

Utility function that, given the query and the results, prints which query ingredients were not covered by the selected recipes.

In [5]:
# tokenize_ingredients is used in below Cover Ingredients Searches
def tokenize_ingredients(ingredient_str):
    # Split only on commas, strip whitespace, and lowercase
    return set([ing.strip().lower() for ing in ingredient_str.split(',') if ing.strip()])

In [6]:
def print_unused_ingredients(ingredients, results):
    query_ings = tokenize_ingredients(ingredients)
    used = set()
    for doc in results:
        recipe_ings = tokenize_ingredients(doc.get('all_ingredients', ''))
        used |= recipe_ings
    unused = query_ings - used
    print("Unused ingredients:", ", ".join(unused) if unused else "All ingredients used!")

#### 2.3. Deduplicate Search results
Elasticsearch and LLMs can sometimes return duplicate or near-duplicate documents.

In [7]:
def deduplicate_results(results):
    seen = set()
    deduped = []
    for doc in results:
        # Use a unique field, e.g. 'id' or a tuple of fields
        key = doc.get('id') or (doc.get('recipe_name'), doc.get('prep_time_minutes'), doc.get('cook_time_minutes'))
        if key not in seen:
            seen.add(key)
            deduped.append(doc)
    return deduped

### 3. Minsearch: In-memory Search Library

Minsearch is a simple, fast, in-memory search library ideal for prototyping, small datasets, and educational purposes. It requires no external dependencies or server.

#### 3.1. Build Minsearch Index and Run Simple Retrieval Strategies

In [None]:
# prep/cook time and servings are numeric, and not suitable for Minsearch's default text/keyword search.
# Setup minsearch index for recipes
index = minsearch.Index(
    text_fields=['recipe_name', 'main_ingredients', 'all_ingredients', \
                 'instructions', 'cuisine_type', 'dietary_restrictions'],
    keyword_fields=['meal_type', 'difficulty_level'] # Categorical values
)

# Fit/train the index on the recipe documents
index.fit(documents)
index.documents = index.docs

##### Approach 1: Basic Keyword Search

Returns recipes that match the query using simple keyword matching across all fields.

In [36]:
def ms_basic_search(query, index=index, num_results=5, max_time=None):
    results = index.search(query=query, num_results=num_results*2)  # get more to allow filtering
    results = filter_by_max_time(results, max_time)
    return results[:num_results]

##### Approach 2: Boosting Search

Similar to basic, but gives more weight to important fields.

In [37]:
def ms_boosting_search(query, index=index, num_results=5, max_time=None):
    boost_dict = {
        'main_ingredients': 4.0,
        'all_ingredients': 5.0,
        'instructions': 2.0,
        'prep_time_minutes': 1.5,
        'cook_time_minutes': 1.5,
        'dietary_restrictions': 2.0
    }
    results = index.search(query=query, boost_dict=boost_dict, num_results=num_results*2)
    results = filter_by_max_time(results, max_time)
    return results[:num_results]

#### 3.2. Advanced Retrieval Strategies
Below are advanced retrieval strategies that allow for more flexible, robust, and user-friendly recipe search.



##### Approach 3: Query Expansion search

Expands the query with synonyms (e.g., "chicken" → "poultry", "breast") before searching, to improve recall.


In [38]:
def expand_culinary_query(query):
    synonyms = {
        'chicken': ['poultry', 'fowl', 'breast', 'thigh'],
        'pasta': ['noodles', 'spaghetti', 'linguine', 'vermicelli', 'orzo', 'macaroni'],
        'tomatoes': ['tomato', 'roma', 'cherry tomatoes', 'heirloom'],
        'beef': ['steak', 'ground beef', 'brisket', 'sirloin'],
        'potatoes': ['spuds', 'russet', 'yukon', 'sweet potatoes'],
        'cheese': ['cheddar', 'mozzarella', 'parmesan', 'feta', 'gruyere'],
        'onion': ['onions', 'red onion', 'yellow onion', 'shallot'],
        'rice': ['basmati', 'jasmine', 'arborio', 'brown rice'],
        'shrimp': ['prawns', 'shellfish', 'seafood'],
        'eggs': ['egg', 'yolk', 'whites'],
        'herbs': ['basil', 'oregano', 'thyme', 'rosemary']
    }
    tokens = query.lower().split()
    expanded_tokens = tokens.copy()
    for token in tokens:
        if token in synonyms:
            expanded_tokens.extend(synonyms[token][:2])
    return ' '.join(expanded_tokens)

def ms_expansion_search(query, index=index, num_results=5, max_time=None):
    expanded_query = expand_culinary_query(query)
    results = index.search(expanded_query, num_results=num_results*2)
    results = filter_by_max_time(results, max_time)
    return results[:num_results]

##### Approach 4: Hybrid Search With OpenAI (Keyword + Embedding)
Combines keyword and with OpenAI's embedding-based (semantic) similarity for more robust retrieval.

In [39]:
def get_embedding(text):
    response = client.embeddings.create( # OpenAI client
        model="text-embedding-3-small",
        input=[text]
    )
    return np.array(response.data[0].embedding)

# Precompute embeddings for all recipes (DO THIS ONCE AND CACHE)
if not hasattr(index, "embeddings"):
    index.embeddings = [get_embedding(doc['all_ingredients']) for doc in index.documents]

def ms_hybrid_search(query, index, num_results=5, alpha=0.5, max_time=None):
    # Keyword search using SimpleIndex
    keyword_results = index.search(query, num_results=10)
    # Embedding search
    query_emb = get_embedding(query)
    similarities = [np.dot(query_emb, emb) for emb in index.embeddings]
    top_indices = np.argsort(similarities)[-10:][::-1]
    embedding_results = [index.documents[i] for i in top_indices]
    # Combine results (simple union, or weighted score)
    combined = {}
    for doc in keyword_results:
        combined[doc['id']] = alpha
    for doc in embedding_results:
        combined[doc['id']] = combined.get(doc['id'], 0) + (1 - alpha)
    # Sort by combined score
    sorted_ids = sorted(combined, key=combined.get, reverse=True)
    results = [doc for doc in index.documents if doc['id'] in sorted_ids]
    results = filter_by_max_time(results, max_time)
    # Deduplicate before returning (optional but recommended if you ever combine sources)
    deduped_results = deduplicate_results(results)
    return deduped_results[:num_results]

##### Approach 5: Cover Ingredients Search
Selects a set of recipes that together cover as many of the query ingredients as possible, optionally filtering out recipes that exceed the max_time constraint.


In [None]:
# *** Already run above ***
def tokenize_ingredients(ingredient_str):
    # Split only on commas, strip whitespace, and lowercase
    return set([ing.strip().lower() for ing in ingredient_str.split(',') if ing.strip()])

In [41]:
def ms_cover_ingredients_search(query, index, num_results=5, max_time=None):
    """
    Selects a set of recipes that together cover as many of the query ingredients as possible,
    optionally filtering out recipes that exceed the max_time constraint.
    """
    query_ings = tokenize_ingredients(query)
    uncovered = set(query_ings)
    selected = []
    docs = index.documents.copy()

    # Apply time filter to docs at the start for efficiency
    docs = filter_by_max_time(docs, max_time)

    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            recipe_ings = tokenize_ingredients(doc.get('all_ingredients', ''))
            overlap = len(uncovered & recipe_ings)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            recipe_ings = tokenize_ingredients(best_doc.get('all_ingredients', ''))
            uncovered -= recipe_ings
            docs.remove(best_doc)
        else:
            break
    # Deduplicate results before returning (recommended if docs may have overlap)
    deduped_results = deduplicate_results(selected)
    return deduped_results

#### 3.3. Test All Retrieval Approaches:
Seach retrieval approaches manually with a random ingredient list.

In [None]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45

ms_basic_results = ms_basic_search(ingredients, index, max_time=max_time)
print("MS Basic Search:", ms_basic_results)
print_unused_ingredients(ingredients, ms_basic_results)

ms_boosting_results = ms_boosting_search(ingredients, index, max_time=max_time)
print("MS Ingredient Search:", ms_boosting_results)
print_unused_ingredients(ingredients, ms_boosting_results)

ms_query_expansion_results = ms_expansion_search(ingredients, index, max_time=max_time)
print("MS Query Expansion Search:", ms_query_expansion_results)
print_unused_ingredients(ingredients, ms_query_expansion_results)

ms_hybrid_results = ms_hybrid_search(ingredients, index, max_time=max_time)
print("MS Hybrid Search:", ms_hybrid_results)
print_unused_ingredients(ingredients, ms_hybrid_results)

# num_results is hardcoded as 5 in above function definitions
# cover_ingredients_search requires num_results to be explicitly passed to control how many recipes are selected in its loop
cover_results = ms_cover_ingredients_search(ingredients, index, num_results=5, max_time=max_time)
for doc in cover_results:
    print(doc['recipe_name'], "-", doc['all_ingredients'])
print_unused_ingredients(ingredients, cover_results)

MS Basic Search: [{'recipe_name': 'African Pasta Roast', 'cuisine_type': 'African', 'meal_type': 'Breakfast', 'difficulty_level': 'Medium', 'prep_time_minutes': 21, 'cook_time_minutes': 3, 'servings': 5, 'main_ingredients': 'Pasta, Rice, Chicken', 'all_ingredients': 'Pasta, Rice, Chicken, Butter, Flour, Garlic, Soy Sauce', 'dietary_restrictions': 'Contains shellfish', 'instructions': 'Prepare Pasta, Rice, Chicken. Cook with Pasta, Rice, Chicken, Butter, Flour, Garlic, Soy Sauce. Season and serve.', 'nutritional_info': 'Calories: 320, Protein: 35g, Carbs: 44g, Fat: 6g', 'id': 235}]
Unused ingredients: onion, tomato, milk, shrimp, potatoes, cheese, lemon, sugar, chocolate, thyme, bell pepper, fish, eggs
MS Ingredient Search: []
Unused ingredients: tomato, milk, rice, potatoes, cheese, lemon, sugar, chocolate, thyme, fish, shrimp, eggs, onion, butter, flour, garlic, chicken, bell pepper, pasta
MS Query Expansion Search: [{'recipe_name': 'African Pasta Roast', 'cuisine_type': 'African', 'm

#### 3.4. Inspect ms_cover_ingredients_search
Examine ms_cover_ingredients_search fields of recipe in the results to get the rest of the fields.

In [43]:
for recipe in ms_cover_ingredients_search(ingredients, index, num_results=5, max_time=max_time):
    print(f"Name: {recipe['recipe_name']}")
    print(f"Main Ingredients: {recipe['main_ingredients']}")
    print(f"All Ingredients: {recipe['all_ingredients']}")
    print(f"Instructions: {recipe['instructions']}")
    print(f"Cuisine: {recipe['cuisine_type']}")
    print(f"Preparation Time: {recipe['prep_time_minutes']} minutes")
    print(f"Cooking Time: {recipe['cook_time_minutes']} minutes")
    print("-" * 40)

Name: African Pasta Roast
Main Ingredients: Pasta, Rice, Chicken
All Ingredients: Pasta, Rice, Chicken, Butter, Flour, Garlic, Soy Sauce
Instructions: Prepare Pasta, Rice, Chicken. Cook with Pasta, Rice, Chicken, Butter, Flour, Garlic, Soy Sauce. Season and serve.
Cuisine: African
Preparation Time: 21 minutes
Cooking Time: 3 minutes
----------------------------------------
Name: African Lentils Tagine
Main Ingredients: Lentils, Shrimp, Eggs
All Ingredients: Lentils, Shrimp, Eggs, Butter, Yogurt, Sugar, Onion
Instructions: Prepare Lentils, Shrimp, Eggs. Cook with Lentils, Shrimp, Eggs, Butter, Yogurt, Sugar, Onion. Season and serve.
Cuisine: African
Preparation Time: 23 minutes
Cooking Time: 4 minutes
----------------------------------------
Name: African Potatoes Curry
Main Ingredients: Potatoes, Beef, Cheese
All Ingredients: Potatoes, Beef, Cheese, Herbs, Yogurt, Honey, Milk
Instructions: Prepare Potatoes, Beef, Cheese. Cook with Potatoes, Beef, Cheese, Herbs, Yogurt, Honey, Milk. Sea

#### 3.5. Retrieval Approach Conclusion:

* MS Basic Search and MS Query Expansion Search both return only a single recipe, covering just a few of the provided ingredients and leaving most unused. This shows that simple keyword or synonym expansion is not sufficient for broad ingredient coverage.
* MS Ingredient Search fails to return any results, indicating that boosting alone may be too restrictive or not well-tuned for this dataset/query.
* MS Hybrid Search improves coverage, returning multiple recipes and using more of the provided ingredients, but still leaves several unused.
* Cover Ingredients Search (as shown in the last block of results) is the most effective at maximizing ingredient usage, returning a diverse set of recipes that together cover nearly all the user's ingredients, with only a few left unused.

For users who want to use as many of their ingredients as possible, Cover Ingredients Search is the best approach. Hybrid and basic/expansion searches are better for finding the most relevant individual recipes, but not for maximizing ingredient usage across a set. The combined approach (cover + hybrid) provides both diversity and relevance, making it the most robust for real-world recipe recommendations.


### 4. OpenAI Integration and Minsearch RAG Flow
Use OpenAI's language model to generate answers based on the context retrieved from Minsearch and later Elasticsearch.

#### 4.1. Build Prompt for the LLM Using Retrieved Context

In [None]:
def build_prompt(query, search_results):
    entry_template = """
Recipe: {recipe_name}
Cuisine: {cuisine_type}
Meal Type: {meal_type}
Difficulty: {difficulty_level}
Prep Time: {prep_time_minutes} minutes
Cook Time: {cook_time_minutes} minutes
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}
""".strip()
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt_template = """
You are an expert chef and culinary assistant. Answer the question based on the content from our recipe database.
Use only the facts from the context when answering the question.

CONTEXT:
{context}

QUESTION: {question}

Provide recipe recommendations with brief explanations of why they match the requested ingredients.
If exact ingredients aren't available, suggest the closest matches and mention any substitutions needed.
""".strip()
    return prompt_template.format(context=context, question=query)

#### 4.2. LLM Call Function

In [9]:
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

#### 4.3. RAG Pipeline Using Minsearch
Test ms_cover_ingredients_search and ms_hybrid_search on the RAG Pipeline.

In [None]:
# DO I NEED THIS FOR EVALUATING ms_cover_then_hybrid_search?
def rag_minsearch(query, max_time=None, num_results=5, approach="cover"): 
    """
    approach: "cover" for ms_cover_ingredients_search, "hybrid" for ms_hybrid_search
    Deduplicates results before building prompt.
    """
    if approach == "cover":
        search_results = ms_cover_ingredients_search(query, index=index, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    elif approach == "hybrid":
        search_results = ms_hybrid_search(query, index=index, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    else:
        raise ValueError("Unknown approach: choose 'cover' or 'hybrid'")
    prompt = build_prompt(query, deduped_results)
    answer = llm(prompt)
    return answer

In [47]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45
print("Cover Ingredients Search:")
print(rag_minsearch(ingredients, max_time=max_time, approach="cover"))
print("\nHybrid Search:")
print(rag_minsearch(ingredients, max_time=max_time, approach="hybrid"))

Cover Ingredients Search:
Based on the ingredients you provided, here are the recipe recommendations from the context, along with brief explanations for their compatibility:

1. **African Pasta Roast**
   - **Ingredients Used**: Pasta, Chicken, Garlic, Butter, Rice
   - **Explanation**: This recipe matches well with the chicken, pasta, garlic, and butter you have. Since you mentioned rice as well, it is also included in the recipe, allowing for a complete match. You can omit any ingredient not listed explicitly if necessary.

2. **African Lentils Tagine**
   - **Ingredients Used**: Shrimp, Eggs, Onion, Sugar, Butter
   - **Explanation**: This recipe makes use of shrimp and eggs from your list. Although lentils are not mentioned in your ingredients, you can incorporate them if you have them on hand. Sugar and onion are included, making it a suitable option.

3. **African Potatoes Curry**
   - **Ingredients Used**: Potatoes, Cheese, Milk
   - **Explanation**: This recipe utilizes potatoe

#### 4.4. Minsearch RAG Pipeline Conclusion

* Cover Ingredients Search provides a diverse set of recipes that, together, maximize the use of the available ingredients. It selects recipes so that, across the set, most ingredients are used at least once. This approach is best when the goal is to use as many of ingredients as possible, even if no single recipe uses them all. It also offers suggestions for substitutions and creative adaptations.

* Hybrid Search surfaces recipes that are individually the most semantically relevant to the ingredient list. These recipes tend to match several of the main ingredients and provide strong conceptual alignment, but may not maximize overall ingredient coverage across multiple recipes. Hybrid Search is best when you want the most relevant single recipes for the main ingredients.

Use Cover Ingredients Search when you want to minimize waste and use a broad range of ingredients across several recipes. Use Hybrid Search when you want the most relevant or conceptually similar recipes, even if some ingredients are left unused. Neither approach will always use every ingredient, but Cover Ingredients Search will typically leave fewer unused items.

#### 4.5. Combine Cover and Hybrid RAG Pipelines (Minsearch)

Combine the two approaches by first running Cover Ingredients Search to select a diverse set of recipes that cover as many ingredients as possible, and then passing those results through the Hybrid Search to rerank or filter them by semantic relevance.

In [None]:

def ms_cover_then_hybrid_search(query, index, num_results=5, max_time=None, hybrid_top_k=5):
    """
    Combine Cover Ingredients Search and Hybrid Search:
    1. Run Cover Ingredients Search to get a diverse set of recipes (pool size = num_results*4).
    2. Rerank the pool by semantic similarity to the query using OpenAI embeddings.
    3. Deduplicate results before returning.
    """
    # Step 1: Cover Ingredients Search to get a diverse set and increase the pool size
    cover_results = ms_cover_ingredients_search(query, index, num_results=num_results*4, max_time=max_time)
    if not cover_results:
        return []
    # Step 2: Hybrid Search, but restrict to cover_results as the candidate pool
    # Compute embedding for the query
    query_emb = get_embedding(query)
    # Compute similarities only for cover_results
    cover_embeddings = [get_embedding(doc['all_ingredients']) for doc in cover_results]
    similarities = [np.dot(query_emb, emb) for emb in cover_embeddings]
    # Get top-k by semantic similarity
    top_indices = np.argsort(similarities)[-hybrid_top_k:][::-1]
    # The hybrid logic is built into the similarity ranking (symbolic/keyword + neural/embedding/semantic)
    hybrid_results = [cover_results[i] for i in top_indices]
    # Deduplicate before returning (defensive, in case of overlap)
    deduped_results = deduplicate_results(hybrid_results[:num_results])
    return deduped_results

In [49]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45
print("Cover + Hybrid Search:")
print(ms_cover_then_hybrid_search(ingredients, index, num_results=5, max_time=max_time))

Cover + Hybrid Search:
[{'recipe_name': 'Thai Tomatoes Power Bowl', 'cuisine_type': 'Thai', 'meal_type': 'Lunch', 'difficulty_level': 'Hard', 'prep_time_minutes': 25, 'cook_time_minutes': 2, 'servings': 2, 'main_ingredients': 'Tomatoes, Fish, Mushrooms', 'all_ingredients': 'Tomatoes, Fish, Mushrooms, Garlic, Spices, Milk, Sugar', 'dietary_restrictions': 'Contains nuts, Low-carb', 'instructions': 'Prepare Tomatoes, Fish, Mushrooms. Cook with Tomatoes, Fish, Mushrooms, Garlic, Spices, Milk, Sugar. Season and serve.', 'nutritional_info': 'Calories: 359, Protein: 27g, Carbs: 75g, Fat: 16g', 'id': 19}, {'recipe_name': 'African Pasta Roast', 'cuisine_type': 'African', 'meal_type': 'Breakfast', 'difficulty_level': 'Medium', 'prep_time_minutes': 21, 'cook_time_minutes': 3, 'servings': 5, 'main_ingredients': 'Pasta, Rice, Chicken', 'all_ingredients': 'Pasta, Rice, Chicken, Butter, Flour, Garlic, Soy Sauce', 'dietary_restrictions': 'Contains shellfish', 'instructions': 'Prepare Pasta, Rice, Chic

#### 4.6. Conclusion of Combined Approach (Cover + Hybrid Search):

The combined Cover + Hybrid Search approach successfully returns a set of recipes that:

* Maximize ingredient coverage: Across the five recipes, most of the user's provided ingredients are used at least once, minimizing waste and increasing the likelihood that the user can cook with what they have.
* Balance diversity and relevance: The recipes span multiple cuisines (Thai, African, Mexican) and meal types, showing diversity, while also being semantically relevant to the input ingredient list.
* Provide practical options: Even if no single recipe uses all ingredients, the set as a whole covers a broad range, and each recipe is a strong match for a subset of the user's ingredients.

The combined approach is more robust than using either Cover Ingredients Search or Hybrid Search alone. It is ideal for users who want both high ingredient utilization and relevant, varied recipe recommendations. This makes it well-suited for real-world recipe assistant scenarios.

### 5. Elasticsearch: Production-Grade Search Engine
Elasticsearch is a powerful, scalable, distributed search engine designed for production use, large datasets, and advanced search features.
It requires running a server and is suitable for real-world applications where persistence, scalability, and advanced querying are needed.

#### 5.1. Elasticsearch Docker, Client and Indexing

Run docker in terminal:

`docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

Or:

`docker run -d --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx1g" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

And check if Elasticsearch is up:

`curl http://localhost:9200`

In [10]:
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

# Create Elasticsearch Client (make sure Elasticsearch is running locally)
es_client = Elasticsearch('http://localhost:9200')

In [None]:
# Define Elasticsearch index settings and mappings for recipes
index_settings = {
    "settings": {
        "number_of_shards": 1, # A unit of storage and search. More shards can improve parallelism for large datasets
        "number_of_replicas": 0 # A shard copy for fault tolerance and increased search throughput (not recommended for production)
    },
    "mappings": {
        "properties": {
            "recipe_name": {"type": "text"},
            "main_ingredients": {"type": "text"},
            "all_ingredients": {"type": "text"},
            "instructions": {"type": "text"},
            "cuisine_type": {"type": "text"},
            "dietary_restrictions": {"type": "text"},
            "meal_type": {"type": "keyword"},
            "difficulty_level": {"type": "keyword"},
            "prep_time_minutes": {"type": "integer"},
            "cook_time_minutes": {"type": "integer"},
            "all_ingredients_vector": {
                "type": "dense_vector",
                "dims": 1536  # text-embedding-3-small returns 1536-dimensional vectors
}
        }
    }
}

index_name = "recipes"

# Create the index (ignore error if it already exists)
try:
    es_client.indices.create(index=index_name, body=index_settings)
except Exception as e:
    print("Index may already exist:", e)

Index may already exist: BadRequestError(400, 'resource_already_exists_exception', 'index [recipes/7f3OhyhISIGjhfTsVnRwfA] already exists')


In [12]:
def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[text]
    )
    return np.array(response.data[0].embedding)

emb = get_embedding("test")
print(emb.shape)  # Should print (1536,)

(1536,)


In [13]:
# Index documents into Elasticsearch, including the embedding vector
for doc in tqdm(documents):
    doc['all_ingredients_vector'] = get_embedding(doc['all_ingredients']).tolist()
    es_client.index(index=index_name, document=doc)

  0%|          | 0/477 [00:00<?, ?it/s]

#### 5.2. Build Elasticsearch Index and Run Simple Retrieval Strategies

##### Approach 1: Basic Keyword Search

Returns recipes that match the query using simple keyword matching across all fields.


In [None]:
# Basic Search
def es_basic_search(query, num_results=5, max_time=None):
    search_query = {
        "size": num_results * 2,  # get more to allow filtering
        "query": {
            "multi_match": {
                "query": query,
                "fields": [ # Elasticsearch's multi_match query is for text searching
                    "recipe_name",
                    "main_ingredients^2",
                    "all_ingredients^3",
                    "instructions^1.5",
                    "cuisine_type",
                    "dietary_restrictions^1.5"
                ],
                "type": "best_fields" # default, can be "most_fields", "cross_fields", etc.
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = []
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    # Apply time filter
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    return result_docs[:num_results]

##### Approach 2: Boosting Search

Similar to basic, but gives more weight to important fields.

In [15]:
# Boosted Field Search
def es_boosting_search(query, num_results=5, max_time=None):
    search_query = {
        "size": num_results * 2,
        "query": {
            "multi_match": {
                "query": query,
                "fields": [
                    "recipe_name",
                    "main_ingredients^4",
                    "all_ingredients^5",
                    "instructions^3",
                    "cuisine_type",
                    "dietary_restrictions^2"
                ],
                "type": "best_fields"
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    return result_docs[:num_results]

##### Approach 3: Query Expansion search

Expands the query with synonyms (e.g., "chicken" → "poultry", "breast") before searching, to improve recall using es_basic_search.

In [21]:
# Query Expansion (with synonyms)
def es_expansion_search(query, num_results=5, max_time=None):
    # Simple synonym expansion (can be replaced with a more advanced method)
    synonyms = {
        'chicken': ['poultry', 'fowl', 'breast', 'thigh'],
        'pasta': ['noodles', 'spaghetti', 'linguine', 'vermicelli', 'orzo', 'macaroni'],
        'tomatoes': ['tomato', 'roma', 'cherry tomatoes', 'heirloom'],
        'beef': ['steak', 'ground beef', 'brisket', 'sirloin'],
        'potatoes': ['spuds', 'russet', 'yukon', 'sweet potatoes'],
        'cheese': ['cheddar', 'mozzarella', 'parmesan', 'feta', 'gruyere'],
        'onion': ['onions', 'red onion', 'yellow onion', 'shallot'],
        'rice': ['basmati', 'jasmine', 'arborio', 'brown rice'],
        'shrimp': ['prawns', 'shellfish', 'seafood'],
        'eggs': ['egg', 'yolk', 'whites'],
        'herbs': ['basil', 'oregano', 'thyme', 'rosemary']
    }
    tokens = query.lower().split()
    expanded_tokens = tokens.copy()
    for token in tokens:
        if token in synonyms:
            expanded_tokens.extend(synonyms[token][:2])
    expanded_query = ' '.join(expanded_tokens)
    return es_basic_search(expanded_query, num_results=num_results, max_time=max_time)

#### 5.3. Advanced Retrieval Strategies
Below are advanced retrieval strategies that allow for more flexible, robust, and user-friendly recipe search.

##### Approach 4: Hybrid Search With OpenAI (Keyword + Embedding)
Combines keyword and with OpenAI's embedding-based (semantic) similarity for more robust retrieval.

In [20]:
# Hybrid Search (keyword + vector, requires Elasticsearch vector plugin and precomputed vectors)
def es_hybrid_search(query, num_results=5, max_time=None):
    # This assumes you have a vector field called 'all_ingredients_vector' in your index
    # and a function get_embedding(text) that returns a vector for the query.
    query_vector = get_embedding(query).tolist()
    search_query = {
        "size": num_results * 2,
        "query": {
            "script_score": {
                "query": {
                    "multi_match": {
                        "query": query,
                        "fields": [
                            "main_ingredients^2",
                            "all_ingredients^3",
                            "instructions^1.5",
                            "cuisine_type",
                            "dietary_restrictions^1.5",
                            "recipe_name"
                        ],
                        "type": "best_fields"
                    }
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'all_ingredients_vector') + 1.0",
                    "params": {"query_vector": query_vector},
                }
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    # Deduplicate before returning (recommended for hybrid search)
    deduped_results = deduplicate_results(result_docs)
    return deduped_results[:num_results]

##### Approach 5: Cover Ingredients Search
Elasticsearch queries return a ranked list based on scoring, but do not natively support iterative, set-cover-style selection across multiple results. A workaround is to retrieve many candidates from Elasticsearch, then run the cover algorithm in Python on those results.

In [21]:
def es_cover_ingredients_search(query, num_results=5, max_time=None, candidate_pool_size=200):
    """
    Simulate Cover Ingredients Search using Elasticsearch results, with optional time filtering.
    Returns a set of recipes that together cover as many of the query ingredients as possible.
    Deduplicates results before returning.
    """
    # Step 1: Get a large pool of candidates from ES (basic search, large pool)
    candidates = es_basic_search(query, num_results=candidate_pool_size)
    # Step 2: Filter by time if needed
    if max_time is not None:
        candidates = filter_by_max_time(candidates, max_time)
    # Step 3: Apply greedy cover algorithm
    query_tokens = set(re.sub(r'[^\w\s]', '', query.lower()).replace(',', ' ').split())
    uncovered = set(query_tokens)
    selected = []
    docs = candidates.copy()
    while uncovered and len(selected) < num_results and docs:
        best_doc = None
        best_overlap = 0
        for doc in docs:
            ingredients = set(re.sub(r'[^\w\s]', '', str(doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            overlap = len(uncovered & ingredients)
            if overlap > best_overlap:
                best_overlap = overlap
                best_doc = doc
        if best_doc and best_overlap > 0:
            selected.append(best_doc)
            ingredients = set(re.sub(r'[^\w\s]', '', str(best_doc.get('all_ingredients', '')).lower()).replace(',', ' ').split())
            uncovered -= ingredients
            docs.remove(best_doc)
        else:
            break
    # Deduplicate results before returning (recommended if docs may have overlap)
    deduped_results = deduplicate_results(selected)
    return deduped_results

##### Approach 6: Filtering (by time, dietary restrictions)
Allows for results filtering by both time and dietary restriction using Elasticsearch's query DSL.

In [32]:
def es_filtered_search(query, num_results=5, max_time=None, dietary_restriction=None):
    """
    Filtered search using Elasticsearch's query DSL.
    Allows filtering by both time and dietary restriction.
    Deduplicates results before returning.
    """
    must_clauses = [
        {"multi_match": {
            "query": query,
            "fields": [
                "main_ingredients^2",
                "all_ingredients^3",
                "instructions^1.5",
                "cuisine_type",
                "dietary_restrictions^1.5",
                "recipe_name"
            ],
            "type": "best_fields"
        }}
    ]
    if dietary_restriction:
        must_clauses.append({
            "match": {"dietary_restrictions": dietary_restriction}
        })
    search_query = {
        "size": num_results * 2,
        "query": {
            "bool": {
                "must": must_clauses
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    # Deduplicate before returning (recommended for filtered search)
    deduped_results = deduplicate_results(result_docs)
    return deduped_results[:num_results]

##### Approach 7: Custom Scoring (function_score)
Adds scoring functions (e.g., penalizing higher prep/cook times) to the text relevance score, so recipes with lower prep/cook times are ranked higher.

In [33]:
def es_custom_score_search(query, num_results=5, max_time=None):
    """
    Custom scoring search using Elasticsearch's function_score.
    Adds scoring functions (e.g., penalizing higher prep/cook times) to the text relevance score,
    so recipes with lower prep/cook times are ranked higher.
    Deduplicates results before returning.
    """
    search_query = {
        "size": num_results * 2,
        "query": {
            "function_score": {  # Custom Scoring
                "query": {
                    "multi_match": {
                        "query": query,
                        "fields": [
                            "main_ingredients^2",
                            "all_ingredients^3",
                            "instructions^1.5",
                            "cuisine_type",
                            "dietary_restrictions^1.5",
                            "recipe_name"
                        ],
                        "type": "best_fields"
                    }
                },
                "boost_mode": "sum",
                "score_mode": "sum",
                "functions": [
                    {
                        "field_value_factor": {
                            "field": "prep_time_minutes",
                            "factor": 1.0,
                            "modifier": "reciprocal",  # Prefer lower prep time
                            "missing": 1
                        }
                    },
                    {
                        "field_value_factor": {
                            "field": "cook_time_minutes",
                            "factor": 1.0,
                            "modifier": "reciprocal",  # Prefer lower cook time
                            "missing": 1
                        }
                    }
                ]
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    if max_time is not None:
        result_docs = filter_by_max_time(result_docs, max_time)
    # Deduplicate before returning (recommended for custom score search)
    deduped_results = deduplicate_results(result_docs)
    return deduped_results[:num_results]

#### 5.4. Test All Retrieval Approaches:
Seach retrieval approaches manually with a random ingredient list.

In [34]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45

# Elasticsearch retrieval approaches
es_basic_results = es_basic_search(ingredients, num_results=5, max_time=max_time)
print("ES Basic Search:")
for doc in es_basic_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)
print_unused_ingredients(ingredients, es_basic_results)

es_boosting_results = es_boosting_search(ingredients, num_results=5, max_time=max_time)
print("ES Boosting Search:")
for doc in es_boosting_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)
print_unused_ingredients(ingredients, es_boosting_results)

es_expansion_results = es_expansion_search(ingredients, num_results=5, max_time=max_time)
print("ES Query Expansion Search:")
for doc in es_expansion_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)
print_unused_ingredients(ingredients, es_expansion_results)

es_hybrid_results = es_hybrid_search(ingredients, num_results=5, max_time=max_time)
print("ES Hybrid Search:")
for doc in es_hybrid_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)
print_unused_ingredients(ingredients, es_hybrid_results)

es_cover_results = es_cover_ingredients_search(ingredients, num_results=5, max_time=max_time)
print("ES Cover Ingredients Search:")
for doc in es_cover_results:
    print(doc['recipe_name'], "-", doc['all_ingredients'])
for doc in es_cover_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)    
print_unused_ingredients(ingredients, es_cover_results)

es_filtered_results = es_filtered_search(ingredients, num_results=5, max_time=max_time, dietary_restriction=None)
print("ES Filtered Search:")
for doc in es_filtered_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)
print_unused_ingredients(ingredients, es_filtered_results)

es_custom_score_results = es_custom_score_search(ingredients, num_results=5, max_time=max_time)
print("ES Custom Score Search:")
for doc in es_custom_score_results:
    doc_copy = dict(doc)
    doc_copy.pop('all_ingredients_vector', None)
    print(doc_copy)
print_unused_ingredients(ingredients, es_custom_score_results)

ES Basic Search:
{'recipe_name': 'Avocado Toast', 'cuisine_type': 'American', 'meal_type': 'Breakfast', 'difficulty_level': 'Easy', 'prep_time_minutes': 5, 'cook_time_minutes': 0, 'servings': 1, 'main_ingredients': 'Bread, Avocado', 'all_ingredients': 'Whole-grain Bread, Ripe Avocado, Lemon Juice, Salt, Pepper, Olive Oil', 'dietary_restrictions': 'Vegetarian, Vegan', 'instructions': 'Toast bread slices. Mash avocado with lemon juice, salt, and pepper. Spread on toast and drizzle with olive oil.', 'nutritional_info': 'Calories: 280, Protein: 6g, Carbs: 28g, Fat: 16g', 'id': 2}
Unused ingredients: bell pepper, thyme, butter, shrimp, garlic, onion, potatoes, tomato, pasta, sugar, flour, chocolate, cheese, rice, chicken, milk, lemon, eggs, fish
ES Boosting Search:
{'recipe_name': 'Avocado Toast', 'cuisine_type': 'American', 'meal_type': 'Breakfast', 'difficulty_level': 'Easy', 'prep_time_minutes': 5, 'cook_time_minutes': 0, 'servings': 1, 'main_ingredients': 'Bread, Avocado', 'all_ingredie

#### 5.5. Retrieval Approach Conclusion
* Basic, Boosting, Expansion, and Filtered Search:
These approaches consistently return only a single recipe (e.g., "Avocado Toast"), leaving most of the provided ingredients unused. This shows that simple keyword matching, boosting, or synonym expansion in Elasticsearch is not sufficient for broad ingredient coverage when the user provides a long list of ingredients.

* Hybrid Search:
Returns a couple of recipes that match more ingredients, but still leaves many unused. It improves relevance by combining keyword and semantic similarity, but does not maximize ingredient coverage across multiple recipes.

* Cover Ingredients Search:
This approach is the most effective for maximizing ingredient usage. It returns a diverse set of recipes that together cover nearly all the user's ingredients, with only a few left unused (e.g., only 4 unused: bell pepper, thyme, tomato, lemon, chocolate). This demonstrates that a set-cover/greedy algorithm on top of Elasticsearch results is best for users who want to use as many of their ingredients as possible.

* Custom Score Search:
Returns a few more recipes than basic search, but still leaves many ingredients unused. It is better for prioritizing recipes with lower prep/cook times, not for maximizing ingredient coverage.

**Comparison to MS (Minsearch) Retrieval Approach:**

* MS Basic, Boosting, and Expansion Search:
Like Elasticsearch, these approaches return only a single recipe or none, leaving most ingredients unused.

* MS Hybrid Search:
Returns several relevant recipes, improving ingredient coverage, but still leaves a moderate number unused.

* MS Cover Ingredients Search:
Like ES Cover, this approach is the best at maximizing ingredient usage, returning a set of recipes that together use almost all the provided ingredients (only 4 unused: thyme, bell pepper, chocolate, lemon).

* Conclusion:
For both Elasticsearch and Minsearch, the Cover Ingredients Search approach is superior when the goal is to use as many of the user's ingredients as possible. Other approaches are better for finding the single most relevant recipe, but not for maximizing ingredient usage across a set. The results are similar for both ES and MS, but ES may be preferable for scalability and production use, while MS is simpler for prototyping.

### 6. OpenAI Integration and Eslasticsearch RAG Flow
Use OpenAI's language model to generate answers based on the context retrieved from Minsearch or Elasticsearch. See the build_prompt and llm functions above.


#### 6.1. RAG Pipeline Using Elasticsearch 
Test es_cover_ingredients_search and es_hybrid_search on the RAG Pipeline.

In [None]:
# DO I NEED THIS FOR EVALUATING ms_cover_then_hybrid_search?
def rag_elasticsearch(query, max_time=None, num_results=5, approach="cover"): 
    """
    approach: "cover" for es_cover_ingredients_search, "hybrid" for es_hybrid_search
    """
    if approach == "cover":
        search_results = es_cover_ingredients_search(query, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    elif approach == "hybrid":
        search_results = es_hybrid_search(query, num_results=num_results, max_time=max_time)
        deduped_results = deduplicate_results(search_results)
    else:
        raise ValueError("Unknown approach: choose 'cover' or 'hybrid'")
    prompt = build_prompt(query, deduped_results)
    answer = llm(prompt)
    return answer

In [38]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45
print("ES Cover Ingredients Search:")
print(rag_elasticsearch(ingredients, max_time=max_time, approach="cover"))
print("\nES Hybrid Search:")
print(rag_elasticsearch(ingredients, max_time=max_time, approach="hybrid"))

ES Cover Ingredients Search:
Based on your list of ingredients, here are some recipe recommendations:

1. **African Pasta Roast**: 
   - **Matching Ingredients**: Pasta, Chicken, Butter, Flour, Garlic, Rice.
   - **Explanation**: This recipe uses both pasta and rice, along with chicken and basic seasonings like garlic and butter that you have. You could use any available seasonings in place of those specifically listed.

2. **African Lentils Tagine**:
   - **Matching Ingredients**: Eggs, Shrimp, Butter, Sugar, Onion.
   - **Explanation**: If you're willing to substitute, this recipe works well with shrimp from your list, and it also incorporates eggs. It might require some substitutions for lentils and yogurt if those aren’t available, but it can still work with what you have.

3. **African Potatoes Curry**:
   - **Matching Ingredients**: Potatoes, Beef, Cheese, Yogurt (as a potential substitute for milk).
   - **Explanation**: This recipe directly matches your potato ingredient and al

#### 6.3. Elasticsearch RAG Pipeline Conclusion
* ES Cover Ingredients Search:
This approach recommends a set of recipes (e.g., African Pasta Roast, African Lentils Tagine, African Potatoes Curry, Avocado Toast) that together use a broad range of the provided ingredients. It suggests substitutions (e.g., swapping beef for chicken, or using eggs in the tagine) to maximize ingredient usage. However, some ingredients remain unused, and not every recipe is a perfect match, but the approach is practical for minimizing waste and offering variety.

* ES Hybrid Search:
This approach surfaces recipes that are semantically most relevant to the main ingredients (e.g., Thai Tomatoes Power Bowl, Mediterranean Tomatoes Soup). These recipes tend to match several key ingredients (like tomatoes, fish, garlic, milk, sugar, chicken, cheese), but leave many other ingredients unused. Hybrid search is best for finding the most relevant individual recipes, not for maximizing ingredient coverage across a set.

**Comparison to MS RAG Pipeline:**

* MS Cover Ingredients Search:
The MS pipeline also recommends a diverse set of recipes (African Pasta Roast, African Lentils Tagine, African Potatoes Curry, Beef Tacos) that together use most of your ingredients. It provides similar substitution suggestions and covers a wide range of the ingredient list, with only a few items left unused.

* MS Hybrid Search:
Like ES Hybrid, this approach surfaces recipes that are individually most relevant to the main ingredients (Thai Tomatoes Power Bowl, African Pasta Roast, Mediterranean Tomatoes Soup). It matches several key ingredients but does not maximize overall ingredient coverage.

Use Cover Ingredients Search for maximum ingredient utilization and variety. Use Hybrid Search for the most semantically relevant recipes. Both ES and MS pipelines provide similar quality, but ES is preferable for larger, production deployments.

#### 6.4. Combine the Two Approaches (Elasticsearch)
Combine the two approaches by first running Cover Ingredients Search to select a diverse set of recipes that cover as many ingredients as possible, and then passing those results through the Hybrid Search to rerank or filter them by semantic relevance.

In [None]:

def es_cover_then_hybrid_search(query, num_results=5, max_time=None, hybrid_top_k=5, candidate_pool_size=200):
    """
    Combine Cover Ingredients Search and Hybrid Search for Elasticsearch:
    1. Run Cover Ingredients Search to get a diverse set of recipes (pool size = candidate_pool_size).
    2. Rerank the pool by semantic similarity to the query using OpenAI embeddings.
    3. Remove 'all_ingredients_vector' from each result for cleaner output.
    4. Deduplicate results before returning.
    """
    # Step 1: Cover Ingredients Search to get a diverse set and increase the pool size
    cover_results = es_cover_ingredients_search(query, num_results=candidate_pool_size, max_time=max_time)
    if not cover_results:
        return []
    # Step 2: Hybrid Search, but restrict to cover_results as the candidate pool
    query_emb = get_embedding(query)
    cover_embeddings = [get_embedding(doc['all_ingredients']) for doc in cover_results]
    similarities = [np.dot(query_emb, emb) for emb in cover_embeddings]
    top_indices = np.argsort(similarities)[-hybrid_top_k:][::-1]
    # The hybrid logic is built into the similarity ranking (symbolic/keyword + neural/embedding/semantic)
    hybrid_results = [cover_results[i] for i in top_indices]
    # Remove all_ingredients_vector from each result
    for doc in hybrid_results:
        doc.pop('all_ingredients_vector', None)

    return hybrid_results[:num_results]

In [25]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
max_time = 45
print("ES Cover + Hybrid Search:")
print(es_cover_then_hybrid_search(ingredients, num_results=5, max_time=max_time))

ES Cover + Hybrid Search:
[{'recipe_name': 'Mediterranean Tomatoes Soup', 'cuisine_type': 'Mediterranean', 'meal_type': 'Breakfast', 'difficulty_level': 'Hard', 'prep_time_minutes': 10, 'cook_time_minutes': 22, 'servings': 1, 'main_ingredients': 'Tomatoes, Cheese, Chicken', 'all_ingredients': 'Tomatoes, Cheese, Chicken, Garlic, Milk, Soy Sauce, Sugar', 'dietary_restrictions': 'Contains dairy, Low-carb, Gluten-free', 'instructions': 'Prepare Tomatoes, Cheese, Chicken. Cook with Tomatoes, Cheese, Chicken, Garlic, Milk, Soy Sauce, Sugar. Season and serve.', 'nutritional_info': 'Calories: 216, Protein: 36g, Carbs: 40g, Fat: 7g', 'id': 255}, {'recipe_name': 'African Pasta Roast', 'cuisine_type': 'African', 'meal_type': 'Breakfast', 'difficulty_level': 'Medium', 'prep_time_minutes': 21, 'cook_time_minutes': 3, 'servings': 5, 'main_ingredients': 'Pasta, Rice, Chicken', 'all_ingredients': 'Pasta, Rice, Chicken, Butter, Flour, Garlic, Soy Sauce', 'dietary_restrictions': 'Contains shellfish', 'i

#### 6.5. Conclusion of Combined Approach (Elasticsearch)

* The ES Cover + Hybrid Search approach returns a set of recipes that are semantically relevant and diverse, but in this specific result set, it covers fewer of the user's provided ingredients compared to the Minsearch (MS) pipeline. For example, the ES results include recipes like "Mediterranean Tomatoes Soup," "African Pasta Roast," "African Lentils Tagine," and "Avocado Toast." While these are relevant and varied, some key ingredients from the user's list (such as potatoes, beef, chocolate, and fish) are not covered in the top results.

**Comparison to ms_cover_then_hybrid_search:**

* The MS Cover + Hybrid Search approach provides a broader ingredient coverage across its top 5 recipes. It includes "Thai Tomatoes Power Bowl," "African Pasta Roast," "African Lentils Tagine," "African Potatoes Curry," and "Beef Tacos." This set covers more of the user's ingredients (including potatoes, beef, fish, and more), resulting in fewer unused ingredients overall.

While both pipelines provide relevant and diverse recipes, the Minsearch Cover + Hybrid Search approach is more effective at maximizing ingredient usage for this query, leaving fewer ingredients unused. The Elasticsearch approach is still strong and more scalable for production, but may require further tuning or a larger candidate pool to match MS's ingredient coverage in this scenario.

### 7. Rerank with LLM

Re-ranking means taking an initial set of retrieved documents (candidates) and re-ordering them, often using a more sophisticated model (like an LLM), to improve the relevance of the top results.

In [None]:
def rerank_with_llm(query, candidates, max_time=None): 
    # Apply time filter before reranking
    if max_time is not None:
        candidates = filter_by_max_time(candidates, max_time)
    context = "\n\n".join([f"Recipe: {doc['recipe_name']}\nIngredients: {doc['main_ingredients']}" for doc in candidates])
    prompt = f"""
Given the following user query and candidate recipes, rank the recipes from most to least relevant.

Query: {query}

Candidates:
{context}

Return a JSON list of recipe names in ranked order.
""".strip()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    import re
    content = response.choices[0].message.content
    json_match = re.search(r'\[.*\]', content, re.DOTALL)
    if json_match:
        ranked_names = json.loads(json_match.group())
    else:
        return candidates
    ranked_docs = [doc for name in ranked_names for doc in candidates if doc['recipe_name'] == name]
    return ranked_docs 

#### 7.1. Re-rank the ms_cover_then_hybrid_search RAG Pipeline 

In [None]:
# 1. Define a query and get some candidate recipes (using any retrieval function)
query = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
candidates = ms_cover_then_hybrid_search(query, index, num_results=5)

# 2. Rerank the candidates using the LLM
reranked = rerank_with_llm(query, candidates)

# 3. Print the reranked recipe names
for doc in reranked:
    print(doc['recipe_name'])

Mediterranean Rice Skillet
Chinese Pasta Soup
Indian Tofu Cookies
French Onion Soup
Beef Tacos


In [None]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
print_unused_ingredients(ingredients, ms_cover_then_hybrid_search(ingredients, index, num_results=5))

Unused ingredients: chocolate, bell pepper, lemon


#### 7.2. Re-rank the es_cover_then_hybrid_search RAG Pipeline 

In [27]:
# 1. Define a query and get some candidate recipes (using any retrieval function)
query = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
candidates = es_cover_then_hybrid_search(query, num_results=5)

# 2. Rerank the candidates using the LLM
reranked = rerank_with_llm(query, candidates)

# 3. Print the reranked recipe names
for doc in reranked:
    print(doc['recipe_name'])

Chicken Tikka Masala
Mediterranean Rice Skillet
French Fish Soup
Chinese Pasta Soup
French Onion Soup


In [28]:
ingredients = "tomato, flour, sugar, chicken, rice, butter, chocolate, shrimp, potatoes, eggs, milk, pasta, cheese, garlic, onion, bell pepper, fish, lemon, thyme"
print_unused_ingredients(ingredients, es_cover_then_hybrid_search(ingredients, num_results=5))

Unused ingredients: lemon, tomato, bell pepper, chocolate


#### 7.3. Reranking Conclusion

The output is a list of recipe names, ranked by the LLM as the most relevant recipes for the list of ingredients, ensuring the most relevant ones appear first. This combination provides both broad ingredient coverage and high relevance. If the priority is to use as many of the user's ingredients as possible, Minsearch's approach is currently performing better with only 3 unused ingrediens, while Elasticsearch has 4. If the priority is scalability or production-ready search, Elasticsearch is preferable.

### 8. Key findings
For a dataset size of 477 rows, Minsearch is likely to outperform Elasticsearch for ingredient coverage and diversity after deduplication. For larger datasets, Elasticsearch’s strengths become more apparent.

* Minsearch is an in-memory, brute-force search that works very well on small datasets. It can efficiently maximize ingredient coverage and diversity because it can scan all recipes directly and apply greedy or hybrid logic without any retrieval bottlenecks.
* Elasticsearch is optimized for large-scale, distributed search. On small datasets, its advantages (scalability, distributed indexing, advanced query DSL) are not needed, and its retrieval is limited by the initial candidate pool and ranking heuristics.
* Deduplication exposes that ES often returns similar or duplicate recipes when the pool is small, reducing its effective diversity and coverage.

For both engines, the "cover then hybrid" approach outperforms all other methods. For small datasets, Minsearch is faster and more effective. For larger datasets, Elasticsearch is more scalable.



Stop Elasticsearch running in Docker:

`docker stop elasticsearch`

Remove the container (optional, if you want to delete it):

`docker rm elasticsearch`
