## Retrieval-Augmented Generation (RAG) for Recipe Assistant: Minsearch and Elasticsearch

This notebook demonstrates an end-to-end RAG workflow for recipe recommendations using both Minsearch (for prototyping and local development) and Elasticsearch (for scalable, production-grade search).  
The notebook will index and search recipes, build prompts, and use OpenAI's LLM to generate grounded answers.

---

### Workflow Overview

* Load and preprocess recipe data
* Build a Minsearch index and perform keyword search
* Build an Elasticsearch index and perform production-grade search
* Construct prompts and call OpenAI LLM for answer generation
* Compare Minsearch and Elasticsearch for RAG
* Relation to `evaluation-data-generation.ipynb`: This notebook demonstrates the practical RAG workflow and retrieval strategies that are evaluated and compared in detail in `evaluation-data-generation.ipynb`.

In [1]:
# Install dependencies (run once) 
# %pip install minsearch elasticsearch tqdm openai python-dotenv

### 1. Load and Preprocess Recipe Data

We load the recipe dataset from CSV and prepare it for indexing.

In [2]:
import pandas as pd
import json

# Load recipe data
df = pd.read_csv('../data/recipes_clean.csv')

# Create documents for indexing
documents = df.to_dict(orient='records')

# Show an example document
documents[0]

{'recipe_name': 'Spaghetti Carbonara',
 'cuisine_type': 'Italian',
 'meal_type': 'Dinner',
 'difficulty_level': 'Medium',
 'prep_time_minutes': 15,
 'cook_time_minutes': 20,
 'servings': 4,
 'main_ingredients': 'Spaghetti, Eggs, Pancetta, Parmesan',
 'all_ingredients': 'Spaghetti, Eggs, Pancetta, Parmesan, Black Pepper, Olive Oil, Salt',
 'dietary_restrictions': 'Contains gluten, dairy, pork',
 'instructions': 'Boil spaghetti until al dente. Fry pancetta until crispy. Whisk eggs with grated Parmesan. Toss hot spaghetti with pancetta and egg mixture off heat. Serve immediately with black pepper.',
 'nutritional_info': 'Calories: 520, Protein: 22g, Carbs: 60g, Fat: 22g'}

### 2. Minsearch: Lightweight In-Memory Search

Minsearch is a simple, fast, in-memory search library ideal for prototyping, small datasets, and educational purposes.  
It requires no external dependencies or server.

In [3]:
# Download minsearch.py if not present (for local dev)
#%pip install minsearch

In [4]:
import minsearch

# Setup minsearch index for recipes
index = minsearch.Index(
    text_fields=['recipe_name', 'main_ingredients', 'all_ingredients', 'instructions', \
        'cuisine_type', 'dietary_restrictions'],
    keyword_fields=['meal_type', 'difficulty_level']
)

# Fit the index on the recipe documents
index.fit(documents)



<minsearch.Index at 0x7f978db49f40>

In [5]:
# Define a Minsearch search function with field boosting
def minsearch_search(query, boost=None, num_results=5):
    if boost is None:
        boost = {'main_ingredients': 4.0, 'all_ingredients': 5.0, 'instructions': 3.0,
                 'cuisine_type': 1.0, 'dietary_restrictions': 2.0}
    results = index.search(
        query=query,
        boost_dict=boost,
        num_results=num_results
    )
    return results

In [6]:
# Test Minsearch retrieval
query = "chicken, pasta, tomato, garlic"
minsearch_results = minsearch_search(query)
minsearch_results

[{'recipe_name': 'Middle Eastern Chicken Bowl',
  'cuisine_type': 'Middle Eastern',
  'meal_type': 'Breakfast',
  'difficulty_level': 'Medium',
  'prep_time_minutes': 79,
  'cook_time_minutes': 170,
  'servings': 4,
  'main_ingredients': 'Chicken, Shrimp, Pasta',
  'all_ingredients': 'Chicken, Shrimp, Pasta, Cream, Butter, Spices, Garlic',
  'dietary_restrictions': 'Kosher, Contains shellfish, Contains nuts',
  'instructions': 'Prepare Chicken, Shrimp, Pasta. Cook with Chicken, Shrimp, Pasta, Cream, Butter, Spices, Garlic. Season and serve.',
  'nutritional_info': 'Calories: 585, Protein: 17g, Carbs: 29g, Fat: 29g'},
 {'recipe_name': 'African Chicken Roast',
  'cuisine_type': 'African',
  'meal_type': 'Desserts',
  'difficulty_level': 'Easy',
  'prep_time_minutes': 31,
  'cook_time_minutes': 1,
  'servings': 5,
  'main_ingredients': 'Chicken, Rice, Pasta',
  'all_ingredients': 'Chicken, Rice, Pasta, Yogurt, Garlic, Chili, Onion',
  'dietary_restrictions': 'Halal',
  'instructions': 'Pr

### 3. OpenAI Integration and RAG Flow

We use OpenAI's language model to generate answers based on the context retrieved from Minsearch or Elasticsearch.

In [7]:
import dotenv
dotenv.load_dotenv(dotenv_path="../.env")  # Load OpenAI API key

from openai import OpenAI
client = OpenAI()

In [8]:
# Build prompt for the LLM using retrieved context
def build_prompt(query, search_results):
    entry_template = """
Recipe: {recipe_name}
Cuisine: {cuisine_type}
Meal Type: {meal_type}
Difficulty: {difficulty_level}
Prep Time: {prep_time_minutes} minutes
Cook Time: {cook_time_minutes} minutes
Main Ingredients: {main_ingredients}
Instructions: {instructions}
Dietary Info: {dietary_restrictions}
""".strip()
    context = "\n\n".join([entry_template.format(**doc) for doc in search_results])
    prompt_template = """
You are an expert chef and culinary assistant. Answer the question based on the content from our recipe database.
Use only the facts from the context when answering the question.

CONTEXT:
{context}

QUESTION: {question}

Provide recipe recommendations with brief explanations of why they match the requested ingredients.
If exact ingredients aren't available, suggest the closest matches and mention any substitutions needed.
""".strip()
    return prompt_template.format(context=context, question=query)

In [9]:
# Define LLM call function
def llm(prompt, model='gpt-4o-mini'):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

In [10]:
# Define RAG pipeline using Minsearch
def rag_minsearch(query):
    search_results = minsearch_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

# Example usage
rag_minsearch(query)

'Based on the provided ingredients (chicken, pasta, tomato, garlic), here are some recipe recommendations from our database, along with explanations and necessary substitutions:\n\n1. **Middle Eastern Chicken Bowl**\n   - **Main Ingredients**: Chicken, Pasta, Garlic\n   - **Explanation**: This recipe includes chicken and pasta, and garlic as an ingredient. However, it does not contain tomato. You can easily add fresh or canned tomatoes to the dish to incorporate this ingredient.\n   - **Substitutions**: Add tomatoes at the cooking stage with the Cream and Spices to create a richer flavor profile.\n\n2. **Spanish Pasta Gazpacho**\n   - **Main Ingredients**: Pasta, Chicken, Garlic\n   - **Explanation**: This dish has pasta and chicken, and includes garlic. Though it does not specify tomatoes, classic gazpacho is typically made with tomatoes. You could incorporate chopped tomatoes into this recipe to align it more closely with your ingredient list.\n   - **Substitutions**: Add fresh or ca

### 4. Elasticsearch: Production-Grade Search Engine

Elasticsearch is a powerful, scalable, distributed search engine designed for production use, large datasets, and advanced search features.  
It requires running a server and is suitable for real-world applications where persistence, scalability, and advanced querying are needed.

Run terminal command:

`docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

Or:

`docker run -d --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx1g" \
  docker.elastic.co/elasticsearch/elasticsearch:8.13.4`

And check if Elasticsearch is up:

`curl http://localhost:9200`

In [11]:
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

# Create Elasticsearch Client (make sure Elasticsearch is running locally)
es_client = Elasticsearch('http://localhost:9200')

In [12]:
# Define Elasticsearch index settings and mappings for recipes
index_settings = {
    "settings": {
        "number_of_shards": 1, # A unit of storage and search. More shards can improve parallelism for large datasets
        "number_of_replicas": 0 # A shard copy for fault tolerance and increased search throughput (not recommended for production)
    },
    "mappings": {
        "properties": {
            "recipe_name": {"type": "text"},
            "main_ingredients": {"type": "text"},
            "all_ingredients": {"type": "text"},
            "instructions": {"type": "text"},
            "cuisine_type": {"type": "text"},
            "dietary_restrictions": {"type": "text"},
            "meal_type": {"type": "keyword"},
            "difficulty_level": {"type": "keyword"},
            "prep_time_minutes": {"type": "integer"},
            "cook_time_minutes": {"type": "integer"}
        }
    }
}

index_name = "recipes"

# Create the index (ignore error if it already exists)
try:
    es_client.indices.create(index=index_name, body=index_settings)
except Exception as e:
    print("Index may already exist:", e)

Index may already exist: BadRequestError(400, 'resource_already_exists_exception', 'index [recipes/YLkybRVbRMSnJpk0FiIC-w] already exists')


In [13]:
# Index documents into Elasticsearch
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/477 [00:00<?, ?it/s]

In [None]:
# Define Elasticsearch search function for recipes
def elasticsearch_search(query, num_results=5):
    search_query = {
        "size": num_results,
        "query": {
            "multi_match": {
                "query": query,
                "fields": [
                    "recipe_name", 
                    "main_ingredients^4", 
                    "all_ingredients^5",
                    "instructions^3",
                    "cuisine_type",
                    "dietary_restrictions^2"
                ],
                "type": "best_fields"
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = []
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    return result_docs

In [15]:
# Test Elasticsearch retrieval
query = "chicken, pasta, tomato, garlic"
es_results = elasticsearch_search(query)
es_results

[{'recipe_name': 'Chicken Tikka Masala',
  'cuisine_type': 'Indian',
  'meal_type': 'Dinner',
  'difficulty_level': 'Hard',
  'prep_time_minutes': 30,
  'cook_time_minutes': 45,
  'servings': 4,
  'main_ingredients': 'Chicken, Yogurt, Tomato Sauce, Spices',
  'all_ingredients': 'Chicken, Yogurt, Tomato Paste, Cream, Onion, Garlic, Ginger, Garam Masala, Turmeric, Chili Powder, Cumin, Coriander',
  'dietary_restrictions': 'Contains dairy',
  'instructions': 'Marinate chicken in yogurt and spices. Grill until charred. Prepare sauce with onion, garlic, ginger, spices, tomato paste, and cream. Simmer chicken in sauce. Serve with rice or naan.',
  'nutritional_info': 'Calories: 600, Protein: 35g, Carbs: 45g, Fat: 28g',
  'id': 1},
 {'recipe_name': 'Chicken Tikka Masala',
  'cuisine_type': 'Indian',
  'meal_type': 'Dinner',
  'difficulty_level': 'Hard',
  'prep_time_minutes': 30,
  'cook_time_minutes': 45,
  'servings': 4,
  'main_ingredients': 'Chicken, Yogurt, Tomato Sauce, Spices',
  'all_

In [16]:
# Deduplication function to ensure unique recipes in results
def deduplicate_results(results, key='recipe_name'):
    seen = set()
    deduped = []
    for doc in results:
        val = doc.get(key)
        if val and val not in seen:
            deduped.append(doc)
            seen.add(val)
    return deduped

In [17]:
# Define RAG pipeline using Elasticsearch
def rag_elasticsearch(query):
    search_results = elasticsearch_search(query)
    search_results = deduplicate_results(search_results)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

# Example usage
rag_elasticsearch(query)

"Based on the requested ingredients of chicken, pasta, tomato, and garlic, here are the recipe recommendations from our database:\n\n1. **Spanish Pasta Gazpacho**:\n   - **Explanation**: This recipe includes chicken and pasta as main ingredients. Although it does not explicitly list tomato, the traditional gazpacho style usually incorporates tomatoes in some form, and the dish may include tomato flavors like tomato paste. Garlic is also a key ingredient in the preparation. Therefore, this recipe would align closely with your request. \n   - **Possible Adjustment**: If a more prominent tomato flavor is desired, you could add diced tomatoes or tomato paste to enhance it.\n\n2. **Middle Eastern Chicken Bowl**:\n   - **Explanation**: This recipe features chicken and includes pasta as well. Garlic is also part of the preparation. However, it does not directly list tomato as an ingredient; thus, for this recipe, you may want to incorporate fresh tomatoes or a tomato-based sauce to complement

Stop and shut down Elasticsearch running in Docker:

`docker stop elasticsearch`

Remove the container (optional, if you want to delete it):

`docker rm elasticsearch`


### 5. End-to-End RAG Workflow Summary

- **Minsearch** is ideal for prototyping, small datasets, and educational use. It is fast and easy to set up, but not suitable for large-scale or production deployments.
- **Elasticsearch** is a robust, scalable, and production-ready search engine. It supports advanced search features, persistence, and can handle large datasets and concurrent users.

**Typical RAG Workflow:**
1. Load and preprocess recipe documents.
2. Index documents using Minsearch or Elasticsearch.
3. Retrieve relevant recipes for a user query.
4. Build a prompt for the LLM using the retrieved context.
5. Generate an answer using the LLM.
6. (Optional) Compare retrieval quality and LLM answers between Minsearch and Elasticsearch.

---

### Minsearch vs. Elasticsearch: When to Use Each

- **Minsearch** is best for:
  - Fast prototyping and experimentation.
  - Educational purposes and small datasets.
  - No server or infrastructure requirements.

- **Elasticsearch** is best for:
  - Production deployments and large datasets.
  - Advanced search features (filtering, scoring, aggregations).
  - Scalable, persistent, and distributed search.

---

### Relation to `rag-evaluation.ipynb`

- This notebook demonstrates the practical RAG workflow and retrieval strategies (Minsearch, Elasticsearch) that are evaluated and compared in detail in `rag-evaluation.ipynb`.
- In `rag-evaluation.ipynb`, you will find systematic evaluation of retrieval quality, diversity, and LLM answer relevance, as well as advanced retrieval strategies (e.g., hybrid search, ingredient coverage, Hit Rate and MRR).
- Use this notebook for hands-on RAG prototyping and as a foundation for further evaluation and productionization.

---
