# Recipe Search with BM25 and MongoDB

This notebook connects to MongoDB to load recipe data, builds a BM25 index on the combined recipe fields (title, ingredients, and instructions), and sets up a Flask API for searching recipes and providing recommendations. The recommendation system tracks user search tags and uses the most frequent ones to drive the suggestions.

## 1. Connect to MongoDB and Load Recipes

This section connects to the local MongoDB instance and loads all recipes from the 'recipes' collection in the 'food_recipes' database.

In [10]:
from pymongo import MongoClient

# Connect to MongoDB (ensure mongod is running)
client = MongoClient("mongodb://127.0.0.1:27017/")
db = client['food_recipes']

# Load recipes from MongoDB
batch_size = 1000
recipes_collection = db['recipes']
cursor = recipes_collection.find({}).batch_size(batch_size)
recipes = []

for batch in cursor:
    recipes.append(batch)

print(f"Loaded {len(recipes)} recipes from MongoDB.")

Loaded 1567551 recipes from MongoDB.


## 2. Download NLTK Data

NLTK is used for tokenizing text. This cell downloads the necessary data (if not already present).

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 3. Define the BM25 Class

The BM25 class builds an inverted index for a list of text documents and computes BM25 scores. These scores are later used to rank recipes based on the query.

In [12]:
import math
from collections import defaultdict
from nltk.tokenize import word_tokenize

class BM25:
    def __init__(self, documents, k1=1.5, b=0.75):
        self.documents = documents  # list of text documents
        self.k1 = k1
        self.b = b
        self.N = len(documents)
        self.doc_lengths = []
        self.avg_doc_length = 0
        self.index = defaultdict(list)  # mapping word -> list of (doc_index, term_frequency)
        self.idf = {}

        self._build_index()

    def _build_index(self):
        """Build the inverted index and compute IDF values."""
        doc_freqs = defaultdict(int)
        total_length = 0

        for i, doc in enumerate(self.documents):
            words = word_tokenize(doc.lower())
            self.doc_lengths.append(len(words))
            total_length += len(words)
            word_counts = defaultdict(int)
            for word in words:
                word_counts[word] += 1
            for word, tf in word_counts.items():
                doc_freqs[word] += 1
                self.index[word].append((i, tf))

        self.avg_doc_length = total_length / self.N if self.N > 0 else 0

        for word, df in doc_freqs.items():
            self.idf[word] = math.log((self.N - df + 0.5) / (df + 0.5) + 1)

    def search(self, query, top_n=5):
        """Score the documents given a query and return the top_n document indices and scores."""
        query_words = word_tokenize(query.lower())
        scores = defaultdict(float)

        for word in query_words:
            if word not in self.index:
                continue

            idf = self.idf.get(word, 0)
            for doc_id, tf in self.index[word]:
                dl = self.doc_lengths[doc_id]
                score = idf * ((tf * (self.k1 + 1)) /
                               (tf + self.k1 * (1 - self.b + self.b * (dl / self.avg_doc_length))))
                scores[doc_id] += score

        return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]

    def save(self, index_file="bm25_index.pkl"):
        import pickle
        with open(index_file, "wb") as f:
            pickle.dump({
                "index": self.index,
                "doc_lengths": self.doc_lengths,
                "avg_doc_length": self.avg_doc_length,
                "idf": self.idf,
                "N": self.N,
                "documents": self.documents,
                "k1": self.k1,
                "b": self.b,
            }, f)

    @staticmethod
    def load(index_file="bm25_index.pkl"):
        import pickle
        with open(index_file, "rb") as f:
            data = pickle.load(f)
        bm25 = BM25(data["documents"], k1=data["k1"], b=data["b"])
        bm25.index = data["index"]
        bm25.doc_lengths = data["doc_lengths"]
        bm25.avg_doc_length = data["avg_doc_length"]
        bm25.idf = data["idf"]
        bm25.N = data["N"]
        return bm25


## 4. Build BM25 Documents

We concatenate each recipe's title, ingredients, and instructions into one text document. This combined text is what the BM25 index will use.

In [None]:
documents = []
recipe_ids = []
for recipe in recipes:
    name = recipe.get('Name', '')
    ingredients = ' '.join(recipe.get('RecipeIngredientParts', []))
    instructions = ' '.join(recipe.get('RecipeInstructions', []))
    text = "{} {} {}".format(name, ingredients, instructions)
    documents.append(text)
    recipe_ids.append(recipe.get('RecipeId'))


## 5. Build the BM25 Index

We create a BM25 index from the combined documents. This index is used later to score and rank recipes based on a search query.

In [None]:
bm25_index = BM25(documents)
print("BM25 index built.")

BM25 index built.


## 6. Define the Recipe Search Function

This function takes a query, uses the BM25 index to get the best matching documents, maps these back to the full recipe data from MongoDB, and appends the BM25 score to each recipe.

In [None]:
def search_recipes(query, top_n=5):
    """Search for recipes using BM25 and return matching recipes with scores."""
    # Get BM25 results: a list of (document index, score) pairs
    results = bm25_index.search(query, top_n=top_n)
    
    # Map BM25 results back to MongoDB _id along with scores
    matched = [(recipe_ids[doc_idx], score) for doc_idx, score in results]
    matched_ids = [m[0] for m in matched]
    
    # Retrieve the full recipe documents
    matched_recipes = list(recipes_collection.find({"_id": {"$in": matched_ids}}))
    
    # Add the BM25 score to each retrieved recipe
    for recipe in matched_recipes:
        for rid, score in matched:
            if recipe['_id'] == rid:
                recipe['bm25_score'] = score
                break
    return matched_recipes


## 7. Test the Search Function

We run a test search using the query "salt" to see the top BM25-scored recipes along with their scores. A sample of the BM25 vocabulary tokens is printed for inspection.

In [None]:
query = "salt"
results = search_recipes(query, top_n=5)
print(f"Search results for query '{query}':")
for res in results:
    title = res.get('Name', 'No Title')  # use 'Name' instead of 'title'
    score = res.get('bm25_score', 0)
    print(f"{title} - BM25 Score: {score:.3f}")

# Debug: Print a sample of the BM25 vocabulary tokens
print("\nSample BM25 vocabulary tokens:", list(bm25_index.idf.keys())[:20])


Search results for query 'salt':

Sample BM25 vocabulary tokens: []


In [20]:
print(documents[0])

  


## 8. Create Flask API for Recipe Search and Recommendations

This section sets up a basic Flask API with two endpoints:

- **/search**: Accepts a query and an optional user_id. It runs the BM25 search and updates an in-memory tag store based on the query.
- **/recommendations**: Uses the tracked search tags for a user to build a query from their top tags and returns recommended recipes.

A helper function `update_user_tags` tokenizes the query and updates the user's tag counts.

In [19]:
from flask import Flask, request, jsonify

app = Flask(__name__)

# In-memory store for tracking user search tags
user_tags = {}

def update_user_tags(user_id, query):
    # Tokenize the query and update tag counts for the user
    tags = word_tokenize(query.lower())
    if user_id not in user_tags:
        user_tags[user_id] = {}
    for tag in tags:
        user_tags[user_id][tag] = user_tags[user_id].get(tag, 0) + 1

@app.route('/search', methods=['GET'])
def search():
    query = request.args.get('query', '')
    user_id = request.args.get('user_id', 'anonymous')
    update_user_tags(user_id, query)
    results = search_recipes(query, top_n=5)
    return jsonify(results)

@app.route('/recommendations', methods=['GET'])
def recommendations():
    user_id = request.args.get('user_id', 'anonymous')
    # Use the tracked tags to form a recommendation query
    tags = user_tags.get(user_id, {})
    if not tags:
        return jsonify({"message": "No search history available for recommendations."})
    # Sort tags by frequency and take the top 3
    sorted_tags = sorted(tags.items(), key=lambda x: x[1], reverse=True)
    top_tags = " ".join([tag for tag, count in sorted_tags[:3]])
    results = search_recipes(top_tags, top_n=5)
    return jsonify(results)

if __name__ == '__main__':
    app.run(debug=False)

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [01/Mar/2025 15:45:43] "GET / HTTP/1.1" 404 -
127.0.0.1 - - [01/Mar/2025 15:45:53] "GET /search HTTP/1.1" 200 -
127.0.0.1 - - [01/Mar/2025 15:45:58] "GET /search?query=egg HTTP/1.1" 200 -
