# Recipe Search with BM25 and MongoDB

This notebook connects to your MongoDB (where your recipes have been loaded), builds a BM25 index over the combined fields (recipe title, ingredients, and instructions), and defines a search function to retrieve matching recipes based on a query.

## 1. Connect to MongoDB and Load Recipes

Make sure that your MongoDB server (mongod) is running. Adjust the database/collection names if needed.

In [1]:
from pymongo import MongoClient

# Connect to MongoDB (ensure mongod is running)
client = MongoClient("mongodb://127.0.0.1:27017/")
db = client['food_recipes']

# Load recipes from MongoDB
recipes_collection = db['recipes']
recipes = list(recipes_collection.find({}))
print(f"Loaded {len(recipes)} recipes from MongoDB.")

Loaded 522517 recipes from MongoDB.


## 2. Download NLTK Data

We need NLTK’s tokenizer. (You can skip this if it’s already downloaded.)

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 3. Define the BM25 Class

This class builds an inverted index and computes BM25 scores.

In [3]:
import math
from collections import defaultdict
from nltk.tokenize import word_tokenize

class BM25:
    def __init__(self, documents, k1=1.5, b=0.75):
        self.documents = documents  # list of text documents
        self.k1 = k1
        self.b = b
        self.N = len(documents)
        self.doc_lengths = []
        self.avg_doc_length = 0
        self.index = defaultdict(list)  # mapping word -> list of (doc_index, term_frequency)
        self.idf = {}

        self._build_index()

    def _build_index(self):
        """Build the inverted index and compute IDF values."""
        doc_freqs = defaultdict(int)
        total_length = 0

        for i, doc in enumerate(self.documents):
            words = word_tokenize(doc.lower())
            self.doc_lengths.append(len(words))
            total_length += len(words)
            word_counts = defaultdict(int)
            for word in words:
                word_counts[word] += 1
            for word, tf in word_counts.items():
                doc_freqs[word] += 1
                self.index[word].append((i, tf))

        self.avg_doc_length = total_length / self.N if self.N > 0 else 0

        for word, df in doc_freqs.items():
            self.idf[word] = math.log((self.N - df + 0.5) / (df + 0.5) + 1)

    def search(self, query, top_n=5):
        """Score the documents given a query and return the top_n document indices and scores."""
        query_words = word_tokenize(query.lower())
        scores = defaultdict(float)

        for word in query_words:
            if word not in self.index:
                continue

            idf = self.idf.get(word, 0)
            for doc_id, tf in self.index[word]:
                dl = self.doc_lengths[doc_id]
                score = idf * ((tf * (self.k1 + 1)) /
                               (tf + self.k1 * (1 - self.b + self.b * (dl / self.avg_doc_length))))
                scores[doc_id] += score

        return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]

    def save(self, index_file="bm25_index.pkl"):
        import pickle
        with open(index_file, "wb") as f:
            pickle.dump({
                "index": self.index,
                "doc_lengths": self.doc_lengths,
                "avg_doc_length": self.avg_doc_length,
                "idf": self.idf,
                "N": self.N,
                "documents": self.documents,
                "k1": self.k1,
                "b": self.b,
            }, f)

    @staticmethod
    def load(index_file="bm25_index.pkl"):
        import pickle
        with open(index_file, "rb") as f:
            data = pickle.load(f)
        bm25 = BM25(data["documents"], k1=data["k1"], b=data["b"])
        bm25.index = data["index"]
        bm25.doc_lengths = data["doc_lengths"]
        bm25.avg_doc_length = data["avg_doc_length"]
        bm25.idf = data["idf"]
        bm25.N = data["N"]
        return bm25


## 4. Build BM25 Documents

Concatenate the recipe fields (adjust the keys if needed) so that BM25 can index the combined text.

In [4]:
# Concatenate title, ingredients, and instructions for each recipe
documents = []
recipe_ids = []  
for recipe in recipes:
    text = "{} {} {}".format(recipe.get('title', ''), recipe.get('ingredients', ''), recipe.get('instructions', ''))
    documents.append(text)
    recipe_ids.append(recipe['_id'])

print(f"Built BM25 documents for {len(documents)} recipes.")

Built BM25 documents for 522517 recipes.


## 5. Build the BM25 Index

In [5]:
bm25_index = BM25(documents)
print("BM25 index built.")

BM25 index built.


## 6. Define the Recipe Search Function

This function takes a query, retrieves the top BM25-scored document indices, and then maps them back to the corresponding MongoDB recipes. The BM25 score is added to each recipe for inspection.

In [6]:
def search_recipes(query, top_n=5):
    """Search for recipes using BM25 and return matching recipes with scores."""
    # Get BM25 results: a list of (document index, score) pairs
    results = bm25_index.search(query, top_n=top_n)
    
    # Map BM25 results back to MongoDB _id along with scores
    matched = [(recipe_ids[doc_idx], score) for doc_idx, score in results]
    matched_ids = [m[0] for m in matched]
    
    # Retrieve the full recipe documents
    matched_recipes = list(recipes_collection.find({"_id": {"$in": matched_ids}}))
    
    # Add the BM25 score to each retrieved recipe
    for recipe in matched_recipes:
        for rid, score in matched:
            if recipe['_id'] == rid:
                recipe['bm25_score'] = score
                break
    return matched_recipes


## 7. Test the Search Function

Try a sample query. If you get empty results, try a more common word (like `'salt'`) and inspect the BM25 vocabulary.

In [9]:
query = "sauce"
results = search_recipes(query, top_n=5)
print(f"Search results for query '{query}':")
for res in results:
    title = res.get('title', 'No Title')
    score = res.get('bm25_score', 0)
    print(f"{title} - BM25 Score: {score:.3f}")

# Debugging: Print a sample of the BM25 vocabulary tokens
print("\nSample BM25 vocabulary tokens:", list(bm25_index.idf.keys())[:20])

Search results for query 'sauce':

Sample BM25 vocabulary tokens: []
