<a href="https://colab.research.google.com/github/sonalvrshny/IR23-MRRS/blob/sonal-search-queries/Project-2_changes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install libraries for the Multilingual Recipe Retrieval System (MRRS).
These libraries are essential for various tasks such as text preprocessing, language detection,
query translation, similarity search, and other natural language processing tasks.

In [96]:
pip install nltk spacy tensorflow torch

In [95]:
!pip install sentence_transformers
!pip install faiss-cpu

In [99]:
!pip install googletrans==4.0.0-rc1
!pip install langid
!pip install stanza

In [101]:
# For english
!python -m spacy download en_core_web_sm

# For spanish
!python -m spacy download es_core_news_sm

# For hindi
!python -m spacy download xx_ent_wiki_sm

Prepare the NLP language models for English, Spanish and Hindi using SpaCy and stanza.

In [132]:
import spacy

# Loading and setting up language models
nlp_en = spacy.load("en_core_web_sm")  # Multilingual model for English
nlp_es = spacy.load("es_core_news_sm") # Multilingual model for Spanish
nlp_hi = spacy.load("xx_ent_wiki_sm")  # Multilingual model for Hindi

In [133]:
# downloading the hindi language model using stanza
import stanza
stanza.download('hi')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

2023-12-13 00:01:06 INFO: Downloading default packages for language: hi (Hindi) ...
2023-12-13 00:01:06 INFO: File exists: C:\Users\sonal\stanza_resources\hi\default.zip
2023-12-13 00:01:09 INFO: Finished downloading models and saved to C:\Users\sonal\stanza_resources.


In [134]:
# Set up the Hindi NLP pipeline with only the necessary processors
nlp_hi = stanza.Pipeline(lang='hi', processors='tokenize,pos,lemma') 
def parse_hindi_query(query):
    doc = nlp_hi(query)
    return [word.lemma for sent in doc.sentences for word in sent.words]

2023-12-13 00:01:12 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

2023-12-13 00:01:14 INFO: Loading these models for language: hi (Hindi):
| Processor | Package       |
-----------------------------
| tokenize  | hdtb          |
| pos       | hdtb_charlm   |
| lemma     | hdtb_nocharlm |

2023-12-13 00:01:14 INFO: Using device: cpu
2023-12-13 00:01:14 INFO: Loading: tokenize
2023-12-13 00:01:14 INFO: Loading: pos
2023-12-13 00:01:15 INFO: Loading: lemma
2023-12-13 00:01:15 INFO: Done loading processors!


Import the langid library and set the supported languages to Spanish ('es'), English ('en'), and Hindi ('hi'). The function detect_language(query) takes a text query as input and attempts to classify its language using langid.classify(). If successful, the function returns the detected language.

In [135]:
import langid

langid.set_languages(['es', 'en', 'hi'])
def detect_language(query):
  try:
    lang, _ = langid.classify(query)
    return lang
  except Exception as e:
    return "Error: " + str(e)

Define a function process_query(query) that processes a given text query. It first uses the previously defined detect_language() function to determine the language of the query. Depending on the detected language ('en' for English, 'es' for Spanish, 'hi' for Hindi), the query is processed using the appropriate natural language processing (NLP) model.

For each processed query, the function extracts keywords by lemmatizing the non-stopword, alphabetic tokens. The result is a dictionary containing the detected language and a list of keywords.

In [136]:
def process_query(query):
    # check the detected language and process the query accordingly
    lang = detect_language(query)
    if lang == 'hi':
        # For Hindi, use the Stanza-based parser
        keywords = parse_hindi_query(query)
    else:
        # For other languages, continue using spaCy
        if lang == 'en':
            doc = nlp_en(query)
        elif lang == 'es':
            doc = nlp_es(query)
        else:
            return "Unsupported language"
        keywords = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

    return {"language": lang, "keywords": keywords}

Some sample queries to show how each query is processed in different languages

In [137]:
sample_queries = ["How to make traditional Mexican guacamole?", "Receta de paella de mariscos", "बर्गर चिकन की रेसिपी"]

for query in sample_queries:
    result = process_query(query)
    print(f"Query: {query}\nResult: {result}\n")

Query: How to make traditional Mexican guacamole?
Result: {'language': 'en', 'keywords': ['traditional', 'mexican', 'guacamole']}

Query: Receta de paella de mariscos
Result: {'language': 'es', 'keywords': ['Receta', 'paella', 'marisco']}

Query: बर्गर चिकन की रेसिपी
Result: {'language': 'hi', 'keywords': ['बर्गर', 'चिकन', 'का', 'रेसिपी']}



Define functions to translate a user query into Hindi, Spanish, and English using the Google Translate API (googletrans). The get_translated_queries function takes a user query, detects its language using the detect_language function, and then generates translated queries in the other two languages. If the language is not detected, it translates the query into all three languages.

In [138]:
from googletrans import Translator

# translate a query into a specified target language
def translate_query(query, target_language):
    translator = Translator()
    translation = translator.translate(query, dest=target_language)
    return translation.text

# get translated queries for a given user query in all the languages
def get_translated_queries(user_query):
    lang = detect_language(user_query)
    if lang == "hi":
        translated_queries = {
                'hindi': user_query,
                'spanish': translate_query(user_query, 'es'),
                'english': translate_query(user_query, 'en')
            }
    elif lang == "es":
        translated_queries = {
                'hindi': translate_query(user_query, 'hi'),
                'spanish': user_query,
                'english': translate_query(user_query, 'en')
            }
    elif lang == "en":
        translated_queries = {
                'hindi': translate_query(user_query, 'hi'),
                'spanish': translate_query(user_query, 'es'),
                'english': user_query
            }
    else : # added this else, if there is a scenario of no language detected, translate query into all lang
        translated_queries = {
                'hindi': translate_query(user_query, 'hi'),
                'spanish': translate_query(user_query, 'es'),
                'english': translate_query(user_query, 'en')
        }

    return translated_queries

# example usage
user_queries = ["How to make traditional Mexican guacamole?", "Receta de paella de mariscos", "बटर चिकन की रेसिपी"]
for q in user_queries:
    print(get_translated_queries(q))

{'hindi': 'पारंपरिक मैक्सिकन गुआकामोल कैसे बनाएं?', 'spanish': '¿Cómo hacer guacamole mexicano tradicional?', 'english': 'How to make traditional Mexican guacamole?'}
{'hindi': 'सीफूड पेला नुस्खा', 'spanish': 'Receta de paella de mariscos', 'english': 'Seafood Paella Recipe'}
{'hindi': 'बटर चिकन की रेसिपी', 'spanish': 'Receta de pollo con mantequilla', 'english': 'Butter chicken recipe'}


Load the recipe database into a list called recipes, which can be used for searching

In [139]:
import json

# Load recipe JSON data
with open('recipes.json', 'r') as file:
    recipes = json.load(file)

The sentence_transformers library can be used to load a pre-trained model (paraphrase-MiniLM-L6-v2) for generating sentence embeddings. The function embed_recipe, which combines the name, ingredients, and instructions of a recipe, then embeds the resulting text using the sentence embedding model.

In [140]:
from sentence_transformers import SentenceTransformer

# load pre-trained sentence embedding model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# generate embeddings for any given text and model
def embed_text(text, model):
    return model.encode(text)

# generate embeddings for a recipe
def embed_recipe(recipe, model):
    # combine recipe components into a single text
    combined_text = f"{recipe['recipeName']} {' '.join(recipe['ingredients'])} {' '.join(recipe['instruction'])}"
    return embed_text(combined_text, model)

Employ the FAISS library to create an index for efficiently searching through recipe embeddings. The faiss library is designed for similarity search and clustering of dense vectors.

In [141]:
import faiss
import numpy as np

# embedding recipes and building FAISS index
database_embeddings = [embed_recipe(recipe, model) for recipe in recipes]
dim = len(database_embeddings[0])

# initialize a flat L2 index with the specified dimension
index = faiss.IndexFlatL2(dim)

# add the database embeddings to the FAISS index
index.add(np.array(database_embeddings).astype('float32'))

Query the FAISS index to retrieve the top-k most similar recipes to a given query embedding

In [142]:
# Function to search the database
def search_database(query_embedding, index, database, k=5):
    # perform a search using the FAISS index to find the k most similar recipes to the query_embedding
    _, indices = index.search(np.array([query_embedding]).astype('float32'), k)

    # retrieve the corresponding recipes from the database based on the indices
    return [database[i] for i in indices[0]]

Define several functions to calculate different aspects of a recipe's similarity or alignment with a user's query. The calculate_ingredient_similarity, calculate_recipe_alignment, and calculate_score functions use embeddings generated by the model to compute various similarity scores.

In [143]:
from sentence_transformers import util

# Calculates ingredient similarity between recipe ingredients and query
def calculate_ingredient_similarity(recipe_ingredients, query_embedding, model):
    recipe_ingredients_str = ' '.join(recipe_ingredients)
    recipe_emb = model.encode(recipe_ingredients_str, convert_to_tensor=True)

    # Calculate cosine similarity between recipe and query embeddings
    similarity_score = util.pytorch_cos_sim(recipe_emb, query_embedding).item()
    return similarity_score

# Calculates similarity between recipe instructions and query
def calculate_recipe_alignment(recipe_instructions, query_embedding, model):
    recipe_instructions_str = ' '.join(recipe_instructions)
    recipe_emb = model.encode(recipe_instructions_str, convert_to_tensor=True)

    # Calculate cosine similarity between recipe instructions and query embeddings
    alignment_score = util.pytorch_cos_sim(recipe_emb, query_embedding).item()
    return alignment_score

# Calculates the overall score for a recipe based on various components
def calculate_score(recipe, query_embedding, model):
    recipe_embedding = embed_recipe(recipe, model)
    
    ingredient_similarity = calculate_ingredient_similarity(recipe['ingredients'], query_embedding, model)
    recipe_alignment = calculate_recipe_alignment(recipe['instruction'], query_embedding, model)
    semantic_similarity = util.pytorch_cos_sim(recipe_embedding, query_embedding).item()
    language_match = 1 if detect_language(recipe['recipeName']) == detect_language(query_embedding) else 0
    
    # Combine all scores with adjusted weights
    score = (semantic_similarity * 0.5) + (ingredient_similarity * 0.2) + (language_match * 0.2) + (recipe_alignment * 0.1)
    return score


Finally, use all the above functions to search the database for similar recipes and score them based on the given aspects. Return a sorted top k result list in each language

In [144]:
from collections import defaultdict

def search_and_score_recipes(query):
    translated_queries = get_translated_queries(query)

    # Score the search results
    scored_results = defaultdict(list)
    for lang, translated_query in translated_queries.items():
        query_embedding = embed_text(translated_query, model)
        search_results = search_database(query_embedding, index, recipes)

        # score each recipe based on its similarity to the translated query
        for recipe in search_results:
            score = calculate_score(recipe, query_embedding, model)
            scored_results[lang].append((recipe, score))

    # sort each result in the dictionary
    for lang, results in scored_results.items():
        scored_results[lang] = sorted(results, key=lambda x: x[1], reverse=True)
    return scored_results

Here is an example of how the results would be displayed. The example query here is "Chicken Curry" asked in Hindi

In [145]:
user_query = "चिकन करी" # chicken curry in Hindi
results = search_and_score_recipes(user_query)

for lang, result in results.items():
    print(f"===== Results for {lang.upper()} =====")
    for idx, (recipe, score) in enumerate(result, 1):
        print(f"{idx}. Recipe: {recipe['recipeName']}\n   Score: {score}\n")
    print("-" * 40)


===== Results for HINDI =====
1. Recipe: रिसोटो
   Score: 0.5913658738136292

2. Recipe: टाकोस
   Score: 0.5747588038444519

3. Recipe: चिकन करी
   Score: 0.5744103252887726

4. Recipe: गजपाचो
   Score: 0.5739733040332795

5. Recipe: बिरयानी
   Score: 0.5623327314853668

----------------------------------------
===== Results for SPANISH =====
1. Recipe: Pollo al Curry
   Score: 0.2990070521831512

2. Recipe: Burger
   Score: 0.28818392157554623

3. Recipe: Biryani
   Score: 0.2845807015895844

4. Recipe: Paneer Tikka
   Score: 0.28277455568313603

5. Recipe: Paella
   Score: 0.28202427029609684

----------------------------------------
===== Results for ENGLISH =====
1. Recipe: Chicken Curry
   Score: 0.49066288471221925

2. Recipe: Paella
   Score: 0.39084060192108155

3. Recipe: Biryani
   Score: 0.38794624507427217

4. Recipe: Tacos
   Score: 0.3610280334949494

5. Recipe: Ramen
   Score: 0.3151261299848557

----------------------------------------
