# 🧪 Qwen 0.5B Exploration – Semantic Recipe Retrieval

This notebook explores the use of the **Qwen2.5-0.5B-Instruct** model as a lightweight baseline for rephrasing recipe-related questions into structured JSON queries. It also integrates FAISS-based semantic search over recipe text embeddings to retrieve relevant results.

---

## 🧰 Workflow Summary

### **1. Dataset Loading and Cleanup**
- The recipe dataset (`filtered_recipes.csv`) is loaded and inspected.
- YAML-formatted `steps` are parsed using a custom fallback mechanism.
- Invalid rows are dropped and the dataset is saved to `parsed_recipes.csv`.

### **2. String Field Restoration**
- After reloading the parsed dataset, stringified columns like `ingredients`, `ingredients_raw`, `amount_gram`, `amounts`, and `tags` are converted back into Python list/dict structures using `ast.literal_eval`.

### **3. Embedding and Clustering**
- Uses **`SentenceTransformer (MiniLM-L6-v2)`** to embed combined `name + description` text.
- Vectors are normalized and clustered using **FAISS KMeans** into 100 clusters.
- The nearest recipe to each cluster centroid is selected as a **representative sample**.

### **4. Prompt Generation and Question Answering**
- A function is defined to generate:
  - A general user-style question from the recipe.
  - A concise answer based on the description.
- OpenAI’s GPT is used to generate these (e.g., “What is this dish?”).

### **5. Semantic Retrieval with FAISS**
- The questions are embedded and queried against the original recipe text embeddings using FAISS.
- Retrieval results are compared to see whether the source recipe is among the top-k hits.

### **6. Qwen Model Evaluation**
- The **Qwen2.5-0.5B-Instruct** model is loaded and queried locally.
- A prompt-based function asks Qwen to rephrase questions into a JSON format with two keys:
  - `search_query`
  - `requested_count`
- The results are evaluated based on whether the original recipe is successfully retrieved.

---

## ⚠️ Known Limitations

### **🧪 Baseline Scope**
- This setup uses a **baseline Qwen model** without any fine-tuning.
- Currently, it **only utilizes**:
  - `name`
  - `description`
- It **does not include** richer recipe metadata such as:
  - `ingredients`
  - `steps`
  - `tags`
  - `nutritional information`

### **⚙️ JSON Parsing Accuracy**
- The Qwen model:
  - Struggles with **consistent JSON formatting**.
  - Often fails to generate a valid or accurate `requested_count` field.
- These errors reduce the robustness of downstream search.

> 🔧 **Future work** may involve fine-tuning Qwen on structured data prompts to:
- Improve JSON compliance.
- Control the structure and schema of rephrased queries.
- Utilize richer inputs beyond just name and description.

---

## ✅ Final Outputs
- Cluster-level questions and answers: `cluster_questions.csv`
- Model evaluation log with captured queries: `captured_by_qwen` column
- Manual review samples for failure cases: `review_df`

---

This notebook provides a lightweight baseline for evaluating small LLMs in recipe retrieval workflows. It opens doors for future improvements in rephrasing reliability, schema adherence, and multimodal context inclusion.


In [1]:
import pandas as pd
import numpy as np

recipes = pd.read_csv('filtered_recipes.csv')

In [2]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134746 entries, 0 to 134745
Data columns (total 26 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   id                                134746 non-null  int64  
 1   name                              134746 non-null  object 
 2   description                       134078 non-null  object 
 3   ingredients_raw                   134746 non-null  object 
 4   steps                             134746 non-null  object 
 5   servings                          134746 non-null  float64
 6   serving_size                      134746 non-null  object 
 7   tags                              134746 non-null  object 
 8   ingredients                       134746 non-null  object 
 9   amounts                           134746 non-null  object 
 10  amount_gram                       134746 non-null  object 
 11  serving_size_numeric              134746 non-null  f

In [3]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
recipes.head(1)

Unnamed: 0,id,name,description,ingredients_raw,steps,servings,serving_size,tags,ingredients,amounts,amount_gram,serving_size_numeric,predicted_total,actual_total,approx_rate,total_recipe_weight,recipe_energy_per100g,recipe_carbohydrates_per100g,recipe_proteins_per100g,recipe_fat_per100g,recipe_energy_kcal_per100g,recipe_energy_per_serving,recipe_carbohydrates_per_serving,recipe_proteins_per_serving,recipe_fat_per_serving,recipe_energy_kcal_per_serving
0,76133,Reuben and Swiss Casserole Bake,I think this is even better than a reuben sand...,"[""1/2-1 lb corned beef, cooked and choppe...","[""Set oven to 350 degrees F."", ""Butter a 9 x 1...",4.0,1 (207 g),"[""60-minutes-or-less"", ""time-to-make"", ""course...","['corned beef', 'thousand island dressing', 's...","[{'unit': 'pound', 'amount_min': 0.5, 'amount_...","['226.8-453.6', 60.0, 453.6, 226.8, 150.0, 53.9]",207.0,1284.5,828.0,0.644609,1284.5,837.891086,8.031474,12.64761,13.490992,200.260776,1734.434548,16.625151,26.180553,27.926353,414.539806


In [None]:
# Try parsing 'steps' field as YAML list
import yaml
import pandas as pd

def safe_yaml_parse(s):
    try:
        # Replace single quotes with double quotes for YAML compatibility
        s_clean = s.replace("'", '"')
        parsed = yaml.safe_load(s_clean)
        # Ensure parsed result is a list (not None or other types)
        return parsed if isinstance(parsed, list) else None
    except Exception as e:
        # Return None for failed parsing (tracked later)
        return None

# Create a new column for parsed steps (keep original 'steps')
recipes['steps_parsed'] = recipes['steps'].apply(safe_yaml_parse)

# Track validation status (True/False)
recipes['steps_validated'] = recipes['steps_parsed'].notna()

# Save failed entries with their IDs to a CSV
failed_mask = ~recipes['steps_validated']
failed_entries = recipes.loc[failed_mask, ['id', 'steps']]  # Assuming 'id' column exists
failed_entries.to_csv('failed_entries.csv', index=False)

# Optional: Drop the temporary 'steps_parsed' column if needed
# recipes = recipes.drop(columns=['steps_parsed'])

In [6]:
print(f"Successfully parsed: {recipes['steps_validated'].sum()}")
print(f"Failed: {len(recipes) - recipes['steps_validated'].sum()}")

Successfully parsed: 113775
Failed: 20971


In [None]:
# Retry parsing failed entries with improved escaping

import yaml

def safe_yaml_parse(s):
    try:
        # Escape existing double quotes and replace single quotes
        s_clean = s.replace('"', '\\"').replace("'", '"')
        parsed = yaml.safe_load(s_clean)
        return parsed if isinstance(parsed, list) else None  # Ensure it's a list
    except Exception as e:
        return None  # Keep as None for tracking

# Identify rows that previously failed parsing
mask = ~recipes['steps_validated']

# Reprocess only failed entries
recipes.loc[mask, 'steps_parsed'] = recipes.loc[mask, 'steps'].apply(safe_yaml_parse)

# Update validation status for reprocessed rows
recipes.loc[mask, 'steps_validated'] = recipes.loc[mask, 'steps_parsed'].notna()

# Save remaining failed entries to a new file
still_failed_mask = ~recipes['steps_validated']
still_failed_entries = recipes.loc[still_failed_mask, ['id', 'steps']]
still_failed_entries.to_csv('still_failed_entries.csv', index=False)

print(f"Successfully parsed after reprocessing: {recipes['steps_validated'].sum()}")
print(f"Still failed: {still_failed_entries.shape[0]}")

Successfully parsed after reprocessing: 134154
Still failed: 592


In [8]:
# Check a sample of remaining failures
still_failed_sample = recipes[still_failed_mask].sample(3)
still_failed_sample['steps'].apply(print)

["1. Stick a stick in through the top of the apple, and twirl the apple in the heated caramel dip.", "2. Roll that apple in some peanuts.", "3.  It"s called a carmel apple and ain"t it grand?"]
["Date Filling: Cook dates in water with orange rind {grated} with sugar over a moderate heat until thickened and smooth remove from heat and add fruit juices mix well and cool before spreading.", "Crumb Mixture: Sift flour, baking soda and salt rub in butter add sugar and oatmeal mix well spread 1/2 of the crumb mixture in a greased pan and press in smoothly. Cover with cooled date mixture evenly then cover with the remaining crumb mixture pat to make smooth bake 30 - 35 minutes in a cool oven 325 Fahrenheit then increase the heat to 350 Fahrenheit to lightly brown cake.", "Cut in squares while hot allow to cool in pan.", "Note: Date mixture can be cut in half if you do not want that much filling. Crumb mixture I blend the butter  well with the crumb mixture then I put the mixture in my Kitchen

18229    None
44026    None
60958    None
Name: steps, dtype: object

In [None]:
# Drop unfixable rows and reset index
recipes = recipes[recipes['steps_validated']]
recipes = recipes.reset_index(drop=True)
print(f"Remaining rows: {len(recipes)}")


Remaining rows: 134154


In [None]:
# Convert stringified columns into an actual list
import ast
recipes['ingredients'] = recipes['ingredients'].apply(ast.literal_eval)

In [14]:
recipes['ingredients_raw'] = recipes['ingredients_raw'].apply(ast.literal_eval)

In [15]:
recipes['amount_gram'] = recipes['amount_gram'].apply(ast.literal_eval)

In [16]:
recipes['amounts'] = recipes['amounts'].apply(ast.literal_eval)

In [17]:
recipes['tags'] = recipes['tags'].apply(ast.literal_eval)

In [16]:
recipes.to_csv('parsed_recipes.csv',index=False)

In [12]:
recipes = pd.read_csv('parsed_recipes.csv')

In [9]:
# We are going to generate questions for each recipe's description and name. 
# This will help to build our baseline model. 
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
openai = OpenAI(api_key = api_key)
# Check the key

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [18]:
# We will load the full dataset and combine the name and description fields into a single text field for each recipe.
# Using a SentenceTransformer , we will convert all combined texts into embeddings.
# We will normalize these embeddings so that cosine similatirty can be computed using inner products.
# Then , we will cluster them using Faiss. Faiss's K-Means will be used to cluster the 130k+ embeddings.
# For each of the 100 centroids , we will generate 1 synthetic question and answer that cover the diversity of our dataset. 

from sentence_transformers import SentenceTransformer
import faiss

df = recipes[['name','description']].copy()


In [None]:
# Combine name and description into one string
df["description"] = df["description"].fillna("")  
df["name_description"] = df["name"] + " - " + df["description"]
df["name_description"] = df["name_description"].str.strip(" -")  

In [20]:
# Load a SentenceTransformer model for embedding generation
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings for all combined texts
embeddings = embedder.encode(df["name_description"].tolist(), convert_to_numpy=True)
# Normalize embeddings so that inner product equals cosine similarity
faiss.normalize_L2(embeddings)

In [None]:
# Cluster recipes into 100 groups using Faiss KMeans
num_clusters = 100
d = embeddings.shape[1]

# Perform K-Means clustering with Faiss
kmeans = faiss.Kmeans(d, num_clusters, niter=20, verbose=True)
kmeans.train(embeddings)

13619.9892578125

In [22]:
# Create FAISS index for retrieving nearest neighbors to centroids.
index = faiss.IndexFlatL2(d)
index.add(embeddings)
# For each centroid, retrieve the nearest neighbor indice.
_, cluster_indices = index.search(kmeans.centroids, 1)

In [None]:
# Run question generation for each cluster’s representative recipe
def generate_question_and_answer(recipe_text):
    """
    Given a recipe text (combined name and description), generate:
      - A general question a user might ask (without having seen the recipe)
      - A concise answer summarizing the recipe (including the recipe name and description)
    
    The prompt instructs the model to output in a structured format:
    
    Question: <your generated general question>
    Answer: <your generated answer>
    """
    prompt = (
         "Based on the following recipe details, generate one broad and general question that a user might ask when looking for a recipe. "
        "Ignore very specific details and focus on the main category, key ingredients, or overall theme of the recipe. For example, instead of mentioning every detail, "
        "the question could be like 'Can you suggest some comfort food desserts?' or 'Are there any budget-friendly recipes that include a specific ingredient?'\n\n"
        "Also, generate a concise answer summarizing the recipe, including the recipe name and a brief description.\n\n"
        f"Recipe Details: {recipe_text}\n\n"
        "Please format your response exactly as follows:\n"
        "Question: <your general question>\n"
        "Answer: <your summary answer>"
    )
    
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",  
        messages=[
            {"role": "system", "content": "You are a helpful culinary assistant skilled in both understanding recipe details and generalizing user search queries."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        temperature=0.35,
    )
    content = response.choices[0].message.content.strip()
    
    # Parse the output assuming it follows the requested format.
    # We'll split on "Question:" and "Answer:".
    question = ""
    answer = ""
    if "Question:" in content and "Answer:" in content:
        parts = content.split("Answer:")
        question_part = parts[0].split("Question:")[-1].strip()
        answer = parts[1].strip()
        question = question_part
    else:
        # Fallback: return the full content as question and empty answer.
        question = content
        answer = ""
    
    # Final clean up: remove any stray numbering or punctuation.
    question = question.strip(" -0123456789").strip()
    answer = answer.strip(" -0123456789").strip()
    return question, answer

In [None]:
# Run question generation for each cluster’s representative recipe
all_cluster_data = []

for cluster_id in range(num_clusters):
    rep_idx = int(cluster_indices[cluster_id][0])  # Representative index for the cluster
    recipe_text = df.iloc[rep_idx]["name_description"]
    
    # Generate one general question and an answer from the representative recipe text.
    q, a = generate_question_and_answer(recipe_text)
    
    all_cluster_data.append({
        "cluster_id": cluster_id,
        "representative_index": rep_idx,
        "recipe_text": recipe_text,
        "generated_question": q,
        "generated_answer": a
    })

In [None]:
# Save generated cluster questions to CSV
questions_df = pd.DataFrame(all_cluster_data)
questions_df.to_csv("cluster_questions.csv", index=False)

In [23]:
questions_df = pd.read_csv('cluster_questions.csv')

In [29]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
questions_df.head(5)

Unnamed: 0,cluster_id,representative_index,recipe_text,generated_question,generated_answer
0,0,63064,Bread Pudding - This is my mom's bread pudding recipe that she found in a magazine shortly after she and my dad were married. It is a wonderful comfort food for cool fall and winter days!,Can you recommend a comforting dessert for fall and winter days?,"Recipe Name: Mom's Bread Pudding\nDescription: A comforting bread pudding recipe perfect for cool fall and winter days, passed down from the user's mom."
1,1,54901,"Pizza Dough - This makes an awesome pizza crust, crispy and yummy",Can you recommend a classic recipe for homemade pizza dough?,"Recipe Name: Homemade Pizza Dough\nDescription: This recipe provides a simple and delicious homemade pizza dough that results in a crispy and tasty crust, perfect for making your own pizzas at home."
2,2,57229,Baked Pork Chops - This is a great meal but a little longer than your weekday meal might take. Great flavor!,Can you recommend a flavorful and slightly time-consuming dinner recipe?,Recipe Name: Baked Pork Chops\nDescription: This Baked Pork Chops recipe is a delicious and flavorful meal that takes a bit longer to prepare than a typical weekday dinner. It offers great taste and is worth the extra time spent in the kitchen.
3,3,119890,Inexpensive Caramels - This is from my Mom's recipe cards. I have not tried this recipe.,Do you have a simple recipe for homemade candy?,Recipe Name: Inexpensive Caramels\nDescription: This recipe is a classic homemade caramel recipe passed down from the user's Mom. It is a simple and inexpensive way to make delicious caramels at home.
4,4,101570,Turkey Tenderloins With Caramelized Onions - I love this with mashed potatoes! This is also good with chicken in place of turkey. It is adapted from a Betty Crocker cookbook.,Can you recommend a recipe featuring turkey tenderloins as the main ingredient?,"Recipe Name: Turkey Tenderloins With Caramelized Onions - A delicious dish featuring turkey tenderloins and caramelized onions, perfect when served with mashed potatoes. It can also be made with chicken as a substitute. Adapted from a Betty Crocker cookbook."


In [None]:
# Semantic search using FAISS based on query question
# The function will be used by Qwen to retrieve the context
def direct_faiss_search(query_text, top_k=5):
    """
    1) Convert query_text into an embedding with the same model that produced 'embeddings'.
    2) Normalize it (because the index presumably expects normalized embeddings).
    3) Search your FAISS index for the top_k hits.
    4) Return a DataFrame slice of the best matches.
    """
    # 1) Encode the query
    q_emb = embedder.encode([query_text], convert_to_numpy=True)
    
    # 2) Normalize so that L2 distance = 2 - 2*cosine_similarity
    faiss.normalize_L2(q_emb)
    
    # 3) Query FAISS
    distances, indices = index.search(q_emb, top_k)
    
    # 4) Return the matching rows from df
    # 'indices' is shape (1, top_k)
    best_idx = indices[0]
    return df.iloc[best_idx]

# Testing with the question:
question = questions_df.loc[2,'generated_question']
results_df = direct_faiss_search(question, top_k=5)

print("QUESTION:", question)
print("\nFAISS Search Results:\n", results_df)


QUESTION: Can you recommend a flavorful and slightly time-consuming dinner recipe?

FAISS Search Results:
                                        name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   description  \
86630                  International Dinner                                                                                                                                                                                                                                                                     

In [171]:
print(questions_df.loc[2,'recipe_text'])

Baked Pork Chops - This is a great meal but a little longer than your weekday meal might take. Great flavor!


In [None]:
# Load Qwen2.5-0.5B-Instruct model and tokenizer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "Qwen/Qwen2.5-0.5B-Instruct"  
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
qwen_model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
qwen_model.eval()

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

In [None]:
# Send model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Qwen model is running on: {device.upper()}")

qwen_model.to(device)

Qwen model is running on: CUDA


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

In [None]:
# Extract and rephrase user query into structured JSON using Qwen

import re
import json

def extract_final_json_block(raw_text):
    """
    1) Find all substrings that start with '{', contain "requested_count",
       and continue until the first matching '}'.
    2) Return the LAST one found, in case Qwen prints multiple partial blocks or instructions.
    3) Attempt to parse it as JSON; if parse fails, fallback to out-of-scope.

    This approach ignores any lines with 'assistant' or system instructions,
    because it hunts only for the final curly brace block containing "requested_count".
    """

    # This regex does a LAZY match of any characters (including newlines) from 
    # the literal '{' up through the first '}' that closes the block containing "requested_count".
    pattern = r'\{\s*"requested_count"(?:.|\n)*?\}'

    matches = re.findall(pattern, raw_text)
    if not matches:
        # If no match, we have no valid JSON => fallback
        return {
            "requested_count": 0,
            "search_query": "[Out of scope or needs clarification]"
        }

    # Take the last match to get the final JSON block
    last_block = matches[-1].strip()

    try:
        parsed = json.loads(last_block)
        return parsed
    except Exception as e:
        print("[DEBUG] parse error on last_block:", e)
        return {
            "requested_count": 0,
            "search_query": "[Out of scope or needs clarification]"
        }

        

def rephrase_query_for_retrieval_json(original_query):
    

    system_instructions = (
    "You are a helpful assistant that extracts structured recipe request info from user input.\n"
    "Return ONLY a valid JSON object like this:\n"
    '{ "requested_count": <integer>, "search_query": "<keywords or constraints>" }\n\n'
    "Rules:\n"
    "- If user says 'Give me X recipes' or 'I want X dishes', set requested_count = X , otherwise set requested_count = 5.\n"
    "- Time or ingredient limits (e.g. '3 ingredients', '15 minutes') go in search_query.\n"
    "- If unclear or out of scope, respond with:\n"
    '{ \"requested_count\": 0, \"search_query\": \"[Out of scope or needs clarification]\" }\n'
    "- No other output. Only JSON. No text or role names.\n\n"
    "Examples:\n"
    "\"Give me 3 chicken recipes\"\n"
    "→ { \"requested_count\": 3, \"search_query\": \"chicken\" }\n\n"
    "\"Quick vegan meals\"\n"
    "→ { \"requested_count\": 5, \"search_query\": \"quick vegan meals\" }\n\n"
    "\"I want 10 dishes with at most 3 ingredients, ready in 15 minutes\"\n"
    "→ { \"requested_count\": 10, \"search_query\": \"at most 3 ingredients, 15 minutes\" }\n\n"
    "\"I want some recipes with tofu\"\n"
    "→ { \"requested_count\": 5, \"search_query\": \"tofu\" }"
    "\"Give me some recipes with chicken as a main ingredient.\"\n"
    "→ { \"requested_count\": 5, \"search_query\": \"chicken\" }"
)



    user_message = f"{original_query}\n→"


    messages = [
        {"role": "system", "content": system_instructions},
        {"role": "user", "content": user_message}
    ]

    # Build final prompt for Qwen
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    print("\n===== [DEBUG] Final Prompt =====")
    print(prompt)
    print("================================\n")

    inputs = tokenizer(prompt, return_tensors="pt").to(qwen_model.device)
    with torch.no_grad():
        output = qwen_model.generate(**inputs, max_new_tokens=150)

    raw_text = tokenizer.decode(output[0], skip_special_tokens=True).strip()

    print("===== [DEBUG] Raw Qwen Output =====")
    print(raw_text)
    print("===================================\n")

    # Use the regex-based extraction
    parsed = extract_final_json_block(raw_text)

    print("===== [DEBUG] Parsed JSON =====")
    print(parsed)
    print("================================\n")

    return parsed




In [None]:
# Run rephrasing on one sample question

user_question = questions_df.loc[5,'generated_question']
parsed_result = rephrase_query_for_retrieval_json(user_question)

if parsed_result["requested_count"] == 0 and parsed_result["search_query"].startswith("[Out of scope"):
    print("Qwen flagged the query or parse error =>", parsed_result["search_query"])
else:
    print("Final parsed JSON =>", parsed_result)

print(questions_df.loc[5,'recipe_text'])
direct_faiss_search(parsed_result["search_query"],5)




===== [DEBUG] Final Prompt =====
<|im_start|>system
You are a helpful assistant that extracts structured recipe request info from user input.
Return ONLY a valid JSON object like this:
{ "requested_count": <integer>, "search_query": "<keywords or constraints>" }

Rules:
- If user says 'Give me X recipes' or 'I want X dishes', set requested_count = X , otherwise set requested_count = 5.
- Time or ingredient limits (e.g. '3 ingredients', '15 minutes') go in search_query.
- If unclear or out of scope, respond with:
{ "requested_count": 0, "search_query": "[Out of scope or needs clarification]" }
- No other output. Only JSON. No text or role names.

Examples:
"Give me 3 chicken recipes"
→ { "requested_count": 3, "search_query": "chicken" }

"Quick vegan meals"
→ { "requested_count": 5, "search_query": "quick vegan meals" }

"I want 10 dishes with at most 3 ingredients, ready in 15 minutes"
→ { "requested_count": 10, "search_query": "at most 3 ingredients, 15 minutes" }

"I want some recip

Unnamed: 0,name,description,name_description
14471,Joanne's Sweet and Hot Pickles,"This recipe was given to me by a grand lady that I work with. It is very easy, fast and taste very good.","Joanne's Sweet and Hot Pickles - This recipe was given to me by a grand lady that I work with. It is very easy, fast and taste very good."
25924,Simply Sweet Pickles (No Processing Required),"This is the simplest recipe ever! I've canned a lot of pickles, and this is now our favorite sweet pickle recipe. And the beauty of it is there is no water bath required! It has a little kick to it from tobasco sauce. I do NOT like hot, spicey food so I was sceptical at first, but trust me, there is just enough kick to this to make it interesting.","Simply Sweet Pickles (No Processing Required) - This is the simplest recipe ever! I've canned a lot of pickles, and this is now our favorite sweet pickle recipe. And the beauty of it is there is no water bath required! It has a little kick to it from tobasco sauce. I do NOT like hot, spicey food so I was sceptical at first, but trust me, there is just enough kick to this to make it interesting."
58545,Crisp Sweet Pickles,Super easy and they taste sooo good! I love this recipe because it only takes 10 minutes + chilling time to make. I found this recipe in Taste of Home magazine and now I keep it on hand in my recipe box!,Crisp Sweet Pickles - Super easy and they taste sooo good! I love this recipe because it only takes 10 minutes + chilling time to make. I found this recipe in Taste of Home magazine and now I keep it on hand in my recipe box!
68444,Sweet Hot Pickles,"This is a combination of several recipes, one from my father, one from an old cook book, and my own adjustments","Sweet Hot Pickles - This is a combination of several recipes, one from my father, one from an old cook book, and my own adjustments"
98261,Lite Spicy Dill Pickle Dip,"Make with your favorite dill pickle, or better yet, your homemade ones","Lite Spicy Dill Pickle Dip - Make with your favorite dill pickle, or better yet, your homemade ones"


In [181]:
# Here we are going to check which questions will be understood by qwen. At least to a level of query
# We won't check request_count for now . It doesn't work , obviously . We will probably need to fine-tune or do other configs for it. 
# 1) Create a new column (all False by default).
questions_df["captured_by_qwen"] = False

for i in range(len(questions_df)):
    # Extract the question + original recipe text
    user_question = questions_df.loc[i, "generated_question"]
    original_text = questions_df.loc[i, "recipe_text"]

    # 2) Parse user_question into a search query
    parsed = rephrase_query_for_retrieval_json(user_question)

    # If Qwen indicates it's out of scope or parse error => skip
    if parsed["requested_count"] == 0 and parsed["search_query"].startswith("[Out of scope"):
        # We'll just mark it as not captured
        questions_df.loc[i, "captured_by_qwen"] = False
        continue

    # 3) Retrieve top 20 matches from FAISS
    search_query = parsed["search_query"]
    results_df = direct_faiss_search(search_query, top_k=50)

    # 4) Check if the original recipe_text is in those top 20
    #    Our approach: Compare 'name_description' in results_df to the exact text from 'recipe_text'.
    
    if any(results_df["name_description"] == original_text):
        questions_df.loc[i, "captured_by_qwen"] = True
    else:
        questions_df.loc[i, "captured_by_qwen"] = False

# Finally, we can inspect questions_df["captured_by_qwen"] 
# to see which rows had their recipe_text appear in the top 20 results.



===== [DEBUG] Final Prompt =====
<|im_start|>system
You are a helpful assistant that extracts structured recipe request info from user input.
Return ONLY a valid JSON object like this:
{ "requested_count": <integer>, "search_query": "<keywords or constraints>" }

Rules:
- If user says 'Give me X recipes' or 'I want X dishes', set requested_count = X , otherwise set requested_count = 5.
- Time or ingredient limits (e.g. '3 ingredients', '15 minutes') go in search_query.
- If unclear or out of scope, respond with:
{ "requested_count": 0, "search_query": "[Out of scope or needs clarification]" }
- No other output. Only JSON. No text or role names.

Examples:
"Give me 3 chicken recipes"
→ { "requested_count": 3, "search_query": "chicken" }

"Quick vegan meals"
→ { "requested_count": 5, "search_query": "quick vegan meals" }

"I want 10 dishes with at most 3 ingredients, ready in 15 minutes"
→ { "requested_count": 10, "search_query": "at most 3 ingredients, 15 minutes" }

"I want some recip

In [184]:
print(f"Captured by qwen:{len(questions_df[questions_df['captured_by_qwen'] == True])}")
print(f"Not captured by qwen:{len(questions_df[questions_df['captured_by_qwen'] == False])}")

Captured by qwen:50
Not captured by qwen:50


In [185]:
# We can assume that half of the questions were understood and turned into a query properly.
# Now let's check the other half.
# Let's collect these "False" rows in a new DataFrame for manual review.
rows_to_review = questions_df[questions_df["captured_by_qwen"] == False].copy()

# We'll build a list of records for a "review_df"
review_records = []

for i in rows_to_review.index:
    question_text = rows_to_review.loc[i, "generated_question"]
    # We'll parse again in case we need the search query
    parsed = rephrase_query_for_retrieval_json(question_text)
    search_q = parsed.get("search_query", "")
    
    # Retrieve top 5 from FAISS
    faiss_results = direct_faiss_search(search_q, top_k=5)
    
    # For clarity, let's store each of the top 5 hits as a dictionary entry,
    # along with the question.
    record = {
        "row_index": i,
        "generated_question": question_text,
        "search_query": search_q,
        "top5_results": faiss_results["name_description"].tolist()  
    }
    review_records.append(record)

# Convert these records into a DataFrame for manual checking
review_df = pd.DataFrame(review_records)



===== [DEBUG] Final Prompt =====
<|im_start|>system
You are a helpful assistant that extracts structured recipe request info from user input.
Return ONLY a valid JSON object like this:
{ "requested_count": <integer>, "search_query": "<keywords or constraints>" }

Rules:
- If user says 'Give me X recipes' or 'I want X dishes', set requested_count = X , otherwise set requested_count = 5.
- Time or ingredient limits (e.g. '3 ingredients', '15 minutes') go in search_query.
- If unclear or out of scope, respond with:
{ "requested_count": 0, "search_query": "[Out of scope or needs clarification]" }
- No other output. Only JSON. No text or role names.

Examples:
"Give me 3 chicken recipes"
→ { "requested_count": 3, "search_query": "chicken" }

"Quick vegan meals"
→ { "requested_count": 5, "search_query": "quick vegan meals" }

"I want 10 dishes with at most 3 ingredients, ready in 15 minutes"
→ { "requested_count": 10, "search_query": "at most 3 ingredients, 15 minutes" }

"I want some recip

In [186]:
print("Rows to review for manual checking:")
review_df.head()

Rows to review for manual checking:


Unnamed: 0,row_index,generated_question,search_query,top5_results
0,0,Can you recommend a comforting dessert for fall and winter days?,comfortable dessert for fall and winter days,"[Dessert - Dessert. It is a good cookie recipe for the winter or any time. I love this recipe made more in the winter so it can set better. Enjoy this recipe., Pumpkin Roll - Excellent dessert for fall season. Freezes well., Strawberry Dessert - This is so easy and so GOOD. Great for your dessert on a hot summer's day., Dessert - It's what's for dessert., Cherries in the Snow - This was absolutely my favorite dessert growing up. It's very easy and yummy.]"
1,2,Can you recommend a flavorful and slightly time-consuming dinner recipe?,"flavorful, slightly time-consuming dinner recipe","[Chicken Hogan - This is a very basic recipe that began as a way to keep from going grocery shopping for a day or two more. It's very forgiving if you have any ingredient that you are short on or have extra you are trying to get rid of. It has a good blend of flavors and is very simple., Baked Garlic Chicken - Speedy weeknight supper! Very flavorful!, International Dinner - I rarely try my own recipes so I hope its edible lol gimme feedback, Cowboy Scramble - This came to me in an e-mail. As we had guests at the moment I decided to try the recipe. Served it for breakfast but it's equally good as lunch. Doesn't take long to make either., Turkey Sloppy Joes - This is one of those recipes that came from having something in the fridge that had to be used - its easy and has good flavors]"
2,3,Do you have a simple recipe for homemade candy?,homemade candy recipe,"[Candy Jewels - Homemade candy with an elegant design, Homemade Candy Bars - Rich and gooey., Penuche - Old candy recipe., Granny White's Brown Sugar Candy - This recipe came from my mother and now my daughters like to make it too., Homemade Candy Bars - Taste like a twix candy bars!]"
3,8,Can you recommend a healthy and quick cookie recipe?,"healthy cookies, quick","[Healthy Cookies - simple, healthy, yummy, Miracle Whip Cookies - Good cookies that are easy and quick., Healthy &quot;cookies&quot; - I have yet to try this, but they seem to be quick, fresh, and delicious bites., Healthy Morning Cookies - a healthy morning cookie, Chocolate Cookies - cookies]"
4,12,Can you suggest a classic holiday drink recipe?,classic holiday drink recipe,"[Hot Spiced Drink - This is a drink I had at a christmas, Hot Spiced Drink - This is a drink I had at a christmas, Holiday Party Punch - Another recipe from the National Honey Board. Very refreshing!, New Year's Eve Punch - This is a great non alcoholic drink for the holidays., Holiday Cheer Eggnog - What a delightful Holiday drink - and you can take your choice of flavoring!!]"


In [187]:
# Qwen seems to be working fine
# The reason it doesn't capture the exact recipe is that the clustering logic we followed
# During k-means training , we retrieved centroids and generated questions based on the centroids. 
# Let's check some others to be sure

review_df.tail()

Unnamed: 0,row_index,generated_question,search_query,top5_results
45,89,Can you suggest a refreshing dessert recipe that combines berries and chocolate?,refreshing dessert with berries and chocolate,"[White Chocolate Berry Dessert - From Taste of Home, Berries N Bits - This is a great summertime desserts. I got the recipe from a coworker. I have used different fresh fruits and more than one. If you want to make this when fresh fruits are not available, you can use frozen, but it does come out a little runny. I would advise using the mini chocolate chips, the regular ones are too big., Berry Topped Brownies - Strawberries and chocolate--YUM!, Strawberries With Rose Cream - Another very simple but elegant dessert, Chocolate Raspberry Dessert - This is a low calorie, delicious desert!!]"
46,94,Can you suggest a tropical-inspired party snack recipe?,tropical-inspired snacks,"[Crazy Good Tropical Cookies - Tropical flavor cookies, Tropical Fruit - Quick, Easy, and Delicious drink., Tropical Fruit Quesadillas - Dessert, Peanut Buttery Coconut Bars - Very easy snack and very good 8), Tropical Dip - Serve with fresh fruit and honey graham sticks.]"
47,96,Can you suggest a simple and crowd-pleasing vegetable side dish recipe?,vegetable side dish,"[Mixed Vegetables Casserole - Simple side dish and goes with so many things. Source - internet, Mixed Vegetable Delight - Mixed vegetable side dish that kicks., Parslied Potatoes - Tasty little side dish., Vegetable Saute - vegetable dish, side dish, baked potato accompaniment, Carrot-Potato Casserole - a quick and easy recipe for a side dish to a meal]"
48,97,Can you suggest a versatile and customizable breakfast recipe?,"versatile, customizable breakfast recipes","[Mexican Breakfast - I can't remember where I found this recipe but it's excellent for when you have overnight guests., Ranch style eggs - something different for breakfast., Breakfast Egg Nests - Looking for more versatile breakfast recipes? I got this from Vegetarian Times., Something Yummy for Breakfast - This recipe is so good for a weekend breakfast!, Healthy Breakfast Scramble - Just something I threw together to start eating healthier. It is delicious!!]"
49,98,Can you recommend a refreshing breakfast beverage using fruit?,"fruit, refreshing, drink","[Hot Fruit Drink - Have a batch on hand for a cold day., Fruit Smoothie - A quick, healthy fruit smoothie., Peach and Ginger Green Tea Smoothie - Refreshing and so good for you., Berry Banana Smoothies - Very refreshing. From Taste of Home., Mango Orange Cooler - Refreshing.]"
