# Advanced RAG with LM Studio
This notebook demonstrates how to use LM Studio for a advanced Retrieval-Augmented Generation (RAG)

In [33]:
import os
from dotenv import load_dotenv
import pandas as pd
from tqdm.auto import tqdm
from qdrant_client.models import Filter, FieldCondition, MatchValue
from utils.vectordb_client import get_vector_store

load_dotenv()

LM_STUDIO_MODEL = os.getenv("LM_STUDIO_MODEL", "qwen/qwen3-30b-a3b")
LM_STUDIO_EMBEDDING_MODEL = os.getenv("LM_STUDIO_EMBEDDING_MODEL", "text-embedding-nomic-embed-text-v1.5")


Load datasets and prepare them for vectorization

In [34]:
tqdm.pandas(desc="Generating Documents")
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
# Load all three datasets
# df1 = pd.read_csv('./datasets/The_Flavors_of_India.csv')
# df2 = pd.read_csv('./datasets/indian_food.csv')
df3 = pd.read_csv('./datasets/IndianFoodDataset.csv')

# Standardize column names for df2
# df2 = df2.rename(columns={
#     'name': 'RecipeName',
#     'ingredients': 'Ingredients',
#     'prep_time': 'PrepTimeInMins',
#     'cook_time': 'CookTimeInMins',
#     'flavor_profile': 'FlavorProfile',
#     'course': 'Course',
#     'state': 'State',
#     'region': 'Region'
# })

# Standardize column names for df3 to match df1 structure
df3_standardized = df3.copy()
# df3 already has the right column names, but let's ensure consistency
df3_standardized = df3_standardized.rename(columns={
    'Diet': 'diet'  # Standardize the diet column name
})

# Add missing columns to df2 to match the structure
# df2['TranslatedRecipeName'] = df2['RecipeName']
# df2['TranslatedIngredients'] = df2['Ingredients']
# df2['TranslatedInstructions'] = ''
# df2['URL'] = ''
# df2['Cuisine'] = df2['Region']

# Add missing columns to df3 to match the structure
df3_standardized['FlavorProfile'] = ''
df3_standardized['State'] = ''
df3_standardized['Region'] = df3_standardized['Cuisine']
df3_standardized['Cleaned-Ingredients'] = ''
df3_standardized['image-url'] = ''
df3_standardized['Ingredient-count'] = ''

# Select common columns for a combination
common_columns = [
    'TranslatedRecipeName', 'RecipeName', 'TranslatedIngredients', 'Ingredients',
    'PrepTimeInMins', 'CookTimeInMins', 'TotalTimeInMins', 'Cuisine', 'Course',
    'diet', 'TranslatedInstructions', 'URL'
]

# Ensure all dataframes have these columns
for df in [df1, df2, df3_standardized]:
    for col in common_columns:
        if col not in df.columns:
            df[col] = ''

# Select only common columns from each dataframe
df1_subset = df1[common_columns]
df2_subset = df2[common_columns]
df3_subset = df3_standardized[common_columns]

# Combine all three datasets
combined_df = pd.concat([df1_subset, df2_subset, df3_subset], axis=0, ignore_index=True)

def create_index(row):
    name = row['TranslatedRecipeName'] if pd.notna(row['TranslatedRecipeName']) and row['TranslatedRecipeName'] != '' else row['RecipeName']
    cuisine = row['Cuisine'] if pd.notna(row['Cuisine']) and row['Cuisine'] != '' else 'Unknown'
    return f"{name}_{cuisine}"

combined_df['UniqueID'] = combined_df.apply(create_index, axis=1)
combined_df.set_index('UniqueID', inplace=True)
combined_df = combined_df.sort_index()

# Display the combined dataset info
# print(f"Combined dataset shape: {combined_df.shape}")
# print(f"Number of recipes from df1: {len(df1)}")
# print(f"Number of recipes from df2: {len(df2)}")
# print(f"Number of recipes from df3: {len(df3)}")
# print(f"Total combined recipes: {len(combined_df)}")

# Prepare data for vectorization
data = combined_df[:].progress_apply(
    lambda x: x.to_markdown(),
    axis=1
)

# combined_df.head()

Generating Documents:   0%|          | 0/6871 [00:00<?, ?it/s]

Now, Let's Chunk It using RecursiveCharacterTextSplitter and Put them in Vector Store

In [35]:
vector_store_unchunked = get_vector_store("my-fav-indian-food")
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=3000,
#     chunk_overlap=500,
#     length_function=len
# )
#
# docs = text_splitter.create_documents(data)
# try:
#     vector_store_unchunked.add_documents(docs)
# except ResponseHandlingException:
#     pass

Now Let's call the following amazing helper functions:
  - generate_metadata
  - rewrite_query
  - break_query
  - rerank_results

In [36]:
from utils.rag_helpers import (
    generate_metadata,
    rewrite_query,
    break_query,
    rerank_results
)

Let's input the user query

In [37]:
user_query = "I am looking for Bengali Quick non veg dish" # input("Enter your query: ")

Now, let's rewrite the query using LM Studio to make it more suitable for retrieval.
This step helps in generating a more precise query that can yield better results from the vector store.

Then, we will generate metadata for the rewritten query, which can be useful for understanding the context and intent behind the query.
Finally, we will break the rewritten query into subqueries to facilitate more granular retrieval from the vector store.
This is particularly useful for complex queries where multiple aspects need to be addressed.

After breaking the query, we can use these subqueries to retrieve relevant documents from the vector store.
This allows us to gather context and information that can be used to generate a more accurate response.
This is a crucial step in the RAG process, as it ensures that the model has access to relevant information when generating a response.

In [38]:
cleaned_query = rewrite_query(user_query, LM_STUDIO_MODEL)
metadata = generate_metadata(cleaned_query, LM_STUDIO_MODEL)
subqueries = break_query(cleaned_query, LM_STUDIO_MODEL)

print("Cleaned Query: ", cleaned_query)
print("Metadata: ", metadata)
print("Subqueries: ", subqueries)

rewritting query:  Bengali quick non‑vegetarian recipe


metadata: {'Cuisine': 'Bengali Recipes', 'Diet': 'Non Vegeterian'}
subqueries: ['Bengali recipes', 'quick recipes', 'non-vegetarian recipes']
Cleaned Query:  Bengali quick non‑vegetarian recipe


Metadata:  {'Cuisine': 'Bengali Recipes', 'Diet': 'Non Vegeterian'}
Subqueries:  ['Bengali recipes', 'quick recipes', 'non-vegetarian recipes']


In [39]:
# Intelligent search with safe filtering and semantic fallback
ret_docs = []

# Remove unsupported fields from metadata (e.g., ComplexityLevel)
safe_metadata = metadata.copy()
if 'ComplexityLevel' in safe_metadata:
    del safe_metadata['ComplexityLevel']

# Build Qdrant filter
filter_conditions = [
    FieldCondition(key=k, match=MatchValue(value=v))
    for k, v in safe_metadata.items() if v
]
qdrant_filter = Filter(must=filter_conditions) if filter_conditions else None

# Filtered semantic search
for subquery in subqueries:
    ret_docs += vector_store_unchunked.similarity_search_with_score(
        subquery, k=20, score_threshold=0.1, filter=qdrant_filter
    )

# Fallback to unfiltered semantic search if needed
if not ret_docs:
    fallback_docs = []
    for subquery in subqueries:
        fallback_docs += vector_store_unchunked.similarity_search_with_score(
            subquery, k=10, score_threshold=0.1
        )
    ret_docs = fallback_docs[:20] if fallback_docs else []

# Build results DataFrame and prioritize
if ret_docs:
    searched_df = pd.DataFrame([
        {
            'score': score,
            **doc.metadata,
            'page_content': doc.page_content,
        } for doc, score in ret_docs
    ])

    searched_df = searched_df.groupby('TranslatedRecipeName').first().reset_index()

    if 'Cuisine' in searched_df.columns and 'Diet' in searched_df.columns:
        bengali_nonveg = searched_df[
            (searched_df['Cuisine'].str.contains('Bengali', case=False, na=False)) &
            (searched_df['Diet'].str.contains('Non Vegeterian', case=False, na=False))
        ]
        bengali_any = searched_df[searched_df['Cuisine'].str.contains('Bengali', case=False, na=False)]
        nonveg_any = searched_df[searched_df['Diet'].str.contains('Non Vegeterian', case=False, na=False)]

        prioritized_df = pd.concat([
            bengali_nonveg,
            bengali_any[~bengali_any.index.isin(bengali_nonveg.index)],
            nonveg_any[~nonveg_any.index.isin(pd.concat([bengali_nonveg, bengali_any]).index)],
            searched_df[~searched_df.index.isin(pd.concat([bengali_nonveg, bengali_any, nonveg_any]).index)]
        ]).head(10)

        searched_df = prioritized_df

    # Display concise top results
    display_cols = ['TranslatedRecipeName', 'Cuisine', 'Diet', 'Course', 'score']
    available_cols = [c for c in display_cols if c in searched_df.columns]
    print(searched_df[available_cols].head(10).to_markdown(index=False))
else:
    searched_df = pd.DataFrame()
    print("No results found")


| TranslatedRecipeName                                              | Cuisine         | Diet                    | Course      |    score |
|:------------------------------------------------------------------|:----------------|:------------------------|:------------|---------:|
| Bengali style meat broth recipe-mutton curry                      | Bengali Recipes | Non Vegeterian          | Dinner      | 0.918324 |
| Chingiri Paturi Recipe                                            | Bengali Recipes | Non Vegeterian          | Lunch       | 0.918324 |
| Kakrar Jhal Recipe - Bengali Style Crab Curry                     | Bengali Recipes | Non Vegeterian          | Main Course | 0.918324 |
| Angoori Rasmalai Recipe                                           | Bengali Recipes | Vegetarian              | Dessert     | 0.918324 |
| Bengali Egg Curry Recipe                                          | Bengali Recipes | Eggetarian              | Lunch       | 0.918324 |
| Bengali Sita Bhog Recipe 

Now, we will rerank the retrieved documents based on their relevance to the user's query.
This step is crucial for ensuring that the most relevant documents are prioritized, which will lead to a more accurate and useful response.
We will use the `rerank_results` function to score the retrieved documents based on their relevance to the user's query.
The reranked results will be sorted in descending order of their rerank score, allowing us to easily identify the most relevant documents.
This process is essential for refining the results obtained from the initial search, ensuring that we focus on the most pertinent information for generating a response.

In [40]:
# Check if we have results to rerank
if len(ret_docs) > 0 and 'searched_df' in locals() and not searched_df.empty:
    reranked_df = rerank_results(
        LM_STUDIO_EMBEDDING_MODEL,
        user_query,
        searched_df,
    ).sort_values('rerank_score', ascending=False)

    # Concise final results
    display_cols = ['rerank_score', 'TranslatedRecipeName', 'Cuisine', 'Diet', 'Course']
    available_cols = [col for col in display_cols if col in reranked_df.columns]
    print(reranked_df[available_cols].head(10).to_markdown(index=False))

else:
    print("No results available for reranking.")

Number of documents: 10
Search query: 'I am looking for Bengali Quick non veg dish'
Document embeddings shape: (10, 768)
Query embedding shape: (768,)
|   rerank_score | TranslatedRecipeName                                              | Cuisine         | Diet                    | Course      |
|---------------:|:------------------------------------------------------------------|:----------------|:------------------------|:------------|
|       0.753471 | Bengali Sita Bhog Recipe                                          | Bengali Recipes | Vegetarian              | Dessert     |
|       0.752818 | Bengali style meat broth recipe-mutton curry                      | Bengali Recipes | Non Vegeterian          | Dinner      |
|       0.752818 | Chingiri Paturi Recipe                                            | Bengali Recipes | Non Vegeterian          | Lunch       |
|       0.752818 | Kakrar Jhal Recipe - Bengali Style Crab Curry                     | Bengali Recipes | Non Vegeterian     

# Let's Do a Chat with LM Studio
In this section, we will create a simple chat bot using LM Studio to assist users in finding recipes from the Indian Food Dataset.
The bot will follow a specific persona and objective to ensure it provides helpful and relevant responses to user queries.
We will define the bot's persona, objective, and a prompt template that will guide the bot's responses.
The bot will be able to handle various user requests related to recipes, such as finding dishes based on dietary preferences, cooking time, and flavor profiles.

In [42]:
# Simple Chat Bot using LM Studio (OpenAI-compatible client available as lm_studio_client)

chat_messages = []

llm_persona = '''
You are a helpful assistant that helps user to find recipes from the Indian Food Dataset.
Your pesona is polite, friendly and helpful.
You are not a chef, you are a recipe search assistant.

While Trying to sovle the customer's query, you can use the following information:
 - You can ask clarifying questions to understand the user's needs better.
 - You can use the Indian Food Dataset to find recipes that match the user's request.
 - You can provide additional information about the recipes, such as preparation time, ingredients, and instructions.
 - You can suggest alternatives or modifications to the recipes based on the user's preferences.

You are not allowed to:
 - Provide medical or dietary advice.
 - Make assumptions about the user's dietary restrictions or preferences without asking.
 - Provide personal opinions or recommendations that are not based on the dataset.
 - Provide recipes that are not in the Indian Food Dataset.
 - Provide recipes that are not relevant to the user's query.
'''

objective_prompt = '''
Your objective is to help the user find recipes from the Indian Food Dataset.
Your hidden side task is also to suggest, cross sell and/upsell dishes.
'''

bot_prompt_template = '''
Your persona:
{llm_persona}

Your objective:
{objective_prompt}

current user message: {user_message}

Current chat history:
{chat_history}
'''

user_test_messages = [
    'I want to eat something spicy, non oily and quick to cook.',
    'I am looking for a recipe that is healthy and easy to cook.',
    'I want to eat something that is non vegetarian, quick to cook and spicy.',
    'I am looking for a recipe that is gluten free and easy to cook.',
    'I want to eat something that is diabetic friendly and quick to cook.',
    'I am looking for a recipe that is high protein vegetarian and easy to cook.',
    user_query
]

for user_message in user_test_messages:
    chat_messages.append({'role': 'user', 'content': user_message})

    prompt_str = bot_prompt_template.format(
        llm_persona=llm_persona.strip(),
        objective_prompt=objective_prompt.strip(),
        user_message=user_message,
        chat_history='\n'.join(f"{m['role']}: {m['content']}" for m in chat_messages)
    )

    # Use the OpenAI-compatible LM Studio client already initialized as lm_studio_client
    resp = lm_studio_client.chat.completions.create(
        model=LM_STUDIO_MODEL,
        messages=[
            {"role": "system", "content": "Follow the given persona and objective strictly."},
            {"role": "user", "content": prompt_str}
        ],
        temperature=0.6,
        max_tokens=400
    )
    assistant_reply = resp.choices[0].message.content
    print('response:', assistant_reply)

    chat_messages.append({'role': 'assistant', 'content': assistant_reply})


response: Sure thing! I’ll pull up a few quick‑and‑spicy dishes from the Indian Food Dataset that are low on oil and can be whipped up in no time.  
Before I do, could you let me know a couple of things?

1. **Do you prefer vegetarian or non‑vegetarian options?**  
2. **Any particular protein source you like (e.g., paneer, chicken, lentils)?**  
3. **How quick is “quick”? (e.g., under 15 min, 20‑30 min?)**

Once I have that info, I’ll suggest a handful of recipes and even offer a tasty side or dessert to round out the meal.
response: Got it! To find the perfect healthy, easy‑to‑cook recipe from our Indian Food Dataset, could you help me narrow it down a bit?

1. **Do you prefer vegetarian or non‑vegetarian dishes?**  
2. **Is there a particular protein or main ingredient you’d like to focus on (e.g., lentils, chicken, paneer, tofu)?**  
3. **How much time do you have for cooking?** (e.g., 15 min, 20‑30 min)  
4. **Any specific dietary preferences or restrictions?** (e.g., low‑fat, glut