# Recipe Search System Development Summary

## Project Overview

This notebook implements a conversational recipe search system that bridges natural language user queries with a structured PostgreSQL recipe database. The system uses Llama 3.2 for natural language understanding and maintains context across multiple conversation turns.

## Architecture Components

The system consists of several key components:

### 1. Natural Language Extraction
Three specialized prompts extract structured data from user messages:
- Recipe query extraction (query text, ingredients to include/exclude, count)
- Semantic tag extraction (time, difficulty, cuisine, diet, etc.)
- Conversation continuation detection

### 2. Database Integration
Connection to PostgreSQL database containing:
- Recipes with embeddings, ingredients arrays, and tags arrays
- Canonical ingredients with variants
- Hierarchical tags organized by groups

### 3. Search Strategy
Evolution from embedding-first to ingredients-first approach:
- Initial: Embedding search followed by filtering
- Final: Ingredients and critical tags first, then ranking by embedding similarity

### 4. Conversation Management
Multi-turn conversation support with:
- State tracking across turns
- Ingredient list merging with conflict detection
- Reset detection for topic changes

## Key Successes

### 1. Robust Ingredient Handling
- Successfully resolves typos like "tomatos" to canonical "tomatoes"
- Expands to all variants (32 variants for tomatoes)
- Handles both inclusion and exclusion constraints

### 2. Effective Conversation Flow
- Correctly identifies continuations vs resets
- Merges ingredient updates while preserving original query
- Tracks changes and conflicts transparently

### 3. Improved Search Results
- Ingredients-first strategy found 12 relevant Italian vegetarian pasta recipes
- Respects dietary restrictions (vegetarian tag required)
- Proper ranking by text similarity after filtering

### 4. Schema Mapping
- Successfully maps between extraction schema and database schema
- Handles tag group differences (DIETS to DIETARY_HEALTH)
- Special handling for "quick" time constraint

## Identified Issues

### 1. Extraction Inconsistencies
- "Make it quick" did not extract TIME_DURATION
- "Mexican tacos" did not extract CUISINES_REGIONAL
- System sometimes infers ingredients not explicitly mentioned

### 2. Over-Inference Problem
- "Actually exclude nuts too" somehow added dairy, eggs, and soy sauce
- This led to zero results due to over-constraining

### 3. Tag Extraction Gaps
- Some semantic fields are missed during extraction
- Inconsistent extraction between similar phrasings

## Technical Implementation

The system uses:
- Ollama with Llama 3.2 for local LLM inference
- Pydantic for structured output validation
- SentenceTransformer for embedding generation
- PostgreSQL with pgvector for similarity search
- Array-based ingredient and tag storage

The final implementation provides a functional conversational recipe search system that successfully handles multi-turn interactions, though some refinement of the extraction prompts would improve consistency.

In [1]:
import json
from ollama import Client
from typing import List, Dict, Optional , Set , Tuple
from pydantic import BaseModel, Field

In [2]:
class RecipeQuery(BaseModel):
    query: str = Field(
        ..., 
        description="High-level natural language description of the desired recipe(s), excluding specific ingredient information."
    )
    include_ingredients: List[str] = Field(
        default_factory=list,
        description="Ingredients the user explicitly wants to include in the recipe."
    )
    exclude_ingredients: List[str] = Field(
        default_factory=list,
        description="Ingredients the user explicitly wants to exclude from the recipe."
    )
    count: int = Field(
        default=5,
        ge=1,
        le=10,
        description="Number of recipe results to return. If unspecified, defaults to 5."
    )


In [3]:
from pydantic import BaseModel, Field

class TagsSemanticSchema(BaseModel):
    TIME_DURATION: str = Field(default="", description="User's phrasing about how long the recipe takes")
    DIFFICULTY_SCALE: str = Field(default="", description="Phrasing that reflects the ease or complexity of the recipe")
    SCALE: str = Field(default="", description="Phrasing about how many people or the serving size")
    FREE_OF: str = Field(default="", description="What the user wants to avoid: allergens or ingredients")
    DIETS: str = Field(default="", description="Dietary or health style mentioned (e.g. vegan, diabetic)")
    CUISINES_REGIONAL: str = Field(default="", description="Cultural or regional cuisine preference")
    MEAL_COURSES: str = Field(default="", description="Meal type or time (e.g. lunch, snack)")
    PREPARATION_METHOD: str = Field(default="", description="How the food is to be cooked or prepared")

In [4]:
system_prompt_main = """
You are a structured recipe query extractor. Convert natural language into a structured format.

OUTPUT FORMAT:
{
  "query": string,               // Natural language summary WITHOUT ingredients
  "include_ingredients": [list], // ONLY explicitly mentioned ingredients to include
  "exclude_ingredients": [list], // ONLY explicitly mentioned ingredients to exclude  
  "count": integer              // 1-10, default 5 if not specified
}

EXTRACTION RULES:

1. QUERY FIELD:
   - Summarize the recipe type, cuisine, meal, or style
   - NEVER include specific ingredients in the query
   - Keep it concise and descriptive
   
2. INGREDIENTS:
   - ONLY add ingredients that are EXPLICITLY mentioned
   - Never infer, assume, or add related ingredients
   - "Vegan" ≠ exclude meat (unless user says "no meat")
   - "Healthy" ≠ exclude sugar (unless user says "no sugar")

3. DECISION FLOW:
   ```
   User mentions ingredient?
   ├─ YES → Says "with/include"? → include_ingredients
   │        Says "without/no/exclude"? → exclude_ingredients
   └─ NO → Leave empty
   ```

4. COUNT:
   - Look for numbers + "recipes/dishes/options"
   - Default: 5, Min: 1, Max: 10

EXAMPLES BY COMPLEXITY:

SIMPLE REQUESTS
Input: "Mexican food"
Output: {
  "query": "Mexican recipes",
  "include_ingredients": [],
  "exclude_ingredients": [],
  "count": 5
}

Input: "Pasta with tomatoes"
Output: {
  "query": "Pasta recipes",
  "include_ingredients": ["tomatoes"],
  "exclude_ingredients": [],
  "count": 5
}

CONSTRAINT REQUESTS 
Input: "7 vegetarian breakfast recipes with cheese but no eggs"
Output: {
  "query": "Vegetarian breakfast recipes",
  "include_ingredients": ["cheese"],
  "exclude_ingredients": ["eggs"],
  "count": 7
}


Input: "I want a quick vegan breakfast without eggs"
Output: {
  "query": "Quick vegan breakfast recipes",
  "include_ingredients": [],
  "exclude_ingredients": ["eggs"],
  "count": 5
}


Input: "I'm allergic to nuts and dairy, need 3 dinner ideas"
Output: {
  "query": "Dinner recipes",
  "include_ingredients": [],
  "exclude_ingredients": ["nuts", "dairy"],
  "count": 3
}

COMPLEX/AMBIGUOUS REQUESTS 
Input: "Something healthy for lunch"
Output: {
  "query": "Healthy lunch recipes",
  "include_ingredients": [],
  "exclude_ingredients": [],
  "count": 5
}
Note: "Healthy" is descriptive, not an ingredient constraint

Input: "Asian fusion with shrimp, no soy sauce, make it spicy"
Output: {
  "query": "Spicy Asian fusion recipes",
  "include_ingredients": ["shrimp"],
  "exclude_ingredients": ["soy sauce"],
  "count": 5
}
Note: "Spicy" describes style, not an ingredient

Input: "I love mushrooms and garlic but hate onions, looking for comfort food"
Output: {
  "query": "Comfort food recipes",
  "include_ingredients": ["mushrooms", "garlic"],
  "exclude_ingredients": ["onions"],
  "count": 5
}

EDGE CASES 
Input: "No strawberries"
Output: {
  "query": "Recipes",
  "include_ingredients": [],
  "exclude_ingredients": ["strawberries"],
  "count": 5
}

Input: "Just add bacon to the previous"
Output: {
  "query": "Recipes",
  "include_ingredients": ["bacon"],
  "exclude_ingredients": [],
  "count": 5
}
Note: Context-dependent, extract only what's in current message

COMMON MISTAKES TO AVOID:
- Adding "tofu" because user said "vegan"
- Excluding "gluten" because user said "healthy"  
- Including "oil" when user says "fried" (technique, not ingredient)
- Adding ingredients from previous context

REMEMBER: When in doubt, be conservative. Only extract what is EXPLICITLY stated.
""".strip()

In [5]:

system_prompt_reset = """
You are a conversation continuity analyzer for a recipe assistant.

You will receive:
1. Conversation history (all previous messages)
2. Current message

Determine if the current message continues the existing conversation or starts a completely new recipe search.

OUTPUT FORMAT:
{
  "continue": boolean,
  "reason": string
}

CONTINUE = true when:
- User modifies/refines the existing search (add/remove ingredients, change count)
- User asks about variations of the same recipe theme
- User references previous messages ("that", "those", "the previous")
- User corrects themselves ("sorry, I meant...")

CONTINUE = false when:
- User asks about completely different cuisine/meal type
- User uses reset phrases ("forget that", "let's start over", "new search")
- User switches to unrelated recipe category (breakfast → dinner, dessert → main course)
- Topic has no connection to previous messages

Return only the JSON object.
""".strip()


In [6]:
system_prompt_semantic_fields = """
Extract recipe-related attributes from user messages. Return ONLY what is explicitly mentioned.

OUTPUT FORMAT:
{
  "TIME_DURATION": "",
  "DIFFICULTY_SCALE": "",
  "SCALE": "",
  "FREE_OF": "",
  "DIETS": "",
  "CUISINES_REGIONAL": "",
  "MEAL_COURSES": "",
  "PREPARATION_METHOD": ""
}

EXTRACTION RULES:
- Extract EXACT user phrases, not standardized terms
- Leave empty ("") if not mentioned
- Do NOT infer or add related concepts
- Keep user's original wording

FIELD DEFINITIONS:

TIME_DURATION: Cooking/prep time mentions
- "quick", "30 minutes", "overnight", "fast"
- NOT ingredients or methods

DIFFICULTY_SCALE: Skill level or simplicity
- "easy", "simple", "beginner", "few ingredients"
- NOT time-related terms

SCALE: Serving size or portions
- "for 2", "family dinner", "meal prep", "party"
- NOT difficulty or method

FREE_OF: Explicit exclusions for allergies/intolerances
- "gluten-free", "nut-free", "no dairy"
- NOT general dietary preferences (those go in DIETS)

DIETS: Named dietary patterns
- "vegan", "keto", "paleo", "vegetarian"
- NOT individual exclusions (those go in FREE_OF)

CUISINES_REGIONAL: Geographic or cultural origins
- "Italian", "Thai", "Southern", "Mediterranean"
- NOT meal types or methods

MEAL_COURSES: When/what type of meal
- "breakfast", "appetizer", "dessert", "lunch"
- NOT cuisines or ingredients

PREPARATION_METHOD: How it's cooked
- "grilled", "baked", "no-cook", "slow cooker"
- NOT difficulty or time

PRIORITY RULES:
1. "vegan" → DIETS (not FREE_OF)
2. "quick" → TIME_DURATION (not DIFFICULTY_SCALE)
3. "gluten-free" → FREE_OF (not DIETS)
4. Time + difficulty mentioned → assign each appropriately

EXAMPLES:

Input: "Quick Italian dinner for 2"
{
  "TIME_DURATION": "quick",
  "DIFFICULTY_SCALE": "",
  "SCALE": "for 2",
  "FREE_OF": "",
  "DIETS": "",
  "CUISINES_REGIONAL": "Italian",
  "MEAL_COURSES": "dinner",
  "PREPARATION_METHOD": ""
}

Input: "Easy vegan breakfast, no nuts, grilled"
{
  "TIME_DURATION": "",
  "DIFFICULTY_SCALE": "easy",
  "SCALE": "",
  "FREE_OF": "no nuts",
  "DIETS": "vegan",
  "CUISINES_REGIONAL": "",
  "MEAL_COURSES": "breakfast",
  "PREPARATION_METHOD": "grilled"
}

Input: "I want pasta with tomatoes"
{
  "TIME_DURATION": "",
  "DIFFICULTY_SCALE": "",
  "SCALE": "",
  "FREE_OF": "",
  "DIETS": "",
  "CUISINES_REGIONAL": "",
  "MEAL_COURSES": "",
  "PREPARATION_METHOD": ""
}
Note: "pasta" and "tomatoes" are ingredients, not semantic fields

Input: "30-minute gluten-free Asian stir-fry"
{
  "TIME_DURATION": "30-minute",
  "DIFFICULTY_SCALE": "",
  "SCALE": "",
  "FREE_OF": "gluten-free",
  "DIETS": "",
  "CUISINES_REGIONAL": "Asian",
  "MEAL_COURSES": "",
  "PREPARATION_METHOD": "stir-fry"
}
""".strip()

In [7]:
# Test the main prompt
client = Client()
test_messages = [
    "Something healthy for lunch",
    "I want 7 vegetarian breakfast recipes with cheese but no eggs",
    "Asian fusion with shrimp, no soy sauce, make it spicy",
    "Just add bacon"
]

for msg in test_messages:
    response = client.chat(
        model='llama3.2:latest',
        messages=[
            {"role": "system", "content": system_prompt_main},
            {"role": "user", "content": msg}
        ],
        options={'temperature': 0.0},
        format=RecipeQuery.model_json_schema()
    )
    print(f"Input: {msg}")
    print(f"Output: {response['message']['content']}\n")

Input: Something healthy for lunch
Output: {
  "query": "Healthy lunch recipes",
  "include_ingredients": [],
  "exclude_ingredients": [],
  "count": 5
}

Input: I want 7 vegetarian breakfast recipes with cheese but no eggs
Output: {
  "query": "Vegetarian breakfast recipes",
  "include_ingredients": ["cheese"],
  "exclude_ingredients": ["eggs"],
  "count": 7
}

Input: Asian fusion with shrimp, no soy sauce, make it spicy
Output: {
  "query": "Spicy Asian fusion recipes",
  "include_ingredients": ["shrimp"],
  "exclude_ingredients": ["soy sauce"],
  "count": 5
}

Input: Just add bacon
Output: {
  "query": "Recipes",
  "include_ingredients": ["bacon"],
  "exclude_ingredients": [],
  "count": 5
}



In [8]:
def check_continue(conversation_history: List[str], current_message: str) -> tuple[bool, str]:
    """
    Check if current message continues from conversation history.
    
    Args:
        conversation_history: List of all previous messages in order
        current_message: The new message to evaluate
        
    Returns:
        tuple: (should_continue, reason)
    """
    # Format history for the prompt
    history_text = "\n".join([f"Message {i+1}: {msg}" for i, msg in enumerate(conversation_history)])
    
    prompt = f"""Conversation history:
{history_text}

Current message: {current_message}"""
    
    response = client.chat(
        model='llama3.2:latest',
        format='json',
        messages=[
            {"role": "system", "content": system_prompt_reset},
            {"role": "user", "content": prompt}
        ],
        options={'temperature': 0.0}
    )
    
    result = json.loads(response['message']['content'])
    should_continue = result.get("continue", False)
    reason = result.get("reason", "")
    
    return should_continue, reason

In [9]:
test_conversations = [
    {
        "history": ["I want Italian pasta recipes", "Add mushrooms"],
        "current": "Make it creamy too",
        "expected": True
    },
    {
        "history": ["Show me vegan breakfast ideas", "Add tofu", "Remove nuts"],
        "current": "Actually, let's look at Mexican dinner recipes",
        "expected": False
    },
    {
        "history": ["Healthy salads"],
        "current": "Something with quinoa",
        "expected": True
    },
    {
        "history": ["Thai curry recipes", "Make it spicy", "Add coconut milk"],
        "current": "I need dessert ideas",
        "expected": False
    }
]

print("Testing simplified reset detection:\n")
for test in test_conversations:
    should_continue, reason = check_continue(test["history"], test["current"])
    print(f"History: {test['history']}")
    print(f"Current: '{test['current']}'")
    print(f"Continue: {should_continue} (expected: {test['expected']})")
    print(f"Reason: {reason}\n")

Testing simplified reset detection:

History: ['I want Italian pasta recipes', 'Add mushrooms']
Current: 'Make it creamy too'
Continue: True (expected: True)
Reason: User is refining the existing search by adding a new ingredient ('creamy') to the previous request for Italian pasta recipes with mushrooms.

History: ['Show me vegan breakfast ideas', 'Add tofu', 'Remove nuts']
Current: 'Actually, let's look at Mexican dinner recipes'
Continue: False (expected: False)
Reason: completely different cuisine/meal type

History: ['Healthy salads']
Current: 'Something with quinoa'
Continue: True (expected: True)
Reason: User references a key ingredient from the previous conversation

History: ['Thai curry recipes', 'Make it spicy', 'Add coconut milk']
Current: 'I need dessert ideas'
Continue: False (expected: False)
Reason: completely different cuisine/meal type



In [10]:
# Simple test for semantic fields extraction
test_inputs = [
    "Quick Italian dinner for 2",
    "Easy vegan breakfast, no nuts, grilled",
    "I want pasta with tomatoes",
    "30-minute gluten-free Asian stir-fry",
    "Healthy keto lunch that's beginner-friendly",
    "Mexican street food style tacos",
    "Simple one-pot family meal",
    "Fancy French dessert that takes overnight to prepare"
]

print("Testing semantic fields extraction with improved prompt:\n")


for input_text in test_inputs:
    response = client.chat(
        model='llama3.2:latest',
        format=TagsSemanticSchema.model_json_schema(),
        messages=[
            {"role": "system", "content": system_prompt_semantic_fields},
            {"role": "user", "content": input_text}
        ],
        options={'temperature': 0.0}
    )
    
    tags = json.loads(response['message']['content'])
    
    print(f"\nInput: '{input_text}'")
    print("Extracted:")
    
    # Only show non-empty fields
    for field, value in tags.items():
        if value:
            print(f"  {field}: '{value}'")
    
    if not any(tags.values()):
        print("  (no semantic fields detected)")
    

Testing semantic fields extraction with improved prompt:


Input: 'Quick Italian dinner for 2'
Extracted:
  TIME_DURATION: 'quick'
  SCALE: 'for 2'
  CUISINES_REGIONAL: 'Italian'
  MEAL_COURSES: 'dinner'

Input: 'Easy vegan breakfast, no nuts, grilled'
Extracted:
  DIFFICULTY_SCALE: 'easy'
  FREE_OF: 'no nuts'
  DIETS: 'vegan'
  MEAL_COURSES: 'breakfast'
  PREPARATION_METHOD: 'grilled'

Input: 'I want pasta with tomatoes'
Extracted:
  (no semantic fields detected)

Input: '30-minute gluten-free Asian stir-fry'
Extracted:
  TIME_DURATION: '30-minute'
  FREE_OF: 'gluten-free'
  CUISINES_REGIONAL: 'Asian'

Input: 'Healthy keto lunch that's beginner-friendly'
Extracted:
  DIFFICULTY_SCALE: 'beginner'
  DIETS: 'keto'
  MEAL_COURSES: 'lunch'

Input: 'Mexican street food style tacos'
Extracted:
  CUISINES_REGIONAL: 'Mexican'

Input: 'Simple one-pot family meal'
Extracted:
  (no semantic fields detected)

Input: 'Fancy French dessert that takes overnight to prepare'
Extracted:
  TIME_DURATION:

In [11]:
from dataclasses import dataclass
from typing import List, Set, Dict
from copy import deepcopy
@dataclass
class MergeResult:
    """Result of merging two recipe queries"""
    merged_query: RecipeQuery
    changes: Dict[str, List[str]]
    conflicts: List[str]

def merge_recipe_query_v2(base: RecipeQuery, patch: RecipeQuery) -> MergeResult:
    """
    Merge patch into base with change tracking.
    
    Returns:
        MergeResult containing merged query, changes made, and any conflicts
    """
    updated = deepcopy(base)
    changes = {
        "added_includes": [],
        "removed_includes": [],
        "added_excludes": [],
        "removed_excludes": [],
        "count_changed": None
    }
    conflicts = []
    
    # Convert to sets for operations
    inc = set(updated.include_ingredients)
    exc = set(updated.exclude_ingredients)
    p_inc = set(patch.include_ingredients)
    p_exc = set(patch.exclude_ingredients)
    
    # Track new includes
    new_includes = p_inc - inc
    changes["added_includes"] = list(new_includes)
    
    # Track new excludes
    new_excludes = p_exc - exc
    changes["added_excludes"] = list(new_excludes)
    
    # Find conflicts: ingredients moving from include to exclude
    moving_to_exclude = inc.intersection(p_exc)
    if moving_to_exclude:
        conflicts.extend([f"'{ing}' moved from include to exclude" for ing in moving_to_exclude])
        changes["removed_includes"].extend(list(moving_to_exclude))
    
    # Find reverse conflicts: ingredients moving from exclude to include
    moving_to_include = exc.intersection(p_inc)
    if moving_to_include:
        conflicts.extend([f"'{ing}' moved from exclude to include" for ing in moving_to_include])
        changes["removed_excludes"].extend(list(moving_to_include))
    
    # Apply changes
    inc.update(p_inc)
    inc.difference_update(p_exc)  # Remove newly excluded
    exc.update(p_exc)
    exc.difference_update(p_inc)  # Remove newly included
    
    updated.include_ingredients = sorted(inc)
    updated.exclude_ingredients = sorted(exc)
    
    # Track count change
    if patch.count != 5 and patch.count != base.count:
        changes["count_changed"] = f"{base.count} → {patch.count}"
        updated.count = patch.count
    
    # Keep original query (as before)
    # updated.query stays the same
    
    return MergeResult(
        merged_query=updated,
        changes=changes,
        conflicts=conflicts
    )

# Test the improved merge logic
def test_merge_improvements():
    print("Testing improved merge logic with change tracking:\n")
    print("="*60)
    
    # Test 1: Simple addition
    print("\nTest 1: Adding ingredients")
    base = RecipeQuery(
        query="Italian pasta",
        include_ingredients=["tomatoes"],
        exclude_ingredients=["nuts"],
        count=5
    )
    patch = RecipeQuery(
        query="",
        include_ingredients=["basil", "garlic"],
        exclude_ingredients=["dairy"],
        count=5
    )
    
    result = merge_recipe_query_v2(base, patch)
    print(f"Base: includes={base.include_ingredients}, excludes={base.exclude_ingredients}")
    print(f"Patch: includes={patch.include_ingredients}, excludes={patch.exclude_ingredients}")
    print(f"Result: includes={result.merged_query.include_ingredients}, excludes={result.merged_query.exclude_ingredients}")
    print(f"Changes: {result.changes}")
    print(f"Conflicts: {result.conflicts}")
    
    # Test 2: Conflict - moving from include to exclude
    print("\n" + "-"*40)
    print("\nTest 2: Conflict - ingredient moves from include to exclude")
    base = RecipeQuery(
        query="Salad recipes",
        include_ingredients=["avocado", "tomatoes"],
        exclude_ingredients=["onions"],
        count=5
    )
    patch = RecipeQuery(
        query="",
        include_ingredients=["cucumber"],
        exclude_ingredients=["avocado"],  # Was included, now excluded
        count=7
    )
    
    result = merge_recipe_query_v2(base, patch)
    print(f"Base: includes={base.include_ingredients}")
    print(f"Patch: excludes={patch.exclude_ingredients}")
    print(f"Result: includes={result.merged_query.include_ingredients}, excludes={result.merged_query.exclude_ingredients}")
    print(f"Changes: {result.changes}")
    print(f"Conflicts: {result.conflicts}")
    
    # Test 3: Multiple changes
    print("\n" + "-"*40)
    print("\nTest 3: Multiple changes including count")
    base = RecipeQuery(
        query="Mexican dinner",
        include_ingredients=["beans", "cheese"],
        exclude_ingredients=["meat"],
        count=5
    )
    patch = RecipeQuery(
        query="",
        include_ingredients=["rice", "meat"],  # meat was excluded!
        exclude_ingredients=["beans", "gluten"],  # beans was included!
        count=8
    )
    
    result = merge_recipe_query_v2(base, patch)
    print(f"Result: includes={result.merged_query.include_ingredients}, excludes={result.merged_query.exclude_ingredients}")
    print(f"Count: {result.merged_query.count}")
    print(f"Changes: {result.changes}")
    print(f"Conflicts: {result.conflicts}")

# Run the test
test_merge_improvements()

Testing improved merge logic with change tracking:


Test 1: Adding ingredients
Base: includes=['tomatoes'], excludes=['nuts']
Patch: includes=['basil', 'garlic'], excludes=['dairy']
Result: includes=['basil', 'garlic', 'tomatoes'], excludes=['dairy', 'nuts']
Changes: {'added_includes': ['garlic', 'basil'], 'removed_includes': [], 'added_excludes': ['dairy'], 'removed_excludes': [], 'count_changed': None}
Conflicts: []

----------------------------------------

Test 2: Conflict - ingredient moves from include to exclude
Base: includes=['avocado', 'tomatoes']
Patch: excludes=['avocado']
Result: includes=['cucumber', 'tomatoes'], excludes=['avocado', 'onions']
Changes: {'added_includes': ['cucumber'], 'removed_includes': ['avocado'], 'added_excludes': ['avocado'], 'removed_excludes': [], 'count_changed': '5 → 7'}
Conflicts: ["'avocado' moved from include to exclude"]

----------------------------------------

Test 3: Multiple changes including count
Result: includes=['cheese', 'meat', 'r

In [12]:


response = client.chat(
    model='llama3.2:latest',
    format=TagsSemanticSchema.model_json_schema(),
    messages=[
        {"role": "system", "content": system_prompt_semantic_fields},
        {"role": "user", "content": "Vegan mexican breakfast that will be ready under 1 hour , i want to grill it , i dont want beans"}
    ]
)

tags = json.loads(response['message']['content']) 
print(tags)

{'TIME_DURATION': 'under 1 hour', 'DIFFICULTY_SCALE': '', 'SCALE': '', 'FREE_OF': '', 'DIETS': 'vegan', 'CUISINES_REGIONAL': '', 'MEAL_COURSES': '', 'PREPARATION_METHOD': 'grilled'}


In [13]:
turn_counter = 0
current_recipe_query: RecipeQuery | None = None
current_tags: TagsSemanticSchema | None = None
conversation_history: List[str] = []

In [14]:
from typing import Any, Union

def coerce_recipe(obj: Union[str, dict, "RecipeQuery"]) -> "RecipeQuery":
    """Always return a RecipeQuery instance."""
    if isinstance(obj, RecipeQuery):
        return obj
    if isinstance(obj, str):
        return RecipeQuery.model_validate_json(obj)
    if isinstance(obj, dict):
        return RecipeQuery.model_validate(obj)
    raise TypeError(f"Cannot coerce {type(obj)} to RecipeQuery")


def coerce_tags(obj: Union[str, dict, "TagsSemanticSchema"]) -> "TagsSemanticSchema":
    if isinstance(obj, TagsSemanticSchema):
        return obj
    if isinstance(obj, str):
        return TagsSemanticSchema.model_validate_json(obj)
    if isinstance(obj, dict):
        return TagsSemanticSchema.model_validate(obj)
    raise TypeError(f"Cannot coerce {type(obj)} to TagsSemanticSchema")


def extract_recipe_query(msg: str) -> "RecipeQuery":
    resp = client.chat(
        model="llama3.2:latest",
        format=RecipeQuery.model_json_schema(),
        messages=[
            {"role": "system", "content": system_prompt_main},
            {"role": "user",   "content": msg},
            
        ],options  = {'temperature': 0.0},
    )
    return coerce_recipe(resp["message"]["content"])


def extract_tags(msg: str) -> "TagsSemanticSchema":
    resp = client.chat(
        model="llama3.2:latest",
        format=TagsSemanticSchema.model_json_schema(),
        messages=[
            {"role": "system", "content": system_prompt_semantic_fields},
            {"role": "user",   "content": msg}
        ],options  = {'temperature': 0.0},
    )
    return coerce_tags(resp["message"]["content"])

In [15]:
def process_user_message(user_message: str) -> None:
    global turn_counter, current_recipe_query, current_tags, conversation_history

    # FIRST TURN – extract everything and store
    if turn_counter == 0:
        conversation_history.append(user_message)
        current_recipe_query = extract_recipe_query(user_message)
        current_tags = extract_tags(user_message)
        turn_counter = 1
        
        print(" INITIAL STATE")
        print(f"Query: {current_recipe_query.model_dump()}")
        print(f"Tags: {current_tags.model_dump()}")
        print(f"History: {conversation_history}\n")
        return

    # CONTINUATION / RESET decision using full history
    is_cont, reason = check_continue(conversation_history, user_message)
    print(f" Continue decision: {is_cont} - Reason: {reason}")

    if is_cont:
        # CONTINUATION: Only merge ingredients and count, keep original query and tags
        conversation_history.append(user_message)
        
        # Extract patch (only ingredients and count matter)
        patch_query = extract_recipe_query(user_message)
        
        # Merge with tracking
        merge_result = merge_recipe_query_v2(current_recipe_query, patch_query)
        current_recipe_query = merge_result.merged_query
        
        print(" CONTINUATION STATE")
        print(f"Query: {current_recipe_query.model_dump()}")
        print(f"Tags: {current_tags.model_dump()}")  # Tags unchanged
        
        # Show what changed
        if any(merge_result.changes.values()):
            print(" Changes made:")
            if merge_result.changes["added_includes"]:
                print(f" Added to includes: {merge_result.changes['added_includes']}")
            if merge_result.changes["removed_includes"]:
                print(f" Removed from includes: {merge_result.changes['removed_includes']}")
            if merge_result.changes["added_excludes"]:
                print(f" Added to excludes: {merge_result.changes['added_excludes']}")
            if merge_result.changes["count_changed"]:
                print(f" Count changed: {merge_result.changes['count_changed']}")
        
        if merge_result.conflicts:
            print(" Conflicts resolved:")
            for conflict in merge_result.conflicts:
                print(f"  - {conflict}")
        
        print(f"History: {conversation_history}\n")
        
    else:
        # RESET: Start fresh with new query and tags
        conversation_history = [user_message]  # Reset history
        current_recipe_query = extract_recipe_query(user_message)
        current_tags = extract_tags(user_message)
        turn_counter = 1
        
        print("RESET STATE")
        print(f"Query: {current_recipe_query.model_dump()}")
        print(f"Tags: {current_tags.model_dump()}")
        print(f"History: {conversation_history}\n")

    turn_counter += 1

In [16]:
process_user_message("I want a quick vegan breakfast without eggs")
process_user_message("Now add corn and i dont like tofu , and show me 7 recipes")
process_user_message("Give me some Japanese dinners for two people")

 INITIAL STATE
Query: {'query': 'Quick vegan breakfast recipes', 'include_ingredients': [], 'exclude_ingredients': ['eggs'], 'count': 5}
Tags: {'TIME_DURATION': 'quick', 'DIFFICULTY_SCALE': '', 'SCALE': '', 'FREE_OF': '', 'DIETS': 'vegan', 'CUISINES_REGIONAL': '', 'MEAL_COURSES': 'breakfast', 'PREPARATION_METHOD': ''}
History: ['I want a quick vegan breakfast without eggs']

 Continue decision: True - Reason: User modifies/refines the existing search (add/remove ingredients)
 CONTINUATION STATE
Query: {'query': 'Quick vegan breakfast recipes', 'include_ingredients': ['corn'], 'exclude_ingredients': ['eggs', 'tofu'], 'count': 7}
Tags: {'TIME_DURATION': 'quick', 'DIFFICULTY_SCALE': '', 'SCALE': '', 'FREE_OF': '', 'DIETS': 'vegan', 'CUISINES_REGIONAL': '', 'MEAL_COURSES': 'breakfast', 'PREPARATION_METHOD': ''}
 Changes made:
 Added to includes: ['corn']
 Added to excludes: ['tofu']
 Count changed: 5 → 7
History: ['I want a quick vegan breakfast without eggs', 'Now add corn and i dont like

In [17]:
# Test the extraction directly
test_inputs = [
    "I want a quick vegan breakfast without eggs",
    "Mexican food with beans but no cheese",
    "Pasta without gluten"
]

for msg in test_inputs:
    result = extract_recipe_query(msg)
    print(f"Input: '{msg}'")
    print(f"Include: {result.include_ingredients}")
    print(f"Exclude: {result.exclude_ingredients}\n")

Input: 'I want a quick vegan breakfast without eggs'
Include: []
Exclude: ['eggs']

Input: 'Mexican food with beans but no cheese'
Include: ['beans']
Exclude: ['cheese']

Input: 'Pasta without gluten'
Include: []
Exclude: ['gluten']



In [18]:
# Reset everything for another test
turn_counter = 0
current_recipe_query = None
current_tags = None
conversation_history = []

# Running the test messages
process_user_message("I want a quick vegan breakfast without eggs")
process_user_message("Now add corn and i dont like tofu, and show me 7 recipes")
process_user_message("Give me some Japanese dinners for two people")

 INITIAL STATE
Query: {'query': 'Quick vegan breakfast recipes', 'include_ingredients': [], 'exclude_ingredients': ['eggs'], 'count': 5}
Tags: {'TIME_DURATION': 'quick', 'DIFFICULTY_SCALE': '', 'SCALE': '', 'FREE_OF': '', 'DIETS': 'vegan', 'CUISINES_REGIONAL': '', 'MEAL_COURSES': 'breakfast', 'PREPARATION_METHOD': ''}
History: ['I want a quick vegan breakfast without eggs']

 Continue decision: True - Reason: User modifies/refines the existing search (add/remove ingredients)
 CONTINUATION STATE
Query: {'query': 'Quick vegan breakfast recipes', 'include_ingredients': ['corn'], 'exclude_ingredients': ['eggs', 'tofu'], 'count': 7}
Tags: {'TIME_DURATION': 'quick', 'DIFFICULTY_SCALE': '', 'SCALE': '', 'FREE_OF': '', 'DIETS': 'vegan', 'CUISINES_REGIONAL': '', 'MEAL_COURSES': 'breakfast', 'PREPARATION_METHOD': ''}
 Changes made:
 Added to includes: ['corn']
 Added to excludes: ['tofu']
 Count changed: 5 → 7
History: ['I want a quick vegan breakfast without eggs', 'Now add corn and i dont like

In [19]:
from sentence_transformers import SentenceTransformer

model  = SentenceTransformer("all-MiniLM-L6-v2")




In [20]:
import psycopg2
import numpy as np

# Database connection
def get_db_connection():
    conn = psycopg2.connect(
        dbname="recipes_db",
        user="postgres",
        password="turgutcem", 
        host="localhost",
        port=5432,
        options='-c client_encoding=utf8'
    )
    conn.set_client_encoding('UTF8')
    return conn

In [21]:
def search_recipes_by_embedding(query_text: str, limit: int = 20, similarity_threshold: float = 0.5) -> List[Dict]:
    """Search recipes using embedding similarity."""
    query_embedding = model.encode(query_text)
    
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        embedding_list = query_embedding.tolist()
        
        query = """
        SELECT 
            id as recipe_id,
            name,
            1 - (embedding <=> %s::vector) as similarity
        FROM recipes
        WHERE 1 - (embedding <=> %s::vector) > %s
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """
        
        cur.execute(query, (embedding_list, embedding_list, similarity_threshold, embedding_list, limit))
        
        results = []
        for row in cur.fetchall():
            results.append({
                'recipe_id': row[0],
                'name': row[1],
                'similarity': float(row[2])
            })
        
        return results
    
    finally:
        cur.close()
        conn.close()

        
# Test the embedding search
test_queries = [
    "Italian pasta dishes",
    "Quick breakfast ideas",
    "Healthy vegetarian dinner",
    "Mexican street food"
]

print("Testing Embedding Search:")
print("-"*50)
for test_query in test_queries:
    results = search_recipes_by_embedding(test_query, limit=5)
    print(f"\nQuery: '{test_query}'")
    print(f"Found {len(results)} recipes:")
    for i, recipe in enumerate(results[:3], 1):
        print(f"  {i}. {recipe['name']} (similarity: {recipe['similarity']:.3f})")

Testing Embedding Search:
--------------------------------------------------


  attn_output = torch.nn.functional.scaled_dot_product_attention(



Query: 'Italian pasta dishes'
Found 5 recipes:
  1. Macaroni and Cheese Italia (similarity: 0.760)
  2. Italian Spaghetti Sauce (similarity: 0.747)
  3. Casserole Italiano (similarity: 0.741)

Query: 'Quick breakfast ideas'
Found 5 recipes:
  1. Quick Start Breakfast Drink (similarity: 0.796)
  2. Snappy Eggs (similarity: 0.774)
  3. Easy Breakfast Pull Apart (similarity: 0.753)

Query: 'Healthy vegetarian dinner'
Found 5 recipes:
  1. Chickpea Casserole (similarity: 0.705)
  2. Whole Wheat Pasta With Peppers, Tomatoes and Olives (similarity: 0.703)
  3. Cheese and Potato Casserole (similarity: 0.694)

Query: 'Mexican street food'
Found 5 recipes:
  1. Mexican Side Dish (similarity: 0.705)
  2. Mexican Lasanga (similarity: 0.697)
  3. Mexican Rice (similarity: 0.694)


In [22]:
def resolve_ingredient_to_canonical(ingredient_name: str, confidence_threshold: float = 0.65) -> Optional[Dict]:
    """
    Find canonical ingredient from user input using exact match or embedding search.
    
    Returns:
        Dict with canonical_id, canonical_name, and confidence
    """
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        # First try exact match in variants
        cur.execute("""
            SELECT DISTINCT i.id, i.canonical
            FROM ingredients i
            JOIN ingredient_variants iv ON i.id = iv.canonical_id
            WHERE LOWER(iv.variant) = LOWER(%s)
        """, (ingredient_name,))
        
        result = cur.fetchone()
        if result:
            return {
                'canonical_id': result[0],
                'canonical_name': result[1],
                'confidence': 1.0,
                'method': 'exact_match'
            }
        
        # If no exact match, use embedding search
        ingredient_embedding = model.encode(ingredient_name).tolist()
        
        cur.execute("""
            SELECT 
                i.id,
                i.canonical,
                iv.variant,
                1 - (i.embedding <=> %s::vector) as similarity
            FROM ingredients i
            JOIN ingredient_variants iv ON i.id = iv.canonical_id
            WHERE 1 - (i.embedding <=> %s::vector) > %s
            ORDER BY i.embedding <=> %s::vector
            LIMIT 1
        """, (ingredient_embedding, ingredient_embedding, confidence_threshold, ingredient_embedding))
        
        result = cur.fetchone()
        if result:
            return {
                'canonical_id': result[0],
                'canonical_name': result[1],
                'matched_variant': result[2],
                'confidence': float(result[3]),
                'method': 'embedding_match'
            }
        
        return None
    
    finally:
        cur.close()
        conn.close()

def get_all_ingredient_variants(canonical_id: int) -> List[str]:
    """Get all variants of a canonical ingredient."""
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        cur.execute("""
            SELECT variant 
            FROM ingredient_variants 
            WHERE canonical_id = %s
        """, (canonical_id,))
        
        return [row[0] for row in cur.fetchall()]
    
    finally:
        cur.close()
        conn.close()

def filter_recipes_by_ingredients(
    include_ingredients: List[str], 
    exclude_ingredients: List[str],
    recipe_ids: Optional[List[int]] = None
) -> Tuple[List[int], Dict]:
    """
    Filter recipes by ingredient constraints.
    Assumes ingredients are stored as array in recipes.ingredients column
    """
    # Resolve all ingredients to canonical form
    resolved_includes = []
    resolved_excludes = []
    unresolved = []
    
    for ing in include_ingredients:
        resolved = resolve_ingredient_to_canonical(ing)
        if resolved:
            resolved['variants'] = get_all_ingredient_variants(resolved['canonical_id'])
            resolved_includes.append(resolved)
        else:
            unresolved.append(('include', ing))
    
    for ing in exclude_ingredients:
        resolved = resolve_ingredient_to_canonical(ing)
        if resolved:
            resolved['variants'] = get_all_ingredient_variants(resolved['canonical_id'])
            resolved_excludes.append(resolved)
        else:
            unresolved.append(('exclude', ing))
    
    # Now filter recipes
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        # Build query for array-based ingredients
        query_parts = ["SELECT id FROM recipes WHERE 1=1"]
        params = []
        
        # Filter by recipe_ids if provided
        if recipe_ids:
            placeholders = ','.join(['%s'] * len(recipe_ids))
            query_parts.append(f"AND id IN ({placeholders})")
            params.extend(recipe_ids)
        
        # Add constraints for included ingredients using ANY and array overlap
        for inc in resolved_includes:
            variants_list = [v.lower() for v in inc['variants']]
            # Use array overlap operator && or ANY
            query_parts.append("""
                AND EXISTS (
                    SELECT 1 FROM unnest(ingredients) AS ing
                    WHERE LOWER(ing) = ANY(%s::text[])
                )
            """)
            params.append(variants_list)
        
        # Add constraints for excluded ingredients
        if resolved_excludes:
            exclude_variants = []
            for exc in resolved_excludes:
                exclude_variants.extend([v.lower() for v in exc['variants']])
            
            query_parts.append("""
                AND NOT EXISTS (
                    SELECT 1 FROM unnest(ingredients) AS ing
                    WHERE LOWER(ing) = ANY(%s::text[])
                )
            """)
            params.append(exclude_variants)
        
        query = '\n'.join(query_parts)
        cur.execute(query, params)
        
        matching_recipe_ids = [row[0] for row in cur.fetchall()]
        
        resolution_info = {
            'resolved_includes': resolved_includes,
            'resolved_excludes': resolved_excludes,
            'unresolved': unresolved
        }
        
        return matching_recipe_ids, resolution_info
    
    finally:
        cur.close()
        conn.close()

# Test ingredient matching
print("\nTesting Ingredient Resolution:")
print("-"*50)

test_ingredients = [
    "tomatos",  # typo
    "eggs",
    "chicken breast",
    "asdfgh"  # nonsense
]

for ingredient in test_ingredients:
    result = resolve_ingredient_to_canonical(ingredient)
    if result:
        print(f"\n'{ingredient}' → '{result['canonical_name']}' (confidence: {result['confidence']:.3f})")
        variants = get_all_ingredient_variants(result['canonical_id'])
        print(f"  Variants: {variants}")
    else:
        print(f"\n'{ingredient}' → NOT FOUND")

# Test recipe filtering
print("\n\nTesting Recipe Filtering:")
print("-"*50)
include = ["tomato", "basil"]
exclude = ["nuts"]
recipe_ids, info = filter_recipes_by_ingredients(include, exclude)
print(f"Include: {include}")
print(f"Exclude: {exclude}")
print(f"Found {len(recipe_ids)} recipes")
print(f"Resolved includes: {[r['canonical_name'] for r in info['resolved_includes']]}")
print(f"Resolved excludes: {[r['canonical_name'] for r in info['resolved_excludes']]}")
print(f"Unresolved: {info['unresolved']}")


Testing Ingredient Resolution:
--------------------------------------------------

'tomatos' → 'tomatoes' (confidence: 0.884)
  Variants: ['baby tomatoes', 'beef tomatoes', 'bush tomatoes', 'chili tomatoes', 'chpd tomatoes', 'cubed tomatoes', 'cut tomatoes', 'diced tomatoes', 'dried tomatoes', 'egg tomatoes', 'firm tomatoes', 'fresh tomatoes', 'grape tomatoes', 'green tomatoes', 'meaty tomatoes', 'mixed tomatoes', 'paste tomatoes', 'plum tomatoes', 'pomi tomatoes', 'red tomatoes', 'rice tomatoes', 'ripe tomatoes', 'roma tomatoes', 'rotel tomatoes', 'sweet tomatoes', 'thin tomatoes', 'tomato', 'tomatoes', 'tomatoes, --', 'truss tomatoes', 'vine tomatoes', 'whole tomatoes']

'eggs' → 'eggs' (confidence: 1.000)
  Variants: ['egg', 'eggs']

'chicken breast' → 'chicken breasts' (confidence: 1.000)
  Variants: ['bone in chicken breasts', 'bone-in chicken breasts', 'boneless chicken breast', 'boneless chicken breasts', 'chicken breast', 'chicken breast fillet', 'chicken breast fillets', 'chi

In [23]:

# Map extraction group names to database group names
GROUP_NAME_MAPPING = {
    "DIETS": "DIETARY_HEALTH",
    "FREE_OF": "DIETARY_HEALTH",
    "SCALE" : "DIFFICULTY_SCALE"  
}

QUICK_TIME_TAGS = ["15-minutes-or-less", "30-minutes-or-less", "60-minutes-or-less"]


In [24]:
def resolve_tag(tag_text: str, tag_group: Optional[str] = None, confidence_threshold: float = 0.7) -> Optional[Dict]:
    """
    Find matching tag using exact match or embedding search.
    
    Args:
        tag_text: The tag text to search for
        tag_group: Optional tag group to limit search
        confidence_threshold: Minimum similarity score
    
    Returns:
        Dict with tag info or None
    """
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        # First try exact match
        if tag_group:
            cur.execute("""
                SELECT t.tag_name, t.group_name
                FROM tags t
                WHERE LOWER(t.tag_name) = LOWER(%s) AND t.group_name = %s
            """, (tag_text, tag_group))
        else:
            cur.execute("""
                SELECT t.tag_name, t.group_name
                FROM tags t
                WHERE LOWER(t.tag_name) = LOWER(%s)
            """, (tag_text,))
        
        result = cur.fetchone()
        if result:
            return {
                'tag_name': result[0],
                'group_name': result[1],
                'confidence': 1.0,
                'method': 'exact_match'
            }
        
        # If no exact match, use embedding search
        tag_embedding = model.encode(tag_text).tolist()
        
        if tag_group:
            cur.execute("""
                SELECT 
                    t.tag_name,
                    t.group_name,
                    1 - (t.embedding <=> %s::vector) as similarity
                FROM tags t
                WHERE t.group_name = %s
                    AND 1 - (t.embedding <=> %s::vector) > %s
                ORDER BY t.embedding <=> %s::vector
                LIMIT 1
            """, (tag_embedding, tag_group, tag_embedding, confidence_threshold, tag_embedding))
        else:
            cur.execute("""
                SELECT 
                    t.tag_name,
                    t.group_name,
                    1 - (t.embedding <=> %s::vector) as similarity
                FROM tags t
                WHERE 1 - (t.embedding <=> %s::vector) > %s
                ORDER BY t.embedding <=> %s::vector
                LIMIT 1
            """, (tag_embedding, tag_embedding, confidence_threshold, tag_embedding))
        
        result = cur.fetchone()
        if result:
            return {
                'tag_name': result[0],
                'group_name': result[1],
                'confidence': float(result[2]),
                'method': 'embedding_match'
            }
        
        return None
    
    finally:
        cur.close()
        conn.close()

def map_tags_for_db(tags_dict: Dict[str, str]) -> Dict[str, str]:
    """Map extracted tags to database format."""
    mapped = {}
    
    for group, value in tags_dict.items():  # Fixed: was tags.items()
        if not value:
            continue
            
        # Map group name
        db_group = GROUP_NAME_MAPPING.get(group, group)
        mapped[db_group] = value
    
    return mapped

def filter_recipes_by_tags(tags: Dict[str, str], recipe_ids: Optional[List[int]] = None) -> Tuple[List[int], Dict]:
    """
    Filter recipes by tag constraints.
    DIETARY_HEALTH and CUISINES_REGIONAL are required.
    TIME_DURATION uses OR logic for "quick".
    """
    # First apply mapping
    mapped_tags = map_tags_for_db(tags)
    
    # Separate tags by type
    required_tags = []  # Must have ALL of these
    optional_tags = []  # Can have ANY of these
    unresolved_tags = []
    
    for group, value in mapped_tags.items():
        if value:
            # Special handling for "quick" time
            if group == "TIME_DURATION" and value.lower() in ["quick", "fast"]:
                time_tags = []
                for quick_tag in QUICK_TIME_TAGS:
                    time_tags.append({
                        'tag_name': quick_tag,
                        'group_name': 'TIME_DURATION',
                        'confidence': 1.0,
                        'method': 'quick_mapping'
                    })
                optional_tags.extend(time_tags)  # These are OR conditions
            else:
                resolved = resolve_tag(value, group)
                if resolved:
                    # Dietary and Cuisine tags are REQUIRED
                    if group in ["DIETARY_HEALTH", "CUISINES_REGIONAL"]:
                        required_tags.append(resolved)
                    else:
                        optional_tags.append(resolved)
                else:
                    unresolved_tags.append((group, value))
    
    if not (required_tags or optional_tags):
        return recipe_ids or [], {'resolved_tags': [], 'unresolved_tags': unresolved_tags}
    
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        # Build query
        query_parts = ["SELECT id FROM recipes WHERE 1=1"]
        params = []
        
        # Filter by recipe_ids if provided
        if recipe_ids:
            placeholders = ','.join(['%s'] * len(recipe_ids))
            query_parts.append(f"AND id IN ({placeholders})")
            params.extend(recipe_ids)
        
        # Add REQUIRED tags (ALL must match)
        for req_tag in required_tags:
            query_parts.append("""
                AND EXISTS (
                    SELECT 1 FROM unnest(tags) AS tag
                    WHERE LOWER(tag) = LOWER(%s)
                )
            """)
            params.append(req_tag['tag_name'])
        
        # Add OPTIONAL tags (ANY can match)
        if optional_tags:
            optional_tag_names = [tag['tag_name'].lower() for tag in optional_tags]
            query_parts.append("""
                AND EXISTS (
                    SELECT 1 FROM unnest(tags) AS tag
                    WHERE LOWER(tag) = ANY(%s::text[])
                )
            """)
            params.append(optional_tag_names)
        
        query = '\n'.join(query_parts)
        cur.execute(query, params)
        
        matching_recipe_ids = [row[0] for row in cur.fetchall()]
        
        # Get match counts for ranking
        all_tags = required_tags + optional_tags
        if matching_recipe_ids and all_tags:
            tag_names = [tag['tag_name'].lower() for tag in all_tags]
            count_query = """
                SELECT id, 
                       (SELECT COUNT(*) 
                        FROM unnest(tags) AS tag 
                        WHERE LOWER(tag) = ANY(%s::text[])) as match_count
                FROM recipes
                WHERE id = ANY(%s::int[])
                ORDER BY match_count DESC
            """
            cur.execute(count_query, (tag_names, matching_recipe_ids))
            match_counts = {row[0]: row[1] for row in cur.fetchall()}
        else:
            match_counts = {}
        
        resolution_info = {
            'resolved_tags': required_tags + optional_tags,
            'required_tags': required_tags,
            'optional_tags': optional_tags,
            'unresolved_tags': unresolved_tags,
            'match_counts': match_counts
        }
        
        return matching_recipe_ids, resolution_info
    
    finally:
        cur.close()
        conn.close()


# Test tag resolution
print("\nTesting Tag Resolution:")
print("-"*50)

test_tags = [
    ("Italian", "CUISINES_REGIONAL"),
    ("quick", "TIME_DURATION"),
    ("vegan", None),  # typo for vegan
    ("spicy", None)
]

for tag_text, group in test_tags:
    result = resolve_tag(tag_text, group)
    if result:
        print(f"\n'{tag_text}' → '{result['tag_name']}' in group '{result['group_name']}'")
        print(f"  Confidence: {result['confidence']:.3f}, Method: {result['method']}")
    else:
        print(f"\n'{tag_text}' → NOT FOUND")

# Test tag filtering
print("\n\nTesting Tag Filtering:")
print("-"*50)
tags_to_search = {
    "CUISINES_REGIONAL": "Italian",
    "MEAL_COURSES": "dinner",
    "DIETS": "vegetarian"  
}
recipe_ids, info = filter_recipes_by_tags(tags_to_search)
print(f"Searching for tags: {tags_to_search}")
print(f"Found {len(recipe_ids)} recipes")
print(f"Resolved tags: {[(t['tag_name'], t['group_name']) for t in info['resolved_tags']]}")
print(f"Unresolved: {info['unresolved_tags']}")
if recipe_ids:
    print(f"Top 3 recipes with match counts: {list(info['match_counts'].items())[:3]}")


Testing Tag Resolution:
--------------------------------------------------

'Italian' → 'italian' in group 'CUISINES_REGIONAL'
  Confidence: 1.000, Method: exact_match

'quick' → NOT FOUND

'vegan' → 'vegan' in group 'DIETARY_HEALTH'
  Confidence: 1.000, Method: exact_match

'spicy' → 'chili' in group 'DISH_TYPES'
  Confidence: 0.724, Method: embedding_match


Testing Tag Filtering:
--------------------------------------------------
Searching for tags: {'CUISINES_REGIONAL': 'Italian', 'MEAL_COURSES': 'dinner', 'DIETS': 'vegetarian'}
Found 166 recipes
Resolved tags: [('italian', 'CUISINES_REGIONAL'), ('vegetarian', 'DIETARY_HEALTH'), ('dinner-party', 'MEAL_COURSES')]
Unresolved: []
Top 3 recipes with match counts: [(200, 3), (1504, 3), (1945, 3)]


In [25]:
def get_recipe_details(recipe_ids: List[int]) -> Dict[int, Dict]:
    """Get recipe details for display."""
    if not recipe_ids:
        return {}
        
    conn = get_db_connection()
    cur = conn.cursor()
    try:
        placeholders = ','.join(['%s'] * len(recipe_ids))
        cur.execute(f"""
            SELECT id, name, description, ingredients, tags
            FROM recipes
            WHERE id IN ({placeholders})
        """, recipe_ids)
        
        details = {}
        for row in cur.fetchall():
            details[row[0]] = {
                'recipe_id': row[0],
                'name': row[1],
                'description': row[2],
                'ingredients': row[3],  
                'tags': row[4]          
            }
        return details
    finally:
        cur.close()
        conn.close()

In [26]:
def search_recipes_combined(recipe_query: RecipeQuery, tags: TagsSemanticSchema) -> Dict:
    """Combined search using embeddings, ingredients, and tags."""
    
    # Convert tags to dict (remove empty values)
    if hasattr(tags, 'model_dump'):
        tags_dict = {k: v for k, v in tags.model_dump().items() if v}
    else:
        # If tags is already a dict
        tags_dict = {k: v for k, v in tags.items() if v} if isinstance(tags, dict) else {}
    
    # Phase 1: Embedding search
    embedding_results = search_recipes_by_embedding(recipe_query.query, limit=100)
    candidate_ids = [r['recipe_id'] for r in embedding_results]
    embedding_scores = {r['recipe_id']: r['similarity'] for r in embedding_results}
    
    # Phase 2: Ingredient filtering
    if recipe_query.include_ingredients or recipe_query.exclude_ingredients:
        filtered_ids, ingredient_info = filter_recipes_by_ingredients(
            recipe_query.include_ingredients,
            recipe_query.exclude_ingredients,
            candidate_ids
        )
    else:
        filtered_ids = candidate_ids
    
    # Phase 3: Tag filtering
    if tags_dict and filtered_ids:
        final_ids, tag_info = filter_recipes_by_tags(tags_dict, filtered_ids)
    else:
        final_ids = filtered_ids
    
    # Get top N results
    result_ids = final_ids[:recipe_query.count]
    
    return {
        'recipe_ids': result_ids,
        'embedding_scores': embedding_scores,
        'total_found': len(final_ids)
    }

In [27]:

# Create the objects
test_query = RecipeQuery(
    query="Italian vegetarian pasta",
    include_ingredients=["tomato", "basil"],
    exclude_ingredients=["meat"],
    count=5
)

test_tags = TagsSemanticSchema(
    TIME_DURATION="quick",
    DIFFICULTY_SCALE="",
    SCALE="",
    FREE_OF="",
    DIETS="vegetarian",
    CUISINES_REGIONAL="Italian",
    MEAL_COURSES="dinner",
    PREPARATION_METHOD=""
)

In [28]:
# Check if any Italian vegetarian recipes have both tomato and basil
conn = get_db_connection()
cur = conn.cursor()

# Get tomato and basil variants
tomato_result = resolve_ingredient_to_canonical("tomato")
basil_result = resolve_ingredient_to_canonical("basil")

tomato_variants = get_all_ingredient_variants(tomato_result['canonical_id'])
basil_variants = get_all_ingredient_variants(basil_result['canonical_id'])

print(f"Tomato variants: {len(tomato_variants)}")
print(f"Basil variants: {len(basil_variants)}")

# Find Italian vegetarian recipes with both
cur.execute("""
    SELECT id, name, ingredients
    FROM recipes 
    WHERE 'italian' = ANY(tags) 
    AND 'vegetarian' = ANY(tags)
    AND EXISTS (
        SELECT 1 FROM unnest(ingredients) AS ing
        WHERE LOWER(ing) = ANY(%s::text[])
    )
    AND EXISTS (
        SELECT 1 FROM unnest(ingredients) AS ing
        WHERE LOWER(ing) = ANY(%s::text[])
    )
    LIMIT 10
""", ([v.lower() for v in tomato_variants], [v.lower() for v in basil_variants]))

results = cur.fetchall()
print(f"\nFound {len(results)} Italian vegetarian recipes with both tomato and basil:")
for row in results:
    print(f"- {row[1]} (ID: {row[0]})")

cur.close()
conn.close()

# Try the search with more candidates
print("\n\nTrying search with 500 candidates instead of 100:")

# Convert test_tags to dict
tags_dict = {k: v for k, v in test_tags.model_dump().items() if v}

# Run the full search with modified embedding limit
results = search_recipes_combined(test_query, test_tags)
print(f"Final result: Found {results['total_found']} recipes")

Tomato variants: 32
Basil variants: 2

Found 10 Italian vegetarian recipes with both tomato and basil:
- Easy Bruschetta Dip (ID: 108554)
- Marinara Sauce (ID: 14979)
- Red Pepper Pesto Bruschetta (ID: 18194)
- Johnny Carino's Pepper Jack Crab Fondue - Copycat (ID: 56333)
- Pizza Sauce (ID: 65314)
- Linguini Alla Cecca (ID: 78702)
- Sweet Potato &quot;pasta&quot; With Tangy Marinara: a Raw Food R (ID: 80084)
- Marinara Sauce (ID: 84604)
- Delicious Basic Pasta Sauce (ID: 85482)
- Sauteed Zucchini And Tomatoes (ID: 102797)


Trying search with 500 candidates instead of 100:
Final result: Found 0 recipes


In [29]:
def search_recipes_ingredients_first(recipe_query: RecipeQuery, tags: TagsSemanticSchema) -> Dict:
    """
    Search strategy: Ingredients + Critical Tags first, then rank by embedding.
    """
    conn = get_db_connection()
    cur = conn.cursor()
    
    try:
        # Convert tags to dict and map
        tags_dict = {k: v for k, v in tags.model_dump().items() if v}
        mapped_tags = map_tags_for_db(tags_dict)
        
        # Extract critical tags (DIETARY_HEALTH and CUISINES_REGIONAL)
        critical_tags = []
        if mapped_tags.get('DIETARY_HEALTH'):
            resolved = resolve_tag(mapped_tags['DIETARY_HEALTH'], 'DIETARY_HEALTH')
            if resolved:
                critical_tags.append(resolved['tag_name'])
        
        if mapped_tags.get('CUISINES_REGIONAL'):
            resolved = resolve_tag(mapped_tags['CUISINES_REGIONAL'], 'CUISINES_REGIONAL')
            if resolved:
                critical_tags.append(resolved['tag_name'])
        
        # Resolve ingredients
        include_variants = []
        exclude_variants = []
        
        for ing in recipe_query.include_ingredients:
            resolved = resolve_ingredient_to_canonical(ing)
            if resolved:
                variants = get_all_ingredient_variants(resolved['canonical_id'])
                include_variants.extend([v.lower() for v in variants])
        
        for ing in recipe_query.exclude_ingredients:
            resolved = resolve_ingredient_to_canonical(ing)
            if resolved:
                variants = get_all_ingredient_variants(resolved['canonical_id'])
                exclude_variants.extend([v.lower() for v in variants])
        
        # Build query: Start with ingredients and critical tags
        query_parts = ["SELECT id FROM recipes WHERE 1=1"]
        params = []
        
        # Must have ALL included ingredients
        for ing_group in recipe_query.include_ingredients:
            resolved = resolve_ingredient_to_canonical(ing_group)
            if resolved:
                variants = [v.lower() for v in get_all_ingredient_variants(resolved['canonical_id'])]
                query_parts.append("""
                    AND EXISTS (
                        SELECT 1 FROM unnest(ingredients) AS ing
                        WHERE LOWER(ing) = ANY(%s::text[])
                    )
                """)
                params.append(variants)
        
        # Must NOT have any excluded ingredients
        if exclude_variants:
            query_parts.append("""
                AND NOT EXISTS (
                    SELECT 1 FROM unnest(ingredients) AS ing
                    WHERE LOWER(ing) = ANY(%s::text[])
                )
            """)
            params.append(exclude_variants)
        
        # Must have ALL critical tags
        for tag in critical_tags:
            query_parts.append("""
                AND %s = ANY(tags)
            """)
            params.append(tag.lower())
        
        query = '\n'.join(query_parts)
        cur.execute(query, params)
        
        # Get all matching recipe IDs
        matching_ids = [row[0] for row in cur.fetchall()]
        
        if not matching_ids:
            return {
                'recipe_ids': [],
                'embedding_scores': {},
                'total_found': 0
            }
        
        # Now get embedding scores for these recipes
        query_embedding = model.encode(recipe_query.query).tolist()
        
        # Get embeddings and calculate scores
        placeholders = ','.join(['%s'] * len(matching_ids))
        cur.execute(f"""
            SELECT id, 1 - (embedding <=> %s::vector) as similarity
            FROM recipes
            WHERE id IN ({placeholders})
            ORDER BY similarity DESC
        """, [query_embedding] + matching_ids)
        
        # Collect results
        results = []
        embedding_scores = {}
        for row in cur.fetchall():
            results.append(row[0])
            embedding_scores[row[0]] = float(row[1])
        
        # Return top N
        return {
            'recipe_ids': results[:recipe_query.count],
            'embedding_scores': embedding_scores,
            'total_found': len(results)
        }
        
    finally:
        cur.close()
        conn.close()

In [30]:
# Test with the same query
results = search_recipes_ingredients_first(test_query, test_tags)

# Get recipe details
recipe_details = get_recipe_details(results['recipe_ids'])

# Display results
print(f"\nINGREDIENTS-FIRST SEARCH RESULTS")
print("="*60)
print(f"Found {results['total_found']} recipes total")
print(f"Showing top {len(results['recipe_ids'])}:\n")

for i, recipe_id in enumerate(results['recipe_ids'], 1):
    details = recipe_details.get(recipe_id, {})
    score = results['embedding_scores'].get(recipe_id, 0)
    
    print(f"\n{i}. {details.get('name', 'Unknown')} (ID: {recipe_id})")
    print(f"   Embedding similarity: {score:.3f}")
    
    # Verify it has our requirements
    ingredients = details.get('ingredients', [])
    tags = details.get('tags', [])
    
    has_tomato = any('tomat' in ing.lower() for ing in ingredients)
    has_basil = any('basil' in ing.lower() for ing in ingredients)
    has_meat = any(meat in ing.lower() for ing in ingredients for meat in ['meat', 'beef', 'pork', 'chicken'])
    
    print(f"   ✓ Ingredients: tomato={has_tomato}, basil={has_basil}, meat={has_meat}")
    print(f"   ✓ Tags: vegetarian={'vegetarian' in tags}, italian={'italian' in tags}")
    print(f"   Description: {details.get('description', '')[:100]}...")


INGREDIENTS-FIRST SEARCH RESULTS
Found 12 recipes total
Showing top 5:


1. Sweet Potato &quot;pasta&quot; With Tangy Marinara: a Raw Food R (ID: 80084)
   Embedding similarity: 0.574
   ✓ Ingredients: tomato=True, basil=True, meat=False
   ✓ Tags: vegetarian=True, italian=True
   Description: Sweet potato noodles are nothing more than finely shredded or spiralized raw sweet potato. It has th...

2. Delicious Basic Pasta Sauce (ID: 85482)
   Embedding similarity: 0.568
   ✓ Ingredients: tomato=True, basil=True, meat=False
   ✓ Tags: vegetarian=True, italian=True
   Description: I wanted to make my meatless tomato sauce, and so with a little searching for ideas and a little cre...

3. Marinara Sauce (ID: 14979)
   Embedding similarity: 0.548
   ✓ Ingredients: tomato=True, basil=True, meat=False
   ✓ Tags: vegetarian=True, italian=True
   Description: Great sauce recipe for pasta, quinoa, or veggies....

4. Summer Pasta (ID: 107004)
   Embedding similarity: 0.545
   ✓ Ingredients: tomat

In [32]:
def test_conversation_flow():
    """Test the conversation flow with predefined messages."""
    test_messages = [
        "I want Italian vegetarian pasta with tomatoes",
        "Add basil and make it quick", 
        "Actually exclude nuts too",
        "Show me Mexican tacos instead"
    ]
    
    # Reset state
    global turn_counter, current_recipe_query, current_tags, conversation_history
    turn_counter = 0
    current_recipe_query = None
    current_tags = None
    conversation_history = []
    
    for user_message in test_messages:
        print(f"\n{'-'*60}")
        print(f"User: {user_message}")
        
        process_user_message(user_message)
        
        # Search using our existing function
        if current_recipe_query and current_tags:
            results = search_recipes_ingredients_first(current_recipe_query, current_tags)
            print(f"\nSearch Results: Found {results['total_found']} recipes")
            
            if results['recipe_ids']:
                # Get details for top result
                details = get_recipe_details(results['recipe_ids'][:1])
                for rid, info in details.items():
                    print(f"   Top result: {info['name']} (ID: {rid})")

# Run test
test_conversation_flow()


------------------------------------------------------------
User: I want Italian vegetarian pasta with tomatoes
 INITIAL STATE
Query: {'query': 'Italian vegetarian pasta recipes', 'include_ingredients': ['tomatoes'], 'exclude_ingredients': [], 'count': 5}
Tags: {'TIME_DURATION': '', 'DIFFICULTY_SCALE': '', 'SCALE': '', 'FREE_OF': '', 'DIETS': 'vegetarian', 'CUISINES_REGIONAL': 'Italian', 'MEAL_COURSES': 'pasta', 'PREPARATION_METHOD': ''}
History: ['I want Italian vegetarian pasta with tomatoes']


Search Results: Found 93 recipes
   Top result: Whole Wheat Pasta With Peppers, Tomatoes and Olives (ID: 36588)

------------------------------------------------------------
User: Add basil and make it quick
 Continue decision: True - Reason: User modifies/refines the existing search
 CONTINUATION STATE
Query: {'query': 'Italian vegetarian pasta recipes', 'include_ingredients': ['basil', 'tomatoes'], 'exclude_ingredients': [], 'count': 5}
Tags: {'TIME_DURATION': '', 'DIFFICULTY_SCALE': '', 