# Plan

## Target Table Structure
- `recipe_id`: Unique identifier for each recipe
- `title`: Name of the recipe
- `ingredients_raw`: Original ingredients list
- `instructions`: Recipe cooking instructions
- `heat_processed`: Boolean indicating if recipe involves heat processing
- `vegan`: Boolean indicating if recipe is vegan
- `vegetarian`: Boolean indicating if recipe is vegetarian
- `ingredients_processed`: JSON array of ingredients with heat processing status
- `cuisine_tags`: JSON array of cuisine classifications

## Implementation Steps

### Phase 1: Basic Processing
1. **[Data Preparation](#data-preparation)**
   - Load and clean the original CSV dataset
   - Create empty columns for new features

2. **[Heat Processing Detection](#heat-processing-detection)**
   - Use keyword matching to determine if recipes involve heat (bake, fry, roast, etc.)
   - Set `heat_processed` boolean flag for each recipe

3. **[Dietary Classification](#dietary-classification)**
   - Create dictionaries of non-vegetarian/non-vegan ingredients
   - Analyze ingredients to set `vegan` and `vegetarian` boolean flags

### Phase 2: Advanced Processing (LLM Required)
2. **[Ingredient Heat Processing Analysis](#ingredient-heat-processing-analysis)**
   - **Only for recipes with `heat_processed=True`**
   - Use LLM to analyze which specific ingredients are heat processed
   - Store as structured JSON in `ingredients_processed` column

2. **[Cuisine Classification](#cuisine-classification)**
   - Use LLM to determine cuisine types for all recipes
   - Store as ordered array in `cuisine_tags` column


### imports

In [3]:
import ast
import json
import pandas as pd

# Phase 1: Basic Processing

### Data Preparation
   - Load and clean the original CSV dataset
   - Delete image name column and ingredients, rename cleaned_ingredients to ingredients
   - Create empty columns for new features

read table, add columns

In [4]:
# Read the CSV
df = pd.read_csv('/content/Kaggle_dataset.csv', header=0, names=['recipe_id', 'title', 'ingredients_raw', 'instructions', 'image_name', 'ingredients'])

# Delete image name column and ingredients, rename cleaned_ingredients to ingredients
df = df.drop(['image_name', 'ingredients'], axis=1)

# Create empty columns for new features
df['ingredients_processed'] = None
df['heat_processed'] = False
df['cuisine_tags'] = None
df['vegan'] = False
df['vegetarian'] = False

# Reorder columns
df = df[['recipe_id', 'title', 'ingredients_raw', 'ingredients_processed', 'instructions', 'heat_processed', 'cuisine_tags', 'vegan', 'vegetarian']]

df.head()

Unnamed: 0,recipe_id,title,ingredients_raw,ingredients_processed,instructions,heat_processed,cuisine_tags,vegan,vegetarian
0,0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...",,"Pat chicken dry with paper towels, season all ...",False,,False,False
1,1,Crispy Salt and Pepper Potatoes,"['2 large egg whites', '1 pound new potatoes (...",,Preheat oven to 400°F and line a rimmed baking...,False,,False,False
2,2,Thanksgiving Mac and Cheese,"['1 cup evaporated milk', '1 cup whole milk', ...",,Place a rack in middle of oven; preheat to 400...,False,,False,False
3,3,Italian Sausage and Bread Stuffing,"['1 (¾- to 1-pound) round Italian loaf, cut in...",,Preheat oven to 350°F with rack in middle. Gen...,False,,False,False
4,4,Newton's Law,"['1 teaspoon dark brown sugar', '1 teaspoon ho...",,Stir together brown sugar and hot water in a c...,False,,False,False


delete missing values

In [5]:
# list of columns to check - without ingredients_processed, cuisine_tags which are on purpose empty
columns_to_check = [col for col in df.columns if col not in ['ingredients_processed', 'cuisine_tags']]

# Show missing values in these columns before dropping
print("Missing values before dropping rows:")
print(df[columns_to_check].isnull().sum())

# Count total rows before dropping
rows_before = len(df)

# Drop rows with missing values in any of the specified columns
df = df.dropna(subset=columns_to_check)

# Count total rows after dropping
rows_after = len(df)

# Show how many rows were dropped
print(f"\nRows before: {rows_before}")
print(f"Rows after: {rows_after}")
print(f"Rows dropped: {rows_before - rows_after}")


Missing values before dropping rows:
recipe_id          0
title              5
ingredients_raw    0
instructions       8
heat_processed     0
vegan              0
vegetarian         0
dtype: int64

Rows before: 13501
Rows after: 13493
Rows dropped: 8


### Heat Processing Detection
   - Use keyword matching to determine if recipes involve heat (bake, fry, roast, etc.)
   - Set `heat_processed` boolean flag for each recipe

In [6]:
# Heat-related keywords
heat_keywords = ['bake', 'barbecue', 'blacken', 'blanch', 'blister', 'boil', 'braise',
                'broil', 'brown', 'bubble', 'burn', 'caramelize', 'char', 'coddle',
                'confit', 'convection', 'cook', 'crisp', 'crock pot', 'crust', 'deep-fry',
                'deglaze', 'double-boil', 'fire', 'flame', 'flambé', 'flash-fry', 'fry',
                'griddle', 'grill', 'hard-boil', 'heat', 'hot', 'hot-smoke', 'induction',
                'infuse', 'melt', 'microwave', 'oven', 'pan-fry', 'pan-sear', 'parboil',
                'poach', 'preheat', 'pressure cook', 'quick-broil', 'reduce', 'reheat',
                'render', 'roast', 'rotisserie', 'salamander', 'sauté', 'scald', 'scorch',
                'sear', 'shallow-fry', 'simmer', 'sizzle', 'skillet', 'slow cook', 'smoke',
                'smoke-roast', 'soft-boil', 'sous vide', 'steam', 'steep', 'stew',
                'stir-fry', 'sweat', 'temper', 'toast', 'torch', 'warm', 'water bath', 'wok']


# Check if heat kewords are in the instructions and update heat_processed column
df['heat_processed'] = df['instructions'].str.lower().apply(
    lambda x: any(keyword in x for keyword in heat_keywords)
)

df.head()


Unnamed: 0,recipe_id,title,ingredients_raw,ingredients_processed,instructions,heat_processed,cuisine_tags,vegan,vegetarian
0,0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...",,"Pat chicken dry with paper towels, season all ...",True,,False,False
1,1,Crispy Salt and Pepper Potatoes,"['2 large egg whites', '1 pound new potatoes (...",,Preheat oven to 400°F and line a rimmed baking...,True,,False,False
2,2,Thanksgiving Mac and Cheese,"['1 cup evaporated milk', '1 cup whole milk', ...",,Place a rack in middle of oven; preheat to 400...,True,,False,False
3,3,Italian Sausage and Bread Stuffing,"['1 (¾- to 1-pound) round Italian loaf, cut in...",,Preheat oven to 350°F with rack in middle. Gen...,True,,False,False
4,4,Newton's Law,"['1 teaspoon dark brown sugar', '1 teaspoon ho...",,Stir together brown sugar and hot water in a c...,True,,False,False


### Dietary Classification
   - Create dictionaries of non-vegetarian/non-vegan ingredients
   - Analyze ingredients to set `vegan` and `vegetarian` boolean flags

In [7]:
# Non-vegetarian ingredients
non_vegetarian_ingredients = ['meat', 'chicken', 'beef', 'pork', 'lamb', 'veal', 'turkey',
                            'duck', 'goose', 'fish', 'salmon', 'tuna', 'shrimp', 'prawn',
                            'crab', 'lobster', 'oyster', 'mussel', 'clam', 'scallop',
                            'anchovy', 'bacon', 'ham', 'sausage', 'gelatin', 'lard',
                            'suet', 'stock', 'broth', 'venison', 'rabbit', 'quail', 'pheasant',
                            'bison', 'buffalo', 'elk', 'deer', 'squab', 'liver', 'kidney',
                            'heart', 'tongue', 'tripe', 'sweetbread', 'foie gras', 'caviar',
                            'cod', 'halibut', 'tilapia', 'sardine', 'herring', 'mackerel',
                            'catfish', 'trout', 'flounder', 'mahi mahi', 'swordfish', 'eel',
                            'octopus', 'squid', 'calamari', 'chorizo', 'pepperoni', 'salami',
                            'prosciutto', 'pancetta', 'bologna', 'hot dog', 'jerky', 'pate',
                            'bone marrow', 'bone broth', 'animal fat', 'tallow', 'schmaltz',
                            'collagen', 'isinglass', 'rennet', 'animal shortening', 'cochineal',
                            'carmine', 'shellac', 'confectioner\'s glaze', 'omega-3', 'fish sauce',
                            'worcestershire sauce', 'caesar dressing', 'dashi', 'katsuobushi',
                            'bonito', 'escargot', 'frog legs']

# Additional non-vegan ingredients (on top of non-vegetarian ones)
non_vegan_ingredients = ['egg', 'milk', 'cream', 'butter', 'cheese', 'yogurt', 'honey',
                        'mayonnaise', 'whey', 'casein', 'ghee', 'lactose', 'rennet',
                        'albumin', 'carmine', 'shellac', 'royal jelly', 'beeswax', 'propolis',
                        'bee pollen', 'buttermilk', 'kefir', 'sour cream', 'ice cream',
                        'custard', 'pudding', 'creme fraiche', 'mascarpone', 'ricotta',
                        'cottage cheese', 'quark', 'paneer', 'egg white', 'egg yolk',
                        'meringue', 'hollandaise', 'marshmallow', 'frosting', 'nougat',
                        'whipped cream', 'condensed milk', 'evaporated milk', 'powdered milk',
                        'milk solids', 'milk protein', 'lactose', 'caseinates', 'lactoferrin',
                        'lactitol', 'lactoglobulin', 'lactalbumin', 'recaldent', 'curds',
                        'vitamin D3', 'lanolin', 'pepsin', 'trypsin', 'glycerin', 'glycerol',
                        'stearic acid', 'oleic acid', 'capric acid', 'myristic acid',
                        'palmitic acid', 'l-cysteine', 'keratin', 'elastin', 'cetyl alcohol',
                        'cholesterol', 'lecithin', 'mono and diglycerides', 'natural flavor',
                        'e120', 'e441', 'e542', 'e631', 'e901', 'e904', 'e910', 'e920',
                        'e921', 'e966', 'e1105']


# Check if ingredients contain non-vegetarian/non-vegan items and update columns
df['vegetarian'] = df['ingredients_raw'].str.lower().apply(
    lambda x: not any(ingredient in str(x).lower() for ingredient in non_vegetarian_ingredients)
)

df['vegan'] = df['ingredients_raw'].str.lower().apply(
    lambda x: not any(ingredient in str(x).lower() for ingredient in non_vegetarian_ingredients + non_vegan_ingredients)
)

df.head()


Unnamed: 0,recipe_id,title,ingredients_raw,ingredients_processed,instructions,heat_processed,cuisine_tags,vegan,vegetarian
0,0,Miso-Butter Roast Chicken With Acorn Squash Pa...,"['1 (3½–4-lb.) whole chicken', '2¾ tsp. kosher...",,"Pat chicken dry with paper towels, season all ...",True,,False,False
1,1,Crispy Salt and Pepper Potatoes,"['2 large egg whites', '1 pound new potatoes (...",,Preheat oven to 400°F and line a rimmed baking...,True,,False,True
2,2,Thanksgiving Mac and Cheese,"['1 cup evaporated milk', '1 cup whole milk', ...",,Place a rack in middle of oven; preheat to 400...,True,,False,True
3,3,Italian Sausage and Bread Stuffing,"['1 (¾- to 1-pound) round Italian loaf, cut in...",,Preheat oven to 350°F with rack in middle. Gen...,True,,False,False
4,4,Newton's Law,"['1 teaspoon dark brown sugar', '1 teaspoon ho...",,Stir together brown sugar and hot water in a c...,True,,False,True


# Phase 2: Advanced Processing (LLM Required)

### Ingredient Heat Processing Analysis
   - **Only for recipes with `heat_processed=True`**
   - Use LLM to analyze which specific ingredients are heat processed
   - Store as structured JSON in `ingredients_processed` column

In [13]:
# Option 1: Using Google Colab Secrets (Recommended for security)
# Look for the 🔑 icon in the left sidebar of Colab
# Click it and add a new secret with:
# - Name: "HF_TOKEN"
# - Value: paste your token from step 1

# Then access it in your code:
import os
from google.colab import userdata

os.environ["HUGGING_FACE_HUB_TOKEN"] = userdata.get('HF_TOKEN')

# Verify the token is set (it will show just the first few characters)
print(f"Token set: {os.environ['HUGGING_FACE_HUB_TOKEN'][:5]}...")


Token set: hf_SQ...


In [31]:
# Install required packages
!pip install -q transformers sentencepiece

# Import libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import json

# Load Mistral 7B Instruct model in 16-bit precision (CPU friendly)
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use float16 precision
    device_map="auto",          # Will use CPU if GPU not available
    low_cpu_mem_usage=True      # Optimize for CPU memory
)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.

In [None]:
# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

# Test the model
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cpu")  # Add padding=True
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],  # Pass attention mask explicitly
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    do_sample=True,  # Enable sampling-based generation
    pad_token_id=tokenizer.eos_token_id  # Set pad token ID to EOS token ID
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

In [None]:
# Install required packages
!pip install -q transformers

# Import libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load GPT-2 model and tokenizer
model_id = "gpt2"  # GPT-2 model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)



In [None]:
# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

# Test the model
prompt = "Answer the following question factually: What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cpu")  # Use CPU

# Generate a response
outputs = model.generate(
    inputs["input_ids"],
    max_length=50,  # Limit the response length
    num_return_sequences=1,  # Generate one response
    temperature=0.1,  # Adjust creativity (lower = more deterministic)
    do_sample=True,  # Enable sampling-based generation
    pad_token_id=tokenizer.eos_token_id  # Set pad token ID to EOS token ID
)

# Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)




### Cuisine Classification
   - Use LLM to determine cuisine types for all recipes
   - Store as ordered array in `cuisine_tags` column

___________

___________

## Model logic

In [None]:
mock_df = pd.read_csv('receipts_table.csv')
mock_df

Unnamed: 0,recipe_id,title,ingredients_raw,ingredients_processed,instructions,heat_processed,cuisine_tags,vegan,vegetarian
0,1,Miso-Butter Roast Chicken,"['1 (3½-4-lb.) whole chicken', '2¾ tsp. kosher...","[{""name"":""whole chicken"", ""heat_processed"":tru...","Pat chicken dry with paper towels, season all ...",True,"[""Japanese"", ""American""]",False,False
1,2,Thai Red Curry Noodle Soup,"['1 tablespoon vegetable oil', '1 onion, chopp...","[{""name"":""vegetable oil"", ""heat_processed"":tru...",Heat oil in a large pot over medium heat. Add ...,True,"[""Thai"", ""Southeast Asian""]",True,True
2,3,Mediterranean Feta Salad,"['1 large cucumber, diced', '2 cups cherry tom...","[{""name"":""cucumber"", ""heat_processed"":false}, ...","In a large bowl, combine cucumber, tomatoes, r...",False,"[""Greek"", ""Mediterranean""]",False,True
3,4,Double Chocolate Brownies,"['200g dark chocolate', '175g unsalted butter'...","[{""name"":""dark chocolate"", ""heat_processed"":tr...",Preheat oven to 350°F. Grease and line a 9-inc...,True,"[""American"", ""British""]",False,True
4,5,Vegetable Pad Thai,"['8 oz rice noodles', '3 tbsp vegetable oil', ...","[{""name"":""rice noodles"", ""heat_processed"":true...",Soak rice noodles in hot water for 10 minutes ...,True,"[""Thai"", ""Southeast Asian""]",False,True
5,6,Fresh Tomato Bruschetta,"['6 ripe tomatoes, diced', '1/2 red onion, fin...","[{""name"":""tomatoes"", ""heat_processed"":false}, ...","In a bowl, combine tomatoes, onion, 2 cloves m...",True,"[""Italian"", ""Mediterranean""]",True,True


In [None]:
# Take out recipes that are not heat processed at all
mock_df = mock_df[mock_df["heat_processed"] == True]

In [None]:
def normalize_ingredient_name(ingredient_name):
    # Remove 's' at the end of the word (for plural forms like 'tomatoes' -> 'tomato')
    if ingredient_name.endswith('s'):
        ingredient_name = ingredient_name[:-1]
    return ingredient_name.lower()

def match_ingredient(ingredient_name, user_input_ingredient):
    # Normalize and split both the ingredient and user input into words
    normalized_ingredient = normalize_ingredient_name(ingredient_name)
    normalized_user_input = normalize_ingredient_name(user_input_ingredient)

    # Split into individual words for more flexible matching
    ingredient_words = set(normalized_ingredient.split())  # Split the ingredient into words
    user_input_words = set(normalized_user_input.split())  # Split the user input into words

    # Check if any of the words in the user input are in the ingredient
    return bool(ingredient_words & user_input_words)

def find_heat_processed_ingredient(mock_df, ingredient_name, vegan=False, vegetarian=False):
    # Normalize the user input ingredient name
    normalized_ingredient_name = normalize_ingredient_name(ingredient_name)

    # Create a list to hold results
    result = []

    # Loop through each row of the DataFrame
    for index, row in mock_df.iterrows():
        # First, check if the recipe matches the dietary restrictions (vegan, vegetarian)
        if (vegan and not row['vegan']) or (vegetarian and not row['vegetarian']):
            continue  # Skip this recipe if it doesn't match the restrictions

        # Ensure 'ingredients_processed' is a valid JSON string and parse it
        try:
            ingredients = json.loads(row['ingredients_processed'])  # Parse the JSON string into a list of dictionaries
        except (json.JSONDecodeError, TypeError) as e:
            print(f"Error parsing ingredients for recipe '{row['title']}': {e}")
            continue  # Skip this row if parsing fails

        # Loop through the ingredients in the parsed list
        for ingredient in ingredients:
            # Ensure the ingredient is a dictionary with 'name' and 'heat_processed' keys
            if isinstance(ingredient, dict) and 'name' in ingredient and 'heat_processed' in ingredient:
                # Check if the ingredient matches the user input and is heat processed
                if match_ingredient(ingredient['name'], ingredient_name) and ingredient['heat_processed']:
                    result.append(row['title'])
                    break  # If found, no need to check other ingredients in this recipe

    return result

# Function to handle valid 'yes' or 'no' inputs for vegan and vegetarian questions
def get_valid_input(prompt):
    while True:
        user_input = input(prompt).strip().lower()
        if user_input in ['yes', 'no']:
            return user_input == 'yes'  # Return True for 'yes', False for 'no'
        else:
            print("Please answer with 'yes' or 'no'.")

# Ask the user if they have any dietary restrictions
vegan = get_valid_input("Are you vegan? (yes/no): ")
vegetarian = get_valid_input("Are you vegetarian? (yes/no): ")

# Prompt user to input an ingredient
user_input_ingredient = input("Enter an ingredient to search for (e.g., 'tomato', 'onion', etc.): ")

# Handle case where user presses 'Escape' or 'Cancel' (empty input)
if not user_input_ingredient.strip():  # Check if the input is empty (just whitespace or no input)
    print("No ingredient entered. Please provide a valid ingredient to search for.")
else:
    # Call function to find recipes with heat processed ingredient
    matching_recipes = find_heat_processed_ingredient(mock_df, user_input_ingredient, vegan, vegetarian)

    # Display the result
    if matching_recipes:
        print(f"Recipes with heat processed {user_input_ingredient}:")
        for recipe in matching_recipes:
            print(f"- {recipe}")
    else:
        print(f"No recipes found with heat processed {user_input_ingredient}.")

Recipes with heat processed oil:
- Miso-Butter Roast Chicken
- Thai Red Curry Noodle Soup
- Vegetable Pad Thai
- Fresh Tomato Bruschetta
