# Plan

## Target Table Structure
- `title`: Name of the recipe
- `ingredients_raw`: Original ingredients list
- `ingredients_processed`: JSON array of ingredients with heat processing status
- `instructions`: Recipe cooking instructions
- `language`: Code for the language that the recipe is in
- `heat_processed`: Boolean indicating if recipe involves heat processing (eventually not used)
- `cuisine_tags`: JSON array of cuisine classifications
- `vegan`: Boolean indicating if recipe is vegan
- `vegetarian`: Boolean indicating if recipe is vegetarian
- `processing_error`: For storing information about error in the LLM analysis



## Implementation Steps

### Phase 1: Basic Processing
1. **[Data Preparation](#data-preparation)**
   - Load and clean the original CSV dataset
   - Create empty columns for new features

2. **[Heat Processing Detection](#heat-processing-detection)**
   - Use keyword matching to determine if recipes involve heat (bake, fry, roast, etc.)
   - Set `heat_processed` boolean flag for each recipe

3. **[Dietary Classification](#dietary-classification)**
   - Create dictionaries of non-vegetarian/non-vegan ingredients
   - Analyze ingredients to set `vegan` and `vegetarian` boolean flags

4.  **[Batching the recipes for LLM Analysis](#batching-the-recipes-for-LLM_Analysis)**
      - Create batches of 200 recipes
      - Creating testing batches: 2 of 10 recipes, to first check if everything works fine


### Phase 2: Advanced Processing (LLM)
1. **[LLM Analysis of Ingredient Heat Processing and Cuisine Classification](#llm-analysis-of-ingredient-heat-processing-and-cuisine-classification)**
For each batch:
   - Load the language model and tokenizer
   - Define the prompt template
   - Process each recipe batch by iterating through the df
   - Generate and decode the LLM's response.
   - Parse the JSON output from the response to extract the heat processing status for ingredients and the assigned cuisine tags.
   - Update the df with the extracted JSON data
   - Save the processed batch df to a CSV file.

2. **[Merging Recipes](#merging-recipes)**
   - Combine all of the csv files into one file
   - Delete missing values due to  errors in LLM

### imports

In [1]:
import ast
import json
import pandas as pd
import os
import glob

---

# Phase 1: Basic Processing

### Data Preparation
   - Load and clean the original CSV dataset
   - Delete URL column, rename Title, Ingredients, Language, Instructions for consistency with the previously used code on a sample csv
   - Create empty columns for new features

read table, add columns

In [None]:
# Read the CSV
df = pd.read_csv('../recipes/english_recipes.csv', header=0, names=['Title', 'Ingredients', 'Instructions', 'URL', 'Language'])

# Drop URL column and rename columns
df = df.drop(['URL'], axis=1)
df = df.rename(columns={
    'Title': 'title',
    'Ingredients': 'ingredients_raw',
    'Language': 'language',
    'Instructions': 'instructions'
})

# Create empty columns for new features
df['ingredients_processed'] = None
df['heat_processed'] = False
df['cuisine_tags'] = None
df['vegan'] = False
df['vegetarian'] = False

# Reorder columns
df = df[['title', 'ingredients_raw', 'ingredients_processed', 'instructions', 'language', 'heat_processed', 'cuisine_tags', 'vegan', 'vegetarian']]

df.head()

Unnamed: 0,title,ingredients_raw,ingredients_processed,instructions,language,heat_processed,cuisine_tags,vegan,vegetarian
0,Easy Salad Recipes: Chopped Salad & More,"['1 recipe Homemade Italian Dressing', '1 Roma...",,Make the Homemade Italian Dressing.\nChop roma...,en,False,,False,False
1,30 Easy Salad Dressing Recipes,"['2 tablespoons aged balsamic vinegar', '2 tab...",,For the balsamic vinaigrette\nIn a medium bowl...,en,False,,False,False
2,21 Green Salad Recipes,"['1 recipe Homemade Italian Dressing', '1 Roma...",,Make the Homemade Italian Dressing.\nChop roma...,en,False,,False,False
3,Winter Salad with Pear & Greens,"['1 head radicchio, washed, dried and torn int...",,"Make the dressing: In a medium bowl, whisk tog...",en,False,,False,False
4,15 Easy Vegan Salads,"['1/2 recipe Homemade Croutons', '1 romaine he...",,"If using, make the Homemade Croutons.\nWash an...",en,False,,False,False


check how many recipes are in the df

In [None]:
df.count()

title                    3554
ingredients_raw          3554
ingredients_processed       0
instructions             3554
language                 3554
heat_processed           3554
cuisine_tags                0
vegan                    3554
vegetarian               3554
dtype: int64

delete missing values

In [None]:
# list of columns to check - without ingredients_processed, cuisine_tags which are on purpose empty
columns_to_check = [col for col in df.columns if col not in ['ingredients_processed', 'cuisine_tags']]

# Show missing values in these columns before dropping
print("Missing values before dropping rows:")
print(df[columns_to_check].isnull().sum())

# Count total rows before dropping
rows_before = len(df)

# Drop rows with missing values in any of the specified columns
df = df.dropna(subset=columns_to_check)

# Count total rows after dropping
rows_after = len(df)

# Show how many rows were dropped
print(f"\nRows before: {rows_before}")
print(f"Rows after: {rows_after}")
print(f"Rows dropped: {rows_before - rows_after}")


Missing values before dropping rows:
title              0
ingredients_raw    0
instructions       0
language           0
heat_processed     0
vegan              0
vegetarian         0
dtype: int64

Rows before: 3554
Rows after: 3554
Rows dropped: 0


### Heat Processing Detection
   - Use keyword matching to determine if recipes involve heat (bake, fry, roast, etc.)
   - Set `heat_processed` boolean flag for each recipe

In [None]:
# Heat-related keywords
heat_keywords = ['bake', 'barbecue', 'blacken', 'blanch', 'blister', 'boil', 'braise',
                'broil', 'brown', 'bubble', 'burn', 'caramelize', 'char', 'coddle',
                'confit', 'convection', 'cook', 'crisp', 'crock pot', 'crust', 'deep-fry',
                'deglaze', 'double-boil', 'fire', 'flame', 'flambé', 'flash-fry', 'fry',
                'griddle', 'grill', 'hard-boil', 'heat', 'hot', 'hot-smoke', 'induction',
                'infuse', 'melt', 'microwave', 'oven', 'pan-fry', 'pan-sear', 'parboil',
                'poach', 'preheat', 'pressure cook', 'quick-broil', 'reduce', 'reheat',
                'render', 'roast', 'rotisserie', 'salamander', 'sauté', 'scald', 'scorch',
                'sear', 'shallow-fry', 'simmer', 'sizzle', 'skillet', 'slow cook', 'smoke',
                'smoke-roast', 'soft-boil', 'sous vide', 'steam', 'steep', 'stew',
                'stir-fry', 'sweat', 'temper', 'toast', 'torch', 'warm', 'water bath', 'wok']


# Check if heat kewords are in the instructions and update heat_processed column
df['heat_processed'] = df['instructions'].str.lower().apply(
    lambda x: any(keyword in x for keyword in heat_keywords)
)

df.head()


Unnamed: 0,title,ingredients_raw,ingredients_processed,instructions,language,heat_processed,cuisine_tags,vegan,vegetarian
0,Easy Salad Recipes: Chopped Salad & More,"['1 recipe Homemade Italian Dressing', '1 Roma...",,Make the Homemade Italian Dressing.\nChop roma...,en,True,,False,False
1,30 Easy Salad Dressing Recipes,"['2 tablespoons aged balsamic vinegar', '2 tab...",,For the balsamic vinaigrette\nIn a medium bowl...,en,True,,False,False
2,21 Green Salad Recipes,"['1 recipe Homemade Italian Dressing', '1 Roma...",,Make the Homemade Italian Dressing.\nChop roma...,en,True,,False,False
3,Winter Salad with Pear & Greens,"['1 head radicchio, washed, dried and torn int...",,"Make the dressing: In a medium bowl, whisk tog...",en,True,,False,False
4,15 Easy Vegan Salads,"['1/2 recipe Homemade Croutons', '1 romaine he...",,"If using, make the Homemade Croutons.\nWash an...",en,False,,False,False


### Dietary Classification
   - Create dictionaries of non-vegetarian/non-vegan ingredients
   - Analyze ingredients to set `vegan` and `vegetarian` boolean flags

In [None]:
# Non-vegetarian ingredients
non_vegetarian_ingredients = ['meat', 'chicken', 'beef', 'pork', 'lamb', 'veal', 'turkey',
                            'duck', 'goose', 'fish', 'salmon', 'tuna', 'shrimp', 'prawn',
                            'crab', 'lobster', 'oyster', 'mussel', 'clam', 'scallop',
                            'anchovy', 'bacon', 'ham', 'sausage', 'gelatin', 'lard',
                            'suet', 'stock', 'broth', 'venison', 'rabbit', 'quail', 'pheasant',
                            'bison', 'buffalo', 'elk', 'deer', 'squab', 'liver', 'kidney',
                            'heart', 'tongue', 'tripe', 'sweetbread', 'foie gras', 'caviar',
                            'cod', 'halibut', 'tilapia', 'sardine', 'herring', 'mackerel',
                            'catfish', 'trout', 'flounder', 'mahi mahi', 'swordfish', 'eel',
                            'octopus', 'squid', 'calamari', 'chorizo', 'pepperoni', 'salami',
                            'prosciutto', 'pancetta', 'bologna', 'hot dog', 'jerky', 'pate',
                            'bone marrow', 'bone broth', 'animal fat', 'tallow', 'schmaltz',
                            'collagen', 'isinglass', 'rennet', 'animal shortening', 'cochineal',
                            'carmine', 'shellac', 'confectioner\'s glaze', 'omega-3', 'fish sauce',
                            'worcestershire sauce', 'caesar dressing', 'dashi', 'katsuobushi',
                            'bonito', 'escargot', 'frog legs']

# Additional non-vegan ingredients (on top of non-vegetarian ones)
non_vegan_ingredients = ['egg', 'milk', 'cream', 'butter', 'cheese', 'yogurt', 'honey',
                        'mayonnaise', 'whey', 'casein', 'ghee', 'lactose', 'rennet',
                        'albumin', 'carmine', 'shellac', 'royal jelly', 'beeswax', 'propolis',
                        'bee pollen', 'buttermilk', 'kefir', 'sour cream', 'ice cream',
                        'custard', 'pudding', 'creme fraiche', 'mascarpone', 'ricotta',
                        'cottage cheese', 'quark', 'paneer', 'egg white', 'egg yolk',
                        'meringue', 'hollandaise', 'marshmallow', 'frosting', 'nougat',
                        'whipped cream', 'condensed milk', 'evaporated milk', 'powdered milk',
                        'milk solids', 'milk protein', 'lactose', 'caseinates', 'lactoferrin',
                        'lactitol', 'lactoglobulin', 'lactalbumin', 'recaldent', 'curds',
                        'vitamin D3', 'lanolin', 'pepsin', 'trypsin', 'glycerin', 'glycerol',
                        'stearic acid', 'oleic acid', 'capric acid', 'myristic acid',
                        'palmitic acid', 'l-cysteine', 'keratin', 'elastin', 'cetyl alcohol',
                        'cholesterol', 'lecithin', 'mono and diglycerides', 'natural flavor',
                        'e120', 'e441', 'e542', 'e631', 'e901', 'e904', 'e910', 'e920',
                        'e921', 'e966', 'e1105']


# Check if ingredients contain non-vegetarian/non-vegan items and update columns
df['vegetarian'] = df['ingredients_raw'].str.lower().apply(
    lambda x: not any(ingredient in str(x).lower() for ingredient in non_vegetarian_ingredients)
)

df['vegan'] = df['ingredients_raw'].str.lower().apply(
    lambda x: not any(ingredient in str(x).lower() for ingredient in non_vegetarian_ingredients + non_vegan_ingredients)
)

df.head()

Unnamed: 0,title,ingredients_raw,ingredients_processed,instructions,language,heat_processed,cuisine_tags,vegan,vegetarian
0,Easy Salad Recipes: Chopped Salad & More,"['1 recipe Homemade Italian Dressing', '1 Roma...",,Make the Homemade Italian Dressing.\nChop roma...,en,True,,False,False
1,30 Easy Salad Dressing Recipes,"['2 tablespoons aged balsamic vinegar', '2 tab...",,For the balsamic vinaigrette\nIn a medium bowl...,en,True,,False,False
2,21 Green Salad Recipes,"['1 recipe Homemade Italian Dressing', '1 Roma...",,Make the Homemade Italian Dressing.\nChop roma...,en,True,,False,False
3,Winter Salad with Pear & Greens,"['1 head radicchio, washed, dried and torn int...",,"Make the dressing: In a medium bowl, whisk tog...",en,True,,False,True
4,15 Easy Vegan Salads,"['1/2 recipe Homemade Croutons', '1 romaine he...",,"If using, make the Homemade Croutons.\nWash an...",en,False,,False,False


### Batching the recipes for LLM Analysis

batches of 200 recipes

In [None]:
main_batches_output_dir = '/batched_recipes'
main_batch_size = 200

# Create the output directory for main batches if it doesn't exist
os.makedirs(main_batches_output_dir, exist_ok=True)

# Calculate number of batches for the main run
num_recipes_main = len(df)
num_main_batches = (num_recipes_main + main_batch_size - 1) // main_batch_size
print(f"Total recipes for main batches: {num_recipes_main}")
print(f"Dividing into {num_main_batches} batches of size {main_batch_size}")

# Split and save main batches
for i in range(num_main_batches):
    # Calculate the start and end index for the current batch
    start_index = i * main_batch_size
    # Ensure the end index does not exceed the total number of recipes
    end_index = min((i + 1) * main_batch_size, num_recipes_main)
    # Select the slice of the DataFrame for the current batch
    batch_df = df.iloc[start_index:end_index]

    # Define the output filename for this batch
    batch_filename = f'recipes_batch_{i+1:04d}.csv'
    output_path = os.path.join(main_batches_output_dir, batch_filename)

    # Save the batch DataFrame to CSV
    batch_df.to_csv(output_path, index=False)
    print(f"Saved main batch {i+1}/{num_main_batches} to {output_path}")

print("Batch splitting completed")

Total recipes for main batches: 3554
Dividing into 18 batches of size 200
Saved main batch 1/18 to recipes/batched_recipes/recipes_batch_0001.csv
Saved main batch 2/18 to recipes/batched_recipes/recipes_batch_0002.csv
Saved main batch 3/18 to recipes/batched_recipes/recipes_batch_0003.csv
Saved main batch 4/18 to recipes/batched_recipes/recipes_batch_0004.csv
Saved main batch 5/18 to recipes/batched_recipes/recipes_batch_0005.csv
Saved main batch 6/18 to recipes/batched_recipes/recipes_batch_0006.csv
Saved main batch 7/18 to recipes/batched_recipes/recipes_batch_0007.csv
Saved main batch 8/18 to recipes/batched_recipes/recipes_batch_0008.csv
Saved main batch 9/18 to recipes/batched_recipes/recipes_batch_0009.csv
Saved main batch 10/18 to recipes/batched_recipes/recipes_batch_0010.csv
Saved main batch 11/18 to recipes/batched_recipes/recipes_batch_0011.csv
Saved main batch 12/18 to recipes/batched_recipes/recipes_batch_0012.csv
Saved main batch 13/18 to recipes/batched_recipes/recipes_b

testing batches: 2 batches of 10 recipes

In [None]:
# Configuration for test batches
test_batches_output_dir = '/testing_batch'
test_batch_size = 10
num_test_batches_to_save = 2

total_test_recipes_needed = test_batch_size * num_test_batches_to_save

test_df_subset = df.head(total_test_recipes_needed).copy()

for i in range(num_test_batches_to_save):
    # Calculate the start and end index for the current test batch from the subset
    start_index = i * test_batch_size
    end_index = min((i + 1) * test_batch_size, len(test_df_subset))
    test_batch_df = test_df_subset.iloc[start_index:end_index]

    # Define the output filename for this test batch
    test_batch_filename = f'test_batch_{i+1:04d}.csv'
    output_path = os.path.join(test_batches_output_dir, test_batch_filename)

    # Save the test batch DataFrame to CSV
    test_batch_df.to_csv(output_path, index=False)
    print(f"Saved test batch {i+1}/{num_test_batches_to_save} to {output_path}")

print("Specific test batch splitting complete.")


Saved test batch 1/2 to recipes/testing_batch/test_batch_0001.csv
Saved test batch 2/2 to recipes/testing_batch/test_batch_0002.csv
Specific test batch splitting complete.


________

# Phase 2: Advanced Processing (LLM)

### Phase 2: Advanced Processing (LLM)
1. LLM Analysis of Ingredient Heat Processing and Cuisine Classification
For each batch:
   - Load the language model and tokenizer
   - Define the prompt template
   - Process each recipe batch by iterating through the df
   - Generate and decode the LLM's response.
   - Parse the JSON output from the response to extract the heat processing status for ingredients and the assigned cuisine tags.
   - Update the df with the extracted JSON data
   - Save the processed batch df to a CSV file.

The part above is carried out in process_batch.py

2. Merging Recipes
   - Combine all of the csv files into one file
   - Delete missing values due to  errors in LLM

### Merging Recipes

Merging together csv output files

In [2]:
# Directory with processed CSV files
INPUT_DIR = "LLM/batched_recipes_results"

# Output directory
OUTPUT_FILE = os.path.join("merged_final_results.csv")

# Find all files ending with .csv in the input directory
csv_files = glob.glob(os.path.join(INPUT_DIR, '*.csv'))

# Filter out the output file itself if it exists from a previous run
if os.path.exists(OUTPUT_FILE):
    csv_files = [f for f in csv_files if os.path.abspath(f) != os.path.abspath(OUTPUT_FILE)]


print(f"Found {len(csv_files)} CSV files to merge.")

# List to hold DataFrames
df_list = []

# Read each CSV file
for file_path in csv_files:
    try:
        df = pd.read_csv(file_path)
        df_list.append(df)
    except Exception as e:
        print(f"Warning: Could not read {file_path}. Skipping. Error: {e}")

# Check if any DataFrames were loaded
if not df_list:
    print("No valid DataFrames were loaded. Cannot merge.")
    exit()

print("Concatenating DataFrames...")
# Concatenate all DataFrames
merged_df = pd.concat(df_list, ignore_index=True)

print(f"Successfully merged {len(df_list)} files into a single DataFrame with {len(merged_df)} rows.")

# Save the merged DataFrame to CSV
try:
    merged_df.to_csv(OUTPUT_FILE, index=False)
    print(f"Merged results saved successfully to {OUTPUT_FILE}")
except Exception as e:
    print(f"Error saving merged results: {e}")


print("Local merge script finished.")

Found 12 CSV files to merge.
Concatenating DataFrames...
Successfully merged 12 files into a single DataFrame with 2400 rows.
Merged results saved successfully to merged_final_results.csv
Local merge script finished.


Deleting rows where missing values after LLM analysis (error in LLM processing)

In [3]:
# Count missing values in each column
missing_values = merged_df.isnull().sum()

# Print total number of rows with missing values in ingredients_processed or cuisine_tags
rows_with_missing = merged_df[['ingredients_processed', 'cuisine_tags']].isnull().any(axis=1).sum()
print(f"Number of rows with missing values in ingredients_processed or cuisine_tags: {rows_with_missing}")

print("\nMissing values by column:")
print(missing_values[missing_values > 0])

merged_df_clean = merged_df.dropna(subset=['ingredients_processed', 'cuisine_tags'])

print(f"\nNumber of rows before cleaning: {len(merged_df)}")
print(f"Number of rows after cleaning: {len(merged_df_clean)}")

# Save back to CSV
merged_df_clean.to_csv('merged_final_results.csv', index=False)
print("\nCleaned data saved back to merged_final_results.csv")

Number of rows with missing values in ingredients_processed or cuisine_tags: 52

Missing values by column:
ingredients_processed      52
cuisine_tags               52
processing_error         2348
dtype: int64

Number of rows before cleaning: 2400
Number of rows after cleaning: 2348

Cleaned data saved back to merged_final_results.csv


In [4]:
print(merged_df_clean.count())
merged_df_clean.head()

title                    2348
ingredients_raw          2348
ingredients_processed    2348
instructions             2348
language                 2348
heat_processed           2348
cuisine_tags             2348
vegan                    2348
vegetarian               2348
processing_error            0
dtype: int64


Unnamed: 0,title,ingredients_raw,ingredients_processed,instructions,language,heat_processed,cuisine_tags,vegan,vegetarian,processing_error
0,Easy Salad Recipes: Chopped Salad & More,"['1 recipe Homemade Italian Dressing', '1 Roma...","[{""ingredient"": ""Romaine heart"", ""heat_process...",Make the Homemade Italian Dressing.\nChop roma...,en,True,"[""Italian"", ""Fusion"", ""Island""]",False,False,
1,30 Easy Salad Dressing Recipes,"['2 tablespoons aged balsamic vinegar', '2 tab...","[{""ingredient"": ""balsamic vinegar"", ""heat_proc...",For the balsamic vinaigrette\nIn a medium bowl...,en,True,"[""Italian"", ""Caesar"", ""French""]",False,False,
2,21 Green Salad Recipes,"['1 recipe Homemade Italian Dressing', '1 Roma...","[{""ingredient"": ""Romaine heart"", ""heat_process...",Make the Homemade Italian Dressing.\nChop roma...,en,True,"[""Italian"", ""Fusion"", ""Island""]",False,False,
3,Winter Salad with Pear & Greens,"['1 head radicchio, washed, dried and torn int...","[{""ingredient"": ""radicchio"", ""heat_processed"":...","Make the dressing: In a medium bowl, whisk tog...",en,True,"[""Italian"", ""French"", ""Salad""]",False,True,
4,15 Easy Vegan Salads,"['1/2 recipe Homemade Croutons', '1 romaine he...","[{""ingredient"": ""Homemade Croutons"", ""heat_pro...","If using, make the Homemade Croutons.\nWash an...",en,False,"[""Italian"", ""American"", ""Salad""]",False,False,
