### Dataset Cleaning and Processing

Use the ChatGPT/Claude web interface for recipe generation.

Prompt for LLM:

Create a comprehensive recipe dataset in CSV format with the following columns:
recipe_name, cuisine_type, meal_type, difficulty_level, prep_time_minutes, cook_time_minutes, servings, main_ingredients, all_ingredients, dietary_restrictions, instructions, nutritional_info

Generate 500 diverse recipes covering:
- Various cuisines (Italian, Asian, Mexican, Indian, Mediterranean, American, etc.)
- All meal types (breakfast, lunch, dinner, snacks, desserts)
- Different difficulty levels
- Various dietary restrictions
- Mix of cooking times (15 min to 3+ hours)

Format as properly escaped CSV with semicolon separators to handle commas in ingredient lists.
Start with 10 examples to verify format, then continue generating.

In [93]:
import pandas as pd
import numpy as np

# Load and clean dataset
df = pd.read_csv('../data/raw_recipes.csv', sep=';')
df.head()

Unnamed: 0,recipe_name,cuisine_type,meal_type,difficulty_level,prep_time_minutes,cook_time_minutes,servings,main_ingredients,all_ingredients,dietary_restrictions,instructions,nutritional_info
0,Spaghetti Carbonara,Italian,Dinner,Medium,15,20,4,"Spaghetti, Eggs, Pancetta, Parmesan","Spaghetti, Eggs, Pancetta, Parmesan, Black Pep...","Contains gluten, dairy, pork",Boil spaghetti until al dente. Fry pancetta un...,"Calories: 520, Protein: 22g, Carbs: 60g, Fat: 22g"
1,Chicken Tikka Masala,Indian,Dinner,Hard,30,45,4,"Chicken, Yogurt, Tomato Sauce, Spices","Chicken, Yogurt, Tomato Paste, Cream, Onion, G...",Contains dairy,Marinate chicken in yogurt and spices. Grill u...,"Calories: 600, Protein: 35g, Carbs: 45g, Fat: 28g"
2,Avocado Toast,American,Breakfast,Easy,5,0,1,"Bread, Avocado","Whole-grain Bread, Ripe Avocado, Lemon Juice, ...","Vegetarian, Vegan",Toast bread slices. Mash avocado with lemon ju...,"Calories: 280, Protein: 6g, Carbs: 28g, Fat: 16g"
3,Sushi Rolls,Japanese,Lunch,Medium,40,0,4,"Sushi Rice, Nori, Fish/Vegetables","Sushi Rice, Rice Vinegar, Sugar, Salt, Nori Sh...","Gluten-free (with tamari), Contains fish","Cook sushi rice, season with vinegar mixture. ...","Calories: 350, Protein: 18g, Carbs: 50g, Fat: 8g"
4,Beef Tacos,Mexican,Dinner,Easy,15,15,4,"Ground Beef, Tortillas","Ground Beef, Taco Seasoning, Corn Tortillas, L...","Contains gluten, dairy (if flour tortillas/che...",Brown beef with seasoning. Warm tortillas. Ass...,"Calories: 420, Protein: 24g, Carbs: 38g, Fat: 18g"


In [94]:
len(df)

500

In [95]:
# Count duplicate rows based on 'recipe_name'
num_duplicates = df.duplicated(subset=['recipe_name']).sum()
print(num_duplicates)

0


In [96]:
# Check for mismatched dietary labels 
import re

# Define animal products for vegan/vegetarian checks
non_vegan = [
    'chicken', 'beef', 'pork', 'fish', 'shrimp', 'egg', 'cheese', 'milk', 'cream',
    'yogurt', 'butter', 'honey', 'lamb', 'bacon', 'sausage', 'pancetta'
]
non_vegetarian = [
    'chicken', 'beef', 'pork', 'fish', 'shrimp', 'lamb', 'bacon', 'sausage', 'pancetta'
]

def has_any(text, keywords):
    text = str(text).lower()
    return any(re.search(r'\b' + re.escape(word) + r'\b', text) for word in keywords)

# Check for vegan mismatches
vegan_mismatches = df[
    df['dietary_restrictions'].str.contains('vegan', case=False, na=False) &
    df.apply(lambda row: has_any(f"{row['main_ingredients']}, {row['all_ingredients']}", non_vegan), axis=1)
]

# Check for vegetarian mismatches
vegetarian_mismatches = df[
    df['dietary_restrictions'].str.contains('vegetarian', case=False, na=False) &
    df.apply(lambda row: has_any(f"{row['main_ingredients']}, {row['all_ingredients']}", non_vegetarian), axis=1)
]

print("Potentially mislabeled as Vegan:")
print(vegan_mismatches[['recipe_name', 'main_ingredients', 'all_ingredients', 'dietary_restrictions']])

print("\nPotentially mislabeled as Vegetarian:")
print(vegetarian_mismatches[['recipe_name', 'main_ingredients', 'all_ingredients', 'dietary_restrictions']])

Potentially mislabeled as Vegan:
                         recipe_name           main_ingredients  \
13               Mexican Cheese Bowl      Cheese, Chicken, Beef   
16               American Rice Roast      Rice, Tofu, Mushrooms   
20          Thai Tomatoes Power Bowl  Tomatoes, Fish, Mushrooms   
50            American Tomatoes Bake   Tomatoes, Chicken, Pasta   
51   Mediterranean Tomatoes Stir-Fry   Tomatoes, Pasta, Chicken   
..                               ...                        ...   
472           Chinese Pork Dumplings           Pork, Fish, Eggs   
477              Indian Cheese Salad    Cheese, Tofu, Mushrooms   
478        Spanish Eggs Mini Skewers        Eggs, Spinach, Pork   
490            Japanese Fish Skillet        Fish, Chicken, Eggs   
491               American Tofu Bake         Tofu, Shrimp, Fish   

                                       all_ingredients  \
13   Cheese, Chicken, Beef, Honey, Soy Sauce, Spice...   
16   Rice, Tofu, Mushrooms, Cream, Milk, Yogur

In [97]:
# Fix mismatched dietary labels 

# Helper function to remove specific dietary labels (e.g., "Vegan, Vegetarian, Gluten-Free" -> "Vegan")
def remove_label(labels, label):
    # Remove label from comma-separated string, clean up spaces and commas
    labels = [l.strip() for l in str(labels).split(',')]
    labels = [l for l in labels if l.lower() != label.lower()]
    return ', '.join(labels)

# Keep track of recipes that had labels removed
vegan_removed, vegetarian_removed = [], []

# For each row, combine the main and all ingredients into a single string
for idx, row in df.iterrows():
    ingredients = f"{row['main_ingredients']}, {row['all_ingredients']}"
    labels = row['dietary_restrictions']

    # Remove 'Vegan' if animal products present
    if 'vegan' in str(labels).lower() and has_any(ingredients, non_vegan):
        df.at[idx, 'dietary_restrictions'] = remove_label(labels, 'Vegan')
        vegan_removed.append(row['recipe_name'])

    # Remove 'Vegetarian' if animal flesh present
    if 'vegetarian' in str(labels).lower() and has_any(ingredients, non_vegetarian):
        df.at[idx, 'dietary_restrictions'] = remove_label(df.at[idx, 'dietary_restrictions'], 'Vegetarian')
        vegetarian_removed.append(row['recipe_name'])

print(f"Removed 'Vegan' label from {len(vegan_removed)} recipes: {vegan_removed[:10]}")
print(f"Removed 'Vegetarian' label from {len(vegetarian_removed)} recipes: {vegetarian_removed[:10]}")

df.to_csv('../data/recipes_clean.csv', index=False)

Removed 'Vegan' label from 107 recipes: ['Mexican Cheese Bowl', 'American Rice Roast', 'Thai Tomatoes Power Bowl', 'American Tomatoes Bake', 'Mediterranean Tomatoes Stir-Fry', 'Mediterranean Rice Stew', 'Vietnamese Pasta Pilaf', 'French Fish Bowl', 'African Beef Stir-Fry', 'American Mushrooms Cookies']
Removed 'Vegetarian' label from 74 recipes: ['Mexican Cheese Bowl', 'Indian Eggs Salad', 'Chinese Chicken Soup', 'American Pork Pilaf', 'Mexican Beef Burrito Bowl', 'Mediterranean Fish Orzo Salad', 'Thai Lentils Larb', 'Middle Eastern Spinach Shawarma', 'Mediterranean Rice Stew', 'Vietnamese Chicken Crumble']


In [None]:
# Check recipes with short instructions
nonsensical = df[df['instructions'].str.len() < 20]
print("Recipes with short instructions:", nonsensical['recipe_name'].tolist())

Recipes with short instructions: []


In [99]:
# Count empty strings in each column
empty_counts = (df == '').sum()
print("Empty string counts per column:")
print(empty_counts)

Empty string counts per column:
recipe_name              0
cuisine_type             0
meal_type                0
difficulty_level         0
prep_time_minutes        0
cook_time_minutes        0
servings                 0
main_ingredients         0
all_ingredients          0
dietary_restrictions    23
instructions             0
nutritional_info         0
dtype: int64


In [100]:
# Remove rows with empty strings caused by dietary update given scikit-learn's vectorizers
df = df[(df['dietary_restrictions'] != '')]

In [101]:
empty_counts = (df == '').sum()
print("Empty string counts per column:")
print(empty_counts)

Empty string counts per column:
recipe_name             0
cuisine_type            0
meal_type               0
difficulty_level        0
prep_time_minutes       0
cook_time_minutes       0
servings                0
main_ingredients        0
all_ingredients         0
dietary_restrictions    0
instructions            0
nutritional_info        0
dtype: int64


In [102]:
df.to_csv('../data/recipes_clean.csv', index=False)