# Interactive Recipe & Kitchen Management Assistant

## Step 1: Data Source & Setup

This notebook implements the first step of our Interactive Recipe & Kitchen Management Assistant capstone project for the Google Gen AI Intensive Course. We'll acquire, explore, and prepare the recipe dataset that will serve as the foundation for our recipe retrieval and customization system.

### Project Overview

The Interactive Recipe & Kitchen Management Assistant helps users:
1. Discover recipes based on available ingredients
2. Customize recipes according to dietary needs
3. Receive step-by-step cooking guidance

This assistant will use multiple Gen AI capabilities including:
- Audio understanding (for voice input)
- Few-shot prompting (for recipe customization)
- Function calling (for specific recipe operations)
- RAG (Retrieval Augmented Generation for recipe knowledge)
- Grounding (using web search for supplemental information)

## Setup Environment

Let's start by installing and importing the necessary libraries for data processing.

In [None]:
# Install required libraries
!pip install -q pandas matplotlib seaborn 

# Install dependencies as needed:
!pip install kagglehub[pandas-datasets]
# Uncomment if you need to download the dataset via Kaggle API
# !pip install -q kaggle
# !mkdir -p ~/.kaggle
# !echo '{"username":"YOUR_USERNAME","key":"YOUR_KEY"}' > ~/.kaggle/kaggle.json
# !chmod 600 ~/.kaggle/kaggle.json
# !kaggle datasets download -d shuyangli94/food-com-recipes-and-user-interactions

## Importing the Dataset in Kaggle

Since you're using Kaggle, you can easily import the Food.com Recipes dataset directly:

1. Search for "Food.com Recipes and User Interactions" in the Kaggle datasets section
2. Or use this direct link: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions

In Kaggle, you can either:
- Add the dataset to your notebook directly from the "Add data" button in the right sidebar
- Use the Kaggle datasets API as shown below

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import os
from collections import Counter
import warnings



# Configure visualizations
plt.style.use('ggplot')
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_colwidth', 100)

print("Environment setup complete!")

## Data Loading

We'll use the Food.com Recipes and Interactions dataset. This contains recipe information including ingredients, steps, and user interactions.

If you've downloaded the dataset using the Kaggle API, uncomment and use the data loading code below. Otherwise, we'll use a direct URL to access the data.

loading both the vectorized and raw data and nutritional breakdown dataset that will be used in subsequent steps, particularly for the few-shot prompting recipe customization implementation.

In [None]:
# Option 1: Direct Kaggle dataset import
# This is the easiest way to import datasets in Kaggle notebooks

try:
    # If the dataset is added via the "Add data" button, it will be available at /kaggle/input/
    recipes_df = pd.read_csv('/kaggle/input/food-com-recipes-and-user-interactions/RAW_recipes.csv')
    interactions_df = pd.read_csv('/kaggle/input/food-com-recipes-and-user-interactions/RAW_interactions.csv')
    pp_recipes_df = pd.read_csv('/kaggle/input/food-com-recipes-and-user-interactions/PP_recipes.csv')
    pp_users_df = pd.read_csv('/kaggle/input/food-com-recipes-and-user-interactions/PP_users.csv')
    nutrition_df = pd.read_csv('/kaggle/input/nutritional-breakdown-of-foods/cleaned_nutrition_dataset.csv')

    print(f"Successfully loaded {len(recipes_df)} recipes")
    print(f"Successfully loaded {len(interactions_df)} interactions")
    print(f"Successfully loaded nutritional dataset with {len(nutrition_df)} records")
    print(f"Successfully loaded vectorized recipe data with {len(pp_recipes_df)} records")
    print(f"Successfully loaded vectorized user data with {len(pp_users_df)} records")
    
    
    
except FileNotFoundError:
    print("Dataset files not found. Please make sure you've added the dataset to your Kaggle notebook.")
    print("You can add it by clicking the 'Add data' button in the right sidebar.")
    print("Alternatively, you can use direct URLs if available.")

# Let's parse the JSON strings in the columns that contain lists
if 'recipes_df' in locals():
    # Check the actual structure of the dataframe
    
    # For Food.com dataset, ingredients, steps, and tags are stored as strings that represent lists
    # We need to convert them from string representation to actual Python lists
    try:
        if 'ingredients' in recipes_df.columns:
            recipes_df['ingredients'] = recipes_df['ingredients'].apply(eval)
            print("Successfully parsed ingredients column")
        
        if 'steps' in recipes_df.columns:
            recipes_df['steps'] = recipes_df['steps'].apply(eval)
            print("Successfully parsed steps column")
        
        if 'tags' in recipes_df.columns:
            recipes_df['tags'] = recipes_df['tags'].apply(eval)
            print("Successfully parsed tags column")
            
            # Add cuisine type based on tags
            recipes_df['cuisine_type'] = recipes_df['tags'].apply(
                lambda x: next((tag for tag in x if tag in ['italian', 'mexican', 'chinese', 'indian', 'french', 'thai']), 'other')
            )
        
      
        # Count number of ingredients
        recipes_df['n_ingredients'] = recipes_df['ingredients'].apply(len)
            
        print("\nDataset successfully processed")
        
    except Exception as e:
        print(f"Error processing dataset: {e}")
        print("Column sample values:")
        for col in recipes_df.columns:
            print(f"{col}: {recipes_df[col].iloc[0]}")



## Data Exploration

Let's explore the dataset to understand its structure and content. This will help us plan our cleaning and preprocessing steps.

In [None]:

# Basic dataset information
print("Raw Datasets information:")
print(f"Number of recipes: {len(recipes_df)}")
print("\nDataset columns:")
print(recipes_df.columns.tolist())
print(15 * "-")
print(f"Number of recipes: {len(interactions_df)}")
print("\nDataset columns:")
print(interactions_df.columns.tolist())
print(15 * "-")
print(f"Number of recipes: {len(nutrition_df)}")
print("\nDataset columns:")
print(nutrition_df.columns.tolist())
print(15 * "-")
print("Vectorized Datasets information:")

print(f"Number of recipes: {len(pp_recipes_df)}")
print("\nDataset columns:")
print(pp_recipes_df.columns.tolist())
print(15 * "-")
print(f"Number of recipes: {len(pp_users_df)}")
print("\nDataset columns:")
print(pp_users_df.columns.tolist())
print(15 * "-")

In [None]:
# Check data types and missing values using a lighter approach
print("\nData types:")
for col in recipes_df.columns:
    print(f"{col}: {recipes_df[col].dtype}")

print("\nMissing values per column:")
missing_values = recipes_df.isnull().sum()
for col, missing in zip(missing_values.index, missing_values.values):
    if missing > 0:
        print(f"{col}: {missing} missing values ({missing/len(recipes_df):.2%})")

In [None]:
    
# Lighter summary statistics - only for numeric columns
print("\nNumeric columns summary:")
numeric_cols = recipes_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
if numeric_cols:
    # Show basic stats for numeric columns only
    print(recipes_df[numeric_cols].describe().T[['count', 'mean', 'min', 'max']])
else:
    print("No numeric columns found")

In [None]:
# Sample a few rows instead of full stats
print("\nSample rows:")
print(recipes_df.sample(3))

In [None]:
# Distribution of cuisine types
plt.figure(figsize=(12, 6))
if 'cuisine_type' in recipes_df.columns:
    # Limit to top 15 cuisines to avoid cluttered plot
    recipes_df['cuisine_type'].value_counts().nlargest(15).plot(kind='bar')
    plt.title('Top 15 Cuisine Types')
    plt.xlabel('Cuisine')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

In [None]:
# Distribution of cooking time - use smaller bins
if 'cooking_time' in recipes_df.columns:
    plt.figure(figsize=(10, 6))
    # Use log scale for better visualization if the range is large
    if recipes_df['cooking_time'].max() > 5 * recipes_df['cooking_time'].median():
        sns.histplot(recipes_df['cooking_time'].clip(upper=recipes_df['cooking_time'].quantile(0.95)), bins=20)
        plt.title('Distribution of Cooking Time (minutes) - Clipped at 95th percentile')
    else:
        sns.histplot(recipes_df['cooking_time'], bins=20)
        plt.title('Distribution of Cooking Time (minutes)')
    plt.xlabel('Cooking Time (minutes)')
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

In [None]:
# Number of ingredients distribution
if 'n_ingredients' in recipes_df.columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(recipes_df['n_ingredients'], bins=range(1, min(30, recipes_df['n_ingredients'].max()+1)))
    plt.title('Distribution of Number of Ingredients')
    plt.xlabel('Number of Ingredients')
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

## Data Cleaning and Preprocessing

Now we'll clean the data by:
1. Removing duplicate recipes
2. Normalizing ingredient names
3. Standardizing measurements
4. Handling missing values
5. Creating dietary tags

In [None]:
# Check for duplicates
print(f"Number of duplicate recipes: {recipes_df.duplicated(subset=['name']).sum()}")

# Remove duplicates
recipes_df = recipes_df.drop_duplicates(subset=['name']).reset_index(drop=True)
print(f"Number of recipes after removing duplicates: {len(recipes_df)}")

# Function to normalize ingredient names
def normalize_ingredients(ingredient_list):
    """
    Normalize ingredient names by removing quantities and standardizing format
    """
    normalized = []
    # If ingredient_list is already a list of strings
    if isinstance(ingredient_list, list):
        for ingredient in ingredient_list:
            # Skip empty ingredients
            if not ingredient or not isinstance(ingredient, str):
                continue
            
            # Remove quantities (simplified for demonstration)
            cleaned = re.sub(r'^\d+\s+\d+/\d+\s+', '', ingredient)
            cleaned = re.sub(r'^\d+/\d+\s+', '', cleaned)
            cleaned = re.sub(r'^\d+\s+', '', cleaned)
            
            # Convert to lowercase and strip whitespace
            cleaned = cleaned.lower().strip()
            
            normalized.append(cleaned)
    else:
        # Handle the case where ingredient_list might be a string or another format
        print("Warning: Expected ingredient_list to be a list, but got:", type(ingredient_list))
        if isinstance(ingredient_list, str):
            # Try to interpret as a string representation of a list
            try:
                actual_list = eval(ingredient_list) if ingredient_list.startswith('[') else [ingredient_list]
                return normalize_ingredients(actual_list)
            except:
                normalized = [ingredient_list.lower().strip()]
    
    return normalized

# Apply normalization to ingredients - with error handling
recipes_df['normalized_ingredients'] = recipes_df['ingredients'].apply(
    lambda x: normalize_ingredients(x) if isinstance(x, list) or isinstance(x, str) else []
)

# Show a sample recipe with normalized ingredients
if len(recipes_df) > 0:
    sample_idx = 0
    print(f"Original ingredients: {recipes_df.iloc[sample_idx]['ingredients']}")
    print(f"Normalized ingredients: {recipes_df.iloc[sample_idx]['normalized_ingredients']}")
else:
    print("No recipes found in the dataframe.")

In [None]:
# Function to identify dietary tags based on ingredients
def identify_dietary_tags(ingredients):
    """
    Identify dietary preferences based on ingredients
    """
    # Handle empty ingredients list
    if not ingredients or not isinstance(ingredients, (list, str)):
        return []
        
    # Convert list of ingredients to a single string for easier checking
    ingredients_str = ' '.join(ingredients).lower()
    
    tags = []
    
    # Vegetarian check (simplified)
    meat_ingredients = ['chicken', 'beef', 'pork', 'lamb', 'turkey', 'veal', 'bacon']
    if not any(meat in ingredients_str for meat in meat_ingredients):
        tags.append('vegetarian')
        
        # Vegan check (simplified)
        animal_products = ['cheese', 'milk', 'cream', 'yogurt', 'butter', 'egg', 'honey']
        if not any(product in ingredients_str for product in animal_products):
            tags.append('vegan')
    
    # Gluten-free check (simplified)
    gluten_ingredients = ['flour', 'wheat', 'barley', 'rye', 'pasta', 'bread']
    if not any(gluten in ingredients_str for gluten in gluten_ingredients):
        tags.append('gluten-free')
    
    # Low-carb check (simplified)
    high_carb_ingredients = ['sugar', 'pasta', 'rice', 'potato', 'bread', 'flour']
    if not any(carb in ingredients_str for carb in high_carb_ingredients):
        tags.append('low-carb')
    
    # Dairy-free check
    dairy_ingredients = ['milk', 'cheese', 'cream', 'yogurt', 'butter']
    if not any(dairy in ingredients_str for dairy in dairy_ingredients):
        tags.append('dairy-free')
    
    return tags

# Apply dietary tagging
recipes_df['dietary_tags'] = recipes_df['normalized_ingredients'].apply(identify_dietary_tags)

# Show the distribution of dietary tags
diet_counts = {}
for tags in recipes_df['dietary_tags']:
    for tag in tags:
        diet_counts[tag] = diet_counts.get(tag, 0) + 1

plt.figure(figsize=(10, 6))
diet_df = pd.Series(diet_counts).sort_values(ascending=False)
diet_df.plot(kind='bar')
plt.title('Distribution of Dietary Tags')
plt.xlabel('Dietary Tag')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Show sample recipes with their dietary tags
print("\nSample recipes with dietary tags:")
sample_recipes = recipes_df[['name', 'normalized_ingredients', 'dietary_tags']].sample(5)
for _, recipe in sample_recipes.iterrows():
    print(f"\nRecipe: {recipe['name']}")
    print(f"Ingredients: {', '.join(recipe['normalized_ingredients'])}")
    print(f"Dietary Tags: {', '.join(recipe['dietary_tags']) if recipe['dietary_tags'] else 'None'}")

## Final Data Structure and Storage

Now we'll organize the data into the final structure and save it for use in subsequent steps.

In [None]:
# Define paths for loading and saving data
# For Kaggle's output sharing feature
DATA_DIR = Path('/kaggle/input/step1-data-setup')
FINAL_DIR = Path('.')
RECIPE_FILE = FINAL_DIR / 'processed_recipes.json'

# Create data directory if it doesn't exist
DATA_DIR.mkdir(exist_ok=True, parents=True)

# # Try to load the processed recipe data from Step 1
try:
    # Check if the file exists in the Kaggle input directory (if step1 was saved as a dataset)
    kaggle_json_path = DATA_DIR / 'processed_recipes.json'
    
    # First check if the file is in the current directory (where step1 might have saved it)
    if RECIPE_FILE.exists():
        with open(RECIPE_FILE, 'r') as f:
            recipes_data = json.load(f)
        recipes_df = pd.DataFrame(recipes_data)
        print(f"Loaded {len(recipes_df)} recipes from JSON file in current directory")
    
    # Check if JSON file exists in Kaggle input directory
    elif kaggle_json_path.exists():
        with open(kaggle_json_path, 'r') as f:
            recipes_data = json.load(f)
        recipes_df = pd.DataFrame(recipes_data)
        print(f"Loaded {len(recipes_df)} recipes from Kaggle dataset input directory (JSON)")
except Exception as e:
    print(f"\nError loading recipe data: {e}")   
 

## Data Integration for Recipe Customization

In Step 3, we'll need to implement few-shot prompting for recipe customization. For this, we'll leverage the following data sources:

1. **Raw Recipe Data**: Provides readable recipe ingredients, steps, and descriptions
2. **Vectorized Recipe Data**: Contains pre-processed tokens and numerical representations for efficient similarity matching
3. **Nutritional Data**: Allows us to make informed decisions about ingredient substitutions

This integrated data will enable us to:
- Adjust recipes based on dietary requirements
- Suggest ingredient substitutions with similar nutritional profiles
- Scale recipes for different serving sizes
- Adapt cooking methods based on available equipment

In [None]:
# Store references to the datasets for use in further steps
datasets = {
    'raw_recipes': recipes_df if 'recipes_df' in locals() else None,
    'raw_interactions': interactions_df if 'interactions_df' in locals() else None,
    'vectorized_recipes': pp_recipes_df if 'pp_recipes_df' in locals() else None,
    'vectorized_users': pp_users_df if 'pp_users_df' in locals() else None,
    'nutrition': nutrition_df if 'nutrition_df' in locals() else None
}

# Save the datasets dictionary to a pickle file for easy access in step 3
import pickle

try:
    with open('datasets.pkl', 'wb') as f:
        pickle.dump(datasets, f)
    print("Successfully saved datasets dictionary to datasets.pkl for use in subsequent steps")
except Exception as e:
    print(f"Error saving datasets: {e}")

# Optional: Save a metadata file with information about the datasets
metadata = {
    'dataset_shapes': {
        name: df.shape if df is not None else None for name, df in datasets.items()
    },
    'dataset_columns': {
        name: df.columns.tolist() if df is not None else None for name, df in datasets.items()
    }
}

try:
    with open('dataset_metadata.json', 'w') as f:
        # Convert any NumPy types to Python native types for JSON serialization
        import json
        def convert(o):
            if isinstance(o, np.int64): return int(o)
            if isinstance(o, np.float64): return float(o)
            raise TypeError
        
        json.dump(metadata, f, default=convert)
    print("Successfully saved dataset metadata to dataset_metadata.json")
except Exception as e:
    print(f"Error saving metadata: {e}")

## Conclusion and Next Steps

In this notebook, we've completed Step 1 of our Interactive Recipe & Kitchen Management Assistant:

1. We loaded and explored the Food.com recipe dataset
2. We cleaned the data by removing duplicates and normalizing ingredients
3. We enhanced the data with dietary preference tags 
4. We structured the data in a format that will facilitate future steps

This processed dataset will serve as the foundation for:
- Building our vector database for RAG implementation
- Creating few-shot examples for recipe customization
- Developing function calling capabilities for specific recipe operations

**Next steps:**
- Step 2: Implement audio input and command recognition
- Step 3: Develop few-shot prompting for recipe customization
- Step 4: Create RAG implementation for recipe retrieval

The data preparation steps in this notebook may appear simple, but they're crucial for ensuring our Gen AI components work effectively in subsequent steps. Clean, well-structured data will lead to better embedding representations, more accurate text matching, and overall improved performance.