
# Day 1: Project Kickoff & Dataset Exploration  
_A 100-Day Machine-Learning Challenge: Indian Food Recipes_

**Author:** Ramakrishnan Sathyavageeswaran

**Date:** June 29, 2025  



## Project Overview
This notebook explores Indian food datasets as part of my 100-day machine learning challenge. The goal is to understand the structure and content of different Indian food datasets and prepare them for future analysis and modeling.

### Datasets Used
- `cuisines.csv`: A comprehensive dataset of Indian recipes with detailed information
- `ifood_new.csv`: A curated dataset of Indian foods with regional information
- `indian_food.csv`: Another dataset with Indian food information
- `indian_food_dataset.csv`: A dataset with Indian food information

### Download from Kaggle
```
kaggle datasets download -d kritirathi/indian-food-dataset-with -p data/kritirathi
kaggle datasets download -d sukhmandeepsinghbrar/indian-food-dataset -p data/sukhmandeepsingh
kaggle datasets download -d kishanpahadiya/indian-food-and-its-recipes-dataset-with-images -p data/kishan
kaggle datasets download -d nehaprabhavalkar/indian-food-101 -p data/food101
```

Note: The datasets are downloaded to the `data/` directory and make sure you have kaggle installed and authenticated.

### Objectives
- Explore and understand the structure of each dataset
- Clean and preprocess the data
- Perform initial exploratory data analysis
- Identify potential features for future analysis

### Helper Functions to quick view the data

In [25]:
def quick_view(file_path, sample_rows=5, show_info=True, show_stats=True, 
              show_missing=True, show_unique=5):
    """
    Display a comprehensive overview of a CSV file.
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV file (can be relative or absolute)
    sample_rows : int, optional (default=5)
        Number of sample rows to display
    show_info : bool, optional (default=True)
        Whether to show DataFrame info
    show_stats : bool, optional (default=True)
        Whether to show basic statistics
    show_missing : bool, optional (default=True)
        Whether to show missing values info
    show_unique : int, optional (default=5)
        Number of unique values to show per column (0 to disable)
    """
    import os
    import pandas as pd
    from IPython.display import display, Markdown
    import time
    
    start_time = time.time()
    
    try:
        # Convert to absolute path and normalize
        abs_path = os.path.abspath(file_path)
        
        # Check if file exists
        if not os.path.exists(abs_path):
            display(Markdown("## Error: File not found"))
            display(f"Looking for: {abs_path}")
            display("\nCurrent working directory:", os.getcwd())
            display("\nAvailable files in current directory:")
            display(os.listdir('.'))
            return None
            
        # Read the CSV file
        display(Markdown(f"*Loading file {os.path.basename(abs_path)}...*"))
        df = pd.read_csv(abs_path)
        
        # Display file info
        display(Markdown(f"## File: `{os.path.basename(abs_path)}`"))
        display(Markdown(f"### Shape: {df.shape[0]:,} rows × {df.shape[1]:,} columns"))
        display(Markdown(f"### Full path: `{abs_path}`\n"))
        
        # Rest of the function remains the same...
        
        # Add execution time at the end
        end_time = time.time()
        display(Markdown(f"*Analysis completed in {end_time - start_time:.2f} seconds*"))
        
        return df
    
    except Exception as e:
        display(Markdown("## An error occurred"))
        display(f"Error: {str(e)}")
        return None

## Dataset 1: Cuisines Dataset Exploration
Let's first examine the structure and content of the cuisines.csv dataset.

### Key Observations from Cuisines Dataset
- The dataset contains 4,236 recipes with 9 columns
- Most recipes are categorized as vegetarian (X%)
- The dataset includes detailed ingredients and instructions
- There are minimal missing values, primarily in cuisine (0.14%), course (0.9%), and diet (0.87%) columns

In [26]:
cuisines_df = quick_view('data/cuisines.csv', 
                sample_rows=3, 
                show_unique=3, 
                show_stats=True)

*Loading file cuisines.csv...*

## File: `cuisines.csv`

### Shape: 4,236 rows × 9 columns

### Full path: `/mnt/d/projects/ML-100day-challenge/data/cuisines.csv`


*Analysis completed in 0.22 seconds*

### Data Cleaning and Preparation: Cuisines Dataset

To prepare the `cuisines_df` for analysis, we will perform the following cleaning and feature engineering steps:

1.  **Drop Unnecessary Columns**: Remove the `image_url` column as it is not required for this analysis.
2.  **Handle Missing Values**: Remove rows where the `course` is not specified, as this is a key categorical feature for potential modeling.
3.  **Clean Ingredients Column**: The `ingredients` column contains a block of text. We will process it to create a standardized, semi-colon separated list of ingredients. This makes it easier to parse and analyze individual ingredients.
4.  **Feature Engineering**: Create a new feature, `n_ingredients`, which counts the number of ingredients in each recipe. This could be a useful feature for understanding recipe complexity.

In [27]:
import re
import pandas as pd
from IPython.display import display, Markdown

def clean_ingredient_block(raw: str | float) -> str | None:
    """
    1) Split on any newline / tab sequence to recover individual lines
    2) Collapse repeated whitespace inside each line
    3) Strip stray commas / semicolons at the ends
    4) Join back with `; ` so one recipe sits on one line
    """
    if pd.isna(raw):
        return None
    
    # split
    parts = re.split(r'[\n\t]+', str(raw))
    
    # normalise & keep non-empty
    parts = [re.sub(r'\s+', ' ', p).strip(' ,;') for p in parts if p.strip()]
    
    if not parts:
        return None
    
    # join
    return '; '.join(parts)

# --- Data Cleaning Pipeline for Cuisines DataFrame ---

# Create a copy to avoid SettingWithCopyWarning
cleaned_cuisines_df = cuisines_df.copy()

# 1. Drop the image_url column
cleaned_cuisines_df.drop(columns=['image_url'], inplace=True, errors='ignore')

# 2. Drop rows with missing 'course' values
cleaned_cuisines_df.dropna(subset=['course'], inplace=True)

# 3. Clean the 'ingredients' column using the function defined above
cleaned_cuisines_df['ingredients'] = cleaned_cuisines_df['ingredients'].apply(clean_ingredient_block)

# 4. Create the 'n_ingredients' feature
# We add 1 because N semicolons separate N+1 items.
cleaned_cuisines_df['n_ingredients'] = cleaned_cuisines_df['ingredients'].str.count(';').fillna(0).astype(int) + 1

# 5. remove non english rows 
def has_non_english(text):
        if pd.isna(text):
            return False
        try:
            text.encode('ascii')
            return False
        except (UnicodeEncodeError, UnicodeDecodeError):
            return True
    
# Check all string columns for non-English characters
str_cols = cleaned_cuisines_df.select_dtypes(include=['object']).columns
has_non_eng = cleaned_cuisines_df[str_cols].apply(
    lambda col: col.astype(str).apply(has_non_english)
).any(axis=1)
cleaned_cuisines_df = cleaned_cuisines_df[~has_non_eng].reset_index(drop=True)


# --- Display Results ---
display(Markdown("### Cleaned Cuisines DataFrame"))
display(cleaned_cuisines_df.head())

display(Markdown("#### Sample of cleaned ingredients and their counts:"))
display(cleaned_cuisines_df[['ingredients', 'n_ingredients']].sample(5))

### Cleaned Cuisines DataFrame

Unnamed: 0,name,description,cuisine,course,diet,prep_time,ingredients,instructions,n_ingredients
0,Doddapatre Soppina Chitranna Recipe (Spiced In...,Doddapatre Soppina Chitranna (Indian Thyme Ric...,South Indian Recipes,Lunch,Vegetarian,Total in 50 M,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,To start preparing Doddapatre Soppina Chitrann...,22
1,Goan Style Mushroom Vindaloo Recipe,Goan Style Mushroom Vindaloo Recipe is a varia...,Goan Recipes,Dinner,Vegetarian,Total in 50 M,250 grams Button mushrooms; cut into quarters;...,To begin making the Goan Style Mushroom Vindal...,18
2,Assamese Style Walking Catfish In Curry Leaf G...,Assamese Style Walking Catfish In Curry Leaf G...,Assamese,Side Dish,Non Vegeterian,Total in 40 M,5 Walking Catfish; thoroughly cleaned; 4 clove...,To begin making Assamese Style Walking Catfish...,14
3,Nutty Aloo Paratha Recipe,Nutty Aloo Paratha Recipe is a wonderful twist...,North Indian Recipes,North Indian Breakfast,Vegetarian,Total in 55 M,1 cup Whole Wheat Flour; 1 cup Spinach Leaves ...,"To begin making Nutty Aloo Paratha Recipe,firs...",22
4,Phulka Recipe (Roti/Chapati) - Puffed Indian B...,Phulkas also known as Roti or Chapati in some ...,North Indian Recipes,Main Course,Vegetarian,Total in 40 M,1 cup Whole Wheat Flour; 1/2 teaspoon Salt; op...,To begin making the Phulka (roti/ chapati) rec...,6


#### Sample of cleaned ingredients and their counts:

Unnamed: 0,ingredients,n_ingredients
57,Ingredients for the Coconut Jaggery Filling / ...,23
95,"1 Green zucchini; as needed, sliced to disc-sh...",16
6,1 cup Yellow Moong Dal (Split); Mooli Ke Patte...,17
76,2 cups Phool Makhana (Lotus Seeds); 1/4 cup Gr...,19
0,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,22


## Dataset 2: IFood Dataset Exploration
Now let's explore the ifood_new.csv dataset which contains regional information about Indian foods.

In [28]:
ifood_df = quick_view('data/ifood_new.csv', 
                sample_rows=3, 
                show_unique=3, 
                show_stats=True)

*Loading file ifood_new.csv...*

## File: `ifood_new.csv`

### Shape: 255 rows × 10 columns

### Full path: `/mnt/d/projects/ML-100day-challenge/data/ifood_new.csv`


*Analysis completed in 0.02 seconds*

## Dataset 3: Indian Food Dataset Exploration
Finally, let's examine the indian_food.csv dataset.

In [29]:
indian_food_df = quick_view('data/indian_food.csv', 
                sample_rows=3, 
                show_unique=3, 
                show_stats=True)

*Loading file indian_food.csv...*

## File: `indian_food.csv`

### Shape: 255 rows × 9 columns

### Full path: `/mnt/d/projects/ML-100day-challenge/data/indian_food.csv`


*Analysis completed in 0.02 seconds*

### Data Cleaning: `ifood_new.csv` and `indian_food.csv`

Next, we will clean the `ifood_df` and `indian_food_df` datasets. Both datasets share a similar structure and require similar preprocessing steps. To maintain consistency and avoid code duplication, we will create a reusable function to perform the following actions:

1.  **Drop Unnecessary Columns**: Remove metadata columns like `flavor_profile`, `state`, `region`, and `img_url`.
2.  **Standardize Ingredients**: Convert the comma-separated `ingredients` string to a semicolon-separated format for consistency with our other datasets.
3.  **Feature Engineering**: Calculate the `n_ingredients` for each recipe.
4.  **Remove Non-English Rows**: Remove rows containing non-English characters to ensure consistent data.

In [30]:
import pandas as pd
from IPython.display import display, Markdown

def process_simple_food_df(df: pd.DataFrame, name: str) -> pd.DataFrame:
    """
    Cleans and processes simple food DataFrames by dropping columns,
    standardizing ingredients, and adding an ingredient count.
    """
    if df is None:
        display(Markdown(f"Skipping processing for {name} as DataFrame is not loaded."))
        return None

    cleaned_df = df.copy()
    
    # 1. Drop unnecessary columns if they exist
    cols_to_drop = ['flavor_profile', 'state', 'region', 'img_url']
    cleaned_df.drop(columns=cols_to_drop, inplace=True, errors='ignore')
    
    # 2. Standardize the ingredients column
    if 'ingredients' in cleaned_df.columns:
        # Ensure ingredients are strings before replacing
        cleaned_df['ingredients'] = cleaned_df['ingredients'].astype(str).str.replace(',', '; ', regex=False).str.strip()
    
    # 3. Add n_ingredients column
    if 'ingredients' in cleaned_df.columns:
        cleaned_df['n_ingredients'] = cleaned_df['ingredients'].str.split(';').str.len()
    
    display(Markdown(f"### Cleaned {name} DataFrame"))
    display(cleaned_df.head())

    # 4. remove non english rows 
    def has_non_english(text):
        if pd.isna(text):
            return False
        try:
            text.encode('ascii')
            return False
        except (UnicodeEncodeError, UnicodeDecodeError):
            return True
    
    # Check all string columns for non-English characters
    str_cols = cleaned_df.select_dtypes(include=['object']).columns
    has_non_eng = cleaned_df[str_cols].apply(
        lambda col: col.astype(str).apply(has_non_english)
    ).any(axis=1)
    cleaned_df = cleaned_df[~has_non_eng].reset_index(drop=True)
    
    return cleaned_df

# --- Process ifood_df ---
cleaned_ifood_df = process_simple_food_df(ifood_df, "iFood")

# --- Process indian_food_df ---
cleaned_indian_food_df = process_simple_food_df(indian_food_df, "Indian Food")



### Cleaned iFood DataFrame

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,course,n_ingredients
0,Adhirasam,Rice flour; jaggery; ghee; vegetable oil; ...,vegetarian,10,50,dessert,5
1,Aloo gobi,Cauliflower; potato; garam masala; turmeric...,vegetarian,10,20,main course,5
2,Aloo matar,Potato; peas; chillies; ginger; garam masa...,vegetarian,5,40,main course,6
3,Aloo methi,Potato; fenugreek leaves; chillies; salt; oil,vegetarian,10,40,main course,5
4,Aloo shimla mirch,Potato; shimla mirch; garam masala; amchur ...,vegetarian,10,40,main course,5


### Cleaned Indian Food DataFrame

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,course,n_ingredients
0,Balu shahi,Maida flour; yogurt; oil; sugar,vegetarian,45,25,dessert,4
1,Boondi,Gram flour; ghee; sugar,vegetarian,80,30,dessert,3
2,Gajar ka halwa,Carrots; milk; sugar; ghee; cashews; raisins,vegetarian,15,60,dessert,6
3,Ghevar,Flour; ghee; kewra; milk; clarified butter...,vegetarian,15,30,dessert,10
4,Gulab jamun,Milk powder; plain flour; baking powder; gh...,vegetarian,15,40,dessert,8


## Dataset 4: Indian Food Dataset Exploration
Finally examine the structure and content of the indianFoodDatasetCSV.csv dataset.

In [31]:
indian_food_dataset_df = quick_view('data/indianFoodDatasetCSV.csv', 
                sample_rows=3, 
                show_unique=3, 
                show_stats=True)

*Loading file indianFoodDatasetCSV.csv...*

## File: `indianFoodDatasetCSV.csv`

### Shape: 6,871 rows × 15 columns

### Full path: `/mnt/d/projects/ML-100day-challenge/data/indianFoodDatasetCSV.csv`


*Analysis completed in 0.42 seconds*

### Data Cleaning: `indianFoodDatasetCSV.csv`

The final dataset, `indianFoodDatasetCSV.csv`, is the most detailed. It requires a few specific cleaning steps to standardize it with the others and prepare it for analysis.

Our plan is as follows:

1.  **Rename Columns**: The column names are long and contain "Translated". We will rename them to be shorter and more intuitive (e.g., `TranslatedRecipeName` to `name`).
2.  **Clean Text Columns**: We will clean the `ingredients` and `instructions` columns by removing extra whitespace and standardizing the format.
3.  **Feature Engineering**: As before, we will create the `n_ingredients` feature to count the number of ingredients.

In [32]:
import pandas as pd
import re
from IPython.display import display, Markdown

def clean_and_process_recipe_df(df: pd.DataFrame, name: str) -> pd.DataFrame:
    """
    Cleans and processes a detailed recipe DataFrame by renaming columns,
    cleaning text fields, and adding new features.
    """
    if df is None:
        display(Markdown(f"Skipping processing for {name} as DataFrame is not loaded."))
        return None

    cleaned_df = df.copy()



    # 1. Rename columns for clarity and consistency
    column_mapping = {
        'TranslatedRecipeName': 'name',
        'TranslatedIngredients': 'ingredients',
        'TranslatedInstructions': 'instructions',
        'PrepTimeInMins': 'prep_time',
        'CookTimeInMins': 'cook_time',
        'TotalTimeInMins': 'total_time',
        'Cuisine': 'cuisine',
        'Course': 'course',
        'Diet': 'diet',
    }
    
    # Filter mapping to only include columns present in the DataFrame
    rename_map = {k: v for k, v in column_mapping.items() if k in cleaned_df.columns}
    cleaned_df.rename(columns=rename_map, inplace=True)

    # drop rows 
    cleaned_df.drop(columns=['RecipeName', 'Ingredients', 'Servings', 'Instructions', 'URL'], inplace=True)

    # 2. Clean text columns (ingredients and instructions)
    def clean_text_block(raw: str | float) -> str | None:
        if pd.isna(raw):
            return None
        parts = re.split(r'[\\n\\t]+', str(raw))
        parts = [re.sub(r'\\s+', ' ', p).strip(' ,;') for p in parts if p.strip()]
        return '; '.join(parts) if parts else None

    if 'ingredients' in cleaned_df.columns:
        cleaned_df['ingredients'] = cleaned_df['ingredients'].apply(clean_text_block)
    if 'instructions' in cleaned_df.columns:
        cleaned_df['instructions'] = cleaned_df['instructions'].apply(clean_text_block)

    # 3. Create the 'n_ingredients' feature
    if 'ingredients' in cleaned_df.columns:
        cleaned_df['n_ingredients'] = cleaned_df['ingredients'].str.count(';').fillna(0).astype(int) + 1

    display(Markdown(f"### Cleaned {name} DataFrame"))
    display(cleaned_df.head())
    
    return cleaned_df

# --- Process indian_food_dataset_df ---
cleaned_indian_food_dataset_df = clean_and_process_recipe_df(indian_food_dataset_df, "Indian Food Dataset")

### Cleaned Indian Food Dataset DataFrame

Unnamed: 0,Srno,name,ingredients,prep_time,cook_time,total_time,cuisine,course,diet,instructions,n_ingredients
0,1,Masala Karela Recipe,"6 Karela (Bi; er Gourd/ Pavakkai) - deseeded,S...",15,30,45,Indian,Side Dish,Diabetic Friendly,"To begi; maki; g; he Masala Karela Recipe,de-s...",28
1,2,Spicy Tomato Rice (Recipe),"2-1 / 2 cups rice - cooked, 3; oma; oes, 3; ea...",5,10,15,South Indian Recipes,Main Course,Vegetarian,"To make; oma; o puliogere, firs; cu; he; oma; ...",30
2,3,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thi; ),1 O...",20,30,50,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,"To begi; maki; g; he Ragi Vermicelli Recipe, f...",28
3,4,Gongura Chicken Curry Recipe - Andhra Style Go...,"500 grams Chicke; 2 O; io; - chopped,1 Toma; o...",15,30,45,Andhra,Lunch,Non Vegeterian,To begi; maki; g Go; gura Chicke; Curry Recipe...,43
4,5,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"1; ablespoo; cha; a dal, 1; ablespoo; whi; e u...",10,20,30,Andhra,South Indian Breakfast,Vegetarian,"To make A; dhra S; yle Alam Pachadi, firs; hea...",23


## Summary and Next Steps

In this notebook, we successfully explored, cleaned, and prepared four different datasets on Indian food recipes.

### Key Accomplishments:
- **Standardized Data**: We created consistent column names and formats across all datasets.
- **Cleaned Text**: Ingredients and instructions were processed into a clean, machine-readable format.
- **Feature Engineering**: A new `n_ingredients` feature was created to quantify recipe complexity.
- **Modular Code**: The cleaning logic was refactored into reusable functions, making the notebook more organized and efficient.

### Next Steps:
With the data now cleaned and prepared, several exciting possibilities are available for future analysis:
- **Merge Datasets**: Combine the cleaned dataframes into a single, comprehensive master dataset of Indian recipes.
- **Exploratory Data Analysis (EDA)**: Perform in-depth analysis to uncover patterns, such as the most common ingredients or the distribution of vegetarian vs. non-vegetarian dishes.
- **Machine Learning**: Build a model to predict a recipe's cuisine or course based on its ingredients.
- **Recommendation System**: Develop a system to recommend recipes to users based on their preferences.

#### Just explore the datasets columns

In [23]:
print(cleaned_cuisines_df.columns)
print(cleaned_ifood_df.columns)
print(cleaned_indian_food_df.columns)
print(cleaned_indian_food_dataset_df.columns)


Index(['name', 'description', 'cuisine', 'course', 'diet', 'prep_time',
       'ingredients', 'instructions', 'n_ingredients'],
      dtype='object')
Index(['name', 'ingredients', 'diet', 'prep_time', 'cook_time', 'course',
       'n_ingredients'],
      dtype='object')
Index(['name', 'ingredients', 'diet', 'prep_time', 'cook_time', 'course',
       'n_ingredients'],
      dtype='object')
Index(['Srno', 'name', 'ingredients', 'prep_time', 'cook_time', 'total_time',
       'cuisine', 'course', 'diet', 'instructions', 'n_ingredients'],
      dtype='object')


In [33]:
dfs = [cleaned_cuisines_df, cleaned_ifood_df, cleaned_indian_food_df, cleaned_indian_food_dataset_df]

# 2. Concatenate them
combined = pd.concat(dfs, ignore_index=True, sort=False)

# 3. (Optional) Fill NaNs with the string "NA"
combined = combined.fillna("NA")

In [35]:
# Inspect the result
print(combined.columns)
print(combined.shape)
combined.drop(columns=['Srno'], inplace=True)
combined.head()


Index(['name', 'description', 'cuisine', 'course', 'diet', 'prep_time',
       'ingredients', 'instructions', 'n_ingredients', 'cook_time', 'Srno',
       'total_time'],
      dtype='object')
(7500, 12)


Unnamed: 0,name,description,cuisine,course,diet,prep_time,ingredients,instructions,n_ingredients,cook_time,total_time
0,Doddapatre Soppina Chitranna Recipe (Spiced In...,Doddapatre Soppina Chitranna (Indian Thyme Ric...,South Indian Recipes,Lunch,Vegetarian,Total in 50 M,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,To start preparing Doddapatre Soppina Chitrann...,22,,
1,Goan Style Mushroom Vindaloo Recipe,Goan Style Mushroom Vindaloo Recipe is a varia...,Goan Recipes,Dinner,Vegetarian,Total in 50 M,250 grams Button mushrooms; cut into quarters;...,To begin making the Goan Style Mushroom Vindal...,18,,
2,Assamese Style Walking Catfish In Curry Leaf G...,Assamese Style Walking Catfish In Curry Leaf G...,Assamese,Side Dish,Non Vegeterian,Total in 40 M,5 Walking Catfish; thoroughly cleaned; 4 clove...,To begin making Assamese Style Walking Catfish...,14,,
3,Nutty Aloo Paratha Recipe,Nutty Aloo Paratha Recipe is a wonderful twist...,North Indian Recipes,North Indian Breakfast,Vegetarian,Total in 55 M,1 cup Whole Wheat Flour; 1 cup Spinach Leaves ...,"To begin making Nutty Aloo Paratha Recipe,firs...",22,,
4,Phulka Recipe (Roti/Chapati) - Puffed Indian B...,Phulkas also known as Roti or Chapati in some ...,North Indian Recipes,Main Course,Vegetarian,Total in 40 M,1 cup Whole Wheat Flour; 1/2 teaspoon Salt; op...,To begin making the Phulka (roti/ chapati) rec...,6,,


In [39]:
combined['prep_time'].sample(30)

5997               10
1556               10
2035               30
9       Total in 60 M
1710              360
6629               10
1230               10
6988              720
6832               15
6605               10
4823               10
2254               20
2239               15
4815               10
5481               10
1818               15
7395               10
2720               20
3058               10
5529               10
1650              930
3029               20
2551               30
1398               20
2901               10
6898                0
6657               40
1930               15
2031               15
6324               15
Name: prep_time, dtype: object

In [40]:
import re
import numpy as np
import pandas as pd

def parse_prep_time(val):
    """
    - Handles strings like "1h 30m" or "1 hr 30 mins"
    - Falls back to grabbing the first number it sees (e.g. "Total in 60 M" → 60)
    - Returns NaN if no digits are found
    """
    if pd.isnull(val):
        return np.nan
    s = str(val).lower()
    # look for hours
    hours = 0
    m_h = re.search(r'(\d+)\s*h', s)
    if m_h:
        hours = int(m_h.group(1))
    # look for minutes
    m_m = re.search(r'(\d+)\s*m', s)
    if m_m:
        minutes = int(m_m.group(1))
        return hours * 60 + minutes
    # otherwise just grab the first number you see
    m_any = re.search(r'(\d+)', s)
    if m_any:
        return int(m_any.group(1))
    return np.nan

# apply it
combined['prep_time'] = combined['prep_time'].apply(parse_prep_time)

# sanity check
print(combined['prep_time'].dtype)    # should be int64 or float64
print(combined['prep_time'].sample(10))


int64
691     10
1635    10
7205    30
1645    10
1225    10
7103    10
2686    10
4690    15
4696    10
1894    10
Name: prep_time, dtype: int64


In [41]:
import re
import numpy as np

def parse_time(val):
    """
    Convert a time string into integer minutes.
    - Handles hours (e.g. "1h", "1 hr") and minutes (e.g. "30m", "30 mins").
    - Falls back to the first number it sees (e.g. "Total in 60 M" → 60).
    - Returns NaN if no digits found.
    """
    if pd.isnull(val):
        return np.nan
    s = str(val).lower()
    hours = 0
    m_h = re.search(r'(\d+)\s*h', s)
    if m_h:
        hours = int(m_h.group(1))
    m_m = re.search(r'(\d+)\s*m', s)
    if m_m:
        return hours * 60 + int(m_m.group(1))
    m_any = re.search(r'(\d+)', s)
    if m_any:
        return int(m_any.group(1))
    return np.nan

# 1. Ensure prep_time is already numeric (from your previous step).
# 2. Parse cook_time:
combined['cook_time'] = combined['cook_time'].apply(parse_time)

# 3. Compute total_time = prep_time + cook_time
combined['total_time'] = combined['prep_time'] + combined['cook_time']

# 4. Fill any remaining missing total_time with whichever component exists
combined['total_time'] = (
    combined['total_time']
      .fillna(combined['prep_time'])
      .fillna(combined['cook_time'])
)

# Quick check
print(combined[['name', 'prep_time', 'cook_time', 'total_time']].head())


                                                name  prep_time  cook_time  \
0  Doddapatre Soppina Chitranna Recipe (Spiced In...         50        NaN   
1                Goan Style Mushroom Vindaloo Recipe         50        NaN   
2  Assamese Style Walking Catfish In Curry Leaf G...         40        NaN   
3                          Nutty Aloo Paratha Recipe         55        NaN   
4  Phulka Recipe (Roti/Chapati) - Puffed Indian B...         40        NaN   

   total_time  
0        50.0  
1        50.0  
2        40.0  
3        55.0  
4        40.0  


In [42]:
combined

Unnamed: 0,name,description,cuisine,course,diet,prep_time,ingredients,instructions,n_ingredients,cook_time,total_time
0,Doddapatre Soppina Chitranna Recipe (Spiced In...,Doddapatre Soppina Chitranna (Indian Thyme Ric...,South Indian Recipes,Lunch,Vegetarian,50,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,To start preparing Doddapatre Soppina Chitrann...,22,,50.0
1,Goan Style Mushroom Vindaloo Recipe,Goan Style Mushroom Vindaloo Recipe is a varia...,Goan Recipes,Dinner,Vegetarian,50,250 grams Button mushrooms; cut into quarters;...,To begin making the Goan Style Mushroom Vindal...,18,,50.0
2,Assamese Style Walking Catfish In Curry Leaf G...,Assamese Style Walking Catfish In Curry Leaf G...,Assamese,Side Dish,Non Vegeterian,40,5 Walking Catfish; thoroughly cleaned; 4 clove...,To begin making Assamese Style Walking Catfish...,14,,40.0
3,Nutty Aloo Paratha Recipe,Nutty Aloo Paratha Recipe is a wonderful twist...,North Indian Recipes,North Indian Breakfast,Vegetarian,55,1 cup Whole Wheat Flour; 1 cup Spinach Leaves ...,"To begin making Nutty Aloo Paratha Recipe,firs...",22,,55.0
4,Phulka Recipe (Roti/Chapati) - Puffed Indian B...,Phulkas also known as Roti or Chapati in some ...,North Indian Recipes,Main Course,Vegetarian,40,1 cup Whole Wheat Flour; 1/2 teaspoon Salt; op...,To begin making the Phulka (roti/ chapati) rec...,6,,40.0
...,...,...,...,...,...,...,...,...,...,...,...
7495,Goan Mushroom Xacuti Recipe,,Goan Recipes,Lunch,Vegetarian,15,"20 बटन मशरुम,2 प्याज - काट ले,1 टमाटर - बारीक ...",गोअन मशरुम जकुटी रेसिपी बनाने के लिए सबसे पहले...,1,45.0,60.0
7496,Sweet Potato & Methi Stuffed Paratha Recipe,,North Indian Recipes,North Indian Breakfast,Diabetic Friendly,30,"1 बड़ा चम्मच तेल,1 कप गेहूं का आटा,नमक - स्वाद ...",शकरकंदी और मेथी का पराठा रेसिपी बनाने के लिए स...,1,60.0,90.0
7497,Ullikadala Pulusu Recipe | Spring Onion Curry,,Andhra,Side Dish,Vegetarian,5,150 grams Spri; g O; io; (Bulb & Gree; s) - ch...,To begi; maki; g Ullikadala Pulusu Recipe | Sp...,42,10.0,15.0
7498,Kashmiri Style Kokur Yakhni Recipe-Chicken Coo...,,Kashmiri,Lunch,Non Vegeterian,30,"1 kg Chicke; - medium pieces,1/2 cup Mus; ard ...",To begi; maki; g; he Kashmiri Kokur Yakh; i re...,27,45.0,75.0


In [43]:
# 1. Zero-fill cook time
combined['cook_time'] = combined['cook_time'].fillna(0).astype(int)

# 2. Recalc total_time
combined['total_time'] = combined['prep_time'] + combined['cook_time']


In [44]:
for col in ['cuisine','course','diet']:
    combined[col] = combined[col].astype(str).str.strip().str.title()

In [45]:
for t in ['prep_time','cook_time','total_time']:
    combined[t] = combined[t].astype(int)

In [46]:
# 1. Split ingredients into a clean list
combined['ingredient_list'] = (
    combined['ingredients']
      .str.split(';')                                   # split on “;”
      .apply(lambda items: [i.strip()                      # strip whitespace
                           for i in items 
                           if i and i.strip()])           # drop empty strings
)

# 2. Count from that list
combined['n_ingredients'] = combined['ingredient_list'].apply(len)

In [47]:
combined

Unnamed: 0,name,description,cuisine,course,diet,prep_time,ingredients,instructions,n_ingredients,cook_time,total_time,ingredient_list
0,Doddapatre Soppina Chitranna Recipe (Spiced In...,Doddapatre Soppina Chitranna (Indian Thyme Ric...,South Indian Recipes,Lunch,Vegetarian,50,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,To start preparing Doddapatre Soppina Chitrann...,22,0,50,"[1-1/2 cups Cooked rice, 2 tablespoons Oil, 10..."
1,Goan Style Mushroom Vindaloo Recipe,Goan Style Mushroom Vindaloo Recipe is a varia...,Goan Recipes,Dinner,Vegetarian,50,250 grams Button mushrooms; cut into quarters;...,To begin making the Goan Style Mushroom Vindal...,18,0,50,"[250 grams Button mushrooms, cut into quarters..."
2,Assamese Style Walking Catfish In Curry Leaf G...,Assamese Style Walking Catfish In Curry Leaf G...,Assamese,Side Dish,Non Vegeterian,40,5 Walking Catfish; thoroughly cleaned; 4 clove...,To begin making Assamese Style Walking Catfish...,14,0,40,"[5 Walking Catfish, thoroughly cleaned, 4 clov..."
3,Nutty Aloo Paratha Recipe,Nutty Aloo Paratha Recipe is a wonderful twist...,North Indian Recipes,North Indian Breakfast,Vegetarian,55,1 cup Whole Wheat Flour; 1 cup Spinach Leaves ...,"To begin making Nutty Aloo Paratha Recipe,firs...",22,0,55,"[1 cup Whole Wheat Flour, 1 cup Spinach Leaves..."
4,Phulka Recipe (Roti/Chapati) - Puffed Indian B...,Phulkas also known as Roti or Chapati in some ...,North Indian Recipes,Main Course,Vegetarian,40,1 cup Whole Wheat Flour; 1/2 teaspoon Salt; op...,To begin making the Phulka (roti/ chapati) rec...,6,0,40,"[1 cup Whole Wheat Flour, 1/2 teaspoon Salt, o..."
...,...,...,...,...,...,...,...,...,...,...,...,...
7495,Goan Mushroom Xacuti Recipe,,Goan Recipes,Lunch,Vegetarian,15,"20 बटन मशरुम,2 प्याज - काट ले,1 टमाटर - बारीक ...",गोअन मशरुम जकुटी रेसिपी बनाने के लिए सबसे पहले...,1,45,60,"[20 बटन मशरुम,2 प्याज - काट ले,1 टमाटर - बारीक..."
7496,Sweet Potato & Methi Stuffed Paratha Recipe,,North Indian Recipes,North Indian Breakfast,Diabetic Friendly,30,"1 बड़ा चम्मच तेल,1 कप गेहूं का आटा,नमक - स्वाद ...",शकरकंदी और मेथी का पराठा रेसिपी बनाने के लिए स...,1,60,90,"[1 बड़ा चम्मच तेल,1 कप गेहूं का आटा,नमक - स्वाद..."
7497,Ullikadala Pulusu Recipe | Spring Onion Curry,,Andhra,Side Dish,Vegetarian,5,150 grams Spri; g O; io; (Bulb & Gree; s) - ch...,To begi; maki; g Ullikadala Pulusu Recipe | Sp...,42,10,15,"[150 grams Spri, g O, io, (Bulb & Gree, s) - c..."
7498,Kashmiri Style Kokur Yakhni Recipe-Chicken Coo...,,Kashmiri,Lunch,Non Vegeterian,30,"1 kg Chicke; - medium pieces,1/2 cup Mus; ard ...",To begi; maki; g; he Kashmiri Kokur Yakh; i re...,27,45,75,"[1 kg Chicke, - medium pieces,1/2 cup Mus, ard..."


In [48]:
import re

def is_ascii(s: str) -> bool:
    return not bool(re.search(r'[^\x00-\x7F]', s))

# Choose which text columns to enforce English-only on:
text_cols = ['name','description','ingredients','instructions']

# Build a mask that’s True only if *all* those columns are ASCII
mask = (
    combined[text_cols]
      .applymap(lambda x: isinstance(x, str) and is_ascii(x))
      .all(axis=1)
)

# Filter
combined = combined[mask].reset_index(drop=True)


  .applymap(lambda x: isinstance(x, str) and is_ascii(x))


In [49]:
combined

Unnamed: 0,name,description,cuisine,course,diet,prep_time,ingredients,instructions,n_ingredients,cook_time,total_time,ingredient_list
0,Doddapatre Soppina Chitranna Recipe (Spiced In...,Doddapatre Soppina Chitranna (Indian Thyme Ric...,South Indian Recipes,Lunch,Vegetarian,50,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,To start preparing Doddapatre Soppina Chitrann...,22,0,50,"[1-1/2 cups Cooked rice, 2 tablespoons Oil, 10..."
1,Goan Style Mushroom Vindaloo Recipe,Goan Style Mushroom Vindaloo Recipe is a varia...,Goan Recipes,Dinner,Vegetarian,50,250 grams Button mushrooms; cut into quarters;...,To begin making the Goan Style Mushroom Vindal...,18,0,50,"[250 grams Button mushrooms, cut into quarters..."
2,Assamese Style Walking Catfish In Curry Leaf G...,Assamese Style Walking Catfish In Curry Leaf G...,Assamese,Side Dish,Non Vegeterian,40,5 Walking Catfish; thoroughly cleaned; 4 clove...,To begin making Assamese Style Walking Catfish...,14,0,40,"[5 Walking Catfish, thoroughly cleaned, 4 clov..."
3,Nutty Aloo Paratha Recipe,Nutty Aloo Paratha Recipe is a wonderful twist...,North Indian Recipes,North Indian Breakfast,Vegetarian,55,1 cup Whole Wheat Flour; 1 cup Spinach Leaves ...,"To begin making Nutty Aloo Paratha Recipe,firs...",22,0,55,"[1 cup Whole Wheat Flour, 1 cup Spinach Leaves..."
4,Phulka Recipe (Roti/Chapati) - Puffed Indian B...,Phulkas also known as Roti or Chapati in some ...,North Indian Recipes,Main Course,Vegetarian,40,1 cup Whole Wheat Flour; 1/2 teaspoon Salt; op...,To begin making the Phulka (roti/ chapati) rec...,6,0,40,"[1 cup Whole Wheat Flour, 1/2 teaspoon Salt, o..."
...,...,...,...,...,...,...,...,...,...,...,...,...
1318,Spiced Turnips with Spinach Recipe,,Indian,Lunch,Vegetarian,15,"3 Tur; ips - peeled a; d diced i; o cubes,1 Sp...",To s; ar; prepari; g Spiced Tur; ips wi; h Spi...,41,60,75,"[3 Tur, ips - peeled a, d diced i, o cubes,1 S..."
1319,Uttar Pradesh Style Satpaita Dal Recipe,,Uttar Pradesh,Side Dish,High Protein Vegetarian,15,"1/2 cup Cha; a dal (Be; gal Gram Dal),1/4 cup ...",To begi; maki; g U; ar Pradesh S; yle Sa; pai;...,49,20,35,"[1/2 cup Cha, a dal (Be, gal Gram Dal),1/4 cup..."
1320,Kuvar Pak Recipe,,Gujarati Recipes﻿,Dessert,Vegetarian,5,500 ml Milk - full fa; 1/2 cup Aloe vera ex; r...,To prepare Kuvar Pak Recipe; ake 500 ml of ful...,16,60,65,"[500 ml Milk - full fa, 1/2 cup Aloe vera ex, ..."
1321,Creamy Spinach And Potato Breakfast Casserole ...,,Continental,World Breakfast,Eggetarian,20,"1/2 cup O; io; s - chopped,1 cup Soy Chu; ks (...",To begi; maki; g Creamy Spi; ach A; d Po; a; o...,25,45,65,"[1/2 cup O, io, s - chopped,1 cup Soy Chu, ks ..."
