# Day 2: Cleaning and Preprocessing the Indian Food Dataset

This notebook details the process of cleaning and preparing a dataset of Indian food recipes. The goal is to create a well-structured, analysis-ready dataset.

## Data Processing Pipeline

To ensure the process is clean and reproducible, all cleaning and preprocessing steps are consolidated into a single function: `clean_and_translate_data`.

This function performs the following operations:

1.  **Load Data**: The initial dataset is loaded from `combined_indian_food.csv`.
2.  **Drop Unnecessary Columns**: Removes the redundant `Srno` column.
3.  **Standardize Time Columns**:
    *   Parses the `prep_time` and `cook_time` columns to extract numerical values in minutes.
    *   Calculates a `total_time` column by summing the preparation and cooking times.
4.  **Translate Text to English**:
    *   Identifies non-English text in the `ingredients` and `instructions` columns.
    *   Uses the `translators` library to translate the text into English.
    *   Saves the translated text into new columns (`ingredients_en` and `instructions_en`) to preserve the original data.

This streamlined approach makes the notebook more organized and the data preparation process easy to understand and reuse.

In [2]:
import pandas as pd

In [3]:
# 1. Load your raw data
raw_df = pd.read_csv("data/combined_indian_food.csv")

In [4]:
import os
from google.cloud import translate_v2 as translate

# --- IMPORTANT: SET UP AUTHENTICATION ---
# Point the environment variable to your downloaded JSON key file.
# The Google Cloud library will automatically use this file to authenticate.

# Initialize the Translation client
translate_client = translate.Client()

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/ram/.config/gcloud/application_default_credentials.json'

def translate_text(text: str) -> str:
    """
    Translates a given text string to English using the official Google Cloud Translation API.

    - It checks if the text is a valid string.
    - It avoids translating text that is already in ASCII (likely English).
    - It handles potential errors during the API call, returning the original text on failure.

    Args:
        text: The string to be translated.

    Returns:
        The translated string in English, or the original string if translation is not needed or fails.
    """
    # Return non-strings or ASCII strings immediately
    if not isinstance(text, str) or text.isascii():
        return text
    
    try:
        # The API returns a dictionary; the translated text is in the 'translatedText' key.
        result = translate_client.translate(text, source_language='hi', target_language='en')
        return result['translatedText']
    except Exception as e:
        print(f"API translation failed for text: '{text[:50]}...'. Error: {e}")
        return text # Return original text on failure

print("Official Google Cloud Translation client initialized.")

  from pkg_resources import get_distribution


Official Google Cloud Translation client initialized.


In [5]:
import pandas as pd
import numpy as np
import re

def clean_and_process_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Cleans and preprocesses the Indian Food dataset.

    This function performs the following steps:
    1. Drops the 'Srno' column.
    2. Parses 'prep_time' and 'cook_time' into numerical minutes.
    3. Calculates 'total_time'.
    4. Calls the `translate_text` function to translate the 'ingredients' 
       and 'instructions' columns to English.

    Args:
        df: The raw pandas DataFrame.

    Returns:
        A cleaned and processed pandas DataFrame.
    """
    print("Starting data cleaning and processing...")
    
    data = df.copy()

    # 1. Drop unnecessary column
    if 'Srno' in data.columns:
        data.drop(columns=['Srno'], inplace=True)
        print("- 'Srno' column dropped.")

    # 2. Clean and parse time columns
    def _parse_time(val):
        m = re.search(r'(\d+)', str(val))
        return int(m.group(1)) if m else np.nan

    data['prep_time'] = data['prep_time'].apply(_parse_time)
    data['cook_time'] = data['cook_time'].apply(_parse_time)
    print("- Time columns parsed.")
    
    # 3. Calculate total time
    data['total_time'] = data['prep_time'].fillna(0) + data['cook_time'].fillna(0)
    print("- 'total_time' calculated.")

    
    # 4. Fill missing with '' and split
    data['ingredient_list'] = (
        data['ingredients']
        .fillna('')            # turn NaNs into empty strings
        .str.split(';')        # now every item is a list (possibly [''] for originally missing)
    )

    # 5. Clean up each list, dropping any empty or whitespace-only entries
    def clean_ingredients(items):
        if not isinstance(items, list):
            return []            # if somehow still not a list, bail out
        return [i.strip() for i in items if isinstance(i, str) and i.strip()]

    data['ingredient_list'] = data['ingredient_list'].apply(clean_ingredients)

    # 3) Recompute n_ingredients
    data['n_ingredients'] = data['ingredient_list'].apply(len)
    
    
    print("\nData cleaning and processing complete!")
    return data

In [6]:
import time
def translate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Translates the 'ingredients' and 'instructions' columns of a DataFrame from Hindi to English.

    Args:
        df: The pandas DataFrame containing the 'ingredients' and 'instructions' columns.

    Returns:
        A pandas DataFrame with the translated 'ingredients_en' and 'instructions_en' columns.
    """
    
    df = df.copy()
    

    print("Translating 'ingredients' column...")
    start_time = time.time()
    df['ingredients_en'] = df['ingredients'].apply(translate_text)
    end_time = time.time()
    print(f"Translation completed in {end_time - start_time:.2f} seconds")
    # Wait for 4 minutes to avoid rate limiting
    
    start_time = time.time()
    print("Translating 'instructions' column...")
    df['instructions_en'] = df['instructions'].apply(translate_text)
    end_time = time.time()
    print(f"Translation completed in {end_time - start_time:.2f} seconds")
    
    return df


In [7]:
from IPython.display import display

# 1. Load your raw data
raw_df = pd.read_csv("data/combined_indian_food.csv")

# 2. Run the full cleaning and translation process
cleaned_df = clean_and_process_data(raw_df)


# 3. Display a sample of the cleaned data
print("\nSample of the cleaned DataFrame:")
display(cleaned_df.head())



Starting data cleaning and processing...
- 'Srno' column dropped.
- Time columns parsed.
- 'total_time' calculated.

Data cleaning and processing complete!

Sample of the cleaned DataFrame:


Unnamed: 0,name,description,cuisine,course,diet,prep_time,ingredients,instructions,n_ingredients,cook_time,total_time,ingredient_list
0,Doddapatre Soppina Chitranna Recipe (Spiced In...,Doddapatre Soppina Chitranna (Indian Thyme Ric...,South Indian Recipes,Lunch,Vegetarian,50,1-1/2 cups Cooked rice; 2 tablespoons Oil; 10 ...,To start preparing Doddapatre Soppina Chitrann...,22,,50.0,"[1-1/2 cups Cooked rice, 2 tablespoons Oil, 10..."
1,Goan Style Mushroom Vindaloo Recipe,Goan Style Mushroom Vindaloo Recipe is a varia...,Goan Recipes,Dinner,Vegetarian,50,250 grams Button mushrooms; cut into quarters;...,To begin making the Goan Style Mushroom Vindal...,18,,50.0,"[250 grams Button mushrooms, cut into quarters..."
2,Assamese Style Walking Catfish In Curry Leaf G...,Assamese Style Walking Catfish In Curry Leaf G...,Assamese,Side Dish,Non Vegeterian,40,5 Walking Catfish; thoroughly cleaned; 4 clove...,To begin making Assamese Style Walking Catfish...,14,,40.0,"[5 Walking Catfish, thoroughly cleaned, 4 clov..."
3,Nutty Aloo Paratha Recipe,Nutty Aloo Paratha Recipe is a wonderful twist...,North Indian Recipes,North Indian Breakfast,Vegetarian,55,1 cup Whole Wheat Flour; 1 cup Spinach Leaves ...,"To begin making Nutty Aloo Paratha Recipe,firs...",22,,55.0,"[1 cup Whole Wheat Flour, 1 cup Spinach Leaves..."
4,Phulka Recipe (Roti/Chapati) - Puffed Indian B...,Phulkas also known as Roti or Chapati in some ...,North Indian Recipes,Main Course,Vegetarian,40,1 cup Whole Wheat Flour; 1/2 teaspoon Salt; op...,To begin making the Phulka (roti/ chapati) rec...,6,,40.0,"[1 cup Whole Wheat Flour, 1/2 teaspoon Salt, o..."


In [None]:
cleaned_and_translated_df = translate_dataframe(cleaned_df)

# 4. Check a translated row to confirm it worked
print("\nChecking a sample of a translated row:")
display(cleaned_df.iloc[7495][['ingredients', 'ingredients_en', 'instructions', 'instructions_en']])

Translating 'ingredients' column...


# Summary of Data Preparation and Next Steps

## Summary

We have successfully cleaned and prepared the Indian food dataset. The key accomplishments include:

*   **Consolidated Cleaning**: All data cleaning logic has been organized into a single, reusable function (`clean_and_process_data`).
*   **Time Standardization**: The `prep_time` and `cook_time` columns were parsed into a consistent numerical format (minutes), and a `total_time` column was created.
*   **Language Translation**: The `ingredients` and `instructions` columns, which contained mixed languages, have been translated into English and stored in new `ingredients_en` and `instructions_en` columns.

The final, clean dataset is now stored in the `cleaned_df` DataFrame, ready for exploratory data analysis (EDA) and visualization.

---

In [None]:
cleaned_and_translated_df

In [None]:
# Save the cleaned DataFrame to a new CSV file in the 'data' directory
# index=False prevents pandas from writing the DataFrame index as a column
cleaned_and_translated_df.to_csv('data/cleaned_indian_food.csv', index=False)

print("Cleaned data has been saved to 'data/cleaned_indian_food.csv'")
print("Ready for visualization next time!")