### What is the dataset about <br>

This dataset was provided by Yummly and is included in a Kaggle competition named "<a href="https://www.kaggle.com/c/whats-cooking">What's Cooking?</a>". The dataset consists of 3 columns: id, ingredients and cuisine and has a total of 39,000 training examples distributed accross 20 cuisine categories. The data is stored in JSON Format and 2 files: train.json, test.json are provided in the dataset. <br>

train.json - Train set contains recipe id, type of cuisine, and list of ingredients. <br>
test.json - Test set contains recipe id and list of ingredients.

<b><i>Attribute Explanation:</b></i>

id: Unique identifier for recipe i.e. recipe id <br>
ingredients: List of different ingredients that constitute to the recipe <br>
cuisine: Type of cuisine the recipe falls into (Target variable)

Link to dataset: https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset <br>

### Aim / Problem Statement <br>

The aim of the data is to predict the cuisine category, given the list of ingredients as input. The categories of cuisine, in the dataset are: <br>
'brazilian', 'british', 'cajun_creole', 'chinese', 'filipino', 'french', 'greek', 'indian', 'irish', 'italian', 'jamaican', 'japanese', 'korean', 'mexican', 'moroccan', 'russian', 'southern_us', 'spanish', 'thai', 'vietnamese' <br>

### Contents <br>
The Kaggle competition contains some data wrangling questions as well, apart from prediction of cuisine category. All of these are included in this current notebook (Part 1). <br>
Thus, following are the Part 1 and Part 2 Notebook Contents:

Part 1 Contents:

1) Answering the questions from the Kaggle competition. <br>
2) Creating corresponding code snippets for the questions, and providing relevant output.

Part 2 Contents:

1) Creating a Classification model for cuisine category prediction. <br>
2) Peforming pre-processing on the text, and choosing relevant model for prediction. <br>
3) Noting future work that can be done for improving the model performance.

Let's glance through Part 1 Contents.

### Read training set and test set
- read `train.json` as train_df 
- split train_df into 80% training data and 20% for validation

In [1]:
# Importing necessary libraries

import pandas as pd
import json

In [2]:
# Reading file contents of train.json

train_file = open("Recipe Ingredients/train.json")
train_json = json.load(train_file)
train_file.close()
train_json[0]

{'id': 10259,
 'cuisine': 'greek',
 'ingredients': ['romaine lettuce',
  'black olives',
  'grape tomatoes',
  'garlic',
  'pepper',
  'purple onion',
  'seasoning',
  'garbanzo beans',
  'feta cheese crumbles']}

In [3]:
# Constructing train_df from the train.json contents

train_df = pd.DataFrame(train_json)
train_df.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


In [4]:
# Reading file contents of test.json

test_file = open("Recipe Ingredients/test.json")
test_json = json.load(test_file)
test_file.close()
test_json[0]

{'id': 18009,
 'ingredients': ['baking powder',
  'eggs',
  'all-purpose flour',
  'raisins',
  'milk',
  'white sugar']}

In [5]:
# Constructing test_df from the test.json contents

test_df = pd.DataFrame(test_json)
test_df.head()

Unnamed: 0,id,ingredients
0,18009,"[baking powder, eggs, all-purpose flour, raisi..."
1,28583,"[sugar, egg yolks, corn starch, cream of tarta..."
2,41580,"[sausage links, fennel bulb, fronds, olive oil..."
3,29752,"[meat cuts, file powder, smoked sausage, okra,..."
4,35687,"[ground black pepper, salt, sausage casings, l..."


#### Splitting train_df into 80% Training and 20% Validation sets

In [6]:
from sklearn.model_selection import train_test_split

X = train_df.drop('cuisine', axis=1)
Y = train_df['cuisine']
train_X, val_X, train_Y, val_Y = train_test_split(X, Y, test_size=0.2, random_state=101)

In [7]:
# Getting overview of training and validation sets shapes

print("X.shape: ",X.shape)
print("Y.shape: ",Y.shape)
print("-----------------------------")
print("train_X.shape: ",train_X.shape)
print("val_X.shape: ",val_X.shape)
print("train_Y.shape: ",train_Y.shape)
print("val_Y.shape: ",val_Y.shape)

X.shape:  (39774, 2)
Y.shape:  (39774,)
-----------------------------
train_X.shape:  (31819, 2)
val_X.shape:  (7955, 2)
train_Y.shape:  (31819,)
val_Y.shape:  (7955,)


## Calculate the average number of Ingredients for each cuisine for train_df

In [8]:
# Maintaining list of number of ingredients for each cuisine
# Following for loop calculates total number of ingredients for each recipe and stores it in a list: number_of_ingredients
number_of_ingredients = []

for ingredients in train_df["ingredients"]:
    length_ingredients = len(ingredients)
    number_of_ingredients.append(length_ingredients)

In [9]:
# Adding the newly created column contents: number_of_ingredients to train_df

train_df["number_of_ingredients"] = pd.Series(number_of_ingredients)
train_df.head()

Unnamed: 0,id,cuisine,ingredients,number_of_ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes...",9
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",11
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",12
3,22213,indian,"[water, vegetable oil, wheat, salt]",4
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe...",20


In [10]:
# Grouping recipes as per cuisine and finding out average number of ingredients

print("Average number of Ingredients for each cuisine:")
train_df.groupby("cuisine")["number_of_ingredients"].mean()

Average number of Ingredients for each cuisine:


cuisine
brazilian        9.520343
british          9.708955
cajun_creole    12.617076
chinese         11.982791
filipino        10.000000
french           9.817838
greek           10.182128
indian          12.705961
irish            9.299850
italian          9.909033
jamaican        12.214829
japanese         9.735067
korean          11.284337
mexican         10.877446
moroccan        12.909866
russian         10.224949
southern_us      9.634954
spanish         10.423660
thai            12.545809
vietnamese      12.675152
Name: number_of_ingredients, dtype: float64

## Find top 5 most common ingredients in training set

In [11]:
# Creating a dictionary: ingredients_count that contains each ingredients' count
# Following for loop initializes of each new ingredient and updates the count of existing ingredients

ingredients_count = {}

for ingredients in train_df["ingredients"]:
    for item in ingredients:
        if item in ingredients_count:
            ingredients_count[item] = ingredients_count[item] + 1
        else:
            ingredients_count[item] = 1

In [12]:
# Sorting the ingredients as per their counts in reverse order, so that we can get top 5 most common ingredients

sortedCount = sorted(ingredients_count.items(), key=lambda kv:(kv[1], kv[0]), reverse=1)[:5]
sortedCount

[('salt', 18049),
 ('onions', 7972),
 ('olive oil', 7972),
 ('water', 7457),
 ('garlic', 7380)]

In [13]:
# Creating a dataframe and displaying the 5 most common ingredients with their total counts

print("Top 5 most common ingredients and their value counts:")
df = pd.DataFrame(sortedCount, columns=["Ingredient","Count"])
df

Top 5 most common ingredients and their value counts:


Unnamed: 0,Ingredient,Count
0,salt,18049
1,onions,7972
2,olive oil,7972
3,water,7457
4,garlic,7380


## Calculate the most common ingredient in each cuisine (excluding 'salt','onions','olive oil')

In [14]:
# Creating a dictionary: ingredients_count_in_cuisines which stores each cuisine category
# The cuisine category is mapped to another dictionary corresponding to ingredients in the cuisine with their total counts
# Following for loop adds new cuisines to the dictionary, with the ingredients and their value counts
# It also updates the value counts of ingredients for specific cuisine
# ingredients_to_exclude is a set which is used to exclude certain ingredients and their value counts in each cuisine
# As it is mentioned in the question to exclude ingredients: "salt","onions","olive oil", we ignore these ingredients

ingredients_count_in_cuisines = {}
# Above dictionary will have such representation: {indian: {tomato:2, potato:3}}

ingredients_to_exclude = {"salt","onions","olive oil"}

for row_index in train_df.index:
    
    cuisine = train_df["cuisine"][row_index]
    ingredients = train_df["ingredients"][row_index]
    
    if cuisine in ingredients_count_in_cuisines:
        ingredients_count = ingredients_count_in_cuisines[cuisine]
      
        for item in ingredients:
            if item not in ingredients_to_exclude:
                if item in ingredients_count:
                    ingredients_count[item] = ingredients_count[item] + 1
                else:
                    ingredients_count[item] = 1
        ingredients_count_in_cuisines[cuisine] = ingredients_count
        
    else:
        ingredients_count={item : 0 for item in ingredients}
        ingredients_count_in_cuisines[cuisine] = ingredients_count

In [15]:
# Maintaining a dictionary: most_common_in_cuisinr, which stores cuisine and the most common ingredient
# For this, we have utilized the dictionary created in previous code snippet: ingredients_count_in_cuisines
# We have sorted the value counts of ingredients for each cuisine in reverse order, and got the most common ingredient
# Later we have stored the final result in a dataframe and displayed the same

most_common_in_cuisine={}
print("Most Common Ingredient in each Cuisine:")

for cuisine in ingredients_count_in_cuisines.keys():
    ingredients = ingredients_count_in_cuisines[cuisine]
    most_common_ingredient = sorted(ingredients.items(), key=lambda kv:(kv[1], kv[0]), reverse=1)[0]
    most_common_in_cuisine[cuisine] = most_common_ingredient[0]

df=pd.DataFrame(most_common_in_cuisine.items(), columns=["Cuisine","Most Common Ingredient"])
df

Most Common Ingredient in each Cuisine:


Unnamed: 0,Cuisine,Most Common Ingredient
0,greek,dried oregano
1,southern_us,butter
2,filipino,garlic
3,indian,garam masala
4,jamaican,water
5,spanish,garlic cloves
6,italian,garlic cloves
7,mexican,ground cumin
8,chinese,soy sauce
9,british,all-purpose flour


## Create a classfication model to classify the cuisine
- what are some models that you can think of?
- what are the tradeoffs?

This question is acknowledged in the 2nd Part of this Notebook: Recipe Classification using NLP Part 2. <br>
Click <a href="https://github.com/shalaka-thorat/Recipe-Classification-using-NLP/blob/main/Recipe%20Classification%20using%20NLP%20Part%202.ipynb">HERE</a> to go to Part 2 of the Notebook.