## Indian Food Exploration

In this short analysis,I have tried to explore on Indian cusines , tastes , ingredients , flavors of various dishes provided in this dataset.

## Reading the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from collections import Counter
%matplotlib inline

In [None]:
data=pd.read_csv("../input/indian-food-101/indian_food.csv")

In [None]:
data.head()

## Data Cleaning and handling missing values

In [None]:
## Check for nulls,
##since in the data description it is mentioned that missing values are indicated with -1,
for c in data.columns:
    print(f'''Total Missing values in column {c} is {len(data[data[c]=='-1'])}''')

flavor_profile,state and region have -1 in them.Are there any other missing values?

In [None]:
data.isna().sum()

Region has one blank.Lets check it and see if we can input it by ourselves.

In [None]:
data[data['region'].isna()]

The state is mentioned as Uttar Pradesh and hence we can provide region as North.

In [None]:
data.loc[data['region'].isna(),'region']='North'

Lets see if we could inpute missing values for the other rows as well (rows with -1).

In [None]:
data.loc[data['region']=='-1',]

For the purpose of simplicity ,let us inpute  'All region' for these recipies.

In [None]:
data.loc[data['region']=='-1','region']='All region'

Similarly for all state,

In [None]:
##state column,
data.loc[data['state']=='-1']

In [None]:
data.loc[data['state']=='-1','state']='All States'

Lets check for flavor profile,

In [None]:
data.loc[data['flavor_profile']=='-1',]

Based on the ingredients used for the dish and based on my taste of that dish,I have tried to map the flavor_profile for the missing dishes.Pls let me know in the comments if incase I have mapped the flavors wrongly.

In [None]:
flavor_dict={'Chapati':'sweet',
'Naan':'sweet',
'Rongi':'sweet',
'Kanji':'sweet',
'Pachadi':'sweet',
'Paniyaram':'sweet',
'Paruppu sadam':'sour',
'Puli sadam':'sour',
'Puttu':'sweet',
'Sandige':'sweet',
'Sevai':'sweet',
'Thayir sadam':'sour',
'Theeyal':'spicy',
'Bhakri':'sweet',
'Copra paak':'sweet',
'Dahi vada':'sweet',
'Dalithoy':'spicy',
'Kansar':'sweet',
'Farsi Puri':'spicy',
'Khar':'sweet',
'Luchi':'sweet',
'Bengena Pitika':'sweet',
'Bilahi Maas':'sweet',
'Black rice':'sour',
'Brown Rice':'sweet',
'Chingri Bhape':'sweet',
'Pakhala':'spicy',
'Pani Pitha':'sweet',
'Red Rice':'spicy'}

In [None]:
## Using a loop to change the values.I think there will be a better way to do this !!
for c in data.loc[data['flavor_profile']=='-1',['name','flavor_profile']]['name']:
    print(f'Assigning flavor profile for {c}')
    data.loc[data['name']==c,'flavor_profile']=flavor_dict[c]

In [None]:
##Now lets check again,
for c in data.columns:
    print(f'''Total Missing values in column {c} is {len(data[data[c]=='-1'])}''')

Now we have to impute for prep time and cook time.For simplicity purpose,let us impute a default value based on whether the dish is veg or non-veg.For vegetarian dish I assume a default value of 10 minutes for prep and cook time and for nonvegetarian I assume this as 20 minutes.

In [None]:
## Prep time and Cook time,
data.loc[(data['prep_time']==-1) & (data['diet']=='vegetarian'),'prep_time']=10
data.loc[(data['cook_time']==-1) & (data['diet']=='vegetarian'),'cook_time']=10
data.loc[(data['prep_time']==-1) & (data['diet']=='non vegetarian'),'prep_time']=20
data.loc[(data['cook_time']==-1) & (data['diet']=='non vegetarian'),'cook_time']=20

Now that we have handled missing values,lets begin our analysis.First let us get a birds eye view of all the columns.

### Data Analysis

In [None]:
##How many dishes ?
print(f'''There are {data['name'].nunique()} dishes ''')

In [None]:
(data['diet'].value_counts()/data['name'].nunique())*100

88 % of the dishes are vegeratian where as 11 % are non vegetarian.

In [None]:
(data['course'].value_counts()/data['name'].nunique())*100

50 % of the dishes are for main course where as 33 % are dessert type.

In [None]:
(data['flavor_profile'].value_counts()/data['name'].nunique())*100

More than 50 % are spicy dishes where as 42 % are sweet.

### Prep Time vs cook time

* Prep time is the time taken to prepare the ingredients prior to cooking- like mixing,washing,stirring etc.

* Cook time is the actual time taken for the dish to cook.

Here it is assumed that the prep time and cook time are provided in minutes.(or is it in seconds ?? Lets find out)

In [None]:
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
sns.distplot(data['prep_time'],color='red')
plt.title('Preparation time distribution')
plt.xlabel('Preparation Time')
plt.ylabel('Frequency')
plt.subplot(1,2,2)
sns.distplot(data['cook_time'],color='blue')
plt.title('Cooking time distribution')
plt.xlabel('Cooking Time')
plt.ylabel('Frequency')

* Most of the dishes have preparation time less than 100 minutes.(~ 1.5 hrs).There are a few dishes for which the preparation time takes more than 500 minutes (~8 hrs).
* The peak in cooking time is between 0-100 minutes and there is a slight peak near 700 minutes(~11 hrs).

This historgram will not provide a difference between vegetarian and non-vegetarian dishes.Also,we should consider the total time taken for the recipe.

In [None]:
data['total_time']=data['prep_time']+data['cook_time']

In [None]:
plt.figure(figsize=(8,8))
sns.boxplot(x='diet',y='total_time',data=data,palette=sns.color_palette('colorblind'))
plt.title('Total Time taken for a dish by diet',fontsize=15)
plt.xlabel('Diet preference',fontsize=8)
plt.ylabel('Total Time',fontsize=8)

It is seen that the total time taken to cook for vegetarian dishes is higher than non-vegetarian dishes.But this cant be generalized since ~88 % of the data is represented by vegeratian dishes and we cant conclude basis the preparation time of only ~11 % of the non-vegetarian dishes.

In [None]:
## Dishes with total time > 400 minutes:
data.loc[data['total_time']>=400,]

Hmm,Our assumption that the prep time and cook time are in minutes turned out to be wrong.Because,I dont think preparation time for Dosa,Idli will take more than 5 hrs.More clarity is required on these two columns inorder to investigate further.

### Ingredients

In [None]:
##total ingredients required:
data['total_ingredients']=data['ingredients'].apply(lambda x:len(set(x.split())))

In [None]:
data['total_ingredients'].describe()

Maximum ingredients required for a dish in this datase is 12 and minumum ingredients is 2.Lets check the dishes.

In [None]:
data.loc[data['total_ingredients']==12,]

In [None]:
data.loc[data['total_ingredients']==2,]

* 2 dishes - Ghevar , an Rajasthani dessert and Mysore Pak ,another dessert from Karnataka require the maximum ingredients to prepare.

* 6 dishes - 5 from dessert and 1 main course require 2 ingredients.

It is interesting to note that dishes with require maximum as well as the least dishes being to dessert type except for 1 dish.

Lets check the total ingredients by course,flavor_profile and diet.

In [None]:
plt.figure(figsize=(18,8))
plt.subplot(1,3,1)
sns.boxplot(x='course',y='total_ingredients',data=data,palette=sns.color_palette('colorblind'))
plt.title('Course vs total ingredients',fontsize=15)
plt.xlabel('Course',fontsize=8)
plt.ylabel('Total Ingredients',fontsize=8)
plt.subplot(1,3,2)
sns.boxplot(x='flavor_profile',y='total_ingredients',data=data,palette=sns.color_palette('colorblind'))
plt.title('Flavor Profile vs total ingredients',fontsize=15)
plt.xlabel('Flavor Profile',fontsize=8)
plt.ylabel('Total Ingredients',fontsize=8)
plt.subplot(1,3,3)
sns.boxplot(x='diet',y='total_ingredients',data=data,palette=sns.color_palette('colorblind'))
plt.title('Diet vs total ingredients',fontsize=15)
plt.xlabel('Diet',fontsize=8)
plt.ylabel('Total Ingredients',fontsize=8)

* Though main course contribute to 50 % of the total dishes,the maximum total ingredients required is 11.

* From the dishes provided in this dataset,it is seen that median ingredients for starters is more compared to other courses.

* Though the maximum number of ingredients required to prepare a desset is 12 ,the median is lesser compared to other courses.Similarly for a sweet dish,the median is lower compared to other dishes whereas the maximum ingredients is the highest.

### Ingredients

Lets create a user defined function to count the ingredients based on the flavor profile,diet.

In [None]:
def ingre_count(d):
    foo=list(d['ingredients'].apply(lambda x:[i.strip() for i in x.split(',')]))
    return Counter(i for j in foo for i  in j).most_common(5)

What are the top 5 ingredients in Indian cusine ?

In [None]:
## top 10 ingredients in Indian cusine,
ingre_count(data)

What are the top 5 ingredients for preparing vegetarian and non-vegetarian dishes ?

In [None]:
## top 5 in vegetarian dishes,
ingre_count(data.loc[data['diet']=='vegetarian',])

In [None]:
## top 5 in non-vegetarian dishes,
ingre_count(data.loc[data['diet']=='non vegetarian',])

43 vegetarian dishes available in this dataset are prepared using sugar and ghee where as non-vegeratian dishes represented 11 % in the dataset have mustard oil,ginger as common ingredient.

In [None]:
### top 5 in spicy dishes
ingre_count(data.loc[data['flavor_profile']=='spicy',])

In [None]:
### top 5 in sweet dishes
ingre_count(data.loc[data['flavor_profile']=='sweet',])

In [None]:
### top 5 in sour dishes
ingre_count(data.loc[data['flavor_profile']=='sour',])

In [None]:
### top 5 in bitter dishes
ingre_count(data.loc[data['flavor_profile']=='bitter',])

The analysis of ingredients by flavor profile is not surprising since they are made of items which gives the flavor.

## Conclusion

In this short analysis,we have tried to explore different aspects of Indian dishes.

* There were 255 dishes which were made available of which 88 % of them were vegetarian and ~50 % of the total dishes were for the main course , ~ 54 % of the total dishes were spicy flavor.

* Total ingredients required for cooking ranged between 2 to 12 with Ghevar and Mysore Pak both of sweet flavor and a dessert requiring maximum ingredients whereas there were another 5 dishes of dessert type that require only 2 ingredients to prepare the dish.

* While the most common ingredient among the 255 dishes were sugar,ginger,garam masala and ghee -sugar and ghee was more used for the vegetarian dishes while mustard oil was used for non-vegetarian dishes.Since the percentage representation of non-vegetarian dishes is only 11 % we could not strongly conclude anything about them.


## What can be done further ?

* We have noted in our analysis that the preparation and cooking time was more for certain dishes which would not normally take that many hours.After getting more clarity on those columns,an analysis could be done on the same.

* The ingredients column can be cleaned since a quick look at the column indicates that there are few items which could be grouped - like Urad dal ,urad dal are indicated as two separate items whereas we know it is the same.