## **<center> EDA on Indian Food 101 using plotly and geopandas </center>**

![North%20India%20Food%20Collage.jpg?](https://www.sodhatravel.com/hs-fs/hubfs/North%20India%20Food%20Collage.jpg?width=5120&name=North%20India%20Food%20Collage.jpg)

### Table of Contents
* [Reading Dataset](#chapter1)
* [Replacing the missing value](#chapter2)
* [Vegetarian & Non Vegetarian](#chapter3)
* [Types of Flavour](#chapter4)
* [Distribution of items based on courses](#chapter5)
* [Region wise distribution of items](#chapter6)
* [Course type and Region distribution](#chapter7)
* [Distribution of Desserts across states](#chapter8)
* [Top 10 sweets in West Bengal based on cooking time](#chapter9)
* [Common incredients within desserts](#chapter10)
* [Distribution of Main Course items across states](#chapter11)
* [Top 10 main course items in Punjab based on cooking time](#chapter12)


In [None]:
import warnings

warnings.filterwarnings('ignore')

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly import tools
import plotly.express as px
from plotly.offline import init_notebook_mode,iplot
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly
from plotly import tools
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
import pandas_profiling
import geopandas as gpd
import descartes
from wordcloud import WordCloud , ImageColorGenerator

### Reading the data set <a class="anchor" id="chapter1"></a>

In [None]:
# Read the indian_food.csv

food = pd.read_csv('../input/indian-food-101/indian_food.csv')
food.head()

In [None]:
# Inspecting the dataset
food.info()

**Replacing the missing value** <a class="anchor" id="chapter2"></a>

In [None]:
# Replacing -1 with Nan Value
food = food.replace('-1', np.nan)
food.cook_time = food.cook_time.replace(-1,np.nan)
food.prep_time = food.prep_time.replace(-1,np.nan)

# Checking the %missing values in each column
null_percent = (food.isnull().sum()/len(food))*100
null_percent

**Observation**
The data type for each column is correct however there are missing values in prep_time, cook_time,flavor_profile, state and region. There are ways to impute missing values but we are going ahead by ignoring them for now 

In [None]:
# Genrating profile report to get more sense of the data
food.profile_report()

**Univariate Analysis**

In [None]:
food.columns

In [None]:
# Function to plot pie charts for the categorical variables
def pltpie(var,t,c):
    temp = food[var].value_counts()
    df = pd.DataFrame({'labels': temp.index,'values': temp.values})
    df.iplot(kind='pie',labels='labels',values='values', title=t,colors=c) 

**Composition of Veg & Non Vegetarian** <a class="anchor" id="chapter3"></a>

In [None]:
# % of Vegetarian and Non Vegetarian items
pltpie('diet','Composition of type of food item',['#75e757', '#ea7c96'])

**Insight**

Majority of the food items are veg which is inline with the fact that Indians make more than 70% of the world's vegetarian population

**Composition of flavour of food items** <a class="anchor" id="chapter4"></a>

In [None]:
# Checking the composition of flavour of food item
pltpie('flavor_profile','Composition of flavour of food item',['#75efff','#7e7e7e','#75e757', '#ea7c96'])

**Insight**

More than 50% items are spicy while only 2% dishes are sour or bitter (No doubt that India is called Land of Spices)

**Composition of flavour of food item** <a class="anchor" id="chapter5"></a>

In [None]:
# Checking the composition of courses of food

pltpie('course','Composition of flavour of food item',None)


**Insight**

83% of the items are main course or dessert items


In [None]:
# Checking the cook time
fig = px.box(food, y="cook_time",title='Distribution of cooking time',width=400, height=400)
fig.show()


**Insight**

There seems to be a dish with more than 600 minutes cooking time. It seems to be an outlier

In [None]:
food[food.cook_time>600]

In [None]:
# This record is an outlier - removing it from dataframe

food = food[~(food.cook_time>600)]

In [None]:
# Checking the cook time
fig = px.box(food, y="cook_time",title='Distribution of cooking time',width=400, height=400)
fig.show()


**Insight**

Majority of food items take less than 50 minutes of cooking time

**Region wise distribution of cuisines** <a class="anchor" id="chapter6"></a>

In [None]:
# Regional distribution of the items 

temp = food['region'].value_counts(normalize=True)*100
temp.iplot(kind='bar',xTitle='Regions',yTitle='Percent of food items',title="Region wise distribution of cuisines") 

**Insight**

54% food items are from West & South region

**Bivariate Analysis**

In [None]:
food.columns

**Region wise distribution of courses of food items** <a class="anchor" id="chapter7"></a>

In [None]:
# Composition of coaurses in each region
table = pd.pivot_table(food, values ='name', index ='region',columns ='course' ,aggfunc = 'count',fill_value=0).reset_index()
fig = px.bar(table, x="region", y=["dessert", "main course", "snack","starter"], title="Composition of courses in each region")
fig.show()

**Insights**

1.Desserts are predominantly orginated or famous in states underlying in West & East region

2.Most of the cuisines in North region are eaten as main course

3.Most of the snacks have originated or famous within West & South regions

In [None]:
# Composition of courses in each region
table_1 = pd.pivot_table(food, values ='name', index ='course',columns ='flavor_profile' ,aggfunc = 'count',fill_value=0).reset_index()
fig = px.bar(table_1, x="course", y=["spicy", "sweet", "bitter","sour"], title="Composition of flavour within courses")
fig.show()

In [None]:
food[(food.course == 'main course') & (food.flavor_profile =='sweet')]

In [None]:
food[(food.course == 'snack') & (food.flavor_profile =='bitter')]

**Visualizing the geographical distribution of desserts**

In [None]:
# Creating the pivot table
state_course = pd.pivot_table(food, values ='name', index ='state',columns ='course' ,aggfunc = 'count',fill_value=0).reset_index()
state_course.head()

In [None]:
# Creating distribution of state and dessert summary table
state_desserts = state_course.loc[:,['dessert']]
state_desserts.reset_index(inplace = True)

In [None]:
fp = "../input/state-boundaries/StateBoundary.shp"
map_df = gpd.read_file(fp)

In [None]:
# Uppercasing the state names to merge with the geopanda data frame
state_course['state_name'] = state_course.state.apply(lambda x : x.upper())
state_course.head()

In [None]:
map_df.head()

In [None]:
# Merged with the geopanda df
merged = map_df.set_index('state').join(state_course.set_index('state_name'))
merged.head()

**State wise distribution of Desserts** <a class="anchor" id="chapter8"></a>

In [None]:
# Plotting the geomap
fig, ax = plt.subplots(1, figsize=(15, 10))
ax.axis('off')
ax.set_title('State Wise Distribution of Desserts', fontdict={'fontsize': '25', 'fontweight' : '3'})
merged.plot(column='dessert', cmap='YlOrRd', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)
plt.show()

**Insight**

West Bengal & Maharashtra are the states where most of the desserts are concenrated

**Top 10 desserts in West Bengal based on cooking time** <a class="anchor" id="chapter9"></a>

In [None]:
# Top 10 desserts in West Bengal based on cooking time 

food_west_bengal = food[(food['state']=='West Bengal') & (food['course']=='dessert')].sort_values('cook_time',ascending = False).head(10)
top_10 = food_west_bengal.loc[:,['name','cook_time']]
top_10.set_index('name',inplace=True)
top_10.iplot(kind='bar',xTitle='Desserts',yTitle='Cooking Time',title="Top 10 desserts in West Bengal based on Cooking time") 


**Insights**

Rasgulla which is a very famous sweet from West Bengal astonishingly takes most cooking time compared to other sweets

*Let's see what are the common ingredients used in desserts from West Bengal & Maharashtra* <a class="anchor" id="chapter10"></a>

In [None]:
# Creating word cloud

dessert_df  = food[(food['course']=='dessert') & (food['state'].isin(['Maharashtra','West Bengal'])) ].reset_index()

ingredients = []
for i in range(0,len(dessert_df)):
    text = dessert_df['ingredients'][i].split(',')
    text = ','.join(text)
    ingredients.append(text)
    text = ' '.join(ingredients)

wordcloud = WordCloud(width = 400, height = 400, colormap = 'seismic'
                      ,background_color ='white', 
                min_font_size = 8).generate(text)                  
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis('off') 
plt.show()

**Insights**

The three common incredients used are milk, flour and sugar

**Visualizing the geograhical distribution of main course** <a class="anchor" id="chapter11"></a>

In [None]:
# df containing 
fig, ax = plt.subplots(1, figsize=(15, 10))
ax.axis('off')
ax.set_title('State Wise Distribution of Main Course', fontdict={'fontsize': '25', 'fontweight' : '3'})
merged.plot(column='main course', cmap='YlOrRd', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)
plt.show()

**Insights**

1. Most of the main course dishes originated in Punjab
2. Within the south region - Tamil Nadu has most of the concentration while in East region its Assam 
3. Gujarat and Maharashtra have nearly equal number of main course dishes

**Top 10 desserts in Punjab based on cooking time** <a class="anchor" id="chapter12"></a>

In [None]:
# Top 10 main course dishes in Punjab based on cooking time 

food_punjab = food[(food['state']=='Punjab') & (food['course']=='main course')].sort_values('cook_time',ascending = False).head(10)
top_10 = food_punjab.loc[:,['name','cook_time']]
top_10.set_index('name',inplace=True)
top_10.iplot(kind='bar',xTitle='Main Course',yTitle='Cooking Time',title="Top 10 main course in Punjab based on Cooking time") 


**Insights**

Pindi chana requires 2 hours of cooking time which is more than double compared to time required in preparing other dishes