# Goal

I want to find the most often used spices in Indian cuisine, to help novices to indian cuisine effectively buy the spices that are most often used and not spend money on "niche" spices that shall be bought later. 

Ideally, my analysis will give you the most "bang for your buck" spices.

# Cleaning the data

Let's clean the data first - we need to drop columns not needed for my goal and drop the cuisines that are not strictly indian.

Then, let's look at how many recipes we have left.


In [None]:
import pandas as pd
import numpy as np

raw = pd.read_csv('../input/6000-indian-food-recipes-dataset/IndianFoodDatasetCSV.csv')
df = raw.copy()

columns_to_drop = ['RecipeName', 'Ingredients', 'URL', 'PrepTimeInMins' , 'CookTimeInMins',
                   'TotalTimeInMins','TranslatedInstructions', 'Instructions', 'Servings', 'Srno']

df = df.drop(columns = columns_to_drop).dropna()

# data seems to contain more than just indian cuisines, so I drop these
cuisines_to_drop = ['Mexican', 'Italian Recipes', 'Thai', 'Chinese', 'Asian', 'Middle Eastern', 'European',
                   'Arab', 'Japanese', 'Vietnamese', 'British', 'Greek', 'French', 'Mediterranean', 'Sri Lankan',
                   'Indonesian', 'African', 'Korean', 'American', 'Carribbean', 'World Breakfast', 'Malaysian', 'Dessert',
                   'Afghan', 'Snack', 'Jewish', 'Brunch', 'Lunch', 'Continental', 'Fusion']

# need to drop desserts and breakfasts, as these are much less likely to contain spices
courses_to_drop = ['South Indian Breakfast', 'Snack', 'Appetizer', 'Indian Breakfast', 'Dessert', 'North Indian Breakfast',
                  'World Breakfast', 'Brunch', 'Side Dish']

df = df[~df['Cuisine'].isin(cuisines_to_drop)]
df = df[~df['Course'].isin(courses_to_drop)]
df.shape

Next, let's remove all columns that were not properly translated from Hindi and only keep the ingredient lists that are in English.

In [None]:
#dataset contains hindi even in "translated" columns, dropping these for convenience sake
df = df['TranslatedIngredients']

def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

#create boolean mask
mask = df.apply(isEnglish)
df = df[mask].dropna()

df.shape #I see we dropped about 350 entries.

We can see that we dropped around 350 recipes.

Next, we import a list of indian spices from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Indian_spices) and clean it. **Unfortunately, read_html does not work in this notebook, so I imported it in Jupyter and created a csv which I then uploaded to Kaggle**

Some spices need to be deleted as they're not strictly considered spices (such as capers, chilli pepper) and some need to be deleted due to anomalities in the dataset (such as, cardamom is often referred to as "cardamom", not specifying the color).

Furthermore, often spices have several names (such as amchoor/amchur, jeera/cumin etc.). Therefore, I created a string of all possible names divided by slash.

In [None]:
#I am interested in only the recipe ingredients
recipes = df

#read file of all indian spices on wikipedia
raw = pd.read_csv('../input/spices/spices.csv')
spices = raw['Standard English'].copy().str.lower()

#some important spices I also added from experience/my own cookbook
spices_to_add = pd.Series(['black mustard seed/raee', 'black salt', 'cumin powder', 'coriander leaves'])

#some spices are too common (such as pepper) or not a spice, but a vegetable, or are otherwise corrupted (for example,
#cardamom is often listed as "cardamom" nto specifying whether it is black or green)
spices_to_drop = ['black pepper', 'peppercorns', 'capers', 'chili pepper powder', 'cinnamon buds', 'garlic',
                  'cumin seed ground into balls', 'dried ginger', 'green chili pepper', 'indian bedellium tree',
                 'indian gooseberry', 'mango extract', 'saffron pulp', 'black cumin']

spices = spices[~spices.isin(spices_to_drop)].append(spices_to_add)

#editing the spices so that my string counter can find different versions of the same spice
spices = spices.str.replace('amchoor', 'amchur/amchoor') \
                    .replace('asafoetida', 'asafetida/asafoetida') \
                    .replace('alkanet root', 'alkanet/alkanet root') \
                    .replace('capsicum', 'CHILIPLACEHOLD') \
                    .replace('celery / radhuni seed', 'celery/radhuni seed') \
                    .replace('bay leaf, indian bay leaf', 'bay leaf/bay leaves') \
                    .replace('curry tree or sweet neem leaf', 'curry leaf/curry leaves') \
                    .replace('fenugreek leaf', 'fenugreek leaf/fenugreek leaves/kasoori methi') \
                    .replace('nigella seed', 'nigella seed/black cumin') \
                    .replace('thymol/carom seed', 'fenuthymol/carom seed') \
                    .replace('ginger', 'ginger/dried ginger/ginger powder') \
                    .replace('black cardamom', 'blackcardamom') \
                    .replace('cloves', 'laung') \
                    .replace('green cardamom', 'cardamom') 



Now, I need to iterate over my list of spices and my list of ingredients and find the counts of each spice in the whole dataset.

In [None]:
# do the same as above except output is a list, not a dict

# iterate over each recipe and split into individual ingredients

words = []
for recipe in recipes:
    w = recipe.split(',')
    words.append(w)

# iterate over each sublist in list of ingredients and create one list (series), set to lowercase, strip whitespace
    
ing = [item for sublist in words for item in sublist]
ing = pd.Series(ing).str.strip()
ing = ing.str.lower()

list_of_spices = []
for row in ing:
    row = row.replace('red chili powder', 'CHILIPLACEHOLD').replace('red chilli powder','CHILIPLACEHOLD') \
            .replace('chilli flakes', 'CHILIPLACEHOLD').replace('chili flakes', 'CHILIPLACEHOLD') \
            .replace('red chilli','').replace('red chili', '') \
            .replace('green chilli', '').replace('green chili', '')
    row = row.replace('black cardamom', 'blackcardamom')
    row = row.replace('coriander (dhania) seeds', 'coriander seed')
    row = row.replace('coriander (dhania) leaves', 'coriander leaves').replace('coriander -', 'coriander leaves')
    row = row.replace('ginger garlic paste', '')
    for k in spices:
        k_new = k.split('/')
        for split in k_new:
            if split in row:
                k = k.capitalize()
                list_of_spices.append(k)
list_of_spices = pd.Series(list_of_spices)

# change back the spices for better readability and to include indian names where applicable

list_of_spices = list_of_spices.str.replace('Amchur/amchoor','Amchur (dried mango powder)') \
                                   .replace('Bay leaf/bay leaves','Bay leaves') \
                                    .replace('Chiliplacehold', 'Red chilli powder') \
                                    .replace('Asafetida/asafoetida','Asafetida (hing)') \
                                    .replace('Curry leaf/curry leaves','Curry leaves') \
                                    .replace('Fenuthymol/carom seed', 'Carom seed (ajwain)') \
                                    .replace('Fenugreek leaf/fenugreek leaves/kasoori methi','Fenugreek leaves (kasoori methi)') \
                                    .replace('Curry leaf/curry leaves','Curry leaves') \
                                    .replace('Cumin seed','Cumin seed (jeera)') \
                                    .replace('Black salt/kala namak','Black salt (kala namak)') \
                                    .replace('Nigella seed/black cumin','Nigella seed (black cumin)') \
                                    .replace('Celery/radhuni seed','Celery seed (radhuni)') \
                                    .replace('Ginger/dried ginger/ginger powder','Ginger') \
                                    .replace('Blackcardamom','Black cardamom') \
                                    .replace('Cardamom','Green cardamom') \
                                    .replace('Laung','Cloves (laung)') \
                                    .replace('Ginger','Ginger (paste or root)') \
                                    .replace('Turmeric','Turmeric (haldi)') 

# count occurences of each spice
list_of_spices = pd.Series(list_of_spices).value_counts()

# exclude spices that are in less than 20 recipes out of all 5200
list_of_spices = list_of_spices[list_of_spices.values > 20]

# calculate a percentage of in how many recipes is each spice included
list_of_spices = pd.DataFrame(list_of_spices)
list_of_spices.columns = ['freq']
list_of_spices['perc'] = list_of_spices.apply(lambda x: round((x/len(recipes)),2))

#view
list_of_spices

# Plotting
 
Now our data is ready to be plotted!

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

#create fig, axes
fig, ax = plt.subplots(figsize=(28, 28))

#create barplot
sns.barplot(y=list_of_spices.index, x=list_of_spices['perc'], 
            data = list_of_spices, orientation = 'horizontal',  palette="Reds_r")
plt.xticks(fontsize = 35);
plt.yticks(fontsize = 35);
sns.despine()

# titles, labels
ax.set_title('Most frequently used spices in Indian cuisine', fontsize = 80, pad = 100);
ax.text(x=0.5, y=1.025, s="Sample consists of ~2000 indian recipes from the website 'Archana's kitchen'", 
        fontsize=40, alpha=0.75, ha='center', va='bottom', transform=ax.transAxes);
ax.set_xlabel('Percentage of recipes containing each spice', fontsize = 35, labelpad = 20);
ax.set_xticks(np.linspace(0,1,11), minor = False)

# set x axis to percentages
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0%}'.format(x))) 

# create grid for better readability
ax.grid(axis = 'x', linestyle = ':', linewidth = 3)

plt.show()