# **Data Wrangling**
In this notebook we will dummify the dataset and start to manipulate the dataset with the goal to decrease the total amount of attributes.

### Load Train Dataset

In [81]:
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sb
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
import Methods
InteractiveShell.ast_node_interactivity = "all"
#%matplotlib inline
rcParams['figure.figsize'] = 12, 10
sb.set_style('whitegrid')

In [82]:
df_original = pd.read_json('dataset/train.json')
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39774 entries, 0 to 39773
Data columns (total 3 columns):
cuisine        39774 non-null object
id             39774 non-null int64
ingredients    39774 non-null object
dtypes: int64(1), object(2)
memory usage: 932.3+ KB


## Dummify the whole dataset

In [83]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

dummies_original = DataFrame(mlb.fit_transform(df_original['ingredients']), columns=mlb.classes_, index=df_original.index)
dummies_original.head()
dummies_original.shape

Unnamed: 0,( oz.) tomato sauce,( oz.) tomato paste,(10 oz.) frozen chopped spinach,"(10 oz.) frozen chopped spinach, thawed and squeezed dry",(14 oz.) sweetened condensed milk,(14.5 oz.) diced tomatoes,(15 oz.) refried beans,1% low-fat buttermilk,1% low-fat chocolate milk,1% low-fat cottage cheese,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(39774, 6714)

In [84]:
# Join Original and Dummfied
df_original_left = df_original.iloc[:].reindex(['cuisine', 'id'], axis=1)
df_original_dummified = df_original_left.join(dummies_original)

df_original_dummified.head()
df_original_dummified.shape

Unnamed: 0,cuisine,id,( oz.) tomato sauce,( oz.) tomato paste,(10 oz.) frozen chopped spinach,"(10 oz.) frozen chopped spinach, thawed and squeezed dry",(14 oz.) sweetened condensed milk,(14.5 oz.) diced tomatoes,(15 oz.) refried beans,1% low-fat buttermilk,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
0,greek,10259,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,southern_us,25693,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,filipino,20130,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,indian,22213,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,indian,13162,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(39774, 6716)

In [85]:
# Create a variable for recipe size 
df_original_dummified.insert(2, 'size_recipe', df_original_dummified.iloc[:, 2:].sum(axis=1))

df_original_dummified.head()
df_original_dummified.shape

Unnamed: 0,cuisine,id,size_recipe,( oz.) tomato sauce,( oz.) tomato paste,(10 oz.) frozen chopped spinach,"(10 oz.) frozen chopped spinach, thawed and squeezed dry",(14 oz.) sweetened condensed milk,(14.5 oz.) diced tomatoes,(15 oz.) refried beans,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
0,greek,10259,9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,southern_us,25693,11,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,filipino,20130,12,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,indian,22213,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,indian,13162,20,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(39774, 6717)

# Divide dataset intro train and test sets

Usually the data partition is done in the train/test phase of the process but because the this is a classification problem and the data must be dummified in order to run the models, it is easier to do so in the beginning of the process so it is easier to get the number of attributes of the train set to match the test set after the feature selection process.

In [86]:
# Partition the data in train and test sets
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df_original_dummified, test_size=0.2, random_state=123) #random_state = set.seed

print("train_set = " + str(len(train_set)) + " and test_set = " + str(len(test_set)))

train_set = 31819 and test_set = 7955


### Creates the TRAIN dataset

In [87]:
#Saves train_set index in case it is needed in the future
train_set_idx = train_set.index

# Assigns TRAIN DATASET
df_TRAIN_original = df_original_dummified.copy()
df_TRAIN_original = df_TRAIN_original.loc[train_set_idx].reset_index(drop=True)

df_TRAIN_original.head()
df_TRAIN_original.shape

Unnamed: 0,cuisine,id,size_recipe,( oz.) tomato sauce,( oz.) tomato paste,(10 oz.) frozen chopped spinach,"(10 oz.) frozen chopped spinach, thawed and squeezed dry",(14 oz.) sweetened condensed milk,(14.5 oz.) diced tomatoes,(15 oz.) refried beans,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
0,chinese,45548,25,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,mexican,8172,11,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,italian,24224,14,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,thai,3640,9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,mexican,13754,17,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(31819, 6717)

### Creates the TEST dateset

In [88]:
#Saves test_set index in case it is needed in the future
test_set_idx = test_set.index

# Assigns TRAIN DATASET
df_TEST_original = df_original_dummified.copy()
df_TEST_original = df_TEST_original.loc[test_set_idx].reset_index(drop=True)

df_TEST_original.head()
df_TEST_original.shape

Unnamed: 0,cuisine,id,size_recipe,( oz.) tomato sauce,( oz.) tomato paste,(10 oz.) frozen chopped spinach,"(10 oz.) frozen chopped spinach, thawed and squeezed dry",(14 oz.) sweetened condensed milk,(14.5 oz.) diced tomatoes,(15 oz.) refried beans,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
0,southern_us,21672,9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,southern_us,25355,8,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,french,302,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,southern_us,43816,6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,italian,32357,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(7955, 6717)

### Creates a DataFrame in the original format for train set to be manipulated

In [89]:
#df_original_dummified.insert(2, 'size_recipe', df_original_dummified.iloc[:, 2:].sum(axis=1))
df_train_original = df_original.loc[train_set_idx].reset_index(drop=True)

if (df_train_original.columns).contains('size_recipe'):
    df_train_original.head()
    df_train_original.shape
else:
    df_train_original.insert(2, 'size_recipe', df_train_original['ingredients'].apply(len))
    df_train_original.head()
    df_train_original.shape

Unnamed: 0,cuisine,id,size_recipe,ingredients
0,chinese,45548,25,"[seasoning, meat marinade, pork tenderloin, se..."
1,mexican,8172,11,"[eggs, cream style corn, enchilada sauce, shre..."
2,italian,24224,14,"[frozen chopped spinach, ground black pepper, ..."
3,thai,3640,9,"[coconut sugar, Thai red curry paste, coconut ..."
4,mexican,13754,17,"[frozen orange juice concentrate, olive oil, r..."


(31819, 4)

## Update all variables shown in `DataPresentation.ipynb` to Train set size

### Number of recipes per cuisine

In [90]:
# Number of recipes per cuisine
number_recipe_cuisine = df_train_original['cuisine'].value_counts()
number_recipe_cuisine

print('There are a total of ' + str(number_recipe_cuisine.sum()) + ' recipes and ' + str(len(number_recipe_cuisine)) + ' cuisines')

italian         6337
mexican         5129
southern_us     3446
indian          2406
chinese         2155
french          2079
cajun_creole    1233
thai            1223
japanese        1130
greek            953
spanish          799
vietnamese       666
moroccan         651
british          647
korean           643
filipino         597
irish            536
jamaican         425
russian          387
brazilian        377
Name: cuisine, dtype: int64

There are a total of 31819 recipes and 20 cuisines


### Most occurring ingredients

In [91]:
# Top 10 most occurring ingredients
from Methods import get_ingredients
  
all_ingredients = get_ingredients(df_train_original)
all_ingredients.value_counts()[:10]


#Number of unique ingredients
print('There are ' + str(len(set(all_ingredients))) + ' unique ingredients.')

salt                   14436
olive oil               6391
onions                  6373
water                   5983
garlic                  5895
sugar                   5101
garlic cloves           4998
butter                  3834
ground black pepper     3823
all-purpose flour       3663
dtype: int64

There are 6289 unique ingredients.


### Groups ingredients by cuisine

In [92]:
# Group all ingredients by cuisine
all_ingredients_cuisine = df_train_original.groupby('cuisine')['ingredients'].sum()

all_ingredients_cuisine

cuisine
brazilian       [papaya, hemp seeds, raspberries, bananas, chi...
british         [bread crumbs, dried sage, pork butt, ground b...
cajun_creole    [boneless skinless chicken breasts, garlic, ro...
chinese         [seasoning, meat marinade, pork tenderloin, se...
filipino        [condensed milk, crushed ice, berries, small p...
french          [chopped fresh chives, garlic cloves, unsalted...
greek           [olive oil, lemon juice, fresh dill, garlic, g...
indian          [curry powder, chopped onion, red potato, garb...
irish           [tomato paste, beef brisket, dill, brown sugar...
italian         [frozen chopped spinach, ground black pepper, ...
jamaican        [red chili peppers, garlic cloves, ground cinn...
japanese        [sugar, green onions, thai chile, dried shiita...
korean          [sesame seeds, garlic, sirloin, green onions, ...
mexican         [eggs, cream style corn, enchilada sauce, shre...
moroccan        [water, yellow onion, serrano chile, black pep...
ru

### Count how many times each ingredient appears in all recipes of a cuisine

In [93]:
from Methods import count_occ

all_ingredient_by_cuisine_count = all_ingredients_cuisine.apply(count_occ)

all_ingredient_by_cuisine_count

cuisine
brazilian       {'salt': 155, 'onions': 103, 'olive oil': 98, ...
british         {'salt': 330, 'all-purpose flour': 193, 'butte...
cajun_creole    {'salt': 591, 'onions': 428, 'garlic': 301, 'g...
chinese         {'soy sauce': 1104, 'sesame oil': 757, 'corn s...
filipino        {'salt': 328, 'garlic': 257, 'water': 247, 'on...
french          {'salt': 952, 'sugar': 489, 'unsalted butter':...
greek           {'salt': 457, 'olive oil': 414, 'dried oregano...
indian          {'salt': 1547, 'onions': 966, 'garam masala': ...
irish           {'salt': 307, 'all-purpose flour': 177, 'butte...
italian         {'salt': 2796, 'olive oil': 2521, 'garlic clov...
jamaican        {'salt': 279, 'onions': 146, 'water': 125, 'ga...
japanese        {'soy sauce': 433, 'salt': 336, 'sugar': 304, ...
korean          {'soy sauce': 326, 'sesame oil': 309, 'garlic'...
mexican         {'salt': 2145, 'onions': 1171, 'ground cumin':...
moroccan        {'salt': 329, 'olive oil': 321, 'ground cumin'...
ru

### Unique ingredients by cuisine

In [94]:
# Get unique ingredients by cuisine
from Methods import remove_duplicates

unique_ingredients_cuisine = all_ingredients_cuisine.apply(remove_duplicates)

unique_ingredients_cuisine

cuisine
brazilian       [rotisserie chicken, mustard, key lime, whole ...
british         [mustard, pie crust, pickling salt, whole milk...
cajun_creole    [shredded parmesan cheese, rotisserie chicken,...
chinese         [rotisserie chicken, sweet and sour sauce, sob...
filipino        [pie crust, whole milk, chicken breasts, guava...
french          [shredded parmesan cheese, pie crust, dark mus...
greek           [rotisserie chicken, whole milk, chicken breas...
indian          [rotisserie chicken, pie crust, lamb neck, app...
irish           [mustard, pie crust, whole milk, dark muscovad...
italian         [shredded parmesan cheese, rotisserie chicken,...
jamaican        [sweet and sour sauce, codfish, chicken breast...
japanese        [mustard, whole milk, boneless center cut pork...
korean          [stir fry beef meat, pickling salt, soba, chic...
mexican         [rotisserie chicken, shredded parmesan cheese,...
moroccan        [rotisserie chicken, whole milk, veal demi-gla...
ru

### Number of unique ingredients used in each cuisine

In [95]:
# Lists number of unique ingredients per cuisine
unique_ingredients_cuisine.apply(len).sort_values(ascending=False)

cuisine
italian         2735
mexican         2464
southern_us     2263
french          1932
chinese         1634
indian          1539
cajun_creole    1430
japanese        1305
thai            1255
spanish         1139
greek           1108
british         1056
vietnamese      1023
moroccan         894
irish            870
filipino         853
korean           799
brazilian        783
russian          782
jamaican         777
Name: ingredients, dtype: int64

### Unique ingredients in the train set

In [96]:
# Get unique ingredients
unique_ingredients = get_ingredients(unique_ingredients_cuisine).unique()

print('There are ' + str(len(unique_ingredients)) + ' unique ingredients in the dataset.')

There are 6289 unique ingredients in the dataset.


### Number of times each ingredient appear in a cuisine

In [97]:
# Number of times each ingredient appear in a cuisine
from Methods import check_for_ingredient
def number_cuisine_per_ingredient(unique):
    cuisines_count_per_ingredient = dict()
    for ingredient in unique:
        cuisines_count_per_ingredient[ingredient] = check_for_ingredient(all_ingredient_by_cuisine_count, ingredient)
    return Series(cuisines_count_per_ingredient)
        
cuisines_count_per_ingredient = number_cuisine_per_ingredient(unique_ingredients)

cuisines_count_per_ingredient.head()

rotisserie chicken    {'brazilian': 1, 'cajun_creole': 2, 'chinese':...
mustard               {'brazilian': 1, 'british': 9, 'cajun_creole':...
key lime               {'brazilian': 2, 'mexican': 2, 'southern_us': 1}
whole milk            {'brazilian': 11, 'british': 44, 'cajun_creole...
chicken breasts       {'brazilian': 8, 'cajun_creole': 32, 'chinese'...
dtype: object

### Unique Cuisines Each Ingredient Appears

In [98]:
# Get name of cuisine from dictionary with frequency in which they appear
from Methods import extract_cuisine

cuisines_per_ingredient = cuisines_count_per_ingredient.apply(extract_cuisine)

In [99]:
cuisines_per_ingredient[:10]

rotisserie chicken    [brazilian, cajun_creole, chinese, greek, indi...
mustard               [brazilian, british, cajun_creole, chinese, fr...
key lime                              [brazilian, mexican, southern_us]
whole milk            [brazilian, british, cajun_creole, chinese, fi...
chicken breasts       [brazilian, cajun_creole, chinese, filipino, f...
chipped beef                                                [brazilian]
apple juice           [brazilian, chinese, french, indian, irish, it...
pinto beans           [brazilian, cajun_creole, french, italian, mex...
ground cinnamon       [brazilian, british, cajun_creole, chinese, fi...
chicken thighs        [brazilian, cajun_creole, chinese, filipino, f...
dtype: object

## Check for *common ingredients* in all cuisines
After we updated all the variables shown in the `DataPresentation.ipynb` notebook to the train set, we are now going to attempt to reduce the dataset dimensionality by finding all the common ingredients among all cuisine and removing them. The rationale is that if an ingredient appear in all cuisines, it cannot be considered a good predictor of a cuisine. 

In [100]:
from Methods import get_common_ingredients

common_ingredients = get_common_ingredients(unique_ingredients_cuisine)

common_ingredients[:10]

print('There are ' + str(len(common_ingredients)) + ' common ingredients.')

['vegetable oil',
 'ground cinnamon',
 'garlic powder',
 'all-purpose flour',
 'sea salt',
 'cayenne',
 'ginger',
 'minced garlic',
 'red bell pepper',
 'extra-virgin olive oil']

There are 86 common ingredients.


### Remove common ingredientes by cuisine

In [101]:
from Methods import remove_ingredients

unique_ingredients_cuisine = unique_ingredients_cuisine.apply(remove_ingredients, remove=common_ingredients)

unique_ingredients_cuisine.apply(len).sort_values(ascending=False)

cuisine
italian         2649
mexican         2378
southern_us     2177
french          1846
chinese         1548
indian          1453
cajun_creole    1344
japanese        1219
thai            1169
spanish         1053
greek           1022
british          970
vietnamese       937
moroccan         808
irish            784
filipino         767
korean           713
brazilian        697
russian          696
jamaican         691
Name: ingredients, dtype: int64

### Remove common ingredientes from the list of all ingredients

In [102]:
#Removes common ingredientes of the list all observation of all ingredients used in all recipes

all_ingredients = all_ingredients[~all_ingredients.isin(Series(common_ingredients))]
all_ingredients.reset_index(drop=True, inplace=True)
all_ingredients.value_counts()[:20]

soy sauce                 2633
tomatoes                  2434
ground cumin              2178
chili powder              1613
grated parmesan cheese    1537
sesame oil                1417
corn starch               1402
jalapeno chilies          1402
fresh lemon juice         1379
chopped cilantro fresh    1376
dried oregano             1368
fresh parsley             1324
diced tomatoes            1288
sour cream                1235
fresh ginger              1210
lime                      1157
fresh lime juice          1094
fish sauce                 998
dry white wine             992
chopped onion              990
dtype: int64

In [103]:
# Number of unique ingredients after removing common ingredients
print('There are ' + str(len(set(all_ingredients))) + ' unique ingredients.')

There are 6203 unique ingredients.


# DataFrame Manipulation

In [104]:
df_train = df_train_original.copy()

df_train.head()

Unnamed: 0,cuisine,id,size_recipe,ingredients
0,chinese,45548,25,"[seasoning, meat marinade, pork tenderloin, se..."
1,mexican,8172,11,"[eggs, cream style corn, enchilada sauce, shre..."
2,italian,24224,14,"[frozen chopped spinach, ground black pepper, ..."
3,thai,3640,9,"[coconut sugar, Thai red curry paste, coconut ..."
4,mexican,13754,17,"[frozen orange juice concentrate, olive oil, r..."


## Remove commom ingredients from recipes
Now the common ingredients will be removed from the dataset. It is important to note however, that by doing that some recipes might become empty. The functions `remove_ingredients_from_table` and `remove_ingredients_from_table_dummified` imported from `Methods.py` that are used to perform this task already takes that into consideration and removes all zero ingredient recipes in case they exist.

### From normal table

In [105]:
from Methods import remove_ingredients_from_table

df_train = remove_ingredients_from_table(df_train, common_ingredients)

df_train.head()
df_train.shape

Unnamed: 0,cuisine,id,size_recipe,ingredients
0,chinese,45548,16,"[oyster sauce, fresh ginger, pork tenderloin, ..."
1,mexican,8172,9,"[sour cream, lime wedges, enchilada sauce, chi..."
2,italian,24224,8,"[grated parmesan cheese, frozen chopped spinac..."
3,thai,3640,7,"[coconut milk, Thai red curry paste, lime juic..."
4,mexican,13754,11,"[dried black beans, crushed garlic, lime, sage..."


(31676, 4)

### From Dummified table 

In [106]:
from Methods import remove_ingredients_from_table_dummified

df_train_dummified = remove_ingredients_from_table_dummified(df_original_dummified, common_ingredients)

df_train_dummified.head()
df_train_dummified.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,cuisine,id,size_recipe,( oz.) tomato sauce,( oz.) tomato paste,(10 oz.) frozen chopped spinach,"(10 oz.) frozen chopped spinach, thawed and squeezed dry",(14 oz.) sweetened condensed milk,(14.5 oz.) diced tomatoes,(15 oz.) refried beans,...,yukon gold potatoes,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms
0,greek,10259,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,southern_us,25693,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,filipino,20130,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,indian,22213,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,indian,13162,10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


(39603, 6631)

## Updates variables after removal of common ingredients from dataset

In [107]:
# Number of recipes per cuisine
number_recipe_cuisine = df_train['cuisine'].value_counts()
number_recipe_cuisine[:10]

italian         6318
mexican         5114
southern_us     3431
indian          2397
chinese         2154
french          2053
cajun_creole    1230
thai            1223
japanese        1125
greek            951
Name: cuisine, dtype: int64

In [108]:
all_ingredients = get_ingredients(df_train)
all_ingredients.value_counts()[:10]

soy sauce                 2633
tomatoes                  2434
ground cumin              2178
chili powder              1613
grated parmesan cheese    1537
sesame oil                1417
corn starch               1402
jalapeno chilies          1402
fresh lemon juice         1379
chopped cilantro fresh    1376
dtype: int64

In [109]:
# Group all ingredients by cuisine after removing common ingredients
all_ingredients_cuisine = df_train.groupby('cuisine')['ingredients'].sum()

all_ingredients_cuisine

cuisine
brazilian       [hemp seeds, aÃ§ai, chia seeds, raspberries, p...
british         [bread crumbs, pork liver, dried sage, pork bu...
cajun_creole    [boneless skinless chicken breasts, cajun seas...
chinese         [oyster sauce, fresh ginger, pork tenderloin, ...
filipino        [coconut milk, crushed ice, berries, condensed...
french          [grated GruyÃ¨re cheese, chopped fresh chives,...
greek           [fresh dill, plain yogurt, baking potatoes, to...
indian          [red potato, chopped onion, unsweetened coconu...
irish           [black peppercorns, chopped celery, chopped on...
italian         [grated parmesan cheese, frozen chopped spinac...
jamaican        [red chili peppers, ground allspice, lime juic...
japanese        [dried shiitake mushrooms, sesame oil, corn st...
korean          [sirloin, mirin, sesame oil, sesame seeds, dar...
mexican         [sour cream, lime wedges, enchilada sauce, chi...
moroccan        [cumin, serrano chile, chiles, fresh cilantro,...
ru

In [110]:
all_ingredient_by_cuisine_count = all_ingredients_cuisine.apply(count_occ)

all_ingredient_by_cuisine_count

cuisine
brazilian       {'lime': 75, 'cachaca': 60, 'tomatoes': 53, 'c...
british         {'baking soda': 58, 'whipping cream': 49, 'who...
cajun_creole    {'cajun seasoning': 241, 'dried thyme': 193, '...
chinese         {'soy sauce': 1104, 'sesame oil': 757, 'corn s...
filipino        {'soy sauce': 195, 'fish sauce': 75, 'coconut ...
french          {'fresh lemon juice': 201, 'dry white wine': 1...
greek           {'dried oregano': 207, 'feta cheese crumbles':...
indian          {'garam masala': 691, 'ground turmeric': 581, ...
irish           {'baking soda': 96, 'buttermilk': 65, 'cabbage...
italian         {'grated parmesan cheese': 1296, 'fresh basil'...
jamaican        {'ground allspice': 106, 'dried thyme': 88, 'c...
japanese        {'soy sauce': 433, 'mirin': 303, 'sake': 210, ...
korean          {'soy sauce': 326, 'sesame oil': 309, 'sesame ...
mexican         {'ground cumin': 1061, 'chili powder': 977, 'j...
moroccan        {'ground cumin': 272, 'couscous': 116, 'ground...
ru

In [111]:
unique_ingredients_cuisine = all_ingredients_cuisine.apply(remove_duplicates)

unique_ingredients_cuisine

cuisine
brazilian       [rotisserie chicken, mustard, key lime, whole ...
british         [mustard, pie crust, pickling salt, whole milk...
cajun_creole    [mustard, shredded parmesan cheese, rotisserie...
chinese         [rotisserie chicken, sweet and sour sauce, sob...
filipino        [pie crust, whole milk, chicken breasts, guava...
french          [shredded parmesan cheese, pie crust, dark mus...
greek           [rotisserie chicken, whole milk, chicken breas...
indian          [rotisserie chicken, pie crust, lamb neck, app...
irish           [mustard, pie crust, whole milk, dark muscovad...
italian         [shredded parmesan cheese, rotisserie chicken,...
jamaican        [sweet and sour sauce, codfish, chicken breast...
japanese        [mustard, whole milk, boneless center cut pork...
korean          [stir fry beef meat, pickling salt, soba, chic...
mexican         [rotisserie chicken, shredded parmesan cheese,...
moroccan        [rotisserie chicken, whole milk, veal demi-gla...
ru

In [112]:
# Lists number of unique ingredients per cuisine
unique_ingredients_cuisine.apply(len).sort_values(ascending=False)

cuisine
italian         2649
mexican         2378
southern_us     2177
french          1846
chinese         1548
indian          1453
cajun_creole    1344
japanese        1219
thai            1169
spanish         1053
greek           1022
british          970
vietnamese       937
moroccan         808
irish            784
filipino         767
korean           713
brazilian        697
russian          696
jamaican         691
Name: ingredients, dtype: int64

In [113]:
# Get unique ingredients
unique_ingredients = Series(get_ingredients(unique_ingredients_cuisine).unique())
unique_ingredients[:10]

print('There are ' + str(len(unique_ingredients)) + ' unique ingredients after removing common ingredients')

0    rotisserie chicken
1               mustard
2              key lime
3            whole milk
4       chicken breasts
5          chipped beef
6           apple juice
7           pinto beans
8        chicken thighs
9             soy sauce
dtype: object

There are 6203 unique ingredients after removing common ingredients


In [114]:
# Number of cuisines each ingredient appears after removing common ingredients
cuisines_count_per_ingredient = number_cuisine_per_ingredient(unique_ingredients)

cuisines_count_per_ingredient[:20]

rotisserie chicken       {'brazilian': 1, 'cajun_creole': 2, 'chinese':...
mustard                  {'brazilian': 1, 'british': 9, 'cajun_creole':...
key lime                  {'brazilian': 2, 'mexican': 2, 'southern_us': 1}
whole milk               {'brazilian': 11, 'british': 44, 'cajun_creole...
chicken breasts          {'brazilian': 8, 'cajun_creole': 32, 'chinese'...
chipped beef                                              {'brazilian': 1}
apple juice              {'brazilian': 1, 'chinese': 1, 'french': 4, 'i...
pinto beans              {'brazilian': 1, 'cajun_creole': 1, 'french': ...
chicken thighs           {'brazilian': 1, 'cajun_creole': 15, 'chinese'...
soy sauce                {'brazilian': 1, 'cajun_creole': 10, 'chinese'...
shells                   {'brazilian': 1, 'cajun_creole': 6, 'chinese':...
plain yogurt             {'brazilian': 3, 'chinese': 1, 'french': 6, 'g...
beer                     {'brazilian': 5, 'british': 20, 'cajun_creole'...
garlic salt              

In [115]:
# Cuisines in which each ingredient appears
cuisines_per_ingredient = cuisines_count_per_ingredient.apply(extract_cuisine)
cuisines_per_ingredient[:10]

rotisserie chicken    [brazilian, cajun_creole, chinese, greek, indi...
mustard               [brazilian, british, cajun_creole, chinese, fr...
key lime                              [brazilian, mexican, southern_us]
whole milk            [brazilian, british, cajun_creole, chinese, fi...
chicken breasts       [brazilian, cajun_creole, chinese, filipino, f...
chipped beef                                                [brazilian]
apple juice           [brazilian, chinese, french, indian, irish, it...
pinto beans           [brazilian, cajun_creole, french, italian, mex...
chicken thighs        [brazilian, cajun_creole, chinese, filipino, f...
soy sauce             [brazilian, cajun_creole, chinese, filipino, f...
dtype: object

In [116]:
# Check number of ingredients to verify deletion
print(str(len(cuisines_count_per_ingredient)) + ' ingredients left from an original of 6289.')

6203 ingredients left from an original of 6289.


### Top 10 ingredients per cuisine after removing common ingredients

In [117]:
top10 = Series()

for cuisine, list_ingredients in all_ingredient_by_cuisine_count.items():
    top10.loc[cuisine] = []
    for ingredient, occurrence in Series(list_ingredients)[:10].items():
        top10.loc[cuisine].append(ingredient)


df_top10 = DataFrame.from_items(zip(top10.index, top10.values)).T
df_top10.columns = ['top{}'.format(i) for i in range(1, 11)]
df_top10



Unnamed: 0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10
brazilian,lime,cachaca,tomatoes,coconut milk,sweetened condensed milk,black beans,ice,lime juice,fresh cilantro,fresh lime juice
british,baking soda,whipping cream,whole milk,worcestershire sauce,raisins,ground nutmeg,egg yolks,plain flour,powdered sugar,large egg yolks
cajun_creole,cajun seasoning,dried thyme,andouille sausage,creole seasoning,shrimp,fresh parsley,worcestershire sauce,diced tomatoes,celery ribs,hot sauce
chinese,soy sauce,sesame oil,corn starch,rice vinegar,fresh ginger,oyster sauce,hoisin sauce,peanut oil,Shaoxing wine,light soy sauce
filipino,soy sauce,fish sauce,coconut milk,vinegar,corn starch,tomatoes,pork,ground pork,cabbage,chicken
french,fresh lemon juice,dry white wine,large egg yolks,fresh parsley,whipping cream,bay leaf,tomatoes,flat leaf parsley,whole milk,egg yolks
greek,dried oregano,feta cheese crumbles,fresh lemon juice,feta cheese,cucumber,tomatoes,fresh parsley,fresh dill,fresh oregano,greek yogurt
indian,garam masala,ground turmeric,cumin seed,ground cumin,tomatoes,tumeric,chili powder,green chilies,curry powder,fresh ginger
irish,baking soda,buttermilk,cabbage,fresh parsley,raisins,bacon,whole wheat flour,beer,russet potatoes,Irish whiskey
italian,grated parmesan cheese,fresh basil,dry white wine,fresh parsley,dried oregano,flat leaf parsley,tomatoes,fresh lemon juice,parmesan cheese,diced tomatoes


## Investigates ingredients that contain a cuisine name in its name
Now that the common ingredients have been identified and removed from the dataset and all other variables have been updated, we will now broaden our search for more ingredients to be eliminated since the 86 indentified so far doesn't do much to mitigate the dimensionality problem.

### Find same name ingredients
After noticing that many ingredientes share the name of a cuisine, I decided to investigate those ingredientes and the cuisines where they are used to verify their occurrence and try to determine if they can perhaps be strong cuisine predictors. 

In [118]:
# Finds the cuisines with same name ingredients 
all_ingredients_same_name_cuisine = Series()

for cuisine, list_ingredients in all_ingredients_cuisine.items():
    all_ingredients_same_name_cuisine.loc[cuisine] = all_ingredients_same_name_cuisine.get(cuisine, [])
    for ingredient in list_ingredients:
        if cuisine in ingredient:
            all_ingredients_same_name_cuisine.loc[cuisine].append(ingredient)

In [119]:
all_ingredients_same_name_cuisine

brazilian                                                      []
british                                                        []
cajun_creole                                                   []
chinese         [chinese rice wine, chinese five-spice powder,...
filipino                                      [filipino eggplant]
french          [french baguette, french bread, french toast, ...
greek           [greek yogurt, nonfat greek yogurt, greek seas...
indian                                                         []
irish           [irish cream liqueur, irish cream liqueur, iri...
italian         [italian seasoning, italian seasoning, italian...
jamaican        [jamaican rum, jamaican jerk season, jamaican ...
japanese        [japanese cucumber, japanese cucumber, japanes...
korean          [korean chile paste, korean chile paste, korea...
mexican         [crema mexicana, fresh mexican cheese, mexican...
moroccan                 [moroccan seasoning, moroccan seasoning]
russian   

### Count occurrences of all same name ingredient per cuisine

In [120]:
# Counts the number of recipes(occurrences) in which same cuisine/ingredients name appear
all_ingredients_same_name_cuisine.apply(len)

brazilian         0
british           0
cajun_creole      0
chinese         459
filipino          1
french           58
greek           163
indian            0
irish            19
italian         620
jamaican         33
japanese         41
korean           12
mexican         106
moroccan          2
russian           1
southern_us       0
spanish          66
thai            207
vietnamese       21
dtype: int64

In [121]:
# Number of recipes with same name ingredient
all_ingredients_same_name_cuisine.apply(len).sum(axis=0)

1809

### Count occurrences of each same name ingredient

In [122]:
all_ingredients_same_name_cuisine_count = all_ingredients_same_name_cuisine.apply(count_occ)

all_ingredients_same_name_cuisine_count

brazilian                                                      {}
british                                                        {}
cajun_creole                                                   {}
chinese         {'chinese five-spice powder': 187, 'chinese ri...
filipino                                 {'filipino eggplant': 1}
french          {'french bread': 43, 'french baguette': 9, 'fr...
greek           {'greek yogurt': 65, 'greek style plain yogurt...
indian                                                         {}
irish           {'irish cream liqueur': 13, 'irish bacon': 5, ...
italian         {'italian seasoning': 273, 'italian sausage': ...
jamaican        {'jamaican jerk season': 18, 'jamaican rum': 5...
japanese        {'japanese rice': 16, 'japanese eggplants': 15...
korean          {'korean chile paste': 8, 'korean chile': 2, '...
mexican         {'mexican chorizo': 27, 'mexican chocolate': 2...
moroccan                                {'moroccan seasoning': 2}
russian   

### Unique same name ingredients by cuisine

In [123]:
#Lists of unique cuisine/ingredients names
unique_ingredients_same_name_cuisine = all_ingredients_same_name_cuisine.apply(remove_duplicates)
unique_ingredients_same_name_cuisine

brazilian                                                      []
british                                                        []
cajun_creole                                                   []
chinese         [chinese celery cabbage, chinese cinnamon, chi...
filipino                                      [filipino eggplant]
french          [french sandwich rolls, french baguette, frenc...
greek           [greek seasoning, greek style plain yogurt, fe...
indian                                                         []
irish              [irish bacon, irish oats, irish cream liqueur]
italian         [sweet italian sausage, country style italian ...
jamaican        [jamaican rum, jamaican pumpkin, jamaican curr...
japanese        [japanese radish, japanese rice, japanese cucu...
korean          [korean chile paste, korean chile, korean buck...
mexican         [crema mexican, mexican style 4 cheese blend, ...
moroccan                                     [moroccan seasoning]
russian   

### Count number of unique same name ingredient per cuisine

In [124]:
# Number of unique ingredients per cuisine
unique_ingredients_same_name_cuisine.apply(len)
print("")

# Total occurrences of same cuisine/ingredient name
print('There are ' + str(all_ingredients_same_name_cuisine_count.apply(len).sum()) + ' instances of same cuisine/ingredient name.')

brazilian        0
british          0
cajun_creole     0
chinese         31
filipino         1
french           7
greek           15
indian           0
irish            3
italian         38
jamaican         6
japanese         5
korean           4
mexican          7
moroccan         1
russian          1
southern_us      0
spanish          5
thai             5
vietnamese       2
dtype: int64


There are 131 instances of same cuisine/ingredient name.


### Unique same name ingredients 

In [125]:
# Get unique same name ingredients
unique_same_name_ingredients = Series(get_ingredients(unique_ingredients_same_name_cuisine).unique())
%store unique_same_name_ingredients

unique_same_name_ingredients[:10]

print('There are ' + str(len(unique_same_name_ingredients)) + ' unique_ingredients.')

Stored 'unique_same_name_ingredients' (Series)


0       chinese celery cabbage
1             chinese cinnamon
2               chinese turnip
3               chinese chives
4    chinese five-spice powder
5          chinese chili paste
6             chinese pea pods
7            chinese eggplants
8           chinese plum sauce
9           chinese roast pork
dtype: object

There are 131 unique_ingredients.


### Number of times each same name ingredient appear in a cuisine

In [126]:
# Number of times each same name ingredient appear in a cuisine

cuisines_count_per_same_name_ingredient = number_cuisine_per_ingredient(unique_same_name_ingredients)

cuisines_count_per_same_name_ingredient[:10]

chinese celery cabbage                                          {'chinese': 1}
chinese cinnamon                               {'chinese': 1, 'vietnamese': 1}
chinese turnip                                       {'chinese': 1, 'thai': 1}
chinese chives               {'chinese': 14, 'japanese': 1, 'korean': 6, 't...
chinese five-spice powder    {'chinese': 187, 'filipino': 1, 'indian': 1, '...
chinese chili paste                                             {'chinese': 1}
chinese pea pods                                                {'chinese': 1}
chinese eggplants            {'chinese': 6, 'filipino': 1, 'japanese': 1, '...
chinese plum sauce                                              {'chinese': 3}
chinese roast pork                                              {'chinese': 3}
dtype: object

### Unique Cuisines Each same name Ingredient Appears

In [127]:
# Checks in how many other cuisines the same name cuisine-ingredients appear

# Get name of cuisine from dictionary with frequency in which they appear
cuisines_per_same_name_ingredient = cuisines_count_per_same_name_ingredient.apply(extract_cuisine)

cuisines_per_same_name_ingredient[:10]

chinese celery cabbage                                               [chinese]
chinese cinnamon                                         [chinese, vietnamese]
chinese turnip                                                 [chinese, thai]
chinese chives                   [chinese, japanese, korean, thai, vietnamese]
chinese five-spice powder    [chinese, filipino, indian, italian, jamaican,...
chinese chili paste                                                  [chinese]
chinese pea pods                                                     [chinese]
chinese eggplants                    [chinese, filipino, japanese, vietnamese]
chinese plum sauce                                                   [chinese]
chinese roast pork                                                   [chinese]
dtype: object

### Number of Unique Cuisines Each Ingredient Appears

In [128]:
# Number of cuisine each same name cuisine appears (first 10 ingredients)
cuisines_per_same_name_ingredient.apply(len)[:10]

chinese celery cabbage       1
chinese cinnamon             2
chinese turnip               2
chinese chives               5
chinese five-spice powder    8
chinese chili paste          1
chinese pea pods             1
chinese eggplants            4
chinese plum sauce           1
chinese roast pork           1
dtype: int64

## Conclusions
After investigating the same name ingredients, it was discovered that there are only 131 of such ingredients being used in just 1809 recipes. A subset of the dataset could be created to investigate further and check if they are good cuisine predictors. But because they are so few compared to the size of the dataset, their impact in the overall accuracy of a model will probably not be very significant regardless of how good they are. Besides they will also not solve the dimensionality problem. Therefore, instead of digging deeper on them, I decided to change the strategy and apply the PCA dimensionality reduction technique to address to reduce the number of attributes in the dataset. The `PCAModel.ipynb` notebook will demonstrate step by step how this was done.