# What's Cooking Kaggle Challenge

Link: https://www.kaggle.com/c/whats-cooking/overview

In this notebook I'll be looking at the dataset from the Kaggle What's Cooking Challenge which is a supervised problem of predicting cuisines from ingredients list. The dataset is a .json file with only 2 important columns being the cuisine as well as the ingredients. In this notebook we will be doing exploratory data analysis followed by using some Natural Language Processing techniques to solve the problem. I will be using PyTorch to implement the deep learning model to learn the vector representations of the words.

## Things to do: 
Add in a content page using Markdown
Segment the Notebook into different components
- Data Preprocessing
- Preliminary Model
- Data Visualisation - 1st Iteration
- CBOW Model
- NN Architecture
- Evaluation
- Visualisation - 2nd Iteration



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn
import random


In [2]:
df = pd.read_json("train.json")
df_test = pd.read_json("test.json")
df.head()

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."


In [3]:
#Different Cuisines present and their counts
df["cuisine"].value_counts()

italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64

In [4]:
ingredients = df["ingredients"].tolist()
test_ingredients = df_test["ingredients"].tolist()
ingredients = ingredients + test_ingredients

# The Vector space has to include all ingredients from both Train and Test

In [5]:
word_dict ={}
for i_list in ingredients:
    for ing in i_list:
        ing = ing.split(" ")
        for word in ing:
            word_dict[word] = word_dict.get(word,0) + 1
            
ingredients_dict = {}
for recipe in ingredients:
    for ingredient in recipe:
        ingredients_dict[ingredient] = ingredients_dict.get(ingredient,0)+ 1

ing_df = pd.DataFrame(data = ingredients_dict.values(),index = ingredients_dict.keys(),columns = ["Counts"])
ing_df.sort_values(["Counts"],ascending = False, inplace = True)
ing_df

Unnamed: 0,Counts
salt,22534
onions,10008
olive oil,9889
water,9293
garlic,9171
...,...
seville orange juice,1
dried hibiscus blossoms,1
pancake batter,1
dairy free coconut ice cream,1


As can be seen from the dataframe, there are currently 7137 differnt types of ingredients present in the dataset. However many of them are repeated but have a slightly different name in the recipe. (Eg. Garlic vs Chopped Garlic). Below you can see a list of stopwords which are redundant. This results in a reduction of the number of ingredients by around 200 which is relatively sizeable. 

In [34]:
stopwords= ["fresh","chopped","large","all-purpose","grated","freshly","crushed","minced","skinless"
           "sodium","low","diced","unsalted","coarse","low-fat","medium","powdered","finely","fine",
           "pitted","plain","low-fat","full-fat","nonfat","fat-free"]
def find_occurence(word,recipe_list):
    result = {}
    for recipe in recipe_list:
        for ingredient in recipe:
            if word in ingredient:
                result[ingredient] = result.get(ingredient,0) + 1
    return list(result.keys())

ingredients2 = []
for index,i in enumerate(ingredients):
    recipe = []
    for j in i:
        ing_word = j.split(" ")
        ing_word = [i for i in ing_word if i not in stopwords]
        recipe.append(" ".join(ing_word))
    ingredients2.append(recipe)
    
ingredients_dict2 = {}
for recipe in ingredients2:
    for ingredient in recipe:
        ingredients_dict2[ingredient] = ingredients_dict2.get(ingredient,0)+ 1
ing_df = pd.DataFrame(data = ingredients_dict2.values(),index = ingredients_dict2.keys(),columns = ["Counts"])
ing_df.sort_values(["Counts"],ascending = False, inplace = True)
df["ingredients"]= ingredients[:len(df)] #Append the "cleaned" list of ingredients to the dataframe
ingredients_map = {k:v for k,v in zip(ing_df.index,range(len(ing_df)))}

def convert_recipe(recipe):
    '''
    Convert Recipe from a List of String Ingredients to a Vector
    recipe: List of Ingredients
    output: 7137x1 Vector
    '''
    output = np.zeros(7137)
    for ingredient in recipe:
        output[ingredients_map[ingredient]] = 1
    return output
    
df["Vector"] = df["ingredients2"].apply(convert_recipe) # Convert each recipe to a OHE Sparse Vector Form

Now that most of the preprocessing is done, traditional ML methods are used as a baseline effectiveness for the classification task so that we can compare the performance using Deep Learning.

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Target"] = le.fit_transform(df["cuisine"])

Things to do:
- Reduce dimensionality of the data by removing redundant words
- Turning everything into a one hot encoded vector
- K-Means Clustering to treat the problem like an unsupervised one
- KNN, SVM vs Neural Network
- Use additional feature engineering to enhance model performance - Protein, spices others etc.


In [8]:
#Store all the vectors as a Matrix of M x 7137
mat = list(df["Vector"])
mat = np.array(mat)
mat.shape

(39774, 7137)

Because of the curse of dimensionality, training a model on this vector space will take way too long. For this we first use a linear dimensionality reduction tool PCA and later in the Natural Language Processing Section we can see how this actually compares with non-linear methods such as using an autoencoder

In [12]:
from sklearn.decomposition import PCA
pca_128 = PCA(128)
mat_pca_128 = pca_128.fit_transform(mat)

In [13]:
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(mat_pca_128,df['Target'],
                                                    test_size=0.30)
sv_linear = SVC(kernel = "linear")
sv_linear.fit(X_train,y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [32]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

predictions = sv_linear.predict(X_test)
print(classification_report(y_test,predictions))
cr = classification_report(y_test,predictions,output_dict= True)

              precision    recall  f1-score   support

           0       0.40      0.20      0.27       144
           1       0.30      0.17      0.22       236
           2       0.73      0.66      0.69       451
           3       0.72      0.80      0.76       769
           4       0.63      0.53      0.57       212
           5       0.47      0.52      0.49       748
           6       0.67      0.60      0.63       346
           7       0.85      0.86      0.85       886
           8       0.40      0.17      0.24       218
           9       0.72      0.84      0.77      2353
          10       0.81      0.57      0.67       162
          11       0.70      0.55      0.61       433
          12       0.72      0.57      0.64       242
          13       0.87      0.88      0.88      1961
          14       0.76      0.64      0.69       248
          15       0.41      0.30      0.35       139
          16       0.58      0.73      0.65      1334
          17       0.61    

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c= df["Target"])

plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

In [None]:

X_train, X_test, y_train, y_test = train_test_split(x_pca,df['Target'],
                                                    test_size=0.30)

In [None]:
svm = SVC()
svm.fit(X_train,y_train)



In [None]:
predictions = svm.predict(X_test)

In [None]:
score = sum(y_test == predictions)/len(y_test)
score

In [None]:
subset_df = df[df["Target"] < 6]
subset = list(subset_df["Vector"])
subset = np.array(subset)
x_pca = pca.fit_transform(subset)

In [None]:
sns.scatterplot(x = x_pca[:,0], y = x_pca[:,1],hue = subset_df["Target"])
plt.legend()

In [None]:
le.inverse_transform(range(0,6))

In [None]:
sns.set_style('whitegrid')

## Natural Language Processing

Now that we have seen how something like PCA can make sense of the data, we come to the interesting part which is to try to find a better vector representation of the words in a recipe.

Here we will use the Common Bag Of Words (CBOW) model in order to learn the representation of the words. The idea behind CBOW is to use the context of a word to learn what the word actually means. For example a sentence such as "I like to eat pasta", in order to learn the representation of the word "eat", we look at "pasta" and "like" as context words. In this case because the recipes are inherently unordered, the context words will be obtained randomly from the sample.

The implementation of this model is to use an autoencoder architecture where the One-Hot encoded word vectors are used in the model as inputs and encoded into the learnt representations and then the decoder will try to recreate the target word. 

Link: https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

## To-Do: Insert picture of the autoencoder
    

Recipe Sampling, Randomly choose 5 other ingredients that are in the same recipe and try to use the words to learn the representation of that word

In [43]:
import torch
import torch.nn as nn

In [63]:
#Helper Functions
#random.choices -> with replacement, random.sample() -> Without replacement
#The sampling function to get context words from a recipe
def sample(recipe,ingredient,samples):
    recipe = recipe[:] #Copy the recipe to prevent alteration
    recipe.remove(ingredient)
    if len(recipe) < CONTEXT_SIZE:
        return random.choices(recipe, k=samples)
    else: 
        return random.sample(recipe, k=samples)
    
def create_context_vector(context):
    idxs = [ingredients_map[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)
def generatePair()


In [64]:
sample1 = sample(ingredients2[1],"flour",CONTEXT_SIZE)
create_context_vector(sample1)

tensor([592,  13,  51, 211,  11])

In [57]:
ingredients2[1]

['flour',
 'ground pepper',
 'salt',
 'tomatoes',
 'ground black pepper',
 'thyme',
 'eggs',
 'green tomatoes',
 'yellow corn meal',
 'milk',
 'vegetable oil']

In [55]:
ingredients_map

{'salt': 0,
 'garlic': 1,
 'onions': 2,
 'olive oil': 3,
 'butter': 4,
 'water': 5,
 'garlic cloves': 6,
 'sugar': 7,
 'eggs': 8,
 'flour': 9,
 'tomatoes': 10,
 'ground black pepper': 11,
 'cilantro': 12,
 'vegetable oil': 13,
 'pepper': 14,
 'ginger': 15,
 'soy sauce': 16,
 'kosher salt': 17,
 'lemon juice': 18,
 'green onions': 19,
 'carrots': 20,
 'parmesan cheese': 21,
 'ground cumin': 22,
 'extra-virgin olive oil': 23,
 'black pepper': 24,
 'lime juice': 25,
 'milk': 26,
 'parsley': 27,
 'chili powder': 28,
 'oil': 29,
 'red bell pepper': 30,
 'scallions': 31,
 'purple onion': 32,
 'onion': 33,
 'corn starch': 34,
 'shrimp': 35,
 'sesame oil': 36,
 'jalapeno chilies': 37,
 'baking powder': 38,
 'dried oregano': 39,
 'sour cream': 40,
 'chicken broth': 41,
 'cayenne pepper': 42,
 'lime': 43,
 'cooking spray': 44,
 'brown sugar': 45,
 'shallots': 46,
 'green bell pepper': 47,
 'garlic powder': 48,
 'basil': 49,
 'celery': 50,
 'ground pepper': 51,
 'honey': 52,
 'vanilla extract': 5

In [44]:
VOCAB_SIZE = len(ingredients_dict2)
EMBED_DIM = 64
CONTEXT_SIZE = 5

class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim) #Transform to Lower Dimension Embeddings
        self.linear1 = nn.Linear(context_size*embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out)
        return log_probs
cbow = CBOWModel(VOCAB_SIZE,EMBED_DIM,CONTEXT_SIZE)

In [49]:
cbow(vecs[0])

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not numpy.ndarray

In [47]:
vecs = df["Vector"]