I had the idea to apply distributed word vectors ([word2vec]( https://en.wikipedia.org/wiki/Word2vec)) to this dataset. 

Word2vec, in a very high level, is an algorithm capable to learn the relationship between words using the context (neighbouring words), and encodes those relatinships in a vector. Using these vectors, we can cluster the words in or library, or even do operations. The classic example of the latter is; "king - man + woman = queen."

Word2vec uses recurrent neural networks to learn, then usually works better with huge datasets (billions of words), but we will see how it performs with the cooking dataset, where each receipt will be a sentence. One of the best features of this algorithm published by Google is the speed. Other recurrent neural networks had been proposed, however they were insanely CPU time consuming. If you want more detailed information about this, I strongly suggest you to read about [here]( https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiU5PmPoOPSAhVn5IMKHcIUDmIQFggaMAA&url=https%3A%2F%2Fpapers.nips.cc%2Fpaper%2F5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf&usg=AFQjCNFvn2t3S41dxIocYbx5EpeOwmjXVQ&sig2=IxYxjFBtWI_BkYLKymPAsw&bvm=bv.149760088,d.amc), [here]( https://www.quora.com/How-does-word2vec-work) and [here]( https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/) : . Now let’s tackle our dataset. 


In [None]:
from __future__ import print_function

# Handle data
import json
import operator
import collections
import re

# Handle table-like data 
import numpy as np
import pandas as pd

# Model Algorithms
# we could use also tensor flow, there are multiple implementations of word2vec
from gensim.models import word2vec

# Modelling Helpers, see above the description
from sklearn.manifold import TSNE

# Visualisation
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
%matplotlib inline

In [None]:
# Load the dataset
# json format labels: cuisine, id number and ingredients (list)
trainrecipts = json.load(open('../input/train.json','r'))

## General Exploration of the Dataset

Although this dataset is probably quite clean, do a general exploration of our data is really good habit, no matter what kind of model you will apply to them. Plot a few frequencies and means can provide a value information of all sorts about potential problems, bias, typos, etc.

In [None]:
# Quick&dirty code to extract info2list
raw_ingredients = list()

for recipt in trainrecipts:
    for ingredient in recipt[u'ingredients']:
        raw_ingredients.append(ingredient.strip())
        

raw_cuisines = list()
for recipt in trainrecipts:
    raw_cuisines.append(recipt[u'cuisine'].strip())



In [None]:

# use Counter to get frequencies 
counts_ingr = collections.Counter(raw_ingredients)
counts_cuis = collections.Counter(raw_cuisines)

### Cuisines

In [None]:
# this will help us to have an idea how our corpora of 
# ingredients looks like
print('Size Ingredients dataset (with repetition):  \t{}'.format((len(raw_ingredients))))
print('Unique Ingredients dataset: \t\t\t{}'.format((len(counts_ingr.values()))))

# This will provide a distribution of cusines, indirect 
# info of the ingredients
print('Total # of recipts \t\t\t\t{}'.format(len(raw_cuisines)))
print('Total # of Cuisines \t\t\t\t{}'.format((len(counts_cuis.values()))))



In [None]:
# top 10
counts_cuis.most_common(10)

In [None]:
# Distribution 

print(np.mean(list(counts_cuis.values())))
print(np.std(list(counts_cuis.values())))

In [None]:
# lets plot this 
# sort
x_cu = [cu for cu, frq in counts_cuis.most_common()]
y_frq = [frq for cu, frq in counts_cuis.most_common()]
fbar = sns.barplot(x = x_cu, y = y_frq)
# xlabels
for item in fbar.get_xticklabels():
    item.set_rotation(90)

For instance, as we can see in the first plot, Italian and Mexican receipts represent more than a third of the entire dataset. So, it is probable that this will affect how our vectors form. It is good to keep this on mind for this or any other further model we apply to this dataset. Let’s check if there is a bias on the size of the receipts. 



### Ingredients

Other interesting parameter is the size of the receipts, how long are they? there is any bias?

In [None]:
# init a dict with a empty list
num_ingredients = dict(zip(counts_cuis.keys(), [list() for x in counts_cuis.keys()]))
for recipt in trainrecipts:
    # append the number in the list
    num_ingredients[recipt['cuisine']].append(len(recipt['ingredients']))

len(num_ingredients)

In [None]:
for cu, frq in num_ingredients.items():

    print('{}    \t\t{:.2f}'.format(cu, np.mean(frq)))

In [None]:
x_cu = [cu for cu, frq in num_ingredients.items()]
y_frq = [np.mean(frq) for cu, frq in num_ingredients.items()]
err = [np.std(frq) for cu, frq in num_ingredients.items()]
fbar = sns.barplot(x = x_cu, y = y_frq, yerr=err)
# xlabels
for item in fbar.get_xticklabels():
    item.set_rotation(90)

Well, on terms of size, all the receipts appear to be more similar. Then, let’s focus on the ingredients. As I mentioned above, I guess this dataset is really clean, or at least more than a real-world dataset and I do not expect any pre-processing. Also, full disclaimer, I did not check any of the models submitted to kaggel, and my intention is build the word2vec with as a proof of concept.

In [None]:
# Dispersion of the frequencies Ingredients
print(np.mean(list(counts_ingr.values())))
print(np.std(list(counts_ingr.values())))



The frequency of the ingredients presents a similar scenario, a few ingredients are tremendously popular. Make sense, some ingredients as salt, or water are common in any recipes.  Half of the ingredients only appear 4 or less times in the dataset, that is wide less what I expected. Let's check the most popular.

In [None]:
# This is to big to plot, let's check the percentiles
print(np.median(list(counts_ingr.values())))
print(np.percentile(list(counts_ingr.values()), [25., 50., 75., 99.]))

Half of the ingredients only appear 4 or less times in the dataset, that is wide less what I expected. Let's check the most populars.

In [None]:
# top 15
counts_ingr.most_common(15)

A few ingredients like Salt and water make a lot of sense that they are highly frequent, but the present of olive oil among these omnipresent ingredients make me think that is an artefact of the bias of the dataset to the Italian cooking.

In [None]:
# Tail 50
counts_ingr.most_common()[-50:]

There are some very specific ingredients... I expect that some of those are typos, or just versions of other ingredients. Also notice that in the dataset the same ingredient can present in different formats, garlic, and garlic cloves. First a quick search for parenthesis or similar symbols that rise a red flag to typos, or weird writing


In [None]:
symbols = list()

for recipt in trainrecipts:

    # I want ingredient remove 
    for ingredient in recipt['ingredients']:
        if re.match("\(|@|\$\?", ingredient.lower()):
            symbols.append(ingredient)
len(symbols)
counts_symbols = collections.Counter(symbols)
counts_symbols.most_common(20)

Well, I guess some pre-processing could be good, but let's see how our model behave. Let's train the neural network with a raw version of the dataset.

# Word2Vec

In [None]:
sentences = list()
# one hot ingredients


for recipt in trainrecipts:
    clean_recipt = list()
    # I want ingredient remove 
    for ingredient in recipt['ingredients']:
        # remove this description from the ingredients
        # minimal preprocessing
        ingredient =  re.sub(r'\(.*oz.\)|crushed|crumbles|ground|minced|powder|chopped|sliced',
                             '', 
                             ingredient)
        clean_recipt.append(ingredient.strip())
    sentences.append(clean_recipt)
        
len(sentences)

In [None]:
# Set values for NN parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 3    # 50% of the corpus                    
num_workers = 4       # Number of CPUs
context = 10          # Context window size; 
                      # let's use avg recipte size                                                                                  
downsampling = 1e-3   # threshold for configuring which 
                    # higher-frequency words are randomly downsampled

# Initialize and train the model 
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

Once the model is train we can ask him a few questions, for example what is similar to feta cheese. Looks like all the ingredients are stuff that you expect to find with feta cheese, and looks like they belong to Greek cuisine.

In [None]:
## Results

In [None]:
model.most_similar(u'feta cheese')

In [None]:
model.similarity('broccoli', 'bacon')



In [None]:
model.similarity('broccoli', 'carrots')



In [None]:
model.similarity('broccoli', u'mushrooms')

In [None]:
#['garlic', 'onion'], ['olive oil']
x = 'basil'
b= 'tomato sauce'
a = 'pasta'
predicted = model.most_similar([x, b], [a])[0][0]
print(" {} is to  {} as {} is to {} ".format(a, b, x, predicted))

In [None]:
#['garlic', 'onion'], ['olive oil']
x = 'chicken'
b= 'broccoli'
a = 'bacon'
predicted = model.most_similar([x, b], [a])[0][0]
print(" {} is to  {} as {} is to {} ".format(a, b, x, predicted))

In [None]:
model.wv.most_similar_cosmul(positive=['chili', u'meat'], negative=['tomato sauce'])

# Viz with t-SNE

In [None]:
corpus = sorted(model.wv.vocab.keys()) #not sure the exact api
emb_tuple = tuple([model[v] for v in corpus])
X = np.vstack(emb_tuple)

In [None]:
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

In [None]:
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])



Looks like there are some clusters of ingredients but is difficult to say anything else without adding labels or colors. Let's start easy and color each ingredient by the cuisine where it is more frequent. May be not the best way, but one of the fastest approaches to test if this is working. I will normalize the frequency.

In [None]:
#
# I will label a ingredient by frequency 
track_ingredients = dict(zip(counts_cuis.keys(), [list() for x in counts_cuis.keys()]))
for recipt in trainrecipts:
    # append the number in the list
    clean_recipt = list()
    # I want ingredient remove 
    for ingredient in recipt['ingredients']:
        # remove this description from the ingredients
        # (10 oz.) 
        ingredient =  re.sub(r'crushed|crumbles|ground|minced|powder|chopped|sliced', '', ingredient)
        clean_recipt.append(ingredient.strip())
        
    track_ingredients[recipt['cuisine']].extend(clean_recipt)

for label, tracking in track_ingredients.items():
    track_ingredients[label] = collections.Counter(tracking)

In [None]:
def return_most_popular(v):
    cuisine = None
    record = 0
    for label, tracking in track_ingredients.items():
        norm_freq = float(tracking[v]) / float(counts_cuis[label])
        if norm_freq > record:
            cuisine = label
            record = norm_freq
    return cuisine

In [None]:
track_2color = {u'irish':"#000000", # blak
                u'mexican':"#FFFF00", #yellow
                u'chinese':"#1CE6FF", #cyan
                u'filipino': "#FF34FF", #pink 
                u'vietnamese':"#FF4A46", #red
                u'spanish':"#FFC300",  # green forest
                u'japanese':"#006FA6", # blue ocean
                u'moroccan':"#A30059",# purple
                u'french':"#FFDBE5",  #light pink
                u'greek': "#7A4900",  # gold or brown 
                u'indian':"#0000A6", # blue electric 
                u'jamaican':"#63FFAC", # green phospho
                u'british': "#B79762", #brown
                u'brazilian': "#EEC3FF", #  
                u'russian':"#8FB0FF", # light blue 
                u'cajun_creole':"#997D87", #violet
                u'thai':"#5A0007", 
                u'southern_us':"#809693", 
                u'korean':"#FEFFE6", #ligt yellow
                u'italian':"#1B4400"}

color_vector = list()
for v in corpus:
    cuisine = return_most_popular(v)
    color_vector.append(track_2color[cuisine])

In [None]:
# ensemble the legend
lgend = list()
for l, c in track_2color.items():
    lgend.append(mpatches.Patch(color=c, label=l))

In [None]:
sns.set_context("poster")
fig, ax = plt.subplots(figsize=(18,18))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=color_vector, alpha=.6, s=60)
plt.legend(handles=lgend)