## RecipeToVec
Word to vec using gensim: https://radimrehurek.com/gensim/models/word2vec.html
Here we are creating the word vectors from the recipes.
Each document is one recipe's list of clean ingredients + verbs. 
We use the Gensim model to create the similarity matrix(cosine similarity).

In [112]:
import numpy as np
import pandas as pd
from collections import defaultdict
#from gensim.models.word2vec import Word2Vec
from gensim import corpora, models
import pickle
import os
import logging

In [113]:
# Enable logging (for gensim)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

cleanerIngredientsDict=pd.read_pickle('cleanerIngredients.pkl')  
verbsDict=pd.read_pickle('verbs.pkl') 

In [114]:
print len(cleanerIngredientsDict)

45245


In [115]:
 allRecipes=pd.read_pickle('originalDF.pkl') 

In [137]:
recipes=[]
names=[]
for idx, row in allRecipes.iterrows():
    name = [word for word in row["name"].lower().split()]
    verbs = verbsDict[url]
    ingredients = list(cleanerIngredientsDict[url])
    recipes.append(name + ingredients + verbs)
    names.append(row["name"])

In [117]:
allRecipes.sample(1)

Unnamed: 0,categories,cookingTime,description,ingredients,instructionSteps,name,rating,ratingCount,url,cookingTimeMinutes,cleanedIngredients,verbs
2367,"[Indian Recipes, Asian Recipes, Drinks, Everyd...",PT5M,,"[2 (15.25 ounce) cans mango pulp, or mango sli...","[Pour mangos, yogurt, milk, and ice into the b...",Restaurant Style Mango Lassi,4.26,26,http://allrecipes.com/recipe/54319/restaurant-...,5.0,"[mango pulp mango juice, plain yogurt, milk, ice]","[pour, blend]"


In [118]:
# Concatenate phrases into single tokens
recipes=[[w.replace(' ','_') for w in recipe] for recipe in recipes]

In [119]:
print 'Total recipes loaded: %s ' % len(recipes)
print recipes[0]

Total recipes loaded: 89061 
[u'fresco', u'salsa', u'barbeque_sauce', u'bratwurst', u'fill', u'inject', u'cook', u'turning', u'serve']


In [120]:
# Create a dictionary and save it
dictionary = corpora.Dictionary(recipes)
dictionary.save('recipe2vec.dict')
print(dictionary)
print "The token ID of milk is: %s " % dictionary.token2id["milk"] 

2017-08-20 21:19:06,392 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-08-20 21:19:06,609 : INFO : adding document #10000 to Dictionary(4475 unique tokens: [u'chicken-coconut', u"nick's", u'zavioli', u'chilean-style', u"sean's"]...)
2017-08-20 21:19:06,822 : INFO : adding document #20000 to Dictionary(7460 unique tokens: [u'caramel-nut', u'blast-off', u"jackie's", u'clapshot', u'charoset']...)
2017-08-20 21:19:07,034 : INFO : adding document #30000 to Dictionary(9566 unique tokens: [u'caramel-nut', u'blast-off', u"jackie's", u'clapshot', u'charoset']...)
2017-08-20 21:19:07,238 : INFO : adding document #40000 to Dictionary(10483 unique tokens: [u'caramel-nut', u'blast-off', u"jackie's", u'clapshot', u'cajun-crusted']...)
2017-08-20 21:19:07,497 : INFO : adding document #50000 to Dictionary(11750 unique tokens: [u'caramel-nut', u'blast-off', u"jackie's", u'clapshot', u'cajun-crusted']...)
2017-08-20 21:19:07,762 : INFO : adding document #60000 to Dictionary(12328 un

Dictionary(14204 unique tokens: [u'caramel-nut', u'blast-off', u"jackie's", u'clapshot', u'cajun-crusted']...)
The token ID of milk is: 880 


In [121]:
# Create a corpus and save it
corpus = [dictionary.doc2bow(recipe) for recipe in recipes]
corpora.MmCorpus.serialize('recipe2vec.mm', corpus)

2017-08-20 21:19:31,630 : INFO : storing corpus in Matrix Market format to recipe2vec.mm
2017-08-20 21:19:31,633 : INFO : saving sparse matrix to recipe2vec.mm
2017-08-20 21:19:31,634 : INFO : PROGRESS: saving document #0
2017-08-20 21:19:31,702 : INFO : PROGRESS: saving document #1000
2017-08-20 21:19:31,784 : INFO : PROGRESS: saving document #2000
2017-08-20 21:19:31,837 : INFO : PROGRESS: saving document #3000
2017-08-20 21:19:31,887 : INFO : PROGRESS: saving document #4000
2017-08-20 21:19:31,941 : INFO : PROGRESS: saving document #5000
2017-08-20 21:19:32,000 : INFO : PROGRESS: saving document #6000
2017-08-20 21:19:32,085 : INFO : PROGRESS: saving document #7000
2017-08-20 21:19:32,166 : INFO : PROGRESS: saving document #8000
2017-08-20 21:19:32,219 : INFO : PROGRESS: saving document #9000
2017-08-20 21:19:32,271 : INFO : PROGRESS: saving document #10000
2017-08-20 21:19:32,347 : INFO : PROGRESS: saving document #11000
2017-08-20 21:19:32,432 : INFO : PROGRESS: saving document #1

## Now let's build a model

In [41]:
if (os.path.exists("recipe2vec.dict")):
    dictionary = corpora.Dictionary.load('recipe2vec.dict')
    corpus = corpora.MmCorpus('recipe2vec.mm')
    print("Loaded dictionary and corpus from disk")
else:
    print("Error: Could find dictionary \"recipe2vec.dict\"")

Loaded dictionary and corpus from disk


In [42]:
tfidf = models.TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]

In [43]:
corpus_tfidf = tfidf[corpus]

In [44]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100) # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->lsi

In [130]:
print('Training a Word2vec model...')
w2v_model = models.word2vec.Word2Vec(recipes, size=200, window=7, min_count=4, workers=4, hs=1, negative=0)

2017-08-20 21:23:54,348 : INFO : collecting all words and their counts
2017-08-20 21:23:54,350 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-08-20 21:23:54,385 : INFO : PROGRESS: at sentence #10000, processed 108347 words, keeping 4475 word types
2017-08-20 21:23:54,429 : INFO : PROGRESS: at sentence #20000, processed 217559 words, keeping 7460 word types
2017-08-20 21:23:54,471 : INFO : PROGRESS: at sentence #30000, processed 324660 words, keeping 9566 word types
2017-08-20 21:23:54,514 : INFO : PROGRESS: at sentence #40000, processed 431118 words, keeping 10483 word types


Training a Word2vec model...


2017-08-20 21:23:54,562 : INFO : PROGRESS: at sentence #50000, processed 541794 words, keeping 11750 word types
2017-08-20 21:23:54,607 : INFO : PROGRESS: at sentence #60000, processed 651033 words, keeping 12328 word types
2017-08-20 21:23:54,653 : INFO : PROGRESS: at sentence #70000, processed 760307 words, keeping 12917 word types
2017-08-20 21:23:54,698 : INFO : PROGRESS: at sentence #80000, processed 867107 words, keeping 13822 word types
2017-08-20 21:23:54,738 : INFO : collected 14204 word types from a corpus of 963451 raw words and 89061 sentences
2017-08-20 21:23:54,740 : INFO : Loading a fresh vocabulary
2017-08-20 21:23:54,768 : INFO : min_count=4 retains 5538 unique words (38% of original 14204, drops 8666)
2017-08-20 21:23:54,770 : INFO : min_count=4 leaves 947355 word corpus (98% of original 963451, drops 16096)
2017-08-20 21:23:54,791 : INFO : deleting the raw counts dictionary of 14204 items
2017-08-20 21:23:54,794 : INFO : sample=0.001 downsamples 17 most-common words


In [66]:
w2v_model['beef'][:10]

array([-0.11901543, -0.04799895, -0.01910873, -1.06104159,  2.0417552 ,
       -0.32504481,  0.0864769 , -0.2238549 , -0.82190663,  0.37192655], dtype=float32)

In [73]:
w2v_model['beef'][:10]

array([-0.81907874,  1.27890754,  0.15164381,  0.81708187,  0.68151563,
        0.05034285, -0.17278571,  1.18341053, -1.6170727 ,  0.20949043], dtype=float32)

In [107]:
w2v_model.wv.most_similar(positive=['stew','beef'],negative=['taco'], topn=3)

[(u'soup', 0.5903193950653076),
 (u'kugel', 0.47400134801864624),
 (u'medley', 0.4700500965118408)]

In [133]:
v=[sum([w2v_model[word] for word in cent if word in w2v_model]) for cent in recipes]
print len(v)
print len(v[0])

89061


In [136]:
print len(v[0])

200


In [None]:
# This runs a shell command from the notebook.
!pip install plotly

# Plotly imports.
import plotly.offline as plotly
plotly.offline.init_notebook_mode()
import plotly.graph_objs as go


In [140]:
n = 1000
data = [go.Scatter(x=v[2], y=v[1], text=names,
                   mode='markers', textposition='bottom', hoverinfo='text')]
fig = go.Figure(data=data, layout=go.Layout(title="Word Embeddings", hovermode='closest'))
plotly.iplot(fig)