# Creating FastText word embeddings for recipe titles

We can use FastText to create word vectors for the words that are used in recipe titles.
With the skipgram model using character n-grams, FastText can be useful to rank semantically close words with a similar word vector, even if they are written slightly differently.
Thus, this vector representation works with typos as well.

First, we need to generate a corpus from the recipe dataset and then train the FastText model on it:

In [1]:
import json

with open('data/kochbar_03.json') as data_file:    
    kochbar = json.load(data_file)

In [2]:
# How does the data look?
kochbar[0]

{'avg_rating': '4,83',
 'calories': '933 (223)',
 'clicks': '101092',
 'comment_number': '59',
 'date': '26.07.2009',
 'difficulty': 'leicht',
 'favorites': '925',
 'ingredients': '500   gr. Hackfleisch    3    Zucchini    1   Btl. Spätzle, fertige   Salz, Pfeffer   Knoblauchzehe    1   Pck. Kräuterfrischkäse, 200gr.    1   Becher süße Sahne    2   Eßl. frische Petersilie, gehackt   Salz, Knoblauchgranulat    4    Eier',
 'name': 'Zucchini-Hackauflauf mit Frischkäse und Spätzle ',
 'number_votes': '141',
 'preparation': '    1  Das  Hackfleisch mit etwas Margarine in der Pfanne krümelig anbraten.\rZucchini waschen, halbieren und mit einem Teelöffel die Kerne entfernen, dann in Scheibchen schneiden. dazugeben und kurz mitbraten. Salzen und Pfeffern,  Knobi auspressen, dazugeben und abschmecken.2  Alles in eine Auflaufform geben und die Spätzle unterheben.\rDen Kräuterfrischkäse mit der Sahne und den Eiern glattrühren, die gehackte Petersilie mit einrühren. Mit Salz und Pfeffer und Knobl

In [4]:
text_file = open("data/stopwords.txt", "r", encoding = "ISO-8859-1")
stopwords  = text_file.readlines()
for i in range(len(stopwords)):
    stopwords[i] = stopwords[i].replace('\n','')

more_stopwords = ('Art', 'a', 'la', 'de', 'á', 'à', 'ohne', 'mal', 'lecker', 'Mein', 'dazu', 
                  'ala', 'au', 'alla', 'ganz', 'schnelle', 'al', 'di', 'Die', 'Der', 'Das', 
                  'MIT', 'Mit', 'a’la', 'Ein', 'so', 'u', 'e', 'E', 'nur', 'D', 'A', 'Co', 
                  'I', 'um', 'ihn', 'and', 'Aus', 'II', 'Nr', 'Style', 'gut', 'ca', 'UND', 'con', 
                  'NUR', 'to', 'go', 'fettarm', 'Low', 'Carb', 'Food', 'original', 'Original', 
                  'super', 'sooo')
for w in more_stopwords:
    stopwords.append(w)
    
stopwords

['aber',
 'als',
 'am',
 'an',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'dadurch',
 'daher',
 'darum',
 'das',
 'daß',
 'dass',
 'dein',
 'deine',
 'dem',
 'den',
 'der',
 'des',
 'dessen',
 'deshalb',
 'die',
 'dies',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'du',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'er',
 'es',
 'euer',
 'eure',
 'für',
 'hatte',
 'hatten',
 'hattest',
 'hattet',
 'hier',
 'hinter',
 'ich',
 'ihr',
 'ihre',
 'im',
 'in',
 'ist',
 'ja',
 'jede',
 'jedem',
 'jeden',
 'jeder',
 'jedes',
 'jener',
 'jenes',
 'jetzt',
 'kann',
 'kannst',
 'können',
 'könnt',
 'machen',
 'mein',
 'meine',
 'mit',
 'muß',
 'mußt',
 'musst',
 'müssen',
 'müßt',
 'nach',
 'nachdem',
 'nein',
 'nicht',
 'nun',
 'oder',
 'seid',
 'sein',
 'seine',
 'sich',
 'sie',
 'sind',
 'soll',
 'sollen',
 'sollst',
 'sollt',
 'sonst',
 'soweit',
 'sowie',
 'und',
 'unser',
 'unsere',
 'unter',
 'vom',
 'von',
 'vor',
 'wann',
 'warum',
 'was',
 'weit

In [5]:
# we need to clean the titles from characters, stopwords, etc.
titlelist = []
for i in range(len(kochbar)):
    title = kochbar[i]['name']
    for char in ['"','-','=','!','(',')','.','♥',',',':','~','„','“','/','–','&','+',';','*','☆']:
        title = title.replace(char,' ')
    tmp_list = list(filter(None, title.split(" ")))   # filter empty elements
    for stop in stopwords:
        tmp_list = list(filter(lambda k: stop != k, tmp_list))  # filter stopwords
    titlelist.append(tmp_list) 
titlelist

[['Zucchini', 'Hackauflauf', 'Frischkäse', 'Spätzle'],
 ['Japanischer', 'Soufflé', 'Käsekuchen'],
 ['Chili', 'Carne', 'Spezial'],
 ['Kartoffelgratin', 'klassisch'],
 ['Schnellste', 'Bienenstich'],
 ['Pfannkuchen', 'Grundrezept'],
 ['Eierlikörkugeln'],
 ['Grundrezept', 'Muffins'],
 ['Käsesuppe', 'Hackfleisch', 'Porree'],
 ['Amerikanische', 'Pancakes'],
 ['Allerbester', 'Käsekuchen'],
 ['Nutella', 'Ecken', 'Banane'],
 ['Bruschetta', 'Tomaten'],
 ['Raffaello', 'Pralinen'],
 ['Spaghetti', 'Bolognese'],
 ['Pfannkuchen'],
 ['Carbonara', 'Rezept', 'Calabrien', 'Sahne'],
 ['American', 'Brownies'],
 ['Pizza', 'Schnecken'],
 ['Flammkuchen', 'Schneller', 'Flammkuchen', 'Hefe'],
 ['Baileys', 'Muffins'],
 ['Knusprige', 'Knoblauch', 'Kartoffeln'],
 ['Nudelsalat', 'Hausfrauen'],
 ['Überbackene', 'Ofen', 'Kartoffeln', 'Cheddar', 'Hähnchen'],
 ['Schneller', 'Nudelauflauf'],
 ['Kartoffelsuppe', 'Würstchen'],
 ['Pizza', 'Waffeln'],
 ['Käsespätzle'],
 ['Schichtsalat'],
 ['Allerbester', 'Rhabarberkuchen'],

In [6]:
# Add all titles to a single string
all_titles = ""
for title in titlelist:
    for w in title:
        all_titles += w+" "
    all_titles += " \n "

In [7]:
titlefile = open('word_embeddings/titles.txt', 'w')
titlefile.write(all_titles)

11487663

### Cleaning the preparation steps before training

It usually provides better word embeddings if the model is trained on longer texts instead of single words, as the sequences in which the words occur influences their position in the word embedding's feature space.

However, it is hard to say if the vectors will provide a much higher benefit for this project when trained on the preparation steps instead of the titles only. Thus, we will need to test both models before deciding.

When preparing the text from the preparation steps, we do not need to filter as much as we did in the titles. Stopwords can now be useful to identify relations between entities, as they are part of continuous text instead of short titles.

In [8]:
preplist = []
for i in range(len(kochbar)):
    prep = kochbar[i]['preparation']
    for char in ['"','-','=','!','(',')','.','♥',',',':','~','„','“','/','–','&','+',';','*','☆','\r']:
        prep = prep.replace(char,' ')
    tmp_list = list(filter(None, prep.split(" ")))   # filter empty elements
    preplist.append(tmp_list) 
preplist

[['1',
  'Das',
  'Hackfleisch',
  'mit',
  'etwas',
  'Margarine',
  'in',
  'der',
  'Pfanne',
  'krümelig',
  'anbraten',
  'Zucchini',
  'waschen',
  'halbieren',
  'und',
  'mit',
  'einem',
  'Teelöffel',
  'die',
  'Kerne',
  'entfernen',
  'dann',
  'in',
  'Scheibchen',
  'schneiden',
  'dazugeben',
  'und',
  'kurz',
  'mitbraten',
  'Salzen',
  'und',
  'Pfeffern',
  'Knobi',
  'auspressen',
  'dazugeben',
  'und',
  'abschmecken',
  '2',
  'Alles',
  'in',
  'eine',
  'Auflaufform',
  'geben',
  'und',
  'die',
  'Spätzle',
  'unterheben',
  'Den',
  'Kräuterfrischkäse',
  'mit',
  'der',
  'Sahne',
  'und',
  'den',
  'Eiern',
  'glattrühren',
  'die',
  'gehackte',
  'Petersilie',
  'mit',
  'einrühren',
  'Mit',
  'Salz',
  'und',
  'Pfeffer',
  'und',
  'Knoblauchgranulat',
  'abschmecken',
  'alles',
  'über',
  'die',
  'Hackfleischmischung',
  'geben',
  '3',
  'Im',
  'vorgeheizten',
  'Backofen',
  'mittl',
  'Schiene',
  'auf',
  '200',
  'Grad',
  'ca',
  '30',
 

In [9]:
import re
def containsNumbers(string):
    return bool(re.search(r'\d', string))

for prep in preplist:
    prep = [w for w in prep if not containsNumbers(w)] 

In [10]:
# Add all preparation steps and words to a single string
all_preps = ""
for prep in preplist:
    for word in prep:
        all_preps += word+" "
    all_preps += " \n \n "

In [11]:
prepfile = open('word_embeddings/preps.txt', 'w')
prepfile.write(all_preps)

266260787

## Train FastText models

Now we can use the generated corpuses to train a FastText vector representations. For simplicity, we will trainwith the recommended default parameters and only adapt the number of threads to the 8 cores of the computer that has been used to train these models.

In [5]:
import fasttext 
title_model = fasttext.skipgram('word_embeddings/titles.txt', 'word_embeddings/title_model', thread=8)

In [7]:
import fasttext 
prep_model = fasttext.skipgram('word_embeddings/preps.txt', 'word_embeddings/prep_model', thread=8)

## Load models

In order to use the word embeddings in our implementation, we will need to load the models from their files after training them once.

In [12]:
import fasttext
title_model = fasttext.load_model('word_embeddings/title_model.bin')
prep_model = fasttext.load_model('word_embeddings/prep_model.bin')

## Evaluation

To choose between the two models that we have trained, we will evaluate their vectors for a chosen set of words that are either similar or different. Using the cosine similarity, we can measure how close or distant those words have been embedded in the model:

In [13]:
print("Similar words:")
print(title_model.cosine_similarity('Schinken', 'Salami'))
print(title_model.cosine_similarity('Filet', 'Steak'))
print(title_model.cosine_similarity('braten', 'backen'))
print(title_model.cosine_similarity('Fisch', 'Lachs'))
print(title_model.cosine_similarity('Paprika', 'Peperoni'))

print("\n Different words:")
print(title_model.cosine_similarity('Gurke', 'Leber'))
print(title_model.cosine_similarity('Salz', 'Kuchen'))
print(title_model.cosine_similarity('Chili', 'Milch'))

Similar words:
0.745956040107
0.683937946663
0.261322767337
0.650748116
0.735996235529

 Different words:
0.340309920804
0.218356984463
0.414159474427


In [14]:
print("Similar words:")
print(prep_model.cosine_similarity('Schinken', 'Salami'))
print(prep_model.cosine_similarity('Filet', 'Steak'))
print(prep_model.cosine_similarity('braten', 'backen'))
print(prep_model.cosine_similarity('Fisch', 'Lachs'))
print(prep_model.cosine_similarity('Paprika', 'Peperoni'))

print("\n Different words:")
print(prep_model.cosine_similarity('Gurke', 'Leber'))
print(prep_model.cosine_similarity('Salz', 'Kuchen'))
print(prep_model.cosine_similarity('Chili', 'Milch'))

Similar words:
0.849591554368
0.665286910104
0.417357009985
0.7826654159
0.783189233043

 Different words:
0.293415141586
0.00538366289464
0.265963553708


### Result

We can see that using the cosine similarity of word vectors, the model trained on the preparation steps rates similar words closer and different word more distant than the other model, thus is a better fit as a language model.
Consecutively, we will use prep_model in further analysis.