# Bytes Into Baking

## Project Overview
- Food is one of the most common topics on the internet, with content being published on the web by big businesses such as *foodnetwork* and *allrecipes* to home chefs writing their own blogs. 
- Food is trendy--keeping on top of food trends could be valuable to people who work in the food or food publishing industries.
- The purpose of this project is to gather recipe data from a variety of different websites and then use supervised and unsupervised NLP models to discover useful information from the data

## Project Phases
- Create a utility web scraper to capture recipes from a variety of different websites
- Use ML to compare and contrast recipes

## Goals
- Generate a list of target websites for a particular recipe class by pulling urls from Google search
- Develop a utility scraper to identify recipes and then grab pertinent information about the recipe from the recipe section--starting with instructions
- Test the results of the scraper on a supervised model that classifies the vectorized recipes
- Explore the possibiliity of identifying different groupings within a specific recipe class

## Tools and techniques used in this project
- **Tools**
> - Python, Beautiful Soup, Pandas, Numpy, Gensim
- **Visualization**
> - Matplotlib, Plotly
- **Techniques**
> - Web-scraping, K Fold cross validation, Multinomial Naive Bayes Classification, Non-negative Matrix Factoring (NMF)

## In this Notebook (phase two of the project)
- Corpus preprocessing and turning documents into vectors
- Using a supervised model (multinomial naive bayes) to predict recipe class (after target words like 'croissant' or 'puff pastry' have been removed from the feature set)
- Using unsupervised machine learning to look for similarities and differences between recipes of a certain class

In [1]:
import numpy as np
import pandas as pd
from numpy.linalg import svd
# import string

import matplotlib.pyplot as plt
%matplotlib inline

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
# from nltk.stem.porter import PorterStemmer
# from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.cluster import KMeans 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

from sklearn.model_selection import train_test_split, KFold
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import confusion_matrix

In [2]:
%load_ext autoreload
%autoreload 2

## Step 1

### Corpus preprocessing
- Turned the recipe documents into vectors for a supervised machine learning classifier problem--can recipe instructions be used to predict recipe class?
- Text preprocessing pipeline included removing punctuation, tokenization, word lemmatization, removing stop words
- Additionally, removed target words (like croissant, puff pastry etc) that appear in the feature set to build a more robust model
- Converted documents into uni-gram count vectors

## Preprocessing functions and methods

In [3]:
wordnet = WordNetLemmatizer()
# porter = PorterStemmer()
# snowball = SnowballStemmer('english')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/salvir1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
titles_to_remove = ['ciambellone', 'ciambella','puff','croissant', 'croissants', 'pie crust', 'pie dough', 'pie', 'crescent', 'brioche']

In [5]:
import re
def remove_punc(string:str) -> str:
    '''Given a string, removes all punctuation and returned punctuation-less string'''
    return re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", string)

In [6]:
def tokenize(str):
    '''
    Tokenize a str and return a tokenized list.
    '''
    return [word for word in word_tokenize(str)]

In [7]:
def lemmatize(doc):
    '''Takes in a doc and lemmatizes tokens in doc
    Parameters
    ----------
    doc: list of tokens
    
    Returns
    -------
    lemmatized tokens
    '''
    return [wordnet.lemmatize(tkn) for tkn in doc]

In [8]:
def rm_stop_words(doc, stops=set(stopwords.words('english'))):
    '''Takes in a doc and removes stop words
    Parameters
    ----------
    doc: list of tokens
    
    Returns
    -------
    Tokens with stop words removed
    '''
    return([w for w in doc if w not in stops])

In [9]:
def rm_title_words(doc, title_words):
    '''Takes in a doc and removes title words to allow algorithm to focus on other keywords
    Parameters
    ----------
    doc: list of tokens
    
    Returns
    -------
    Tokens with title words removed
    '''
    return([w for w in doc if w not in title_words])

In [10]:
def preprocess_corpus(content):
    '''
    Add docstring. Make flexible to allow for doing, or not doing, preprocessing functions. 
    Parameters
    ----------
    content (str): a collection of strings
    Returns
    -------
    A list of lists: each list contains a tokenized version of the original string
    '''
    preprocessed = []
    for i in range(len(content)):
        step_1 = remove_punc(content[i].lower())
        step_2 = tokenize(step_1)
        step_3 = lemmatize(step_2)
        step_4 = rm_stop_words(step_3)
        step_5 = rm_title_words(step_4, titles_to_remove)
        preprocessed.append(step_5)
    return preprocessed

In [11]:
# loading sample data to check functions
df_piecrust = pd.read_csv('data/us_piecrust.csv')
df_csnt = pd.read_csv('data/us_croissant.csv')
df_ciambellone = pd.read_csv('data/us_ciambellone.csv')
df_puff = pd.read_csv('data/us_puff.csv')
df_brioche = pd.read_csv('data/us_brioche.csv')
df = pd.concat([df_piecrust, df_csnt, df_ciambellone, df_puff, df_brioche], axis=0, ignore_index=True) 
df.columns = ['drop','url','instructions','recipe type']
df = df.drop('drop', axis = 1)
corpus = df['instructions']
y = df['recipe type']

In [12]:
corpus.iloc[81] # Instructions for Gourmetier croissant recipe

'The preparation of the croissants consists of two main phases: the initial dough, called détrempe, and the laminating, which consists of integrating the butter to the détrempe, forming alternate layers.\nThe difficulty in making croissants at home lies in manually creating thin and even layers of dough/butter in order to achieve a product with good volume and an open, honey-comb texture.\n\nTools\nFor making the dough and shaping the butter:\n\nClingfilm and freezer bag. It is important to correctly wrap the détrempe (and then the dough) in clingfilm and then put it inside a freezer bag to prevent it from drying out. It will also protect the dough against possible unpleasant smells inside the refrigerator.\nWaxed paper, pencil and ruler. To give butter the right shape in order to work with precision.\n\nFor laminating:\n\nCold tray. Before rolling, place a tray (better if steel or marble) inside the freezer so you can place in it croissants once cut and keep them cold before shaping.\

In [13]:
corpus

0      Mix the flour and salt together in a large bow...
1      Add&nbsp;flour, salt to a large mixing bowl.&n...
2      Add&nbsp;flour, salt to a large mixing bowl.&n...
3      Pour cold water into a small dish with a few p...
4      In the bowl of a food processor; add the flour...
                             ...                        
339    In a medium mixing bowl, whisk together the wa...
340    Combine 1/3 cup of milk and 1 tablespoon of fl...
341    In a small bowl, whisk the yeast with the butt...
342    Warm the milk, transfer to a bowl and crumble ...
343    In a glass measuring cup, combine one cup warm...
Name: instructions, Length: 344, dtype: object

### Preprocessing--data load and function calls

In [14]:
cleaned_tokenized = preprocess_corpus(corpus) # cleaned and tokenized
str_cleaned_tokenized = [" ".join(x) for x in cleaned_tokenized] # string version of cleaned and tokenized 

In [15]:
pd.Series(str_cleaned_tokenized).to_csv('data/str_cleaned_tokenized.csv', index=False)

## Step 2
## Multinomial naive bayes text classifer
- First pass: train test split to get a sense of how well the model appears to be interpreting the data
- Second pass: K Fold cross validation for a more robust approach
- MNB Usually done on word vector counts
> - The distribution is parametrized by theta_y vectors for each class
> - theta_y is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting

## Results
- Mean accuracy of about 90% for the train test split set
- Similar results from the K Fold cross validation approach

## Analysis
- What goes wrong when it goes wrong? Is it bad data (from the scraping process), insufficient data, or was the model just fooled?
- Most of the errors occurred before the ML model step. In several cases, the recipe author misclassified their recipe. One common mistake is to classify a pie crust as a puff pastry--they are not the same. Google search also can have a relatively high rate of type 2 errors--false positives. Non-recipe websites were weeded out by the web scraper, but some non-target recipes made it through. For instance, a breakfast sausage croissant sandwich recipe made it through.
- The scraper grabbed insufficient data in one case.
- The model made bad predictions in two cases.

| Error type | Number | Percent (%) |
| :--- |  :---: | :---: |
| Recipe author miscategorization or inaccurate Google search data | 7 | 8% |
| Scraper error | 1 | 1% |
| Bad prediction by model | 2 | 2% |

In [16]:
# 'Bag of words function'
vect = CountVectorizer(max_features=5000)
word_counts = vect.fit_transform(str_cleaned_tokenized)

In [17]:
feature_words = vect.get_feature_names()

In [18]:
X_train, X_test, y_train, y_test = train_test_split(word_counts, y, random_state=3)

In [19]:
clf = MultinomialNB(alpha=1)
clf.fit(X_train, y_train)

scoreboard = zip(clf.predict_proba(X_test), clf.predict(X_test), y_test)
spc=''
print(f'\nMean accuracy: {clf.score(X_test, y_test)}\n')

print(f'     Prediction:          Actual                Prob brioche       Prob ciam         Prob crsnt      Prob pie      Prob puff')
print(f'     ----------           ------                ------------       ---------         ----------      --------      ---------')
for i in scoreboard:
    if i[2] != i[1]:
        print(f'P:  {i[1]:<15} A:  {i[2]:<18} Probs:   {i[0][0]:.5f}{spc:<10}{i[0][1]:.5f}{spc:<10}{i[0][2]:.5f}{spc:<10}{i[0][3]:.5f}{spc:<10}{i[0][4]:.5f}')



Mean accuracy: 0.8953488372093024

     Prediction:          Actual                Prob brioche       Prob ciam         Prob crsnt      Prob pie      Prob puff
     ----------           ------                ------------       ---------         ----------      --------      ---------
P:  puff_pastry     A:  pie_crust          Probs:   0.00001          0.00000          0.05476          0.05696          0.88827
P:  brioche         A:  puff_pastry        Probs:   0.99696          0.00000          0.00003          0.00000          0.00301
P:  croissant       A:  puff_pastry        Probs:   0.00000          0.00000          0.99936          0.00000          0.00064
P:  brioche         A:  croissant          Probs:   0.97196          0.00148          0.02582          0.00005          0.00068
P:  brioche         A:  croissant          Probs:   1.00000          0.00000          0.00000          0.00000          0.00000
P:  brioche         A:  pie_crust          Probs:   0.99996          0.000

### Errors analysis

In [20]:
yhat = clf.predict(X_test)
msk = (y_test == yhat)
df.iloc[msk[~msk].index]

Unnamed: 0,url,instructions,recipe type
15,https://www.crazyforcrust.com/favorite-all-but...,Make sure your butter is diced and cold before...,pie_crust
239,https://www.cookwithmanali.com/easy-puff-pastry/,In the steel bowl of your stand mixer whisk to...,puff_pastry
218,https://sallysbakingaddiction.com/danish-pastr...,"To help guarantee success, I recommend reading...",puff_pastry
121,https://lifewiththecrustcutoff.com/hot-ham-swi...,"Mix the mustard, honey and brown sugar togethe...",croissant
115,https://lifemadesimplebakes.com/overnight-saus...,"In a large skillet, brown the sausage over med...",croissant
66,https://boulderlocavore.com/all-butter-pie-crust/,Cut the butter into ½-inch cubes. Place in the...,pie_crust
155,https://anitalianinmykitchen.com/italian-cake/,"Pre-heat oven to 320F (160C), grease and flour...",ciambellone
275,https://flourandfloral.com/heart-shaped-fruit-...,Roll out puff-pastry onto a lightly floured su...,puff_pastry
146,https://amandascookin.com/john-wayne-casserole/,Preheat oven to 350 F and place your oven rack...,croissant


In [21]:
df.iloc[146][2]

'croissant'

In [22]:
print(f'\nConfusion matrix. Predictions in rows. Actuals in columns\n')

print(f'\t\t     Br  Cm  Cs Pi Pf')
print(f'\t\t     --  --  -- -- --')
print(f'\t\tBrio {confusion_matrix( clf.predict(X_test), y_test)[0]}')
print(f'\t\tCiam {confusion_matrix( clf.predict(X_test), y_test)[1]}')
print(f'\t\tCsnt {confusion_matrix( clf.predict(X_test), y_test)[2]}')
print(f'\t\tPie  {confusion_matrix( clf.predict(X_test), y_test)[3]}')
print(f'\t\tPuff {confusion_matrix( clf.predict(X_test), y_test)[4]}')

print(f'\nMean accuracy: {clf.score(X_test, y_test)}\n')



Confusion matrix. Predictions in rows. Actuals in columns

		     Br  Cm  Cs Pi Pf
		     --  --  -- -- --
		Brio [11  1  2  1  2]
		Ciam [ 0 11  0  0  0]
		Csnt [ 0  0 17  0  1]
		Pie  [ 0  0  1 17  0]
		Puff [ 0  0  0  1 21]

Mean accuracy: 0.8953488372093024



In [23]:
kf = KFold(n_splits=5, shuffle=True)  # almost always use shuffle=True
fold_scores = []

for train, test in kf.split(word_counts):
    model = MultinomialNB()
    model.fit(word_counts[train], y[train])
    fold_scores.append(model.score(word_counts[test], y[test]))
    
print(np.mean(fold_scores))

0.8778772378516624


### Exploring the features which are most important to the supervised model 

- The predictive power of the model is good
- What are the key words that the model is relying upon?
- This exploration is where the model isn't so helpful. The significant words repeat across each class. The model is not weighing certain obvious key words as highly as one would expect--*lamination*, *yeast*, *eggs*, and *cold* are feature words that should be present in every case for certain recipes and not others.
- The model prediction may be good, but it suffers from a lack of solid intuition. Ideally the most significant words would clearly identify class. 
- This could be due to the small data set. Given that the web scraper yield from part 1 was only 47%, it may be worthwhile to look to look for ways to improve the yield.

- Can an unsupervised model look within a particular recipe class to identify different schools of thought for that particular product? E.g. Can it identify different types of croissant recipes?

In [24]:
for j in range(5):
    print(f'\n          Recipe class: {clf.classes_[j]}')
    print(f'Most significant words: {[feature_words[i] for i in (word_counts.toarray()[273] * np.exp(clf.feature_log_prob_[j])).argsort()[-20:][::-1]]}\n')


          Recipe class: brioche
Most significant words: ['dough', 'butter', 'place', 'minute', 'roll', 'cup', 'bowl', 'make', 'milk', 'warm', 'add', 'yeast', 'two', 'flour', 'one', 'pan', 'tablespoon', 'mix', 'speed', 'turn']


          Recipe class: ciambellone
Most significant words: ['dough', 'butter', 'cup', 'baking', 'make', 'one', 'flour', 'add', 'milk', 'pan', 'sugar', 'wa', 'minute', 'place', 'two', 'teaspoon', 'bowl', 'recipe', 'mix', 'mixture']


          Recipe class: croissant
Most significant words: ['dough', 'butter', 'roll', 'make', 'place', 'fold', 'baking', 'third', 'one', 'water', 'sheet', 'yeast', 'minute', 'flour', 'two', 'cup', 'milk', 'pastry', 'warm', 'turn']


          Recipe class: pie_crust
Most significant words: ['dough', 'butter', 'roll', 'pastry', 'water', 'place', 'make', 'add', 'flour', 'cup', 'tablespoon', 'two', 'baking', 'fold', 'use', 'together', 'mixture', 'minute', 'one', 'bowl']


          Recipe class: puff_pastry
Most significant words: ['d

In [25]:
# Here's another way to gauge model output. Can you figure out the recipe class based upon the top n tokens the model uses for that class? 

# feature_words = word_counts.get_feature_names()
n = 20 #number of top words associated with the category that we wish to see
target_names = ['brioche', 'ciambellone', 'croissant', 'pie crust', 'puff pastry']

for cat in range(5):
    print(f"\nTarget: {cat}, name: {target_names[cat]}")
    log_prob = clf.feature_log_prob_[cat]
    i_topn = np.argsort(log_prob)[::-1][:n]
    features_topn = [feature_words[i] for i in i_topn]
    print(f"Top {n} tokens: ", features_topn)


Target: 0, name: brioche
Top 20 tokens:  ['dough', 'minute', 'bowl', 'butter', 'flour', 'egg', 'place', 'add', 'loaf', 'let', 'mixer', 'cover', 'pan', 'piece', 'side', 'ball', 'oven', 'hour', 'mix', 'speed']

Target: 1, name: ciambellone
Top 20 tokens:  ['cake', 'flour', 'baking', 'pan', 'sugar', 'egg', 'minute', 'oven', 'add', 'bowl', 'recipe', 'italian', 'cup', 'oil', 'one', 'ingredient', 'dough', 'batter', 'mix', 'mixture']

Target: 2, name: croissant
Top 20 tokens:  ['dough', 'butter', 'oven', 'roll', 'minute', 'make', 'flour', 'baking', 'place', 'fold', 'side', 'time', 'rectangle', 'one', 'water', 'rolling', 'cut', 'use', 'temperature', 'triangle']

Target: 3, name: pie crust
Top 20 tokens:  ['dough', 'crust', 'butter', 'flour', 'water', 'add', 'roll', 'edge', 'use', 'wrap', 'together', 'pastry', 'place', 'mixture', 'minute', 'rolling', 'time', 'make', 'plate', 'bowl']

Target: 4, name: puff pastry
Top 20 tokens:  ['dough', 'butter', 'pastry', 'flour', 'fold', 'roll', 'make', 'wr

## Word2Vec
- For future consideration as a way to improve MNB model intuition
- Would be better to have more data to improve the ability of word2vec to identify similarities

In [26]:
from gensim.models import Word2Vec

In [27]:
word2vec = Word2Vec(cleaned_tokenized)
#A value of 2 for min_count specifies to include only those words in the Word2Vec model that appear at least twice in the corpus.

In [28]:
vocabulary = word2vec.wv.vocab
#vocabulary = word2vec objects containing words appearing two or more times in the corpus

In [29]:
word2vec.wv.most_similar('cold')
# method that returns the most similar words to the given word, not association, but similarity supposedly

[('ice', 0.9950507879257202),
 ('allpurpose', 0.9947797060012817),
 ('unsalted', 0.9928722381591797),
 ('pulse', 0.9901953339576721),
 ('food', 0.9886661767959595),
 ('dry', 0.9885644912719727),
 ('blend', 0.9884785413742065),
 ('softened', 0.9879720211029053),
 ('lemon', 0.9878119826316833),
 ('instant', 0.9873054027557373)]

# Unsupervised NLP Models

## K Means

- How well is an unsupervised K Means model able to cluster the recipes into groupings that identify with one of the five classes? 

> - Investigate the 'centroids' to find out what "topics" Kmeans has discovered by mapping these vectors back into the 'word space'.  Think of each feature/dimension of the centroid vector as representing the "average" article or the average occurrences of words for that cluster.
   
> - With a k= 6, four classes are fairly clear. The first centroid refers to pie crust recipes, the second looks like it identifies puff pastry recipes, the third appears to be for brioche, and the sixth is picking up on ciambellone recipes.

> - The groupings aren't super clear though. There are quite a few overlapping words.


In [30]:
tfidfvect = TfidfVectorizer(max_features=500)
tfidf_vectorized = tfidfvect.fit_transform(str_cleaned_tokenized)
tf_idf= tfidf_vectorized.toarray()

In [31]:
clusters = 8
kmeans = KMeans(n_clusters=clusters, 
                random_state=0).fit(tfidf_vectorized)

In [32]:
def Sort(sub_li): 
    return sorted(sub_li, key = lambda x: x[0], reverse=True)

def get_word(centroid):
    return [x[1] for x in centroid]

for k in range(6):
    matched = zip(kmeans.cluster_centers_[k], tfidfvect.get_feature_names())
    match = Sort(list(matched))
    print(' '.join(get_word(match[:18])), '\n')

pulse dough processor food wrap water time add butter shortening mixture together form disc flour clump disk crust 

dough butter pastry flour fold roll wrap rectangle rolling make third water turn square time layer fridge recipe 

dough bowl loaf minute mixer let egg cover speed rise yeast ball pan butter flour add place hour 

pastry cheese oven dough baking sheet egg half minute bread almond recipe make set cook slice place one 

dough rectangle triangle roll butter inch fold sheet side cut baking half minute wrap end place rolling third 

batter cake pan sugar bundt minute powder oven flour baking zest egg preheat oil cool whisk come bowl 



## Cosine similarity
- Unsupervised learning

- Use the cosine similarity to compare similarity between documents.

- sklearn's [linear_kernel](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html) (computes dot product) can be used on tfidf to compute the cosine similarity since rows are normalized.*

- Here's a page on cosine similarity from [sklearn documentation](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) and a relevant [stack overflow post](http://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity).

- *The stack overflow post is helpful. It provides instruction over how to slice the tfidf and then how to apply cosine similarity between one doc and all of the rest.*

In [33]:
# Looking for recipes similar to the croissant recipe found in Gourmetier--a recipe with excellent instructions for a difficult-to-make product
# The Gourmetier recipe is found in the corpus at index 81.

cosine_similarities = linear_kernel(tfidf_vectorized[81:82], tfidf_vectorized[1:500]).flatten() # This is comparing cs for article #2 and the next 500.

In [34]:
len(cosine_similarities)

343

In [35]:
related_docs_indices = cosine_similarities.argsort()[:-5:-1] # This identifies the index of the top 5 most similar.
print(related_docs_indices)

cosine_similarities[related_docs_indices] # and their related cs

[ 80 277 138 112]


array([1.        , 0.59095001, 0.55951136, 0.52166381])

In [36]:
df.iloc[112] 
# Going step by step pulling up the most similar articles by index. Nothing really useful here. 
# 112 is probably the most detailed and helpful of the four identified.

url                https://simplyhomecooked.com/almond-croissant/
instructions    In a bowl or medium measuring cup, combine the...
recipe type                                             croissant
Name: 112, dtype: object

## Non-negative Matrix Factorization (NMF)
- Unsupervised learning
- Good for situations when there's some potentially valid grouping to both rows and columns, such as putting Joe and Sam in the same group because they like similar movies (as opposed to traditional supervised models where there are features and targets)



NMF will factorize a document-term matrix `V` into a matrix `W` (where each row is a latent vector of a single document in the corpus) and a matrix `H` (where each column is a latent vector of a single word in the vocabulary). See the docstring of the NMF.fit() method:

In [37]:
from src.nmf_helpers import build_text_vectorizer, hand_label_topics, analyze_recipe
from src.my_nmf import NMF

In [38]:
corpus

0      Mix the flour and salt together in a large bow...
1      Add&nbsp;flour, salt to a large mixing bowl.&n...
2      Add&nbsp;flour, salt to a large mixing bowl.&n...
3      Pour cold water into a small dish with a few p...
4      In the bowl of a food processor; add the flour...
                             ...                        
339    In a medium mixing bowl, whisk together the wa...
340    Combine 1/3 cup of milk and 1 tablespoon of fl...
341    In a small bowl, whisk the yeast with the butt...
342    Warm the milk, transfer to a bowl and crumble ...
343    In a glass measuring cup, combine one cup warm...
Name: instructions, Length: 344, dtype: object

In [39]:
# Build text-to-vector vectorizer, then vectorize corpus. This is using a hand-built model to allow for certain functions not available in sklearn.

vectorizer, vocabulary = build_text_vectorizer(corpus,
                             use_tfidf=True,
                             use_stemmer=False,
                             max_features=5000)
X = vectorizer(corpus)



In [40]:
np.random.seed(12345)

# Find k latent topics using our NMF model.
factorizer = NMF(k=5, max_iters=50, alpha=0.5)
W, H = factorizer.fit(X, verbose=True)

iter 0 : reconstruction error: 70.54309328037031
iter 1 : reconstruction error: 25.103364659738038
iter 2 : reconstruction error: 18.395053848077264
iter 3 : reconstruction error: 16.827955129362245
iter 4 : reconstruction error: 16.29152935209108
iter 5 : reconstruction error: 16.09270500029098
iter 6 : reconstruction error: 16.004785836816694
iter 7 : reconstruction error: 15.958451306944863
iter 8 : reconstruction error: 15.930653778803974
iter 9 : reconstruction error: 15.91258574888887
iter 10 : reconstruction error: 15.900066717981336
iter 11 : reconstruction error: 15.89085475539772
iter 12 : reconstruction error: 15.883757314463967
iter 13 : reconstruction error: 15.878110757490578
iter 14 : reconstruction error: 15.873514385507304
iter 15 : reconstruction error: 15.86971285744077
iter 16 : reconstruction error: 15.866539679502852
iter 17 : reconstruction error: 15.86386478441709
iter 18 : reconstruction error: 15.861594757390776
iter 19 : reconstruction error: 15.8596700032125

In [41]:
# Label topics and analyze a few recipes.
hand_labels = hand_label_topics(H, vocabulary)

topic 0
--> nbsp large paper parchment dish flat gently pin disk hands together dough dampen place long rolling smoothly 30 blender piece


please label this topic:  pie crust



topic 1
--> dough minutes bowl mixer cover let 1 speed yeast loaf rise place egg hours 2 oven medium stand doubled loaves


please label this topic:  brioche



topic 2
--> crust dough pulse water plate processor add shortening wrap mixture flour food butter together ice 1 cold fork plastic disc


please label this topic:  pie crust



topic 3
--> cake pan batter powder oil sugar zest baking bundt vanilla lemon eggs oven flour add italian beat toothpick pour orange


please label this topic:  ciambellone



topic 4
--> dough butter roll pastry fold rectangle flour rolling wrap square make 1 fridge inch 2 third cut inches side place


please label this topic:  puff pastry





In [42]:
rand_recipe = np.random.choice(range(len(W)), 15)

for i in rand_recipe:
    analyze_recipe(i, corpus, W, hand_labels)

You can always depend on this cake - it always turns out just right, with a zesty lemon flavour to boot. Olive oil means that it has a marvellous moist texture, even a day or two after cooking.We recently harvested over 700kg of olives from Carlo's family property and they have been cold pressed into a peppery olive oil. So it's the perfect time to make the luscious Tuscan ciambellone to eat with our coffee.In Australia it's often said that you don't cook with extra virgin olive oil - that you should use a lighter olive oil for cooking. But this cake is an exception to the rule, hence I always make it when our family-made Italian olive oil comes back from being pressed.Enjoy!Sioban Ciambellone Recipe (Italian Olive oil cake) Serves 8 Ingredients 2 x 50g eggs1 ¼ cups caster sugar¾ cup extra virgin olive oil¾ cup skim milkzest of 2 lemons and juice of 1 lemon1 ¼ cup plain flour1 heaped tsp baking powder (or 1 16 g sachet of lievito)¼ tsp baking soda (don’t add this if you have the lievit