# Clustering the recipes to map to a simple wine pairing

### I am following https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/ closely

The goal is to cluster the recipes into abotu 10 clusters to see if we can make reliable word embeddings. If similar recipes are clustered, we assume the word embeddings are somewhat representative and we can move forward with them as features in various models. 

First, I combine the ingredients and instruction text and tokenize accordingly. I create a model that embeds the words to 150 dimensional space (high dimensions due to relatively low number of sampls ~2000). I then do TSNE clustering to visualize how the different docuemnts (recipes) cluster. Finally, I perform k-means clustering with 10 clusters and display the top three "most representative" recipes from each cluster.

In [52]:
# I am following https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/ closely

import pandas as pd
import numpy as np

from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from gensim.models import Word2Vec

import re
import string


from sklearn.manifold import TSNE

import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score

In [53]:
data = pd.read_csv('updated_recipe_data.csv')
data=data.drop(columns='Unnamed: 0')
data

Unnamed: 0,recipe_link,ingredients,instructions
0,https://tasty.co/recipe/fusion-chana-masala-tacos,"['Chana Masala Ingredients:', '4 cups chickpea...","['1: First, make Chana Masala as follows:Heat ..."
1,https://tasty.co/recipe/apple-crumble-cupcakes,"['Apple Mixture', '1 ⅔ cups apple (200 g) dice...","['1: Preheat oven to 160°C (325°F)', '2: In a ..."
2,https://tasty.co/recipe/turkey-spaghetti-squas...,"['1 tablespoon olive oil', '1 lb ground turkey...","['1: Preheat oven to 350°F (180°C)', '2: Heat ..."
3,https://tasty.co/recipe/veggie-hummus-trio,"['24 oz hummus (675 g) divided', '1 small avoc...","['1: In a food processor, combine 8 ounces (22..."
4,https://tasty.co/recipe/honey-lime-fruit-salad,"['½ lb fresh strawberry (225 g) quartered', '...",['1: Place 2 bananas sliced fruits in a large ...
...,...,...,...
2084,https://tasty.co/recipe/slow-cooker-root-beer-...,"['3 lb chicken wings (1360 g)', '2 teaspoons s...",['1: Season 3 lb chicken wings with 2 teaspoon...
2085,https://tasty.co/recipe/quick-easy-and-delicio...,['½ cup unsalted butter (125 g) at room temper...,"['1: Cream the ½ cup unsalted butter, dark ½ c..."
2086,https://tasty.co/recipe/ham-cheese-chicken-rol...,"['2 boneless skinless chicken breasts', '1 tea...",['1: Preheat oven to 400°F (200°C)Cut about ¾ ...
2087,https://tasty.co/recipe/chili-mac-n-cheese,"['1 lb lean ground beef (455 g)', '1 small whi...","['1: In a large pot, cook 1 lb lean ground bee..."


In [5]:
all_instructions = [data.instructions[i][2:-2].replace(" '",'').replace("'",'') for i in range(len(data)) if type(data.instructions[i]) != float]

In [3]:
all_ingredients = [data.ingredients[i][2:-2].replace(" '",'').replace("'",'') for i in range(len(data)) if type(data.ingredients[i]) != float]

In [8]:
relevant_text = [all_instructions[i] + ' '+ all_ingredients[i] for i in range(len(all_instructions))]

In [9]:
relevant_text[0]

'1: First, make Chana Masala as follows:Heat 2 tablespoons avocado oil in a pan,2:  To the pan, add 2 whole bay leaves, 2 whole dry chili pepper, cinnamon sticks and whole ½ teaspoon cumin seeds,3: Once ½ teaspoon cumin seeds start to sizzle, add diced 1 medium onion and cook until slightly brown and translucent,4: Lower the heat and add the crushed or fresh 2 roma tomatoes, 2 tablespoons ginger and 3 cloves garlic and mixwell,5: Add 1 teaspoon salt, ½ teaspoon turmeric, 2 teaspoons paprika, red chili powder, 2 teaspoons chana masala mix and water,6:  Mix well,7: Let simmer for about 5 minutes with lid on,8: Add cooked 4 cups chickpeas, mix well and let cook for 7-10 minutes,9: Then, add the 2 tablespoons taco seasoning to the 2 teaspoons chana masala you just made,10:  Mash with masher and add water as needed for a loose but not runny consistency,11:  Heat in a saucepan,12:  Stir occasionally to prevent sticking,13: Serve mixture on 2 tablespoons taco shells with lettuce, 2 roma tomat

In [11]:
# Modified from https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/ 
def clean_text(text, tokenizer, stopwords=stopwords.words("english")):
    """Pre-process text and generate tokens

    Args:
        text: Text to tokenize.

    Returns:
        Tokenized text.
    """
    text = text.lower()  # Lowercase words
    text = re.sub(r"\[(.*?)\]", "", text)  # Remove [+XYZ chars] in content
    text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)  # Remove punctuation
    text = re.sub(r"\s+", " ", text)  # Remove multiple spaces in content
    text = re.sub(r"\w+…|…", "", text)  # Remove ellipsis (and last word)
    text = re.sub(r"(?<=\w)-(?=\w)", " ", text)  # Replace dash between words
    #text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)  # Remove punctuation

    tokens = tokenizer.tokenize(text)
    tokens = [t for t in tokens if not t in stopwords]  # Remove stopwords
    tokens = ["" if t.isdigit() else t for t in tokens]  # Remove digits
    tokens = [t for t in tokens if len(t) > 1]  # Remove short tokens
    return tokens

In [12]:
tokenizer = WordPunctTokenizer()

In [13]:
data_tok = [clean_text(text, tokenizer) for text in relevant_text]

In [32]:
model = Word2Vec(sentences=data_tok, vector_size=150, workers=1, seed=314).wv

In [54]:
model.sort_by_descending_frequency
words = list(model.key_to_index.keys())[:100]
words[:10] #most common words

['cup',
 'teaspoon',
 'add',
 'tablespoons',
 'cups',
 'minutes',
 'salt',
 'tablespoon',
 'oil',
 'pepper']

In [44]:
#averaging the word embedding vectors for all words in each document
my_features = []

for texts in relevant_text:
    tokens = clean_text(texts, tokenizer)
    vecs = []
    for toke in tokens:
        try:
            vecs.append(model[toke])
        except:
            pass
    my_features.append(np.asarray(vecs).mean(axis=0))

In [45]:

word_tsne_unnorm = TSNE().fit_transform(X=my_features)


In [46]:
#This is taken directly from my source listed above: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [47]:
draw_vectors(word_tsne_unnorm[:, 0], word_tsne_unnorm[:, 1], color='green', token=words)



In [48]:
#This is taken directly from my source listed above: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/
def mbkmeans_clusters(
	X, 
    k, 
    mb, 
    print_silhouette_values, 
):
    """Generate clusters and print Silhouette metrics using MBKmeans

    Args:
        X: Matrix of features.
        k: Number of clusters.
        mb: Size of mini-batches.
        print_silhouette_values: Print silhouette values per cluster.

    Returns:
        Trained clustering model and labels based on X.
    """
    km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
    print(f"For n_clusters = {k}")
    print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
    print(f"Inertia:{km.inertia_}")

    if print_silhouette_values:
        sample_silhouette_values = silhouette_samples(X, km.labels_)
        print(f"Silhouette values:")
        silhouette_values = []
        for i in range(k):
            cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
            silhouette_values.append(
                (
                    i,
                    cluster_silhouette_values.shape[0],
                    cluster_silhouette_values.mean(),
                    cluster_silhouette_values.min(),
                    cluster_silhouette_values.max(),
                )
            )
        silhouette_values = sorted(
            silhouette_values, key=lambda tup: tup[2], reverse=True
        )
        for s in silhouette_values:
            print(
                f"    Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
            )
    return km, km.labels_

In [49]:
#This is taken almost directly from my source listed above: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/
clustering, cluster_labels = mbkmeans_clusters(
	X=my_features,
    k=10,
    mb=500,
    print_silhouette_values=True,
)


For n_clusters = 10
Silhouette coefficient: 0.11
Inertia:3265.16830490133
Silhouette values:
    Cluster 8: Size:310 | Avg:0.22 | Min:-0.04 | Max: 0.44
    Cluster 9: Size:227 | Avg:0.15 | Min:0.01 | Max: 0.33
    Cluster 2: Size:179 | Avg:0.13 | Min:-0.11 | Max: 0.35
    Cluster 0: Size:225 | Avg:0.12 | Min:-0.02 | Max: 0.28
    Cluster 3: Size:246 | Avg:0.11 | Min:-0.12 | Max: 0.33
    Cluster 6: Size:217 | Avg:0.08 | Min:-0.06 | Max: 0.23
    Cluster 5: Size:212 | Avg:0.07 | Min:-0.11 | Max: 0.24
    Cluster 1: Size:129 | Avg:0.07 | Min:-0.12 | Max: 0.30
    Cluster 4: Size:142 | Avg:0.06 | Min:-0.05 | Max: 0.24
    Cluster 7: Size:202 | Avg:0.05 | Min:-0.12 | Max: 0.24


In [50]:
#This is taken almost directly from my source listed above: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/
df_clusters = pd.DataFrame({
    "link": data.recipe_link,
    "tokens": [" ".join(text) for text in data_tok],
    "cluster": cluster_labels
})

In [51]:
#This is taken almost directly from my source listed above: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/
for i in range(10):
    most_representative_docs = np.argsort(
        np.linalg.norm(my_features - clustering.cluster_centers_[i], axis=1)
    )
    print('Cluster '+ str(i+1))
    for d in most_representative_docs[:3]:
        print(df_clusters.link[d])
        print("-------------")
    print('\n\n')

Cluster 1
https://tasty.co/recipe/vegetarian-gumbo
-------------
https://tasty.co/recipe/paella
-------------
https://tasty.co/recipe/30-minute-nagan-kozhi-curry-kerala-chicken-curry
-------------



Cluster 2
https://tasty.co/recipe/pull-apart-flauta-ring
-------------
https://tasty.co/recipe/baked-ham-cheese-ring
-------------
https://tasty.co/recipe/spicy-chicken-bacon-flatbread
-------------



Cluster 3
https://tasty.co/recipe/banana-bread-overnight-oats
-------------
https://tasty.co/recipe/dairy-free-key-lime-coconut-bars
-------------
https://tasty.co/recipe/pre-packed-smoothie-in-a-jar
-------------



Cluster 4
https://tasty.co/recipe/pizza-margherita
-------------
https://tasty.co/recipe/beet-gnocchi
-------------
https://tasty.co/recipe/the-best-strawberry-shortcake-you-ll-ever-eat
-------------



Cluster 5
https://tasty.co/recipe/mushroom-and-garlic-quinoa-salad
-------------
https://tasty.co/recipe/one-pot-chicken-and-mushroom-pasta
-------------
https://tasty.co/recipe/