
## Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation


NMF relies on linear algebra and is a deterministic algorithm which arrives at a single representation of the corpus. Therefore NMF performs better in cases where the topic probabilities should remain fixed per document (unlikely)—or in small datasets.

LDA is based on probabilistic graphical modeling. Documents that have similar words or groups of words usually have the same topic. Thus, LDA is a probabilistic model capable of expressing uncertainty about the assignment of words to topics and also about the placement of topics across texts.

This compilation of code extracts topic “descriptions” based on top ranked words in basis vectors.

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [2]:
df = pd.read_excel("C:\\Users\\thesk\\Desktop\\RAW_recipes.xlsx", feature_names = ['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags', 'nutrition', 'n_steps','steps', 'description', 'ingredients', 'n_ingredients'], na_values=['NA'])
df.shape

(231637, 12)

In [3]:
df.head(3)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13


## Use tf (raw term count or count vectorizer) features ..

Creating vocabulary of all words in our data.
Only words that appear in less than 80% of the document and appear in at least 100 documents.
I set max_features=2000 in tfv for speed reasons, otherwise there would be 10,000+ features.

In [5]:
tfv = CountVectorizer(max_df=0.80, min_df=100, max_features=2000, stop_words='english')
tf_word_matrix = tfv.fit_transform(df['name'].values.astype('U'))
tf_word_matrix

<231637x1103 sparse matrix of type '<class 'numpy.int64'>'
	with 735331 stored elements in Compressed Sparse Row format>

Each of 231637 rows in Excel file (or each document) is represented as a 1103 dimensional vector as our vocabulary has 1103 words.

## .. for LDA
Create topics and their probability distributions for each word in our vocabulary.
n_components: number of topics
earning_method='online' is faster than 'batch'

In [14]:
lda = LatentDirichletAllocation(n_components=10, learning_method='online')
lda.fit(tf_word_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [15]:
first_topic = lda.components_[0]
print(first_topic)

[0.1000057  0.10000253 0.10000309 ... 0.10002022 0.10000182 0.10001143]


<p style="color:gray;">The first topic contains the probabilities of 1103 words for this topic.

In [16]:
# Sort the probabilities from smallest to highest. The argsort() function sorts the indexes according to probabilities.
# The highest probabilities will be at the last 4 indexes of the array. 
# Get the indexes of the 4 words with the highest probabilities.

top_words_in_topic = first_topic.argsort()[-7:]
print(top_words_in_topic)

[ 48 959 653  62 830 255 138]


In [17]:
# Use these indexes to retrieve the value of the words from the tfv (or count vectorizer) object
for i in top_words_in_topic:
    print(tfv.get_feature_names()[i])

banana
strawberry
muffins
bean
roasted
cream
cake


In [18]:
for i,topic in enumerate(lda.components_):
    print(f'Top words for topic #{i}:')
    print([tfv.get_feature_names()[i] for i in topic.argsort()[-7:]])

Top words for topic #0:
['banana', 'strawberry', 'muffins', 'bean', 'roasted', 'cream', 'cake']
Top words for topic #1:
['pizza', 'salmon', 'white', 'orange', 'stuffed', 'spinach', 'soup']
Top words for topic #2:
['fruit', 'tuna', 'hot', 'red', 'style', 'casserole', 'rice']
Top words for topic #3:
['creamy', 'dressing', 'apple', 'easy', 'sweet', 'potato', 'salad']
Top words for topic #4:
['sugar', 'slow', 'stew', 'pudding', 'bake', 'beef', 'tomato']
Top words for topic #5:
['pumpkin', 'grilled', 'pot', 'lemon', 'spicy', 'pie', 'cheese']
Top words for topic #6:
['oatmeal', 'vegan', 'corn', 'peanut', 'butter', 'cookies', 'chocolate']
Top words for topic #7:
['sausage', 'beans', 'green', 'baked', 'shrimp', 'chicken', 'pasta']
Top words for topic #8:
['honey', 'roast', 'steak', 'vegetable', 'turkey', 'garlic', 'chicken']
Top words for topic #9:
['quick', 'chili', 'dip', 'potatoes', 'pork', 'bread', 'sauce']


In [19]:
# To add a column to the dataset showing the topic,
# use lda.transform() method and pass it document-word matrix 
# to assign new columns with probabilities of each topic to each Excel row (or document). 

topic_values = lda.transform(tf_word_matrix)
topic_values.shape

(231637, 10)

In [20]:
## Use the argmax() method and axis=1 for a column in order to find the topic index with maximum value.
# Pandas syntax: Index.argmax(axis=None)

df['Topic'] = topic_values.argmax(axis=1)
df.head(3)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,Topic
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6,2
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13,9


In [21]:
# 5 most common words for curiosity reasons
occ = np.asarray(tf_word_matrix.sum(axis=0)).ravel().tolist()
counts_df = pd.DataFrame({'term': tfv.get_feature_names(), 'occurrences': occ})
counts_df.sort_values(by='occurrences', ascending=False).head(5)

Unnamed: 0,term,occurrences
178,chicken,22966
847,salad,13299
856,sauce,10075
170,cheese,9745
191,chocolate,9029


# Use tf-idf features ..

In [22]:
tiv = TfidfVectorizer(max_df=0.90, min_df=100, max_features=2000, stop_words='english')

In [23]:
tfidf_word_matrix = tiv.fit_transform(df['name'].values.astype('U'))

## for NMF, the Frobenius norm

In [24]:
nmf = NMF(n_components=10, random_state=1, alpha=.1, l1_ratio=.5)

# Fit the NMF model with tf-idf features
nmf.fit(tfidf_word_matrix)

NMF(alpha=0.1, beta_loss='frobenius', init=None, l1_ratio=0.5, max_iter=200,
  n_components=10, random_state=1, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

In [25]:
print("Top words for Topic 1:\n")
first_topic = nmf.components_[0]
top_words_in_topic = first_topic.argsort()[-7:]

for i in top_words_in_topic:
    print(tiv.get_feature_names()[i])
    
print("\nTopics in NMF model (Frobenius norm):")
    
for i,topic in enumerate(nmf.components_):
    print(f'Top words for topic #{i}:')
    print([tiv.get_feature_names()[i] for i in topic.argsort()[-7:]])
    
topic_values = nmf.transform(tfidf_word_matrix)
df['Topic_F'] = topic_values.argmax(axis=1)
df.head(3)

Top words for Topic 1:

garlic
curry
fried
breasts
lemon
grilled
chicken

Topics in NMF model (Frobenius norm):
Top words for topic #0:
['garlic', 'curry', 'fried', 'breasts', 'lemon', 'grilled', 'chicken']
Top words for topic #1:
['fruit', 'bean', 'spinach', 'pasta', 'dressing', 'potato', 'salad']
Top words for topic #2:
['white', 'oatmeal', 'peanut', 'butter', 'chip', 'cookies', 'chocolate']
Top words for topic #3:
['ham', 'baked', 'blue', 'dip', 'macaroni', 'cream', 'cheese']
Top words for topic #4:
['creamy', 'lentil', 'vegetable', 'tomato', 'bean', 'potato', 'soup']
Top words for topic #5:
['cream', 'pound', 'lemon', 'apple', 'chocolate', 'coffee', 'cake']
Top words for topic #6:
['pumpkin', 'wheat', 'garlic', 'pudding', 'machine', 'banana', 'bread']
Top words for topic #7:
['lemon', 'spaghetti', 'garlic', 'shrimp', 'tomato', 'pasta', 'sauce']
Top words for topic #8:
['easy', 'pork', 'rice', 'beef', 'crock', 'casserole', 'pot']
Top words for topic #9:
['crust', 'easy', 'pecan', 'c

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,Topic,Topic_F
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,7,8
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6,2,8
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13,9,8



## .. the generalized Kullback-Leibler divergence

In [26]:
nmf = NMF(n_components=10, random_state=1, beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1, l1_ratio=.5)
nmf.fit(tfidf_word_matrix)

NMF(alpha=0.1, beta_loss='kullback-leibler', init=None, l1_ratio=0.5,
  max_iter=1000, n_components=10, random_state=1, shuffle=False,
  solver='mu', tol=0.0001, verbose=0)

In [27]:
print("Top words for Topic 1:\n")
first_topic = nmf.components_[0]
top_words_in_topic = first_topic.argsort()[-7:]

for i in top_words_in_topic:
    print(tiv.get_feature_names()[i])
    
print("\nTopics in NMF model (generalized Kullback-Leibler):")
    
for i,topic in enumerate(nmf.components_):
    print(f'Top words for topic #{i}:')
    print([tiv.get_feature_names()[i] for i in topic.argsort()[-7:]])
    
topic_values = nmf.transform(tfidf_word_matrix)
df['Topic_KL'] = topic_values.argmax(axis=1)
df.head(3)

Top words for Topic 1:

curry
salsa
lemon
fried
grilled
style
chicken

Topics in NMF model (generalized Kullback-Leibler):
Top words for topic #0:
['curry', 'salsa', 'lemon', 'fried', 'grilled', 'style', 'chicken']
Top words for topic #1:
['tuna', 'fruit', 'dressing', 'salmon', 'pasta', 'spinach', 'salad']
Top words for topic #2:
['oatmeal', 'bars', 'hot', 'peanut', 'butter', 'cookies', 'chocolate']
Top words for topic #3:
['chili', 'roasted', 'stuffed', 'garlic', 'tomato', 'shrimp', 'potatoes']
Top words for topic #4:
['lentil', 'cabbage', 'black', 'vegetable', 'bean', 'potato', 'soup']
Top words for topic #5:
['cream', 'carrot', 'sour', 'coffee', 'pineapple', 'lemon', 'cake']
Top words for topic #6:
['zucchini', 'pizza', 'dip', 'banana', 'muffins', 'bread', 'cheese']
Top words for topic #7:
['cranberry', 'quick', 'honey', 'pudding', 'orange', 'sweet', 'sauce']
Top words for topic #8:
['crock', 'pot', 'beef', 'casserole', 'pork', 'easy', 'rice']
Top words for topic #9:
['blueberry', '

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,Topic,Topic_F,Topic_KL
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,7,8,4
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6,2,8,6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13,9,8,4


## Conclusion
<p>Fitting the NMF model with tf-idf features (beta_loss='frobenius') seems to do the best topic extractions.<br>
Such as #1 seems to be chicked, #4 soups, #5 cakes and #9 pies, #7 mediterranean food.
</p><br>
Top words for topic <br>
0: ['garlic', 'curry', 'fried', 'breasts', 'lemon', 'grilled', 'chicken']<br>
1: ['fruit', 'bean', 'spinach', 'pasta', 'dressing', 'potato', 'salad']<br>
2: ['white', 'oatmeal', 'peanut', 'butter', 'chip', 'cookies', 'chocolate']<br>
3: ['ham', 'baked', 'blue', 'dip', 'macaroni', 'cream', 'cheese']<br>
4: ['creamy', 'lentil', 'vegetable', 'tomato', 'bean', 'potato', 'soup']<br>
5: ['cream', 'pound', 'lemon', 'apple', 'chocolate', 'coffee', 'cake']<br>
6: ['pumpkin', 'wheat', 'garlic', 'pudding', 'machine', 'banana', 'bread']<br>
7: ['lemon', 'spaghetti', 'garlic', 'shrimp', 'tomato', 'pasta', 'sauce']<br>
8: ['easy', 'pork', 'rice', 'beef', 'crock', 'casserole', 'pot']<br>
9: ['crust', 'easy', 'pecan', 'cream', 'pumpkin', 'apple', 'pie']<br>