# Topic Modeling with LDA

This tutorial shows how to perform LDA topic modeling using the `sklearn`. It uses different functions to explore the topics and the documents. It is using a small dataset from the NYT articles to keep the interpretation manageable. 

There are **a few scattered activities for you** in the notebook.


**Table of Content**

1. [Load the data](#sec1)  
2. [Convert to document-term matrix](#sec2)  
3. [Fit the LDA model and explore it](#sec3)
4. [Finding the most optimal number of topics with GridSearch](#sec4)

<a id="sec1"></a>
## 1. Load the data

I used the NYT API to get all articles from March 2024. Then, I combined together the fields "snippet" and "lead_paragraph" to create a longer document for each article. Then, I chose the articles for the section_name: food, realestate, and science. I saved the documents only into a json file. 

Below there is a function that will read a JSON file and turn it into a dataframe.

In [1]:
import json
import pandas as pd
import numpy as np

def jsonToDF(name):
    """Read a list of sentences from the JSON file, store them in a dataframe"""
    
    with open(f"{name}.json") as fin:
        textList = json.load(fin)

    # create a name for each document, based on its category
    indexNames = [f"{name}_{i+1}" for i in range(len(textList))]

    # create the dataframe, it will have one column and one index
    df = pd.DataFrame(data=textList, index=indexNames)
    df.columns = ['document']
    return df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


First, let's read the content of all three files:

In [2]:
food = jsonToDF("food")
realestate = jsonToDF("realestate")
science = jsonToDF("science")

Check one dataframe:

In [3]:
food.head()

Unnamed: 0,document
food_1,Commit this method to memory for caramelized a...
food_2,"Kenji López-Alt’s buttery, unabashedly garlick..."
food_3,Just add rice or potatoes (and maybe a chilled...
food_4,"It’s a showstopping kaleidoscope of bulgogi, s..."
food_5,In the third installment of her YouTube series...


Check the shape of each dataframe:

In [4]:
print("food:", food.shape)
print("realestate:", realestate.shape)
print("science:", science.shape)

food: (81, 1)
realestate: (82, 1)
science: (71, 1)


Let's concatenate all of them in a single dataframe for the moment:

In [5]:
allDocs = pd.concat([food, realestate, science])
allDocs.shape

(234, 1)

In [6]:
allDocs.head()

Unnamed: 0,document
food_1,Commit this method to memory for caramelized a...
food_2,"Kenji López-Alt’s buttery, unabashedly garlick..."
food_3,Just add rice or potatoes (and maybe a chilled...
food_4,"It’s a showstopping kaleidoscope of bulgogi, s..."
food_5,In the third installment of her YouTube series...


Make the column wide enough to show all text:

In [7]:
pd.set_option("display.max_colwidth",1000)

Look at the results:

In [8]:
allDocs.head()

Unnamed: 0,document
food_1,"Commit this method to memory for caramelized and crisp yet tender vegetables all year long. The kindest thing you can do for yourself when you’re stiff from being in the cold is to find some warmth: Because as the chill in your bones starts to fade, so does your stiffness. The same thing happens to hard winter vegetables when they’re enveloped in the heat of the oven — they soften and sweeten as they roast until they’re golden outside and tender in the middle."
food_2,"Kenji López-Alt’s buttery, unabashedly garlicky noodles are as easy to make as they are to devour. Good morning. The vernal equinox is in less than three weeks, but you wouldn’t know it from the frosted mud in the woods and the storm-wounded lawns where I stay. It’s bare ugly everywhere save in the bays, where water clear as gin flows over rocks in a spectrum of pink. At the market: cabbage and potatoes, a box of turnips, industrial berries that might have been grown in space. The new season’s coming, sure as tulips, but right now it’s hard to imagine."
food_3,"Just add rice or potatoes (and maybe a chilled white wine). Citrus and salmon is a winning combination, previously proven in New York Times Cooking’s recipes for broiled salmon with mustard and lemon, roasted salmon with ginger-lime butter and citrusy roasted salmon and potatoes. Our newest addition to this esteemed company is Farideh Sadeghin’s recipe for orange-glazed baked salmon. It’s a no-fuss fish dinner with a clever, timesaving twist: Farideh builds a side salad into the recipe by tossing salad greens with some of the reserved honeyed orange juice that is used to flavor the salmon. If you’re looking at the above image and thinking, “I bet blood oranges would be especially beautiful and excellent in this recipe,” know that Coco, a reader, already tried that and can confirm that the results were “absolutely delicious.”"
food_4,"It’s a showstopping kaleidoscope of bulgogi, shiitakes, bean sprouts, spinach, carrots and cucumbers, all drizzled with a spicy gochujang sauce. Good morning. On Sunday, I like a project in the kitchen more than on any other day. It’s a chance to work at the stove without the need to get something on the table in 45 minutes, a time to stretch my skill set. Mostly, it’s an opportunity to explore recipes rather than simply following them. On Sundays I don’t want to fly by wire. I want to fly."
food_5,"In the third installment of her YouTube series, the cookbook author and chef Sohla El-Waylly will teach you how to master the basics of the bird. For beginners and experienced cooks alike, preparing chicken can come with a lot of questions (and nerves!). Am I going to get salmonella? How do I butcher a whole bird? Am I doomed to an eternity of dry breast meat? In the third installment of her YouTube series, Cooking 101, the chef and cookbook author Sohla El-Waylly will help you master the basics of the bird, then set you up with a handful of recipes that highlight white and dark meat."


In [9]:
allDocs.tail()

Unnamed: 0,document
science_67,"Dr. Goodall, who is best known for her work with chimpanzees, recently celebrated her forthcoming 90th birthday with as many dogs and explained why she isn’t slowing down. Jane Goodall is turning 90 on April 3 and the primatologist-turned-activist seems busier than ever. This year, she’ll be on the road for 320 days. She’ll be raising money for her nonprofit organizations, the Jane Goodall Institute and Roots & Shoots, and encouraging people to take environmental action."
science_68,How do champion skaters accomplish their extraordinary jumps and spins? Brain science is uncovering clues.
science_69,"The Delta IV Heavy, a rocket that briefly bursts into flame just before it lifts off, is set to launch for the last time soon. The ignition of the Delta IV Heavy rocket is perhaps the most visually striking liftoff you’ll ever see — the rocket seemingly burns itself up on the launchpad before it heads to space. Now, the very last Delta IV Heavy ever is on the launchpad."
science_70,"A device called LightSound is being distributed to help the blind and visually impaired experience this year’s event. On Aug. 21, 2017, Kiki Smith’s teenage sons giddily prepared to watch the partial solar eclipse in Rochester, N.Y. As Ms. Smith listened to their chatter, she felt excluded."
science_71,"The rendezvous between the sun and the moon in 2017 captivated a small region in the Midwest. Lucky for Americans at the eclipse crossroads, they get to see it again. It is rare for a total solar eclipse to hit the same place twice — once every 366 years on average. In 2019, this happened in the Pacific Ocean, far west of the coast of Chile. By a stroke of luck, the next one will span a region of about 10,000 square miles that includes parts of southern Illinois, southeastern Missouri and western Kentucky."


<a id="sec2"></a>
## 2. Convert to document-term matrix

We will apply the CountVectorizer to convert our corpus into a document-term matrix. Empirical evidence has shown that simply counting words is more meaningful for performing LDA on documents. (It is possible to use the Tf-idf vectorizer too.)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

This process has always two steps: 

1. initialize the vectorizer constructor
2. apply `fit_transform` to perform the transformation.

In [11]:
# Initialize the vectorizer
vectorizer = CountVectorizer(
    strip_accents='unicode',
    stop_words='english',
    lowercase=True,
    token_pattern=r'\b[a-zA-Z]{3,}\b', # we want only words that contain letters and are 3 or more characters long
)

# Transform our data into the document-term matrix
dtm = vectorizer.fit_transform(allDocs['document'])
dtm

<234x4504 sparse matrix of type '<class 'numpy.int64'>'
	with 8227 stored elements in Compressed Sparse Row format>

### Exploring the features   
Let's look at the features of the "model", that is, the columns of our document-term matrix:

In [12]:
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['abel', 'able', 'aboard', ..., 'zootampa', 'zumper', 'zuni'],
      dtype=object)

It's an array, let's look at its dimensions:

In [13]:
feature_names.shape

(4504,)

Let's look at a larger chunk of values:

In [14]:
feature_names[300:350]

array(['barge', 'bargoer', 'bargoers', 'barked', 'barley', 'barn', 'barr',
       'barriers', 'bars', 'bartender', 'bartered', 'bartlett', 'based',
       'basement', 'basic', 'basics', 'basil', 'basilica', 'basin',
       'basmati', 'bass', 'bassin', 'batch', 'bath', 'bathroom',
       'bathrooms', 'battle', 'bay', 'bays', 'beach', 'beaches', 'beads',
       'bean', 'beans', 'beat', 'beau', 'beautiful', 'beauty', 'beckoned',
       'bedford', 'bedrock', 'bedroom', 'bedrooms', 'beef', 'beefbars',
       'beekman', 'beets', 'began', 'beginners', 'beginning'],
      dtype=object)

It's clear that these are all cleaned words, three or more characters long, which have not been stemmed. That is, we have both "bean" and "beans" as two separate features.

### Understanding the document-term matrix

Let's look at a single row of the matrix, the first row, which corresponds to the first document from the NYT articles:

In [15]:
doc1 = dtm[0]
doc1

<1x4504 sparse matrix of type '<class 'numpy.int64'>'
	with 32 stored elements in Compressed Sparse Row format>

It says that it has 4504 colums, but there are only 32 stored elements (terms that are non-zero).

We can use some Python code to find the words and their counts for this document:

In [16]:
row_index = 0
doc_vec = dtm.getrow(row_index).toarray()

non_zero_indices = doc_vec.nonzero()[1]
dtm_scores = doc_vec[0, non_zero_indices] # goes and retrieves the values corresponding to the non_zero_indices
words = [feature_names[i] for i in non_zero_indices]

for word, score in zip(words, dtm_scores):
    print(f"{word}: {score}")

bones: 1
caramelized: 1
chill: 1
cold: 1
commit: 1
crisp: 1
does: 1
enveloped: 1
fade: 1
golden: 1
happens: 1
hard: 1
heat: 1
kindest: 1
long: 1
memory: 1
method: 1
middle: 1
outside: 1
oven: 1
roast: 1
soften: 1
starts: 1
stiff: 1
stiffness: 1
sweeten: 1
tender: 2
thing: 2
vegetables: 2
warmth: 1
winter: 1
year: 1


We can look at non_zero_indices to check what that variable stores:

In [17]:
non_zero_indices

array([ 437,  599,  713,  792,  830,  972, 1168, 1331, 1436, 1717, 1807,
       1810, 1840, 2162, 2334, 2503, 2513, 2525, 2792, 2794, 3392, 3728,
       3832, 3863, 3864, 3967, 4027, 4051, 4282, 4357, 4424, 4480],
      dtype=int64)

These values correspond to the column indices of each of the terms (words) in the matrix. A word like "year" has a high index, since is toward the end of the matrix, where terms are ordered alphabetically. 

Now that we know the indices of these words, we can use them to find how often each words occurrs in the entire matrix.

We will check the word "caramelized", which has the index 599.

In [18]:
dtm.getcol(599).toarray().T # get the column, turn it into an array format, then transpose it to be a row

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

It's obvious that the word doesn't show up often, I see only 3 values of 1. Let's check for the word "vegetables", index = 4282

In [19]:
dtm.getcol(4282).toarray().T

array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

Even this word doesn't show in more than 3 documents in total. Meanwhile, let's see a word like "year", index = 4480:

In [20]:
dtm.getcol(4480).toarray().T

array([[1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]], dtype=int64)

This seems to occur a bit more often, we can find in how many documents:

In [21]:
np.count_nonzero(dtm.getcol(4480).toarray().T)

18

### **Task for you:** find the top 5 words from this document (meaning, they show in most articles).

Use the variable names that have been seen so far.

**Reflection questions** 

1. From these top words would you be able to infer that this document is about cooking/food? 

2. If these words were part of a **topic**, what would you name that topic?

### Going back to the dataframe

We can create a function that takes the representation of each document as a row of numbers in the matrix and converts it back to a list of words.

In [None]:
def matrix2Doc(dtMatrix, features, index):
    """Turns each row of the document-term matrix into a list of terms"""
    row = dtMatrix.getrow(index).toarray()
    non_zero_indices = row.nonzero()[1]
    words = [features[idx] for idx in non_zero_indices]
    return words

In [None]:
allDocsAsTerms = [matrix2Doc(dtm, feature_names, i) for i in range(dtm.shape[0])]

Check that we have all of them:

In [None]:
len(allDocsAsTerms)

Add a column to the dataframe:

In [None]:
allDocs['terms'] = allDocsAsTerms
allDocs.head()

<a id="sec3"></a>
## 3. Fit the LDA model

Now that the data is ready and we understand well how it is represented (and how sparse it is), let us fit the LDA model:

In [22]:
from sklearn.decomposition import LatentDirichletAllocation

# Step 1: Initialize the model

lda = LatentDirichletAllocation(n_components=15, # we are picking the number of topics arbitrarely at the moment
                                random_state=0)

# Step 2: Fit the model
lda.fit(dtm)

The representation of topics can be accessed this way:

In [23]:
lda.components_

array([[0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 1.06666664,
        0.06666667],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        2.06666667],
       ...,
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667],
       [0.06666667, 0.06666667, 0.06666667, ..., 0.06666667, 0.06666667,
        0.06666667]])

What are the dimensions?

In [24]:
lda.components_.shape

(15, 4504)

So, this is a 15 by 4504 matrix, where each row is one of our topics and each column is a word (term). The values that we see are **not** probabilities, they are the **parameters** fitted by the LDA model for the topic-term distribution. We can see that they are not probabilities, since at least some of them seem to have a value > 1. 

These values are so-called "pseudo-counts" that reflect how many times, probabilistically speaking, each word was assigned to each topic across the entire corpus, adjusted by the model's learning process. The values are proportional to the probability of a term given a topic, but they need to be normalized to sum to one across each row to represent actual probabilities.

Now that we have such a **topic-term distribution**, we can find the top words associated with each topic.

In [25]:
def display_topics(model, features, no_top_words):
    """Helper function to show the top words of a model"""
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([features[i]
                        for i in topic.argsort()[:-no_top_words-1:-1]])) # syntax for reversing a list [::-1]

display_topics(lda, feature_names, 15)

Topic 0:
moon home world years major built spacecraft head year ago way place skaters champion include
Topic 1:
new city dining home restaurants year house space said restaurant built program years make victorian
Topic 2:
eclipse chicken restaurants new time built monday family home square near place moon style hard
Topic 3:
new rent apartment make building bedroom halloumi home biscuits work century time minute best high
Topic 4:
eggs home just easy sweet people early outside needs park probably day better founded cooking
Topic 5:
home sellers realtors estate real chicken group pay association new national said commissions lawsuits brought
Topic 6:
home bread kenji buy south american ago restaurant drive day species dishes years questions art
Topic 7:
bedroom room properties house space week new com bath kitchen half dining floor basement living
Topic 8:
family home salmon flour near recipe make leeks new work died like just small lower
Topic 9:
new home national united bedroom work s

**To note:** Looking at these words, it is hard to decide what topic each of them represents since words about food, realestate, and science are mixed together in each topic. Topic 11 seems relatively homogenous, it's clear that it is talking about food. 

Knowing how sparse our document-term matrix was (only 234 documents, but 4504 terms) it is to be expected that there isn't enough data to learn a better model that captures better topics (and the words associated with them).

### The document-topic matrix and dominant topics

In the prior step, by fitting the LDA model, we found the topics that are present in our corpus. Now, we will use these topics to generate the documents. For that, we will use the method `transform`. This method will transform our document-term matrix into a new matrix, the document-topic matrix. This is where the **dimensionality reduction** is happening. We go from the large document-term matrix to a narrow document-topic matrix.

In [26]:
doc_topic_dist = lda.transform(dtm)
doc_topic_dist 

array([[1.85185343e-03, 1.85185310e-03, 1.85185415e-03, ...,
        1.85185194e-03, 1.85185449e-03, 1.85185232e-03],
       [1.25786180e-03, 9.82389927e-01, 1.25786235e-03, ...,
        1.25786176e-03, 1.25786520e-03, 1.25786206e-03],
       [8.13008233e-04, 8.13008358e-04, 8.13008488e-04, ...,
        8.13008322e-04, 8.13008332e-04, 8.13008816e-04],
       ...,
       [2.29885068e-03, 2.29885189e-03, 2.29885136e-03, ...,
        9.67816078e-01, 2.29885078e-03, 2.29885116e-03],
       [2.29885133e-03, 9.67816079e-01, 2.29885285e-03, ...,
        2.29885289e-03, 2.29885103e-03, 2.29885086e-03],
       [1.58730305e-03, 1.58730257e-03, 9.77777766e-01, ...,
        1.58730338e-03, 1.58730263e-03, 1.58730230e-03]])

Verify the shape:

In [27]:
doc_topic_dist.shape

(234, 15)

**Meaning of the matrix values:** The entries in this matrix represent the proportion of the document's content that is attributed to each topic. This means each row of the output matrix is a distribution over topics for the corresponding document and should sum to one. We can easily test that by getting the sum of a row:

**Better representing the document-topic matrix**

The document-topic matrix above is not very legible, we will create a dataframe that has a better representation. First, I'll modify the function `display_topics` to show a few terms for each topic:

In [28]:
def displayHeader(model, features, no_top_words):
    """Helper function to show the top words of a model"""
    topicNames = []
    for topic_idx, topic in enumerate(model.components_):
        topicNames.append(f"Topic {topic_idx}: " + (", ".join([features[i]
                             for i in topic.argsort()[:-no_top_words-1:-1]])))
    return topicNames

In [29]:
# column names
topicnames = displayHeader(lda, feature_names, 5)

# index names
docnames = allDocs.index.tolist() # We will use the original names of the documents

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(doc_topic_dist, 3), 
                                 columns=topicnames, 
                                 index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1) # finds the maximum argument
df_document_topic['dominant_topic'] = dominant_topic

df_document_topic.head()

Unnamed: 0,"Topic 0: moon, home, world, years, major","Topic 1: new, city, dining, home, restaurants","Topic 2: eclipse, chicken, restaurants, new, time","Topic 3: new, rent, apartment, make, building","Topic 4: eggs, home, just, easy, sweet","Topic 5: home, sellers, realtors, estate, real","Topic 6: home, bread, kenji, buy, south","Topic 7: bedroom, room, properties, house, space","Topic 8: family, home, salmon, flour, near","Topic 9: new, home, national, united, bedroom","Topic 10: like, new, spring, cooking, winter","Topic 11: sheet, pan, new, eggs, meal","Topic 12: new, birds, known, sea, low","Topic 13: bond, trees, years, mortgage, year","Topic 14: new, study, family, percent, estate",dominant_topic
food_1,0.002,0.002,0.002,0.002,0.974,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,4
food_2,0.001,0.982,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,1
food_3,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.989,0.001,0.001,0.001,0.001,0.001,0.001,8
food_4,0.002,0.002,0.002,0.978,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,3
food_5,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.982,0.001,0.001,0.001,11


Let's look at some documents between food and realestate:

In [30]:
df_document_topic[76:86]

Unnamed: 0,"Topic 0: moon, home, world, years, major","Topic 1: new, city, dining, home, restaurants","Topic 2: eclipse, chicken, restaurants, new, time","Topic 3: new, rent, apartment, make, building","Topic 4: eggs, home, just, easy, sweet","Topic 5: home, sellers, realtors, estate, real","Topic 6: home, bread, kenji, buy, south","Topic 7: bedroom, room, properties, house, space","Topic 8: family, home, salmon, flour, near","Topic 9: new, home, national, united, bedroom","Topic 10: like, new, spring, cooking, winter","Topic 11: sheet, pan, new, eggs, meal","Topic 12: new, birds, known, sea, low","Topic 13: bond, trees, years, mortgage, year","Topic 14: new, study, family, percent, estate",dominant_topic
food_77,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.973,0.002,0.002,0.002,0.002,0.002,0.002,8
food_78,0.006,0.006,0.006,0.006,0.915,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,0.006,4
food_79,0.001,0.001,0.001,0.001,0.984,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,4
food_80,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.965,0.002,0.002,0.002,0.002,10
food_81,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.983,0.001,0.001,0.001,0.001,0.001,9
realestate_1,0.002,0.002,0.002,0.972,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,0.002,3
realestate_2,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.005,0.928,0.005,0.005,0.005,0.005,0.005,0.005,8
realestate_3,0.963,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0
realestate_4,0.963,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0
realestate_5,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.001,0.839,0.145,13


One interesting thing here is that articles food_78 and food_79 seem to share the dominant topic, just like realestate_1 and realestate_2. Interestingly, realestate_5 has two topics with value > 0.1, both of which seem to be primarely about real estate.

### Topic distribution across documents

Now that we have the document-topic matrix, we can see which topics show up most frequently:

In [None]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

### Challenge yourself: Add two more columns

Using your pandas skills add two new columns to this dataframe:

1. a column with the top 10 words of the corresponding topic. (see Topic Num for the topic number)
2. a column that lists the document names associated with the topic (document names are things like food_1, food_2, etc.)

By adding these two columns it will be a bit easier to understand what is going one with the model and whether it is capturing something about the corpus of documents. 

### Interpretation Task

Pick a topic that doesn't have many documents assigned to it and then read all the articles (see dataframe at the start of the notebook) associated with this topic. Do you see any reason for why they were given the same dominant topic? Can you summarize in a single phrase what the meaning of that topic is? (Also make use of the top 15 words for that topic.)

<a id="sec4"></a>
## 4. Grid Search: Find number of topics

In the example so far, we arbitrarely chose the number of topics to be 15. However, that is not the right way to go about it. We whould use methods for selecting the optimal number of topics. This can be done through a mechanism known as GridSearch with cross-validation that builds multiple models and then compares them to see which one performs the best.

In [None]:
from sklearn.model_selection import GridSearchCV

# We are going to test multiple values for the number of topics
search_params = {'n_components': [5, 10, 15, 20, 25, 30, 35]}

# Initialize the LDA model
lda = LatentDirichletAllocation()

# Initialize a Grid Search with cross-validation instance
grid = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
grid.fit(dtm)

Let us look at the results:

In [None]:
grid.cv_results_

Since this representation is a bit overwhelming, let's access a few features of the grid instance:

In [None]:
# Best Model
best_lda_model = grid.best_estimator_

# Model Parameters
print("Best Model's Params: ", grid.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", grid.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(dtm))

The results are showing that the best LDA model should have 5 topics, the smallest number we tried. This raises the question of whether we should try other small numbers, which I'm doing below:

In [None]:
search_params = {'n_components': [1,2,3,4,5,6]}

lda = LatentDirichletAllocation()
grid = GridSearchCV(lda, param_grid=search_params)

grid.fit(dtm)

# Best Model
best_lda_model = grid.best_estimator_

# Model Parameters
print("Best Model's Params: ", grid.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", grid.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(dtm))

This result shows that actually the best number of topics for this corpus is 1.

**Meaning of Log Likelihood**. 

Log Likelihood is the logarithm of the probability of observing the given data under the model with specific parameters. Essentially, it measures how well the model explains the observed data. (It is a conditional probability.)

**Meaning of perplexity**

Perplexity is a common metric used to evaluate the quality of probabilistic models. It reflects how well the model describes or predicts the documents in the dataset.

A lower perplexity score suggests that the model is more certain about its predictions (i.e., the probability distributions it assigns to unseen documents are more accurate). This means that the topic distributions learned by the model are a good fit for the observed data.

**Words for best modesl with one topic**

Let's see what are the top words for the best model with one topic:

In [None]:
display_topics(best_lda_model, feature_names, 40)

As we can see it is a mix of food and realestae and New York. If we had documents with more distinct nature and more of them we might have seen something else. 

However, the point of this tutorial was to show the mechanics of building LDA models. 

Now it's time to take what you saw here and apply it to your projects.

Have fun exploring!