# Topic modelling (part B - NMF)

These notes continue on with topic modelling. In the previous notes, I built a LDA model and demonstrated to you how you can build a LDA topic model, use `GridSearch` to analyse which LDA model could suit your texts best, and then how to display topic contents so that we can choose the most appropriate generalised word for each topic. 

In this lecture I'll now examine **Non-Negative Matrix Factorization (NMF)**.

See http://jmlr.csail.mit.edu/papers/volume5/hoyer04a/hoyer04a.pdf for further information.

Since these notes are not reliant on the previous session, I'll reload the sample dataset first.

In [1]:
import os
import pandas as pd
all_text_samples = []
labels = []
all_files = []
# List all files inside the "clean_data" directory
file_list = os.listdir("clean_data/")

for filename in file_list:
    # Construct filename and its path
    file = (f"clean_data/" + filename)
    my_text_file = open(file, encoding="utf8")
    file_data = my_text_file.read()
    all_text_samples.append(file_data)
    
dataframe = pd.DataFrame(all_text_samples)
dataframe.columns = ["Text"]

In [2]:
dataframe.head()

Unnamed: 0,Text
0,"2016 Update: Whether you enjoy myth busting, P..."
1,Let's start with the truth. The 3-point shot w...
2,Media playback is not supported on this device...
3,Krampus with babies postcard (via riptheskull/...
4,"Last week, Michael Dorf published a long and c..."


Now we'll load the TF-IDF vectorizer which we will use to **fit** the NMF model. Refer to the Blackboard notes on **TF-IDF** for further information on TF-IDF vectorising.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

Now I'm building a TF-IDF vectorizer with `max_df` set to 0.90 and `min_df` set to 4. If you are unsure about what these values represent, read **topic modeling using LDA (section A)** on Blackboard where I describe this process in more detail.

In [4]:
tfidf = TfidfVectorizer(max_df=0.90, min_df=4, stop_words="english")

In [5]:
dtm_with_tfidf = tfidf.fit_transform(dataframe["Text"])

The resultant output is a sparse matirx which is 7911 rows by 45783 columns. Each row represents one text file from the dataset. There are 7911 documents. Each column represents an occurrance of individual words within all of the text documents.

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. 

In [6]:
dtm_with_tfidf

<7911x45783 sparse matrix of type '<class 'numpy.float64'>'
	with 3482007 stored elements in Compressed Sparse Row format>

Next I'll initialise NMF, create an instance of it, and then build the NMF model using the TF-IDF fit-transform data built using the dataset.

I'm applying the same values for `no_components` to this model as I did with the LDA one. Read the Blackboard notes on **topic modeling using LDA (section A)** for further information.

In [7]:
from sklearn.decomposition import NMF

In [8]:
nmf_model = NMF(n_components=30, random_state=1)

In [9]:
nmf_model.fit(dtm_with_tfidf)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=30, random_state=1, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

I can view any word within the matrix of names in each document just as we did with LDA.

In [10]:
tfidf.get_feature_names()[5000]

'benchwarmer'

Using this logic, I can build a loop to show the most popular words that are recognised by the number of topics I've used to build my NMF model. In this example, I decided a few steps ago to build a model with `n_components` = 30. this indicates that I'm assuming there are approximately 30 individual topics within the dataset that I'm using.

Once the topic list is displayed using this loop, I can examine whether there is a lot of overlap in the words within each topic. If there are, I can re-adjust the `n_components` number accordingly to either reduce or increase the current number of topics in the NMF model.

In [27]:
word_list = []
probability_list = []

top_number = 50
count = 0
for probability_number in nmf_model.components_: # model.components contains the prob of each word for each doc
    text_message = f"Top words for topic {count} are : "
    print(text_message)    
    for number in probability_number.argsort()[-top_number:]: # we're only interested in the top words
        print([tfidf.get_feature_names()[number]], end= "")
        word_list.append([tfidf.get_feature_names()[number]])
        probability_list.append(number)
    #show_chart(word_list, probability_list, text_message)
    print("\n")  
    count += 1

Top words for topic 0 are : 
['try']['tell']['spend']['money']['best']['help']['great']['maybe']['important']['job']['love']['sleep']['getting']['ask']['doesn']['little']['look']['got']['actually']['didn']['start']['let']['person']['better']['right']['doing']['say']['lot']['thing']['good']['way']['need']['going']['really']['feel']['day']['life']['make']['think']['ll']['ve']['work']['know']['want']['things']['like']['just']['time']['don']['people']

Top words for topic 1 are : 
['salvo']['cabo']['holmgren']['james']['vida']['mar']['luz']['forman']['facebook']['cuba']['lo']['grave']['skiers']['ver']['del']['http']['por']['ski']['https']['su']['com']['las']['aire']['www']['publicado']['los']['cómo']['una']['para']['te']['que']['ha']['es']['en']['esta']['el']['más']['este']['la']['se']['si']['habilitada']['continuación']['automáticamente']['reproducirá']['siguiente']['está']['vídeo']['automática']['reproducción']

Top words for topic 2 are : 
['posting']['influencer']['app']['email']['reac

['raise']['risk']['older']['research']['world']['care']['dad']['behavior']['lives']['girls']['raising']['son']['likely']['helicopter']['daughter']['born']['learn']['skills']['old']['babies']['time']['father']['mom']['study']['childhood']['fathers']['parental']['age']['home']['teach']['says']['life']['baby']['adult']['women']['play']['young']['kid']['mother']['adults']['families']['mothers']['school']['family']['parent']['parenting']['child']['kids']['parents']['children']

Top words for topic 16 are : 
['sports']['jog']['train']['weight']['advertisement']['legs']['foot']['endurance']['fast']['walking']['athletes']['minute']['shoe']['gymnastics']['long']['faster']['slow']['speed']['ll']['feet']['trail']['walk']['time']['muscles']['week']['injury']['exercises']['distance']['mile']['barefoot']['5k']['muscle']['minutes']['fitness']['workouts']['runs']['strength']['pace']['miles']['workout']['runner']['race']['exercise']['body']['shoes']['marathon']['runners']['training']['run']['running']


### Add topic number to original dataframe

Just as we did with LDA, now I would like to add the relevant topic number to the original dataframe.

We can view the probability of each particular text file belonging to a particular topic as follows:

In [18]:
textfile_topics = nmf_model.transform(dtm_with_tfidf)

In [19]:
textfile_topics

array([[0.00000000e+00, 6.06487900e-04, 3.16513011e-04, ...,
        0.00000000e+00, 0.00000000e+00, 3.91064411e-05],
       [1.48472845e-02, 0.00000000e+00, 4.80045461e-03, ...,
        0.00000000e+00, 0.00000000e+00, 5.33331022e-03],
       [1.59862823e-02, 0.00000000e+00, 0.00000000e+00, ...,
        1.39454281e-03, 0.00000000e+00, 0.00000000e+00],
       ...,
       [7.06973122e-03, 0.00000000e+00, 0.00000000e+00, ...,
        7.53119750e-03, 0.00000000e+00, 3.40153233e-03],
       [1.21099398e-03, 0.00000000e+00, 1.99435698e-04, ...,
        0.00000000e+00, 9.19710790e-04, 2.09084194e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 1.28626807e-02, 0.00000000e+00]])

In [20]:
# Contains list of the 30 topics for each text file, so there are
# 7911 text files
textfile_topics.shape

(7911, 30)

Lets view the index values of each topic for the first text file

In [21]:
textfile_topics[0]

array([0.00000000e+00, 6.06487900e-04, 3.16513011e-04, 9.72943769e-04,
       1.65906253e-04, 8.20678945e-04, 1.42364380e-03, 1.49145528e-03,
       4.49516115e-04, 0.00000000e+00, 3.91425916e-03, 0.00000000e+00,
       4.66834930e-04, 3.25802743e-03, 0.00000000e+00, 8.13547793e-04,
       8.19956243e-04, 0.00000000e+00, 5.90019433e-04, 0.00000000e+00,
       4.01814767e-04, 1.47614114e-03, 8.54672078e-02, 1.18697821e-03,
       3.39248863e-03, 2.28920823e-03, 4.04963068e-04, 0.00000000e+00,
       0.00000000e+00, 3.91064411e-05])

Just as we did with LDA, we can see the values as a more representative topic number, we can round these values up. This example shows the index positions for each topic for the first text file.

In [22]:
textfile_topics[0].round(2)

array([0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.09, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

It appears that this document definitely belongs to topic 22.

We can use the command `argmax()` to view the position of the highest probability within the array for the first topic and verify the topic number this document is assigned to.

In [23]:
textfile_topics[0].argmax()

22

And just as we did previously, we can add a column called **Topic number** to the text file dataframe.

In [24]:
topic_list = []
# Textfile_topics is a list of arrays containing 
# all index positions of words for each textfile
for popular_index_pos in textfile_topics:
    # Get the max index position in each array
    # and add to the topic_list list
    topic_list.append(popular_index_pos.argmax())

# Add a new column to the dataframe
dataframe["Topic number"] = topic_list

In [25]:
dataframe

Unnamed: 0,Text,Topic number
0,"2016 Update: Whether you enjoy myth busting, P...",22
1,Let's start with the truth. The 3-point shot w...,7
2,Media playback is not supported on this device...,19
3,Krampus with babies postcard (via riptheskull/...,6
4,"Last week, Michael Dorf published a long and c...",25
5,"""Eva Braun was the ""first lady"" of the Third R...",6
6,Reproducción automática Si la reproducción aut...,1
7,"Journal reference:\n\nIn C. Freksa, ed., Found...",22
8,1. Keep makeup remover next to your bed so you...,23
9,"Here, we refrain from providing another genera...",27


Now I can examine each topic and assign matching descrptions for each topic number. I'll create a list of topic number with relevant text and then match the topic number to the relevant topic text using the list.

In [28]:
topic_list = {0: "Wellbeing", 
              1: "Languages", 
              2: "Social media", 
              3: "Politics", 
              4: "Technology", 
              5: "Health", 
              6: "Religion", 
              7: "Sports", 
              8: "Healthcare", 
              9: "Education", 
              10: "Information Technology", 
              11: "Ecology", 
              12: "Fashion", 
              13: "Finance", 
              14: "Tech news", 
              15: "Parenthood", 
              16: "Exercise", 
              17: "Art", 
              18: "Astronomy", 
              19: "Motorsport", 
              20: "Vitamin", 
              21: "Social media", 
              22: "Software development", 
              23: "Cosmetics", 
              24: "Travel", 
              25: "Law", 
              26: "Relationships", 
              27: "Research/Genetics", 
              28: "Meteorology", 
              29: "Cookery"}

topic_no_to_topic = dataframe["Topic number"].map(topic_list)

In [31]:
dataframe["Topic desc"] = topic_no_to_topic

In [33]:
dataframe.head(20)

Unnamed: 0,Text,Topic number,Topic desc
0,"2016 Update: Whether you enjoy myth busting, P...",22,Software development
1,Let's start with the truth. The 3-point shot w...,7,Sports
2,Media playback is not supported on this device...,19,Motorsport
3,Krampus with babies postcard (via riptheskull/...,6,Religion
4,"Last week, Michael Dorf published a long and c...",25,Law
5,"""Eva Braun was the ""first lady"" of the Third R...",6,Religion
6,Reproducción automática Si la reproducción aut...,1,Languages
7,"Journal reference:\n\nIn C. Freksa, ed., Found...",22,Software development
8,1. Keep makeup remover next to your bed so you...,23,Cosmetics
9,"Here, we refrain from providing another genera...",27,Research/Genetics


For comparison, here's what was created using LDA. Refer to **LDA notes** for more details.

|Text| 	Topic number |	Topic desc|
|----|---------------|------------|
|2016 Update: Whether you enjoy myth busting, P... 	|24 |	Design|
|Let's start with the truth. The 3-point shot w... 	|10 |	Sport|
|Media playback is not supported on this device... 	|3| 	Sport|
|Krampus with babies postcard (via riptheskull/... 	|19| 	History|
|Last week, Michael Dorf published a long and c... 	|2 |	Law|
|"Eva Braun was the "first lady" of the Third R... 	|19 |	History|
|Reproducción automática Si la reproducción aut... 	|29 |	Social media|
|Journal reference:\n\nIn C. Freksa, ed., Found... 	|23 |	AI|
|1. Keep makeup remover next to your bed so you... 	|24 |	Design|
|Here, we refrain from providing another genera... 	|1 |	Research|
|He has shared recipe for shakshuka dish of bak... 	|11| 	Cooking|
|Capturing powerful landscape photographs, imag... 	|12 |	Career|
|[This story has been optimized for offline rea... 	|0 |	Education|
|Who loves Crusty Artisan Bread? We sure do! A ... 	|11| 	Cooking|
|Dear Lifehacker,\n\nI'm tired of the rat race ... 	|12 |	Career|
|I love to ski, but between the storm tracking,... 	|13| 	gardening|
|Facebook has been busy with the app updates, a... 	|13| 	gardening|
|Astronomy Picture of the Day Discover the cosm... 	|3 |	Sport|
|I will bet anyone $1 million dollars that I ca... 	|12 |	Career|
|LAST night 40,000 people rented accommodation ... 	|18 |	Tech news|




The next step is to choose some random documents from the dataset and comapre the contents with the identified label.

Then depending on the indicated results, I may need to reduce the `max_df`, `min_df` or number of topics to further improve this accuray.

This is an iterative process and ends when our sampled documents all match with their assigned topics.

Once the dataset is analysed and verified, we can then use this as for supervised learning data.
