# TOPIC MODELLING WITH NMF

In this notebook we are going to show how to do topic modelling with NMF. The goal is to learn how we can extract topics from natural language text using matrix factorization. These topics can then be used for various purposed, for instance, to extract features from a text description that can be used as independent features in a supervised modelling problem.

In the following, we will use both sklearn (for the NMF model) and the spacy NLP library (for NLP transformations).
In particular, spacy comes with pretrained language models (they are essentially mostly neural networks) which can be used for a variety of purposes, such as, tokenization, entity recognition, and so forth. These models must be installed individually in the conda environment.

To install spacy and the english language model in your conda environment, you can use the following commands:

- pip install spacy
- python -m spacy download en_core_web_sm

You can find more information on spacy at https://spacy.io/usage/spacy-101



In [1]:
import pandas as pd
import numpy as np
import spacy
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
import pickle
import os
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
import re

In [2]:
news_df = pd.read_csv("../dataset/news-data.csv")

In [3]:
news_df

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1103658,20171231,the ashes smiths warners near miss liven up bo...
1103659,20171231,timelapse: brisbanes new year fireworks
1103660,20171231,what 2017 meant to the kids of australia
1103661,20171231,what the papodopoulos meeting may mean for ausus


## Data cleaning

In [4]:
# first we take the actual texts as a vector
text = news_df['headline_text'].tolist()

In [5]:
# next, this command will load the english language model of spacy, which we have previously installed
nlp = spacy.load("en_core_web_sm")
type(nlp)

spacy.lang.en.English

The first thing we need to do is to transform the natural language sentences in the text vector into lists of tokens. A token can be defined as a string of contiguous characters between two spaces, or between a space and punctuation marks, but there are a lot of exceptions to this definition depending on the language used. Spacy NLP models have been trained to tokenize text for specific languages. Below, we see an example for English.

In [6]:
# we can use the nlp object to tokenize a string into individual tokens:
text[0]

for token in nlp(text[0]):
    print(token)

# note that the spacy language model has been trained to recognize exceptions to tokenization rules
# for instance, in the below, N.Y. is treated as a single token
"It's cold in winter in N.Y"
print("\n")
tokenization_example = nlp("It's cold in winter in N.Y")
for token in tokenization_example:
    print(token)

'aba decides against community broadcasting licence'

aba
decides
against
community
broadcasting
licence


"It's cold in winter in N.Y"



It
's
cold
in
winter
in
N.Y


Spacy tokenizer can be customized if needed. See https://spacy.io/usage/linguistic-features#tokenization for how to do this. Note that for e.g. the funda case, you will need the Dutch language model to do this: see https://spacy.io/models/nl.

One of the features of spacy is that the application of the nlp() function (i.e., the "language" object) does not only do tokenization, but applies a full pre-trained end-to-end natural language processing pipeline to the text. For instance, when nlp is called above, spacy applies the following operations:

<div>
<img src="img/spacy_nlp_pipeline.png" width="600">
</div>

Note that spacy will first take the text; then turn it into a list of tokens (a **doc** object); and then use a variety of pre-trained statistical model to annotate these tokens with additional information, to **extract** structured information from the text. For instance, the NER model in the pipeline attempts to identify tokens that refer to specific entities, such as people, organizations, or cities; we call this **named entity recognition**, and it is a fundamental task in NLP.

In [7]:
# the entities that spacy recognized in our tokenization example
tokenization_example.ents
for i in tokenization_example.ents:
    print("Spacy recognized entity: {}, which has been labelled as: {}".format(i, i.label_)) # GPE stands for geopolitical entity

(winter, N.Y)

Spacy recognized entity: winter, which has been labelled as: DATE
Spacy recognized entity: N.Y, which has been labelled as: GPE


In our application, however, we will only need the tokenization feature; we are not interested in the other elements of the pipeline (the parser, NER, and so forth) (the parser is the pipeline component that identifies part of speech, i.e. it does part of speech tagging, such as identifying nouns and verbs and **negations**). Hence we will disable the other default features of the spacy pipeline and only run the rule-based tokenizer using nlp.make_doc. This will also be a lot faster than using the statistical models.
We can now tokenize all the documents.

In [8]:
# perform tokenization
docs = [nlp.make_doc(x) for x in text]

Next, we need to remove the stopwords from the tokenized documents. Spacy comes with a built in list of stopwords:

In [9]:
stopwords = nlp.Defaults.stop_words
print(stopwords)

{'twenty', 'also', 'since', 'whom', 'whence', 'side', '’m', 'perhaps', 'should', 'toward', 'bottom', 'therefore', 'am', 'yet', 'around', 'made', 'keep', 'be', 'former', 'hence', 'or', 'per', 'into', 'anywhere', 'been', 'which', 'third', 'every', 'whenever', 'show', 'herein', 'anyhow', 'get', '‘ve', 'doing', 'nevertheless', 'what', 'their', 'wherever', 'itself', 'they', 'thereby', 'otherwise', 'them', 'much', 'me', 'least', 'often', 'else', 'cannot', 'even', 'while', 'below', 'own', 'hundred', 'front', 'after', 'moreover', 'yourselves', 'ten', 'amount', 'becomes', 'due', 'you', 'besides', 'can', 'eight', 'not', 'his', 'few', 'nothing', 'might', 'against', 'everything', 'last', 'did', 'most', 'the', 'full', 'thru', '‘m', 'one', 'becoming', 'once', "'re", 'each', 'used', 'have', 'in', 'by', 'at', 're', 'herself', 'towards', 'thereafter', 'up', 'themselves', 'fifty', 'across', 'whither', "'m", 'somewhere', 'before', 'everyone', 'using', 'a', 'some', 'whose', "n't", 'further', 'we', 'quite'

We can add new stopwords to the list, like this:

In [10]:
#nlp.Defaults.stop_words.add("my_new_stopword")

Stop words, punctuation, and other things like numbers/digits do not encode semantic content and should be removed before doing topic modelling. We can do so easily as follows:

In [11]:
# the following lines of code illustrate how to check if a string is all composed of digits and punctuation using regex
digit_re = re.compile('^([0-9]|[\\.,])*$')
if re.match(digit_re, "1234,500.3"):
    print("it's a match")
else:
    print("it's not a match")

it's a match


In [12]:
tokens_cleaned = []
for doc in docs:
    new_tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct and not re.match(digit_re, token.text):
            new_tokens.append(token.text) # append the text only because we don't need the spacy tokens anymore from now on
    tokens_cleaned.append(new_tokens)

Note two things:

1. after we remove the stopwords like in the above, the tokens_cleaned is not a list of doc objects anymore; it is now a list of lists of token objects. This is because the principle of spacy is that you should always be able to reconstruct the original text from a doc object (i.e., a doc object only parses and adds new structure over the natural language text)
2. if you add your own stopwords, then you will need to do that BEFORE running the tokenization with make_doc for it to be picked up

In [13]:
tokens_cleaned[0:5]

[['aba', 'decides', 'community', 'broadcasting', 'licence'],
 ['act', 'fire', 'witnesses', 'aware', 'defamation'],
 ['g', 'calls', 'infrastructure', 'protection', 'summit'],
 ['air', 'nz', 'staff', 'aust', 'strike', 'pay', 'rise'],
 ['air', 'nz', 'strike', 'affect', 'australian', 'travellers']]

Now that we have tokenized and cleaned the documents (which in our case are sentences headlines), we are going to use sklearn CountVectorizer to create a big matrix of dimension N x M, where N are the individual documents (the healdines), and the columns M of the matrix are the individual words that occur in the *corpus* (i.e., in the whole list of documents). The values of the matrix cells V_ij will be the occurrence counts of word j in document i.

In [14]:
# we need to pass a dummy function to the tokenizer and preprocessor parameters of count vectorizer because we already calculated our tokens
def dummy(doc):
    return doc

count_vectorizer = CountVectorizer(
    tokenizer=dummy,
    preprocessor=dummy,
    )

frequency_matrix = count_vectorizer.fit_transform(tokens_cleaned)

In [15]:
type(frequency_matrix)

scipy.sparse.csr.csr_matrix

Note that sklearn stores the frequency matrix as a sparse matrix, which is more efficient for 0 inflated matrices like the one above.

In [16]:
count_vectorizer.get_feature_names_out()[:50]

array(['$', "'em", '0.2pc', '0.6pc', '01pc', '026pc', '02pc', '035pc',
       '03pc', '03rd', '04pc', '05pc', '06pc', '083pc', '08s', '09pc',
       '1.1b', '1.26b', '1.2b', '1.2pc', '1.30am', '1.3b', '1.3bn',
       '1.4b', '1.59b', '1.5b', '1.5pc', '1.6b', '1.7b', '1.8b', '1.9bn',
       '10.5pc', '10.7pc', '10.9pc', '10000th', '10000yo', '1000cc',
       '1000k', '1000kms', '1000pc', '1000s', '1000th', '1000yo', '100am',
       '100b', '100k', '100kgs', '100kms', '100kph', '100mi'],
      dtype=object)

You can see that although we removed the digits, still, there are words left in the corpus that have little semantic meaning in the context of a bag of word model, such as 100kms and so forth. One might want to remove these from the corpus as well using specific rules or lists of words to remove. 

In [17]:
count_vectorizer.get_feature_names_out()[-2000:-1950]

array(['withold', 'witholding', 'withought', 'withour', 'withstand',
       'withstanding', 'withstands', 'withstood', 'witih', 'witkop',
       'witnes', 'witness', 'witnesse', 'witnessed', 'witnesses',
       'witnessing', 'witnesss', 'witnsses', 'wits', 'witsunday', 'witt',
       'wittenoom', 'wittenooms', 'wittner', 'witton', 'witts', 'witty',
       'witzig', 'wivenhoe', 'wives', 'wiwa', 'wiya', 'wiz', 'wizard',
       'wizardry', 'wizards', 'wizkid', 'wj', 'wjdap', 'wk', 'wk2',
       'wknd', 'wks', 'wladimi', 'wladimir', 'wleague', 'wlecome',
       'wlefare', 'wlhd', 'wlliams'], dtype=object)

Note also that words like, e.g., "withold" and "witholding" are considered as separate words in this analysis. This is also not perfect, because the semantic content of the words "withold" and "witholding" is the same, and we would like in principle to count them as one word. To do so, one could apply **lemmatization** to identify the underlying lemma of the two words (in this case, "withold"), and count the occurrences of the lemma itself rather than the individual words. Spacy has a lemmatization component: https://spacy.io/api/lemmatizer that can be used in a nlp pipeline for this purpose, but we won't worry about it for this demo.

Another thing that can be useful to reduce the dimensionality of the matrix it to pass the max_feature argument to the count vectorizer. Then, sklearn will create a matrix only with the top max_feature features in terms of frequency in the document. Here, we are going to use max_features 5000. However, this parameter needs to be adjusted based on data analysis.

In [18]:
count_vectorizer = CountVectorizer(
    tokenizer=dummy,
    preprocessor=dummy,
    max_features=5000 # add the max_features argument
    )

frequency_matrix = count_vectorizer.fit_transform(tokens_cleaned)

In [19]:
frequency_matrix

<1103663x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 4555971 stored elements in Compressed Sparse Row format>

Next, we apply the tf-idf transformer to turn the frequency matrix into a tf-idf matrix. Tf-idf stands for: *term frequency inverse document frequency*. It is a statistical measure of how relevant a word is to a given document, in the context of a collection of documents.

It is calculated as: **frequency of a word in a document * inverse document frequency of the word across all documents**.

The inverse document frequency encodes how rare a word is in the whole dataset. The closest to 0 the inverse document frequency is, the more common the word is. It is calculated by taking the total number of documents, dividing it by the number of documents containing the word, and then taking the logarithm.

The idea is that if a word has high idf, then it is a rare word in the corpus; hence, is likely to carry a lot of the semantics content of the document compared to words that are very common. E.g., a word like house in the funda dataset would occur very often and be less meaningful than a word like "garden" or "school" or "dakkapel", hence we want to give higher weight to words in a document that are infrequent in general.

<div>
<img src="img/tf_idf.jpeg" width="500">
</div>

In [20]:
tfd_idf_trans = TfidfTransformer()
tf_idf_matrix = tfd_idf_trans.fit_transform(frequency_matrix)

The tf idf matrix is also stored as a sparse matrix.

In [21]:
tf_idf_matrix

<1103663x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 4555971 stored elements in Compressed Sparse Row format>

We now use the NMF model from sklearn (which is actually implemented as a transformer rather than an estimator) to decompose our tf-idf matrix into two: a document x topics matrix W, and a topics x words matrix H. Thus, the model finds two matrices W, H, such that tf_idf_matrix ~ WH.

Note the sign ~ in the equation above, and not the sign =. This is because the two decomposing matrices W, H will not return the original tf_idf_matrix when multiplied together; rather, they will return a new matrix that approximates the original tf_idf matrix with an error E.

One important caveat of this model is that the number of topics to be discovered (the n_components parameter in the NMF model), i.e., the number of columns of W and the number of rows of H, is a **hyperparameter**; i.e., it needs to be specified when the model is instantiated.

In [22]:
# we use 10 as the n_components here
model = NMF(n_components=10, init='nndsvd')

# fit the model
model.fit(X=tf_idf_matrix)

NMF(init='nndsvd', n_components=10)

In [23]:
# the model.components_ matrix is the H matrix mapping topics to word weights
model.components_
model.components_.shape

array([[0.00000000e+00, 5.59881810e-06, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.01220539e-02, 2.61696942e-03, 8.76515833e-03, ...,
        8.20097637e-04, 0.00000000e+00, 6.76022141e-03],
       [3.38853033e-03, 9.90211870e-04, 3.49371009e-03, ...,
        5.81247911e-03, 2.49112509e-03, 0.00000000e+00],
       ...,
       [3.70511399e-03, 2.20310474e-04, 0.00000000e+00, ...,
        8.31889312e-03, 3.40713921e-03, 3.98590572e-04],
       [1.15192810e-02, 1.83898953e-05, 3.90650924e-04, ...,
        1.37729671e-02, 7.12337185e-03, 5.91310975e-03],
       [2.27368217e-03, 3.85378359e-04, 3.48036846e-02, ...,
        7.26846073e-03, 0.00000000e+00, 8.73785186e-03]])

(10, 5000)

In [24]:
def get_topics(model, n_top_words):
    '''This function takes a fitted NMF factorization model, and a n_top_words parameter.
    It then produces a dataframe where the columns are the topics that have been learned, and the
    rows are the top words that define the topic, ranked by their coefficients.
    '''

    feature_names = count_vectorizer.get_feature_names()
    d = {}
    for i in range(model.n_components):
        # model.components_ matrix contains the matrix H where topics are 
        words_ids = model.components_[i].argsort()[:-n_top_words - 1:-1]
        words = [feature_names[key] for key in words_ids]
        d['Topic # ' + '{:02d}'.format(i+1)] = words
    return pd.DataFrame(d)

In [25]:
get_topics(model, 20)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10
0,interview,man,police,new,says,abc,court,country,fire,crash
1,michael,charged,probe,zealand,council,news,accused,hour,house,car
2,extended,murder,investigate,laws,govt,rural,face,nsw,crews,dies
3,david,jailed,missing,year,plan,business,murder,wa,blaze,killed
4,john,missing,search,hospital,water,weather,faces,podcast,sydney,fatal
5,james,stabbing,death,home,nt,sport,charges,rural,threat,woman
6,nrl,guilty,hunt,york,australia,market,told,qld,school,road
7,ben,found,officer,centre,health,national,case,sa,home,driver
8,matt,arrested,shooting,deal,urged,analysis,high,vic,destroys,plane
9,smith,death,seek,years,report,entertainment,hears,tas,suspicious,dead


We see some coeherence in the topics that the NMF model has identified. In particular:

- Topic 01 has "interview" together with a bunch of first names, so it looks like it's identifying headlines whose topic is interviews of people
- Topic 02 seems to identify headlines whose subject is violent crimes
- Topic 03 is similar to topic 02, but seems to focus more on headlines having to do with police investigations
- Topic 04 is not fully clear
- Topic 05 identifies headlines that have to do with government or politics, such as funding, budgets, and so forth
- Topic 06 is not clear
- Topic 07 has also to do with crime headlines; however it focuses more on the legal aspect of these, as witnessed by the words "court", "accused", "face" (probably used in sentences like "faces X years in prison"), and so forth
- Topic 08 seems to be isolating headlines that specify temporal details (see all the month names)
- Topic 09 is natural disasters
- Topic 10 is disasters of a human nature, in particular transport accidents (car crashes, et cetera)

We can now use the above model to identify, for each document, which topics apply to that document.

In [35]:
document_topics = model.transform(X=tf_idf_matrix)
document_topics = pd.DataFrame(document_topics)
main_topic = document_topics.idxmax(axis=1)
topics = ["Topic " + str(x) for x in range(1, 11)]
main_topic = main_topic.apply(lambda x: topics[x])

In [39]:
news_df['main_topic'] = main_topic

In [42]:
news_df.loc[lambda x: x.main_topic == "Topic 3"].iloc[:20]

Unnamed: 0,publish_date,headline_text,main_topic
35,20030219,death toll continues to climb in s korean subway,Topic 3
62,20030219,greens offer police station alternative,Topic 3
68,20030219,harrington raring to go after break,Topic 3
72,20030219,inquest finds mans death accidental,Topic 3
73,20030219,investigations underway into death toll of korean,Topic 3
98,20030219,more than 40 pc of young men drink alcohol at,Topic 3
114,20030219,nth koreans seek asylum at japanese embassy,Topic 3
132,20030219,police cracking down on driver safety,Topic 3
133,20030219,police defend aboriginal tent embassy raid,Topic 3
161,20030219,search continues for victims in s korean subway,Topic 3
