# Introduction to Machine Learning<br><br>Day 02:<br>Topic Modeling

<center>Dr. William Mattingly<br>
TAP Institute with JSTOR</center>

## Preface

In the last notebook, we learned about some of the fundamental concepts and terms for engaging in machine learning. In this notebook, we begin to apply those concepts to an unsuperivsed learning problem: topic modeling. This is an unsupervsed learning problem because we will not know the number of topics found within our corpus. Instead, we want to create a model that will cluster the data and find topics (based on a number we assign) across an entire corpus.

## Covered in this Notebook

1) What is topic modeling?<br>
2) When should I use topic modeling?<br>
3) Clusters<br>
4) Topics<br>
5) The Gensim Library in Python

## Part One - What is Topic Modeling?

Topic modeling and text classification (addressed below) is a branch of natural language understanding, better known as NLP. It is closely connected to natural language understanding, better known as NLU. NLP is the process by which a researcher uses a computer system to parse human language and extract important metadata from texts. The purpose of NLP is to perform, among other things, distant reding.

Distant reading has a long history extending to the late-twentieth century. It is commonly used when the quantity of texts in a given corpus prevent a researcher (or a team of researchers) from reading the corpus closely in its entirety. In order to make sense of that large corpus, the researcher will often pass certain tasks to a computer with the understanding that there is a margin of error. This margin of error is accepted in exchange for the ability to gain a larger, distant understanding of that corpus.

The metadata from these tasks can then be used to get a sense of the texts without reading them closely, hence the term distant reading.

To get a better understanding of how these fields relate to one another, please see the image below.

<img src="https://cdn-images-1.medium.com/max/1000/1*Uf_qQ0zF8G8y9zUhndA08w.png" alt="fishy" class="bg-primary" width="700px">

This image is commonly shared across various NLP tutorials and for good reason. It accuarately portrays the diverse field of NLP and its close partner fields of NLU and ASR. The goal of NLU is to give a computer system a text (or collection of texts) and produce some sense of understanding about that text or those texts.

There are various types of tasks that fall under NLU, including paraphrase and natural language inference. This is when a computer system takes an input text of, say 5,000 words, reduces that text to its core components, and outputs a summary of the text. This is a task often used by law firms that need to gain a quick understanding of a large corpus of documents to target their investigation and use their time wisely. Another task is sentiment analysis in which a user gives a computer system a text and the system determines whether it is x or y. This is often used by social media companies to determine if a text is abusive so that they can flag and delete inappropriate content automatically.

A common form of NLP and the subject of these notebooks is topic modeling and text classification. While closely linked and rather similar, they are distinct methods that perform distinct tasks. For topic modeling, we give a computer system a text and it tells us what topic(s) is (are) discussed in it. For text classification, we give a system a text and it classifies it into certain categories. In essense, while NLP is essential for working with textual data in a computer environment by parsing it and identifying its key components, NLU goes one step further and tries to understand that same data the way a human may.

For all NLP and NLU tasks, there are rules-based and machine learning-based approaches. In this notebook, we will be looking at each. Parts Two and Three in this book are focused on clustering and topic modeling. In Part Two, we will explore rules-based methods, such as Term Frequency-Inverse Document Frequency, better known as TF-IDF; and in Part Three we will explore machine learning-based methods, specifically Latent Dirichlet Allocation models, better known as LDA models.

Before we move into those subjects, something should be said of rules-based vs. machine learning-based approaches.

A rules-based approach to topic modeling uses a set of rules to extract topics from a text. It does this by identifying keywords in each text in a corpus. One of the most common ways to perform this task is via TF-IDF, or term frequency-inverse document frequency. We will discuss this method a lot more in Part Two of these notebooks. Simply put, a TF-IDF looks for a word’s frequency in a single text, respective to that word’s use across the corpus as a whole. If that word occurs infrequently in all other documents, but frequently in one document, then we use rules to identify the document that sees one word used with a high frequency as the chief document of a given topic.

For certain problems, a rules-based approach is particularly useful. As we will see, documents that are shorter, such as tweets, tend to fare better from rules-based approaches.

Another option to identify topics in a text is via a machine learning-based approach. In this method, we do not give a computer system a set of rules, rather we let the computer generate its own rules to identify topics in a corpus. This is done in two different ways: supervised and unsupervised learning.

In supervised learning, we know the key subjects in a corpus. We give a computer system a set of documents with their corresponding label to teach it to identify the characteristics that make that particular topic or class unique. This is mostly used for text classification.

Another approach is via unsupervised learning. In unsupervised learning, we do not know the topics of our documents and, instead, we want let the system identify those topics and cluster the ones of a highd degree of similarity together. We then examine the words that occur the most frequently in each cluster to get a sense of the topics at hand. The classic example for machine learning topic modeling is LDA, or Latent Dirichlet Allocation. We will learn about this method in far more detail in Part Three.

## Part Two - When to Use Topic Modeling

All of this leads to a vital question: Why use topic modeling? Topic modeling affords researchers the ability to learn a lot about their corpus very quickly. It is often used whent he corpus is so large that no single human could read it in a single lifetime.

In both a rules-based and machine learning-based approach, a researcher can see what major subjects are discussed in a corpus. This information can be used to perform targetted research by weeding out the documents that likely do not contain the information the researcher needs. Additionally, the information drawn from topic modeling can be used to make large deductions about the corpus at hand. We will see that topic modeling can be used to draw imprecise or incorrect conclusions.

It is vital, however, to understand the limitations of topic modeling. There is always a potential for the researcher to use topic modeling to validate a wrong presumption about the data. Throughout this series, I will emphasize methodological steps that can (and should) be taken to limit these mistakes. Despite this potential for error, topic modeling can provide valuable insight, relatively quickly about a large corpus.

## Part Three - What are Clusters and Topics?

Topics are labels assigned to textual data that detail the subjects contained within a given text. In topic modeling we try to create computer systems that can assign topics the way a human would. In order to understand this process, it’s best if we take a step back and think about how we assign topics.

To do this, let’s examine these two texts.

Text 1: Thomas enjoys playing basketball. He is an exceptionally good point guard.<br>
Text 2: Victoria enjoys playing baseball. She is an exceptionally good at playing first base.

If I asked you to provide two topics to these texts, what might they be? Basketball and baseball are likely two top candidates. Text 1 would have the topic of basketball, while text 2 would have the topic of baseball. Now, let’s consider these same texts, but add two more into the mix.

Text 3: John is a talented chef. He enjoys making pasta professionally.<br>
Text 4: Jeff is a talented cook. He owns a pizzaria.

Now, if I asked you to assign two topics to all four texts, what might those topics be? It is likely that your answer changed. No longer are the two topics of baseball and basketball relevant because Text 3 and Text 4 do not align well with those topics. Instead, a better pair of topics might be sports and cooking, or something like that. What changed? The collection of texts in our corpus changed.

What does this demonstrate? It tells us that topics are corpus-dependent, meaning the topics we assign to texts depend on their context against surrounding texts. The same holds true for topic modeling via computer systems.

In topic modeling, computer systems do not generate topics, rather they generate a list of high concentration words. Texts that share common terms are clustered together by similarity. A cluster is nothing more than a collection of similar texts.

## Part Four - Introduction to Gensim

Gensim is a powerful Python library that was originally designed to produce good topic models. Topic models are machine learning models that read over an entire corpus and cluster individual documents into clusters of similarity. In order to produce good results, Gensim (and other topic modeling methods) are reliant upon numerical represntations of words. In other words, these methods depend on word vectors. To have accurate results, therefore, Gensim is capable of generating word vectors with relatively minimal code. SpaCy, on the other hand, is an NLP library not capable of generating custom word vectors. While users can inject to words into models, spaCy is not designed to generate word vectors on its own. For this reason, even spaCy’s documentation recomends using other libraries, such as Gensim to generate word vectors.

In this notebook, we will be going through the process of generating our own word vectors. In order to reduce the time to perform the task at hand, we will use a toy corpus. This process, however, can easily be scaled for a corpus of millions of documents.

In order to generate word vectors, we need one thing: a corpus/ Let’s create one right now.

In [1]:
corpus = "Tom is cat, while Jerry is a mouse. Tom and Jerry are characters in a cartoon series. Some of the cartoons contain words, but most are silent. Silent cartoons still have music and sound effects."

Before we can give this corpus to Gensim, however, we need to do a few preprocessing techniques to it.

1. First, we need to remove the stopwords from the corpus. Stopwords are words that occur frequently in a corpus, so frequently that they do not necessarily offer much meaning for distant reading and, as a result, throw off machine learning models. Other stopwords are words that occur with high frequency in a langauge as a whole. For our purposes, we will use the following stopwords available from the NLTK (natural language toolkit)

In [2]:
stopwords = ["i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves",
             "he","him","his","himself","she","her","hers","herself","it","its","itself","they","them","their",
             "theirs","themselves","what","which","who","whom","this","that","these","those","am","is","are","was",
             "were","be","been","being","have","has","had","having","do","does","did","doing","a","an","the","and",
             "but","if","or","because","as","until","while","of","at","by","for","with","about","against","between",
             "into","through","during","before","after","above","below","to","from","up","down","in","out","on","off",
             "over","under","again","further","then","once","here","there","when","where","why","how","all","any","both",
             "each","few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very",
             "s","t","can","will","just","don","should","now"
            ]
corpus = corpus.lower()
words = corpus.split()

new_corpus = []
for word in words:
    if word not in stopwords:
        new_corpus.append(word)

corpus = " ".join(new_corpus)
print (corpus)

tom cat, jerry mouse. tom jerry characters cartoon series. cartoons contain words, silent. silent cartoons still music sound effects.


2. Second, this corpus should be divided into sentences. In order to do that, I recommend using spaCy’s sentence tokenizer.<br>
3. While we do this, we should also eliminate the punctuation from the sentences. We can do this with the standard string library from Python.<br>
4. Also at this stage, we should lowercase our words (OPTIONAL)<br>
4. If we wish to produce a smaller amount of word vectors, we could also consider lemmatizing our words as well (OPTIONAL)<br>
5. We need to split the sentence into words and append that list of words to a new object

In [3]:
import spacy
import string

nlp = spacy.load("en_core_web_sm")
doc = nlp(corpus)

sentences = []
for sent in doc.sents:
    sentence = sent.text.translate(str.maketrans('', '', string.punctuation))
    words = sentence.split()
    sentences.append(words)
print (sentences)

[['tom', 'cat', 'jerry', 'mouse'], ['tom', 'jerry', 'characters', 'cartoon', 'series'], ['cartoons', 'contain', 'words', 'silent'], ['silent', 'cartoons', 'still', 'music', 'sound', 'effects']]


## Part Five - Creating Word Vectors

At this stage, we can start preparing our word vectors. To do this, we will use the function below.

In [4]:
def create_wordvecs(corpus, model_name):
    from gensim.models.word2vec import Word2Vec
    w2v_model = Word2Vec(min_count=1)
    w2v_model.build_vocab(sentences)
    w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
    w2v_model.wv.save_word2vec_format(f"data/{model_name}.txt")
create_wordvecs(sentences, "word_vecs")



Now, we can open up our word vectors and examine them. The first line in this text file will be the shape of the word vectors. This should be two integers. The first number (17) is the number of unique words in the vocabulary. The second number (10) are the number of dimensions of each word.

In [5]:
with open ("data/word_vecs.txt", "r") as f:
    data = f.readlines()
    print (data[0])

15 100



Let’s look at the first word in our word vectors, “silent”:

In [6]:
 print (data[1])

silent -0.00054096937 0.00023816715 0.0051021995 0.009006135 -0.009300297 -0.0071164677 0.0064598355 0.00897975 -0.0050193104 -0.0037623234 0.0073800264 -0.0015330206 -0.0045433184 0.0065578683 -0.004862911 -0.001817435 0.0028776203 0.0009981464 -0.008280853 -0.009456136 0.0073130513 0.0050611165 0.0067697237 0.0007680518 0.0063493964 -0.003403032 -0.00094805425 0.0057731112 -0.0075262673 -0.0039340584 -0.007505744 -0.0009285571 0.009542602 -0.0073249387 -0.002325071 -0.0019412229 0.008077628 -0.0059301443 4.4383218e-05 -0.0047524283 -0.009602784 0.0050010644 -0.008770156 -0.004384546 -3.431853e-05 -0.00030049618 -0.007657594 0.009613245 0.004982054 0.009229672 -0.008153839 0.0044951704 -0.0041342853 0.00082299445 0.008498835 -0.004468288 0.004511762 -0.0067935865 -0.0035477346 0.009396493 -0.0015808161 0.00031599196 -0.0041384036 -0.007684123 -0.0015134802 0.0024720856 -0.00087977306 0.005536616 -0.0027453955 0.0022682755 0.0054518464 0.008352847 -0.001456623 -0.0092013 0.0043750172 0

Here, we see two pieces of information. The first is a string and it is the word itself. In this case, “silent”. The second bit of data is a series of 10 floats. These are our dimensions for the word. This is the numerical way in which “silent” is understood by the Gensim model.

# Part Six - Topic Modeling with Gensim

With these understandings now down, it is time to engage in topic modeling with Gensim. To do this, we will use a sample dataset known as the 20-newsgroups dataset. In what follows, we will largely follow the standard Gensim tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ with significant modifications. My modifications are designed to reduce the complexity of topic modeling for entry-level students. I have also updated the visualization code.

We first need to load in the data. To do that, we will use Pandas and its built in read_json() function. This will allow us to grab the data from the internet and automatically format it for us to work with it in Python.

In [7]:
import pandas as pd
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')

Now that we have the we have the data loaded, let's convert it to a list and take a look at it. Here's a sample of one entry in the dataset:

In [8]:
news_data = df.content.values.tolist()
print (news_data[1])

From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>



Let's take note of the features of this dataset. Each item in the dataset will have some preliminary header data (which we do not need), separated by a double line break. That which comes after the double line break is the key piece of the message that we are interested in. We are also not interested in email addresses. With this bit of information, we can write a few rules to remove the header and clean up the data. In addition to that, we can leverage the power of Gensim's built in simple_preprocess() function that allows us to easily clean and structure texts for topic modeling. The code that follows is not what I would call "efficient" nor is it the most Pythonic way to achieve the tasks at hand. It is, however, easier to parse for new coders. I have, therefore, opted to be more verbose in my code so that the reader will be able to more easily follow along.

In [14]:
import re
from gensim.utils import simple_preprocess
final = []
for item in news_data:
    #separate the header from the main body of the text
    item = item.split("\n\n", 1)[1]
    
    #remove all email addresses with Regex
    item = re.sub('\S*@\S*\s?', '', item)
    
    #create a list of words by removing all non-text material, specifically punctuation and numbers
    words = simple_preprocess(item)
    
    #create an empty list to append to
    final_words = []
    
    #make sure all the words are not stopwords (the list we created above)
    for word in words:
        if word not in stopwords:
            
            #eliminate short words. This dataset has a lot of "rx", "rz", etc. data that may throw off the model.
            if len(word) > 3:
                words = final_words.append(word)
    final.append(final_words)
print (final[1])

  and should_run_async(code)
  item = re.sub('\S*@\S*\s?', '', item)


['fair', 'number', 'brave', 'souls', 'upgraded', 'clock', 'oscillator', 'shared', 'experiences', 'poll', 'please', 'send', 'brief', 'message', 'detailing', 'experiences', 'procedure', 'speed', 'attained', 'rated', 'speed', 'cards', 'adapters', 'heat', 'sinks', 'hour', 'usage', 'floppy', 'disk', 'functionality', 'floppies', 'especially', 'requested', 'summarizing', 'next', 'days', 'please', 'network', 'knowledge', 'base', 'done', 'clock', 'upgrade', 'haven', 'answered', 'poll', 'thanks']


Our dataset now is properly cleaned and in the format Gensim expects. Now, it's time to create a model. We will not be finetuning this model or exploring all the various hyperparameters available to researchers. Instead, we will create a basic model. I always like to do this as a first pass with my topic modeling.

In [19]:
import gensim
import gensim.corpora as corpora
id2word = corpora.Dictionary(final)
corpus = [id2word.doc2bow(text) for text in final]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=20, chunksize=100, passes=10)

  and should_run_async(code)


In [20]:
topics = lda_model.print_topics()
for topic in topics:
    print(topic)

(0, '0.058*"health" + 0.045*"medical" + 0.042*"insurance" + 0.034*"father" + 0.034*"kids" + 0.033*"disease" + 0.032*"treatment" + 0.031*"patients" + 0.024*"offers" + 0.021*"studies"')
(1, '0.094*"physical" + 0.067*"master" + 0.064*"window" + 0.043*"friend" + 0.027*"becomes" + 0.023*"direct" + 0.020*"became" + 0.017*"georgia" + 0.016*"militia" + 0.016*"beings"')
(2, '0.024*"would" + 0.023*"writes" + 0.020*"article" + 0.016*"like" + 0.013*"think" + 0.012*"know" + 0.011*"time" + 0.009*"good" + 0.009*"could" + 0.009*"well"')
(3, '0.016*"drive" + 0.014*"thanks" + 0.013*"system" + 0.013*"mail" + 0.011*"please" + 0.011*"also" + 0.010*"windows" + 0.009*"computer" + 0.009*"using" + 0.009*"program"')
(4, '0.026*"video" + 0.023*"serial" + 0.022*"driver" + 0.018*"number" + 0.017*"mouse" + 0.015*"dave" + 0.015*"input" + 0.014*"function" + 0.014*"line" + 0.014*"return"')
(5, '0.049*"values" + 0.045*"floppy" + 0.045*"islam" + 0.032*"upgrade" + 0.031*"islamic" + 0.030*"registration" + 0.024*"muslims" 

  and should_run_async(code)


In [21]:
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)

  and should_run_async(code)


In [22]:
vis

  and should_run_async(code)


pyLDAvis allows us to easily interpret the model's results. We can see that through unsupervised learning the model has found for us the desired 20 clusters (or topics) of data. It now falls on us, as the rersearcher, to interpret these clusters. If I were to assign a label to Cluster 7, for example, I might assign the word "religious". This is because the words that are associated with cluster 7 are clearly religious in nature. Cluster 9, likewise, deals with sports. Cluster 8 deals with encryption.

One of the notable problems with this result is the different conjugation forms of stop words, specifically the verb "to be". This can be seen in the high priority of Cluster 4 with the word "would". One of the ways to improve results is through something known as lemmatization, or the reduction of a word to a specific root, or lemma. This takes all the forms of a specific verb or noun and reduces them to a single standard form. The verbs "would", "was, and "am", therefore, all become "is" in lemmatization. Likewise, the words "teams" and "team" become "team". Lemmatization reduces a text further and typically improves the results.

Lemmatization can, however, be computationally expensive.