# <center>Introduction to Topic Modeling</center>

## Introduction

The purposes of this part of the textbook is fivefold.

1) Introduce the reader to the core concepts of topic modeling and text classification<br>
2) Provide an introduction to three libraries used for traditional topic modeling (Scikit Learn, Gensim, and spaCy) for those with limited Python knowledge<br>
3) Detail the problems and solutions to working with various topic modeling problems<br>
4) Provide an overview of transformer-based topic modeling<br>
5) Provide code that will be easily reproducible for readers who wish to apply these methods to their own domains.<br>

Throughout this part of the textbook, we will work with one dataset, a collection of short descriptions of violence in Apartheid South Africa, which comes from Volume 7 of the Truth and Reconciliation Commission's final report (hereafter, TRC Vol. 7). I have chosen this dataset because I have experience with it and I know that the data is perfectly suited to topic modeling as hidden topics are found within it.

## What is Topic Modeling?

Topic modeling is an approach in NLP where we try to find hidden themes within a collection of documents in a corpus. In this scenario, we do not know the subject of each document in our collection. Topic modeling is particularly useful when the corpus is too vast to manually tag each document with a specific topic name. How, then, do we understand how all the documents in our corpus relate to one another? The answer lies in computational approaches to text analysis.

Topic modeling is distinctly different from another similar approach in NLP known as **text classification**. In text classification, we train a machine learning model to recognize specific known labels. We do this with training data that consists of texts and their corresponding labels. This is known as supervised learning. Topic modeling is an entirely different approach designed to work with an entirely different problem. Since we do not know our labels and do not have training data, supervised learning would not work. Topic modeling is an unsupervised learning approach to finding and identifying the labels.

Today, there are many approaches to topic modeling. In Chapter 2, we will build an LDA (Latent Dirichlet Allocation) model. While useful, this approach to topic modeling has largely been replaced with transformer-based topic models (Chapter 3). Before we can explore each of these, however, it is import to have a key understanding of the themes and concepts of topic modeling. This will be the subject of this chapter.

## Rules-Based Methods

A rules-based approach to topic modeling uses a set of rules to extract topics from a text. It does this by identifying keywords in each text in a corpus. One of the most common ways to perform this task is via TF-IDF, or term frequency-inverse document frequency. We will discuss this method a lot more in Part Two of these notebooks. Simply put, a TF-IDF looks for a word's frequency in a single text, respective to that word's use across the corpus as a whole. If that word occurs infrequently in all other documents, but frequently in one document, then we use rules to identify the document that sees one word used with a high frequency as the chief document of a given topic.

For certain problems, a rules-based approach is particularly useful. As we will see, documents that are shorter, such as tweets, tend to fare better from rules-based approaches.

## Machine Learning-Based Methods

Another option to identify topics in a text is via a machine learning-based approach. In this method, we do not give a computer system a set of rules, rather we let the computer generate its own rules to identify topics in a corpus. This is done in two different ways: supervised and unsupervised learning.
 
In supervised learning, we know the key subjects in a corpus. We give a computer system a set of documents with their corresponding label to teach it to identify the characteristics that make that particular topic or class unique. This is mostly used for text classification.

Another approach is via unsupervised learning. In unsupervised learning, we do not know the topics of our documents and, instead, we want let the system identify those topics and cluster the ones of a highd degree of similarity together. We then examine the words that occur the most frequently in each cluster to get a sense of the topics at hand. The classic example for machine learning topic modeling is LDA, or Latent Dirichlet Allocation. We will learn about this method in far more detail in Part Three.

## Why use Topic Modeling?

All of this leads to a vital question: Why use topic modeling? Topic modeling affords researchers the ability to learn a lot about their corpus very quickly. It is often used whent he corpus is so large that no single human could read it in a single lifetime.

In both a rules-based and machine learning-based approach, a researcher can see what major subjects are discussed in a corpus. This information can be used to perform targetted research by weeding out the documents that likely do not contain the information the researcher needs. Additionally, the information drawn from topic modeling can be used to make large deductions about the corpus at hand. We will see that topic modeling can be used to draw imprecise or incorrect conclusions.

It is vital, however, to understand the limitations of topic modeling. There is always a potential for the researcher to use topic modeling to validate a wrong presumption about the data. Throughout this series, I will emphasize methodological steps that can (and should) be taken to limit these mistakes. Despite this potential for error, topic modeling can provide valuable insight, relatively quickly about a large corpus.