# Topic Modeling using Latent Dirichlet Allocation (LDA)

The Vox News corpus is a collection of all Vox articles published before March 21, 2017. Vox Media released this dataset as part of the KDD 2017 Workshop on Data Science + Journalism. Their goal for publishing this dataset was to enable data science researchers to apply various techniques on a news dataset.

<b>The dataset consists of 22,994 news articles with their titles, author names, categories, published dates, updated on dates, links to the articles and their short descriptions (8 columns). While visualizing the dataset, I noticed that all the articles are clustered by 185 distinct categories. Out of those articles, 7145 articles were tageed by the category "The Latest". It cannot be a coincidence that such a large number of articles would be tagged by a generic category. Hence, I decided to address this problem by unsupervised learning because the categories of articles cannot be predicted beforehand neither the articles can be tagged by their categories in the training dataset.</b>

<b>We get a crude idea of the article by just skimming through the category of the article. Hence, topic modeling is useful for categorizing or ranking articles which are remaining to be read by an individual. Moreover, clustering of articles based on topics also enable them to be organized by groups of similar topics inside a database. This simplifies the collective analysis of such Big Data especially in the field of News and Journalism where an enormous amount of data is archived and retrieved only when needed. Categorical clustering will also make information retrieval quicker and more efficient.</b>

We can analyze the title, short description and the body of these 7145 articles and predict their categories by using Topic Modeling.

In reality, analyzing the body would drastically improve the topic model. However, due to time constraints and proclivity towards minimalism, I have decided to drop the body column entirely. Also, parsing html tags in the body of articles would be a time-consuming task in itself.

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Topic models are also referred to as probabilistic topic models as they are based on probabilistic graphical modeling, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as bioinformatics. [Source: https://en.wikipedia.org/wiki/Topic_model]

The necessary python libraries and packages like numpy, pandas, matplotlib and scikit-learn have been imported

In [1]:
%matplotlib inline
import pickle
import pprint
import random
import warnings
import time

# numpy, pandas, matplotlib and regular expressions (data science essentials)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# tqdm
from tqdm import tqdm

# spacy
import spacy
from spacy.lang.en import English
import en_core_web_sm

# gensim
import gensim
from gensim import corpora
from gensim.models import CoherenceModel

# nltk
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import words
from nltk.stem.wordnet import WordNetLemmatizer

# pyLDAvis
import pyLDAvis
import pyLDAvis.gensim

# styling
pd.set_option('display.max_columns',150)
plt.style.use('bmh')
from IPython.display import display

### 1)  Data Visualization

This bubble chart quantifies the number of articles written by different authors.  [Source: https://data.world/elenadata/vox-articles]

![Figure 1](../bin/resources/articles-per-author.png "Figure 1")

This graph signifies the gradual increase in the number of articles being published during each month. However, the average articles published in the months of 2017 and 2016 seems to be the similar.  [Source: https://data.world/elenadata/vox-articles]

![Figure 2](../bin/resources/articles-by-month.png "Figure 2")

The entire dataset consists of a total of 185 distinct topics. This bubble plot shows records grouped by category. We can observe that the category "The Latest" has the maximum number of records.

![Figure 3](../bin/resources/records-by-category.png "Figure 3")

This bar graph tells us the distribution of records around topics and also around different authors who have written about the same topic

![Figure 4](../bin/resources/records-by-category-&-author.png "Figure 4")

There are a number of algorithms developed for topic modeling which use singular value decomposition (SVD) and the method of moments. These algorithms are listed below:
<ul>Explicit semantic analysis</ul>
<ul>Latent semantic analysis</ul>
<ul>Latent Dirichlet Allocation (LDA)</ul>
<ul>Hierarchical Dirichlet process</ul>
<ul>Non-Negative Matrix Factorization (NMF)</ul>

I decided to use LDA as it is widely praised topic modeling technique by researchers and data scientists. Owing to my Data Mining project, I also had prior experience on working with Gensim library in Python which has a robust LDA model. LDA is a kind of probabilistic model that exploits similarity between data and extracts inference from the resulting analysis.

In natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Essentially the same model was also proposed independently by J. K. Pritchard, M. Stephens, and P. Donnelly in the study of population genetics in 2000. Both papers have been highly influential, with 19858 and 20416 citations respectively by August 2017.  [Source: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation]

In simple words, one can say that LDA converts a set of documents into a set of topics

This cell will auto-download the required NLTK modules

In [2]:
warnings.simplefilter('ignore')

nltk.download('wordnet')
nltk.download('stopwords')

nlp = spacy.load('en')
# nlp = en_core_web_sm.load()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tanveershaikh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tanveershaikh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Using NLTK, I am creating a corpus of English words and also an object of the lemmatizer is being created using WordNet

In [3]:
# Initialization and Global Variables
dictionary = dict.fromkeys(words.words(), None)
lemmatizer = WordNetLemmatizer()
pp = pprint.PrettyPrinter(indent = 4)

# List for filtering out stop words
en_stop = set(nltk.corpus.stopwords.words('english'))

### 2) Reading the dataset (dsjVoxArticles.tsv) - Data Extraction

News Article topic modeling is an unsupervised machine learning method that helps us discover hidden semantic structures in an article, that allows us to learn topic representations of articles in a corpus.

The data is being fetched from the data.world URL and converted into a Pandas DataFrame in the following cell

In [4]:
url = "https://query.data.world/s/ee6arp6cngynnoj4hvyuhckn3tb4hj"
df = pd.read_csv(url, delimiter = '\t', encoding = 'utf-8')

Selecting only the articles having category as 'The Latest' and dropping all other articles which have their correct categories would be unfair because we are dropping our training data. Hence, I am keeping all 22,994 rows

### 3) Exploratory Analysis

This section deals with exploring and analyzing the dataset. It will give us a deeper understanding of the dataset by making us familiar with all the rows and columns of the dataset.

In [5]:
# Prints out the first 5 rows of data in the dataset
df.head()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
0,Bitcoin is down 60 percent this year. Here's w...,Timothy B. Lee,Business & Finance,2014-03-31 14:01:30,2014-12-16 16:37:36,http://www.vox.com/2014/3/31/5557170/bitcoin-b...,Bitcoins have lost more than 60 percent of the...,<p>The markets haven't been kind to<span> </sp...
1,6 health problems marijuana could treat better...,German Lopez,War on Drugs,2014-03-31 15:44:21,2014-11-17 00:20:33,http://www.vox.com/2014/3/31/5557700/six-probl...,Medical marijuana could fill gaps that current...,<p>Twenty states have so far legalized the med...
2,9 charts that explain the history of global we...,Matthew Yglesias,Business & Finance,2014-04-10 13:30:01,2014-12-16 15:47:02,http://www.vox.com/2014/4/10/5561608/9-charts-...,These nine charts from Thomas Piketty's new bo...,<p>Thomas Piketty's book <i>Capital in the 21s...
3,Remember when legal marijuana was going to sen...,German Lopez,Criminal Justice,2014-04-03 23:25:55,2014-05-06 21:58:42,http://www.vox.com/2014/4/3/5563134/marijuana-...,"Three months after legalizing marijuana, Denve...",<p><span>When Colorado legalized recreational ...
4,Obamacare succeeded for one simple reason: it'...,Sarah Kliff,Health Care,2014-04-01 20:26:14,2014-11-18 15:09:14,http://www.vox.com/2014/4/1/5570780/the-two-re...,"After a catastrophic launch, Obamacare still s...",<p>There's a very simple reason that Obamacare...


In [6]:
# Prints out the last 5 rows of data in the dataset
df.tail()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
23019,Marijuana legalization opponents warned teen p...,German Lopez,The Latest,2017-03-21 19:30:01,2017-03-21 19:51:00,http://www.vox.com/policy-and-politics/2017/3/...,,"<p id=""6OljE3"">So far, <a href=""http://www.vox..."
23020,4 ways the House health care vote could go dow...,Andrew Prokop,The Latest,2017-03-21 21:41:12,2017-03-21 23:46:25,http://www.vox.com/policy-and-politics/2017/3/...,This Thursday should be an eventful day.,"<p id=""5WuiOu"">House Speaker Paul Ryan still a..."
23021,In search of Forrest Fenn's treasure,Zachary Crockett,First Person,2017-02-28 12:33:27,2017-02-28 14:12:31,http://www.vox.com/a/fenn-treasure-hunt-map,,"<div class=""restricted""> \n \n <p class=..."
23022,Oscars 2017: every movie nominated for an Acad...,Sarah Frostenson,The Latest,2017-02-23 20:20:01,2017-02-26 14:10:56,http://www.vox.com/a/oscars-2017-movies-nominees,"Yes, even 13 Hours: The Secret Soldiers of Ben...","<h2 id=""RCAyzl""><a href=""http://www.imdb.com/..."
23023,Transcript: President Trumpâ€™s speech to Cong...,Vox Staff,Congress,2017-03-01 01:06:07,2017-03-01 04:06:34,http://www.vox.com/a/trump-speech-transcript-j...,,<p>President Trump took a trip up Pennsylvani...


In [7]:
# Summary statistics about the data column-wise
df.describe()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
count,23024,23023,23013,23013,23013,23013,23013.0,23013
unique,22977,825,185,22825,22474,23013,20043.0,23013
top,"Republican debate 2016 live stream: time, TV s...",German Lopez,The Latest,2016-07-13 12:00:03,2016-11-16 21:01:35,http://www.vox.com/2015/1/30/7952719/deflatega...,,"<p>In 1937, shortly after relocating permanent..."
freq,7,2243,7152,9,24,1,2729.0,1


### 4) Data Pre-Processing

#### Split Training and Testing sets

In [8]:
# Training set with all 20003 rows
df_train = df.copy()

# Duplicate the original dataset
df_test = df.copy()

# Drop rows with category "The Latest" from our training dataset
df_train = df_train.loc[df_train['category'] != 'The Latest']

Initially, the data pre-processing steps include dropping the irrelevant columns from the dataframe and then dropping the rows having any of the values as NaN.

But before doing that step, empty cell locations are being checked or the ones having whitespaces. These rows are marked and dropped entirely.

In [9]:
print("Cleaning the dataset...")
columns_train = ['author','category','published_date','updated_on','slug','body']
df_train.drop(columns_train, axis = 1, inplace = True)

columns_test = ['author','published_date','updated_on','slug','body']
df_test.drop(columns_test, axis = 1, inplace = True)

Cleaning the dataset...


In [10]:
print(df.shape)
print(df.ndim)

(23024, 8)
2


I decided to drop missing values as we have a large number of records to train on and the number of records having at least 1 value missing is negligible. Hence, it will only have minuscule effect on our model’s performance which can be neglected. <br>

<br>I am deleting the author, published_date and updated_on columns as they are irrelevant to my end goal, which is, topic modeling using the title and blurb (short description).<br>

<br>I have also decided to delete the slug and body columns as this is just a naive implementation of topic modeling. I will have to consider those two columns after completing this project to make my topic modeling more coherent.
Also, I am dropping the category column as I am trying to determine that attribute itself and unsupervised learning does not require the training labels.


In [11]:
print("Removing missing values...")
df_train['blurb'].replace(' ', np.nan, inplace = True)
df_train.dropna(axis = 0, how = 'any', inplace = True)

df_test['blurb'].replace(' ', np.nan, inplace = True)
df_test.dropna(axis = 0, how = 'any', inplace = True)

Removing missing values...


I am performing the operation of cleaning up weird characters from the dataframe. These characters exist because the string data was decoded in another format and is now being encoded in UTF-8 format

In [12]:
df_train.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

df_train['blurb'] = df_train['blurb'].str.replace('â€™',"'").str.replace('â€”',"-").str.replace('â€œ','"').str.replace('â€','"')
df_train['blurb'] = df_train['blurb'].str.strip()
df_train['blurb'] = df_train['blurb'].apply(lambda x: x.strip())

df_train['title'] = df_train['title'].str.replace('â€™',"'").str.replace('â€”',"-").str.replace('â€œ','"').str.replace('â€','"')
df_train['title'] = df_train['title'].str.strip()
df_train['title'] = df_train['title'].apply(lambda x: x.strip())

# Same operation on test dataset
df_test.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

df_test['blurb'] = df_test['blurb'].str.replace('â€™',"'").str.replace('â€”',"-").str.replace('â€œ','"').str.replace('â€','"')
df_test['blurb'] = df_test['blurb'].str.strip()
df_test['blurb'] = df_test['blurb'].apply(lambda x: x.strip())

df_test['title'] = df_test['title'].str.replace('â€™',"'").str.replace('â€”',"-").str.replace('â€œ','"').str.replace('â€','"')
df_test['title'] = df_test['title'].str.strip()
df_test['title'] = df_test['title'].apply(lambda x: x.strip())

# Checking Values
print("Checking values...")
# print(df_train.at[23003, 'blurb'])

Checking values...


Here, I am keeping only the distinct (unique) values of titles as well as of the blurb and dropping duplicates

In [13]:
df_train = df_train.drop_duplicates('blurb')
df_train = df_train.drop_duplicates('title')

df_test = df_test.drop_duplicates('blurb')
df_test = df_test.drop_duplicates('title')

Converting our dataset into a collection of 5495 documents with just 1 column consisting of title concatenated with blurb.

In [14]:
df_train['documents'] = df_train['title'].map(str) + '. ' + df_train['blurb'].map(str)

df_test['documents'] = df_test['title'].map(str) + '. ' + df_test['blurb'].map(str)

In [15]:
columns = ['title','blurb']
df_train.drop(columns, axis = 1, inplace = True)

df_test.drop(columns, axis = 1, inplace = True)

Selecting only the articles having category as 'The Latest' and dropping all other articles which have their correct categories to build our test set

In [16]:
# Test set with only approximately 5554 articles which have category as "The Latest"
df_test = df_test.loc[df_test['category'] == 'The Latest']
df_test.drop('category', axis = 1, inplace = True)

print(df_train.shape)
print(df_train.ndim)

print(df_test.shape)
print(df_test.ndim)

(14554, 1)
2
(5455, 1)
2


LDA Implementation Process:<br>
<br>The number of topics should have been already decided even if we're not sure what the topics are.
<br>Each document is represented as a distribution over topics.
<br>Each topic is represented as a distribution over words.

### 5) Text Cleaning and Tokenization

The following function cleans our text and returns a list of tokens

In [17]:
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

### 6) Lemmatization

Next, I will use NLTK's Wordnet to find the synonyms, antonyms and the meanings of words. In addition, WordNetLemmatizer() will give us the root word of a token. (similar to stemming but better)

In [18]:
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

I am trying to make my code modularized by defining functions for specific tasks. 

Like, I defined this function to prepare the text for topic modeling

### 7) Stopwords Removal

In [19]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

### 8) Feature Extraction

The following function opens our data frame, reads each row sequentially; and for each row, prepares text for LDA, and then adds it to a list. This function calls the helper function prepare_text_for_lda(text) which in-turn calls all the other functions I defined above

The following cell shows how each of our document is being tokenized. Each and every document is getting converted into a list of tokens.

In [20]:
text_data = []
for row in tqdm(df_train['documents']):
    tokens = prepare_text_for_lda(row)
    #if random.random() > .99:
        #print(tokens)
    text_data.append(tokens)

100%|██████████| 14554/14554 [00:16<00:00, 878.86it/s]


### 9) Feature Engineering

The word dictionary and corpus needed for topic modeling are created and saved on disk for further usage

#### LDA with Gensim

Firstly, I am creating a dictionary from our data, then I am converting it into a bag-of-words corpus and then saving the dictionary and corpus for future use using pickle library

In [21]:
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open('../bin/resources/corpus.pkl', 'wb'))
dictionary.save('../bin/resources/dictionary.gensim')

### 10) Building the Topic Model

Now, the number of topics in the articles will be proportional to the number of articles in the dataset. For our original dataset consisting of 22,994 articles, we had around 185 distinct topics. Proportionally, for our subset of articles; it can be inferred that we will have around 227 unique topics. So , the total topics to be trained on becomes 227

LDA is finding 227 topics from our data

In [None]:
NUM_TOPICS = 100
start_time = time.time()
ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics=NUM_TOPICS, 
                                           random_state=89, 
                                           update_every=1,  
                                           id2word=dictionary, 
                                           passes=42, 
                                           alpha='auto', 
                                           per_word_topics=True)

ldamodel.save('../bin/resources/model100.gensim')
train_time = time.time() - start_time
print("Training Time --- %s seconds " % (round(train_time, 2)))

The above LDA model is built with 227 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic. You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics() as shown next.

In [None]:
topics = ldamodel.print_topics()
for topic in topics:
    pp.pprint(topic)

Topic 5 includes words like “history”, "netflix", "play" and "comedy" which sounds like a topic related to movies/entertainment.

Topic 96 includes words like “election”, “hillary”, “clinton”, "campaign" and “republican”; it is definitely a politics related topic.

Topic 40 includes words like “obamacare”, “health”, “insurance”, "liberal" and “coverage”, sounds like a topic related to healthcare and health insurance. and so on.

With LDA, we can see that different documents have different topics, and the discriminations are obvious.

In [None]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=text_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
for row in df_test['documents']:
    test_doc = prepare_text_for_lda(test_doc)
    test_doc_bow = dictionary.doc2bow(test_doc)
    print(test_doc_bow)
    print(ldamodel.get_document_topics(test_doc_bow))