# Topic Modeling using Latent Dirichlet Allocation (LDA)

The Vox News corpus is a collection of all Vox articles published before March 21, 2017. Vox Media released this dataset as part of the KDD 2017 Workshop on Data Science + Journalism. Their goal for publishing this dataset was to enable data science researchers to apply various techniques on a news dataset.

<b>The dataset consists of 22,994 news articles with their titles, author names, categories, published dates, updated on dates, links to the articles and their short descriptions (8 columns). While visualizing the dataset, I noticed that all the articles are clustered by 185 distinct categories. Out of those articles, 7145 articles were tageed by the category "The Latest". It cannot be a coincidence that such a large number of articles would be tagged by a generic category. Hence, I decided to address this problem by unsupervised learning because the categories of articles cannot be predicted beforehand neither the articles can be tagged by their categories in the training dataset.</b>

<b>We get a crude idea of the article by just skimming through the category of the article. Hence, topic modeling is useful for categorizing or ranking articles which are remaining to be read by an individual. Moreover, clustering of articles based on topics also enable them to be organized by groups of similar topics inside a database. This simplifies the collective analysis of such Big Data especially in the field of News and Journalism where an enormous amount of data is archived and retrieved only when needed. Categorical clustering will also make information retrieval quicker and more efficient.</b>

We can analyze the title, short description and the body of these 7145 articles and predict their categories by using Topic Modeling.

In reality, analyzing the body would drastically improve the topic model. However, due to time constraints and proclivity towards minimalism, I have decided to drop the body column entirely. Also, parsing html tags in the body of articles would be a time-consuming task in itself.

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as bioinformatics. [Source: https://en.wikipedia.org/wiki/Topic_model]

The necessary python libraries and packages like numpy, pandas, matplotlib and scikit-learn have been imported

In [191]:
%matplotlib inline
import logging
import os
import pprint
import random
import re
import sys
import time

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
from nltk import Tree, pos_tag, word_tokenize
from nltk.corpus import wordnet as wn
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from tqdm import tqdm

# styling
pd.set_option('display.max_columns',150)
plt.style.use('bmh')
from IPython.display import display

### Data Visualization

This bubble chart quantifies the number of articles written by different authors.  [Source: https://data.world/elenadata/vox-articles]

![Figure 1](../bin/resources/articles-per-author.png "Figure 1")

This graph signifies the gradual increase in the number of articles being published during each month. However, the average articles published in the months of 2017 and 2016 seems to be the similar.  [Source: https://data.world/elenadata/vox-articles]

![Figure 2](../bin/resources/articles-by-month.png "Figure 2")

The entire dataset consists of a total of 185 distinct topics. This bubble plot shows records grouped by category. We can observe that the category "The Latest" has the maximum number of records.

![Figure 3](../bin/resources/records-by-category.png "Figure 3")

This bar graph tells us the distribution of records around topics and also around different authors who have written about the same topic

![Figure 4](../bin/resources/records-by-category-&-author.png "Figure 4")

There are a number of algorithms developed for topic modeling which use singular value decomposition (SVD) and the method of moments. These algorithms are listed below:
<ul>Explicit semantic analysis</ul>
<ul>Latent semantic analysis</ul>
<ul>Latent Dirichlet Allocation (LDA)</ul>
<ul>Hierarchical Dirichlet process</ul>
<ul>Non-Negative Matrix Factorization (NMF)</ul>

I decided to use LDA as it is widely praised by researchers and data scientists. Owing to my Data Mining project, I also had prior experience on working with Gensim library in Python which has a robust LDA model. LDA is a kind of probabilistic model that exploits similarity between data and extracts inference from the resulting analysis.

In natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Essentially the same model was also proposed independently by J. K. Pritchard, M. Stephens, and P. Donnelly in the study of population genetics in 2000. Both papers have been highly influential, with 19858 and 20416 citations respectively by August 2017.  [Source: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation]

This cell will auto-download the required NLTK modules

In [171]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tanveershaikh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tanveershaikh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Using NLTK, I am creating a corpus of English words and also an object of the lemmatizer is being created using WordNet

In [172]:
# Initialization and Global Variables

dictionary = dict.fromkeys(words.words(), None)
lemmatizer = WordNetLemmatizer()

### Reading the dataset (dsjVoxArticles.tsv) - Data Extraction

The data is being fetched from the data.world URL and converted into a Pandas DataFrame in the following cell

In [252]:
url = "https://query.data.world/s/ee6arp6cngynnoj4hvyuhckn3tb4hj"
df = pd.read_csv(url, delimiter = '\t', encoding = 'utf-8')

Selecting only the articles having category as 'The Latest' and dropping all other articles which have their coorect categories

In [253]:
df = df.loc[df['category'] == 'The Latest']
print(df.shape)
print(df.ndim)

(7152, 8)
2


### Exploratory Analysis

This section deals with exploring and analyzing the dataset. It will give us a deeper understanding of the dataset by making us familiar with all the rows and columns of the dataset.

In [254]:
# Prints out the first 5 rows of data in the dataset
df.head()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
31,Obama wants to fight gerrymandering once he le...,Andrew Prokop,The Latest,2014-04-15 18:00:02,2016-10-25 21:52:52,http://www.vox.com/2014/4/15/5604284/us-electi...,Our neighbor to the north solved its gerrymand...,<p>If Donald Trump wants to complain about US ...
56,"4/20: National Weed Day, explained",German Lopez,The Latest,2014-04-19 20:20:02,2016-04-21 00:09:02,http://www.vox.com/2014/4/19/5624560/why-is-42...,Tens of thousands are celebrating a less tradi...,"<p>It is 4/20, the day tens of thousands of Am..."
64,"You're Shakespeare, but you're playing Hamlet ...",Dara Lind,The Latest,2014-04-20 15:30:02,2016-06-14 20:57:15,http://www.vox.com/2014/4/20/5628860/hes-unive...,"Gregory Rabassa, who died Tuesday, is probably...","<p><i>Gregory Rabassa, who died Tuesday, was a..."
116,This awesome footage of DC was shot by an ille...,Zack Beauchamp,The Latest,2014-05-05 13:00:03,2015-05-15 15:45:54,http://www.vox.com/2014/5/5/5676010/dc-drone-f...,And the troubling things it tells us about the...,<p>As one of those rare people who actually gr...
231,A viewer's guide to the 2016 National Spelling...,Alex Abad-Santos,The Latest,2014-05-27 18:20:10,2016-05-25 15:10:48,http://www.vox.com/2014/5/27/5754264/a-viewers...,What you need to know before the carnage begins.,"<p>This week, one of the most brutal competiti..."


In [255]:
# Prints out the last 5 rows of data in the dataset
df.tail()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
23017,Bad typography has ruined more than just the O...,Christophe Haubursin,The Latest,2017-03-21 19:10:01,2017-03-21 19:34:47,http://www.vox.com/2017/3/21/15004126/oscars-g...,,"<p id=""BkGLCG"">You can blame a lot of people f..."
23018,Neil Gorsuch is denying former students' claim...,Emily Crockett,The Latest,2017-03-21 18:50:02,2017-03-21 19:51:37,http://www.vox.com/identities/2017/3/21/150091...,,"<p id=""fSQrxr"">During his confirmation hearing..."
23019,Marijuana legalization opponents warned teen p...,German Lopez,The Latest,2017-03-21 19:30:01,2017-03-21 19:51:00,http://www.vox.com/policy-and-politics/2017/3/...,,"<p id=""6OljE3"">So far, <a href=""http://www.vox..."
23020,4 ways the House health care vote could go dow...,Andrew Prokop,The Latest,2017-03-21 21:41:12,2017-03-21 23:46:25,http://www.vox.com/policy-and-politics/2017/3/...,This Thursday should be an eventful day.,"<p id=""5WuiOu"">House Speaker Paul Ryan still a..."
23022,Oscars 2017: every movie nominated for an Acad...,Sarah Frostenson,The Latest,2017-02-23 20:20:01,2017-02-26 14:10:56,http://www.vox.com/a/oscars-2017-movies-nominees,"Yes, even 13 Hours: The Secret Soldiers of Ben...","<h2 id=""RCAyzl""><a href=""http://www.imdb.com/..."


In [256]:
# Summary statistics about the data column-wise
df.describe()

Unnamed: 0,title,author,category,published_date,updated_on,slug,blurb,body
count,7152,7152,7152,7152,7152,7152,7152.0,7152
unique,7133,220,1,7111,7071,7152,5510.0,7152
top,"Republican debate 2016 live stream: time, TV s...",German Lopez,The Latest,2016-05-19 12:00:03,2016-11-16 21:01:35,http://www.vox.com/science-and-health/2017/1/1...,,"<p><a href=""http://www.vox.com/cards/gender-wa..."
freq,5,648,7152,3,12,1,1598.0,1


### Data Pre-Processing

Initially, the data pre-processing steps include dropping the irrelevant columns from the dataframe and then dropping the rows having any of the values as NaN.

But before doing that step, empty cell locations are being checked or the ones having whitespaces. These rows are marked and dropped entirely.

In [257]:
print("Cleaning the dataset...")
columns = ['author','category','published_date','updated_on','slug','body']
df.drop(columns, axis = 1, inplace = True)

Cleaning the dataset...


In [258]:
print(df.shape)
print(df.ndim)

(7152, 2)
2


I decided to drop missing values as we have a large number of records to train on and the number of records having at least 1 value missing is negligible. Hence, it will only have minuscule effect on our model’s performance which can be neglected. <br>

<br>I am deleting the author, published_date and updated_on columns as they are irrelevant to my end goal, which is, topic modeling using the title and blurb (short description).<br>

<br>I have also decided to delete the slug and body columns as this is just a naive implementation of topic modeling. I will have to consider those two columns after completing this project to make my topic modeling more coherent.
Also, I am dropping the category column as I am trying to determine that attribute itself and unsupervised learning does not require the training labels.


In [259]:
print("Removing missing values...")
df['blurb'].replace(' ', np.nan, inplace = True)
df.dropna(axis = 0, how = 'any', inplace = True)

Removing missing values...


I am performing the operation of cleaning up weird characters from the dataframe. These characters exist because the string data was decoded in another format and is now being encoded in UTF-8 format

In [260]:
df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

df['blurb'] = df['blurb'].str.replace('â€™',"'").str.replace('â€”',"-").str.replace('â€œ','"').str.replace('â€','"')
df['blurb'] = df['blurb'].str.strip()
df['blurb'] = df['blurb'].apply(lambda x: x.strip())

df['title'] = df['title'].str.replace('â€™',"'").str.replace('â€”',"-").str.replace('â€œ','"').str.replace('â€','"')
df['title'] = df['title'].str.strip()
df['title'] = df['title'].apply(lambda x: x.strip())

# Checking Values
print("Checking values...")
# print(df.at[23003, 'blurb'])

Checking values...


Here, I am keeping only the distinct (unique) values of titles as well as of the blurb and dropping duplicates

In [261]:
df = df.drop_duplicates('blurb')
df = df.drop_duplicates('title')

Converting our dataset into a collection of 5495 documents with just 1 column consisting of title concatenated with blurb.

In [262]:
df['documents'] = df['title'].map(str) + '. ' + df['blurb'].map(str)

In [263]:
columns = ['title','blurb']
df.drop(columns, axis = 1, inplace = True)