# Topic Modeling using LDA

This notebook demonstrates how to implement topic modeling using Latent Dirichlet Allocation (LDA) on news feed. 

LDA is a generative model which is commonly used in topic modeling. In LDA, we assume that each document is generated by a mixtures of latent topics, and each topic has probabilities of generating various words. The use of Dirichlet prior in the model is based on the intuition that a document can only cover a small set of topics and topics use only small set of words frequently.

In this example, a news feed can be viewed as a mixture of various latent topics. Our goal is to find word counts of top words appeared in each news feed, and predict topics which are most likely to generate this news feed according to the word count vector. We first preprocess the news feed by removing stop words and then represent the text into bag of words matrix. LDA takes bag of words matrix as inputs and output a document to topic matrix and a word to topic matrix. These two matrices when multiplied reproduce the bag of words matrix with the lowest error.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from pprint import pprint
from collections import defaultdict
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from gensim import corpora, models, similarities
from utils import clean_text, build_stopwords

## 1. Text Preprocessing

First, let us load and clean the text data. The dataset comprises chinese news feed which fall into 3 topics, namely '梁振英', '美國大選', '足球' ('Hong Kong Chief Executive', 'US President Election', 'Soccer').

To train LDA model using Gensim library, we have to tokenize the text into list of words. We also remove words that appear less than 100 times in all news feed.

In [2]:
# load data
df = pd.read_csv("data/newsfeed.csv",  header=0)
X = df['text']
y = df['tags']

# text preprocessing
stopwords = build_stopwords(filepath='data/stopwords.txt')
X_cleaned = clean_text(X, stopwords)

print('Before text preprocessing:')
print(X[0])
print('\nAfter text preprocessing:')
print(X_cleaned[0])

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/cd/k8wp_1kd24g6w3k6pxhs99_m0000gn/T/jieba.cache
Loading model cost 0.779 seconds.
Prefix dict has been built succesfully.


Before text preprocessing:
利物浦重賽擊敗乙組仔　英足盃過關 英格蘭足總盃第三圈今晨重賽，貴為英超勁旅的利物浦上場被乙組仔埃克斯特尷尬逼和，多獲一次機會的紅軍不敢再有差池。先有近期回勇的「威爾斯沙維」祖阿倫10分鐘開紀錄，加上兩個小將舒爾奧祖，及祖奧迪西拿下半場各入一球，以3比0擊敗對手，總算在主場挽 <p style="text-align: justify;">英格蘭足總盃第三圈今晨重賽，貴為英超勁旅的利物浦上場被乙組仔埃克斯特尷尬逼和，多獲一次機會的紅軍不敢再有差池。先有近期回勇的「威爾斯沙維」祖阿倫10分鐘開紀錄，加上兩個小將舒爾奧祖，及祖奧迪西拿下半場各入一球，以3比0擊敗對手，總算在主場挽回面子，下一圈對手為韋斯咸。</p> <p style="text-align: justify;">另一場英超球隊對壘，今季異軍突起的李斯特城戰至第三圈就宣告畢業。熱刺憑韓國前鋒孫興<U+615C>上半場遠射破網先開紀錄，換邊後此子助攻予中場查迪尼建功，令球隊以兩球輕取李斯特城，第四圈將面對英甲的高車士打。</p>

After text preprocessing:
利物浦 重賽 擊敗 乙組 仔英足 盃 過關   英格蘭足 總 盃 第三圈 今晨 重賽 貴為 英超 勁 旅 利物浦 上場 乙 組仔 埃克斯 特 尷尬 逼 多獲 一次 機會的紅 軍 不敢 差池 先有 近期 回勇 威爾斯沙維 祖阿倫 分鐘 開紀錄 加上 兩個 小將 舒爾奧祖 及祖奧 迪西 拿下 半場 各入 一球 以比擊敗 手 總算 主場 挽   英格蘭足 總 盃 第三圈 今晨 重賽 貴為 英超 勁 旅 利物浦 上場 乙 組仔 埃克斯 特 尷尬 逼 多獲 一次 機會的紅 軍 不敢 差池 先有 近期 回勇 威爾斯沙維 祖阿倫 分鐘 開紀錄 加上 兩個 小將 舒爾奧祖 及祖奧 迪西 拿下 半場 各入 一球 以比擊敗 手 總算 主場 挽回 面子 一圈 對手 韋斯咸   一場 英超球 隊 壘 今季異 軍 突起 李斯特 城戰 第三圈 宣告 畢業 熱刺 韓國 前鋒 孫興 上半 場遠射 破 網先 開紀錄 換邊 此子 助攻 予中場 查迪尼 建功 令球隊 兩球 輕取 李斯特 城 四圈 將面 英甲 高車士


In [3]:
# text to list of words
texts = []
for text in X_cleaned:
    texts.append(text.split())
    
# remove words that appear less than 100 times 
freq = defaultdict(int)
for text in texts:
    for token in text:
        freq[token] += 1
texts = [[token for token in text if freq[token] > 100] for text in texts]

print("Number of tokenized text data: ", len(texts))
print(texts[0])

Number of tokenized text data:  3894
['利物浦', '擊敗', '盃', '總', '盃', '今晨', '英超', '勁', '利物浦', '上場', '逼', '一次', '軍', '不敢', '近期', '分鐘', '開紀錄', '加上', '兩個', '小將', '半場', '一球', '手', '主場', '總', '盃', '今晨', '英超', '勁', '利物浦', '上場', '逼', '一次', '軍', '不敢', '近期', '分鐘', '開紀錄', '加上', '兩個', '小將', '半場', '一球', '手', '主場', '對手', '韋斯咸', '一場', '英超球', '隊', '軍', '李斯特', '熱刺', '韓國', '前鋒', '上半', '開紀錄', '助攻', '建功', '兩球', '李斯特', '城']


## 2. Train LDA model

LDA model in Gensim takes dictionary and corpus as inputs and output the topic-word distributions. We first convert the tokenized text into dictionary and corpus, and then train the LDA model with number of topics = 3.

In [4]:
# text to dictionary
dic = corpora.Dictionary(texts)
print(dic)

# text to corpus
corpus = [dic.doc2bow(text) for text in texts]

# save dictionary and corpus for future use
dic.save('model/news.dict')
gensim.corpora.MmCorpus.serialize('model/news.mm', corpus)

# load corpus and dictionar if necessary
# dic = gensim.corpora.Dictionary.load('model/news.dict')
# corpus = gensim.corpora.MmCorpus('model/news.mm')

print('Dictionary size: ', len(dic))
print('Corpus size: ', len(corpus))

# build LDA model
lda = models.LdaModel(corpus, id2word=dic, num_topics=3)

Dictionary(1888 unique tokens: ['一場', '一次', '一球', '上半', '上場']...)
Dictionary size:  1888
Corpus size:  3894


## 3. Evaluate the LDA model

To evaluate the trained LDA model, we can extract the topic-word distributions. Let us see the top 5 most contributing words for the 3 topics. Here are some of the observations.

* Topic 0 contains words such as 'HK Chief Executive', 'Hong Kong', which is related to 'Hong Kong Chief Executive'
* Topic 1 contains words such as 'soccer', 'Premier League', which is related to 'soccer'
* Topic 2 contains words such as 'Donald Trump', 'Hillary', 'America', 'President' which is realted to 'US President Election'

In [5]:
pprint(lda.print_topics(num_words=5))

[(0, '0.014*"梁振英" + 0.011*"特首" + 0.007*"香港" + 0.007*"盃" + 0.006*"馬"'),
 (1, '0.011*"盃" + 0.011*"香港" + 0.009*"足球" + 0.007*"英超" + 0.007*"球員"'),
 (2, '0.028*"特朗普" + 0.019*"希拉里" + 0.015*"美國" + 0.010*"黨" + 0.009*"總統"')]


Since we have ground truth for the news feed dataset. We can evaluate the LDA model performace using the accuracy score between the true topics and the predicted topics.

In [7]:
# predict topic probability distribution for corpus
corpus_lda = lda[corpus]

# get the label (topic) with highest probability
pred = list(map(lambda l: sorted(l, key=lambda label_prob: label_prob[1], reverse=True)[0][0], corpus_lda))

# encode the label for ground truth
class_dict = {0:'梁振英', 1:'足球', 2:'美國大選'}
y_class = []
for i in range(len(y)):
    for j in range(len(class_dict)):
        if y[i]==class_dict[j]: y_class.append(j)
            
# compute accuracy score
print('Accuracy: ', accuracy_score(y_class, pred))

Accuracy:  0.7668207498715973
