<a href="https://colab.research.google.com/github/wenxuan0923/My-notes/blob/master/Topic_modeling_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Topic Modeling: Latent Dirichlet Allocation

**Topic modeling** is an unsupervised machine learning technique to cluster unlabelled text documents according to their hidden topics by analyzing "bags" of words frequently occur together. It can be used to:

- Label documents according to these hidden topics

- Search, organize and summarize texts documents

There are many techniques that are used for topic modeling. In this note I will dig into the popular **Latent Dirichlet Allocation** (LDA) method. In particular, I will compare the results using **CountVectorizer** and **TfidfVectorizer**, and visualize them using an interactive visualization tool: **pyLDAvis**. The data set used in this note is from Kaggle <a target='_blank' href='https://www.kaggle.com/therohk/million-headlines'> A million News Headlines</a>: news headlines published over a period of 17 Years. 

In [3]:
# General 
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style('whitegrid')

# NLP dependencies
import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet, stopwords
STOPWORDS = stopwords.words('english')
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pyLDAvis.sklearn

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Data preprocessing

Let's load the data and see how should we preprocess the data.


In [4]:
text_df = pd.read_csv('abcnews-date-text.csv')
print(text_df.shape)
text_df.head(5)

(1186018, 2)


Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


This is a large dataset, let's randomly sample 100000 of them for simplicity.

In [5]:
df = text_df.sample(n=100000)

### Check for lower/upper case 

In [6]:
have_upper = np.sum(df.headline_text.apply(lambda x: x.isupper()))
if have_upper==len(df):
  print('There are only uppercase in the document')
elif have_upper==0:
  print('There are only lowercase in the document')
else:
  print('There are both uppercase and lowercase in the document')

There are only lowercase in the document


### Check for punctuations & special characters

In [7]:
punctuations = list(df.headline_text.apply(lambda x: re.findall(r'[^a-zA-Z0-9\s]', x)))
punctuations = set(sum(punctuations, []))
print('There are {} punctuations: {}'.format(len(punctuations), punctuations))

There are 5 punctuations: {'.', ';', '$', ':', "'"}


**We need to perform the following steps:**

- Remove punctuations

- Remove stop words

- Tokenize the text

- Lemmatize the words

The first three steps can actually be bundled into one by using only one command in Scikit-Learn. We just need to lemmatize the words for now. 

One important thing to notice while using Lemmatizer from NLTK is: we should specify the **part of speech (pos) tag** while lemmatizing to the sentences. This is because if not specified, the default setting `noun` will be applied, meaning the lemmatizer will attempt to find the closest noun, which can potentially be very wrong. Luckily, we can easily get pos tag of each token in a sentence with `nltk.pos_tag`. Reference: https://pythonprogramming.net/lemmatizing-nltk-tutorial/


In [8]:
# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    wordnet_tagged = [(word, nltk_tag_to_wordnet_tag(tag)) 
                      for word, tag in nltk_tagged]
    lemmatizer = WordNetLemmatizer()
    
    lemmatized_sentence = ''
    for word, tag in wordnet_tagged:
        if tag:
            lemmatized_sentence += ' ' + lemmatizer.lemmatize(word, tag)
        else:
            lemmatized_sentence += ' ' + word

    return lemmatized_sentence.strip()

### Preview the funtion with an example

In [11]:
txt = df.headline_text.iloc[2325]
print('\033[34m Original text: \033[30m', txt)
lemmatized_txt = lemmatize_sentence(txt)
print('\033[35m Lemmatized text: \033[30m', lemmatized_txt)

[34m Original text: [30m hospital reaches settlement over disabilities case
[35m Lemmatized text: [30m hospital reach settlement over disability case


Great, it works as expected! Now we can apply this method to the whole dataset and save it in another column `lemmatized_txt`.

In [12]:
%%time
df['lemmatized_txt'] = df.headline_text.apply(lambda x: lemmatize_sentence(x))
# it takes quite a while to process the data, let's save it for further usage
# text_df.to_csv('news.csv')

CPU times: user 1min 4s, sys: 1.1 s, total: 1min 5s
Wall time: 1min 5s


In [13]:
df.head()

Unnamed: 0,publish_date,headline_text,lemmatized_txt
1152845,20190113,deadly paris bakery blast kills firefighters,deadly paris bakery blast kill firefighter
238369,20060522,tamil tiger leader killed in sri lanka,tamil tiger leader kill in sri lanka
1045406,20160831,cavers rescue calf,cavers rescue calf
638377,20110910,vettel on monza pole; webber fifth,vettel on monza pole ; webber fifth
508005,20091218,windies scratch way to lunch,windies scratch way to lunch


## Vectorization 

Now we are ready to vectorize the lemmatized texts, and at the same time remove punctuations and stopwords. We will be using **CountVectorizer** and  **TfidfVectorizer** from `sklearn.feature_extraction.text` class.

- **CountVectorizer** converts a collection of text documents to a matrix of token counts.

- **TfidfVectorizer** converts a collection of raw documents to a matrix of TF-IDF features.

### Some important common attributes in these two vectorizers:
>**lowercase**, bool, default=True, convert all characters to lowercase before tokenizing <br><br>
> **analyzer** {'word', 'char', 'char_wb'} or callable, default='word'
Whether the feature should be made of word or character n-grams <br><br>
>**stop_words**: string or list, default=None. If 'english', a built-in stop word list for English is used<br><br>
> **token_pattern**: string. Regular expression denoting what constitutes a 'token', only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator) <br><br>
>**ngram_range**: tuple (min_n, max_n), default=(1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams<br><br>
>**max_df**: float in range [0.0, 1.0] or int, default=1: ignore terms that have a document frequency strictly higher than the given threshold<br><br>
>**min_df**: float in range [0.0, 1.0] or int, default=1: ignore terms that have a document frequency strictly lower than the given threshold

For this news dataset, we have only lowercase chracters, and we will use 'word' as analyzer to classify the texts. The list `STOPWORDS` we downloaded from nltk at the begining of the note will be passed in the stop_words argument.

In [14]:
count_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                   lowercase = True,
                                   analyzer = 'word',
                                   token_pattern = r'\b[a-zA-Z]{3,}\b',  # words contain at least 3 letters
                                   stop_words = STOPWORDS,
                                   ngram_range = (1, 1),   # only consider unigrams 
                                   max_df = 0.5,   # can't occur in more than half of the documents 
                                   min_df = 20)   # at least occur in 20 documents

# Use the same parameters to define the TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer(**count_vectorizer.get_params())

In [15]:
vec_count = count_vectorizer.fit_transform(df.lemmatized_txt)
vec_tfidf = tf_idf_vectorizer.fit_transform(df.lemmatized_txt)
print(vec_count.shape)
print(vec_tfidf.shape)



(100000, 3726)
(100000, 3726)


## LDA and its Visualization

In the LDA model, each document is viewed as a mixture of topics that are present in the corpus. The model proposes that each word in the document is attributable to one of the document’s topics. The LDA model determine how much of each topic is present in a document.

The two main idea behind LDA model are:

- **Every document is a mixture of topics**
> E.g. we can say document A is 90% topic 1 and 10% topic 2, Document B is 30% topic 1 and 70% topic 2
- **Every topic is a mixture of words**
> E.g. Say there are two topics in our news headlines: "Politics" and "Agriculture". The most common words in the "Politics" topic might be "Democrats", "Election", "Federal" and so on. While the "Agriculture" topic may be made up of words like "Farm", "Plant", "Seed", "Weather", etc. Importantly, **words can be shared between topics**, the a word like "budget" might appear in both equally.


LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. 

Reference: https://www.tidytextmining.com/topicmodeling.html

In [16]:
num_topics = 8   # Hyperparameter to be tuned

In [17]:
# for count vectorizer
%%time
count_LDA = LatentDirichletAllocation(n_components=num_topics, random_state=0)
count_LDA.fit(vec_count)

CPU times: user 2min 36s, sys: 34.6 ms, total: 2min 36s
Wall time: 2min 36s


In [20]:
# get topics for samples
topics = count_LDA.transform(vec_count)
df['predict_topic'] = topics.argmax(axis=1)

In [28]:
# News belong to topic 1
df[df.predict_topic==1]

Unnamed: 0,publish_date,headline_text,lemmatized_txt,predict_topic
1152845,20190113,deadly paris bakery blast kills firefighters,deadly paris bakery blast kill firefighter,1
974618,20150826,police raid canberra headquarters of cfmeu,police raid canberra headquarters of cfmeu,1
408915,20080905,mother acquitted of raping daughter,mother acquit of rap daughter,1
331403,20070917,mystery toddler from new zealand police,mystery toddler from new zealand police,1
537642,20100514,21 years in jail for savage murder of flatmate,21 year in jail for savage murder of flatmate,1
...,...,...,...,...
649907,20111103,pakistani trio handed jail sentences,pakistani trio hand jail sentence,1
9247,20030403,woman pleads guilty to transvestite murder,woman plead guilty to transvestite murder,1
527280,20100323,womans arm almost severed in dog attack,woman arm almost sever in dog attack,1
514955,20100124,second person questioned over teen party stabbing,second person question over teen party stabbing,1


### Top keywords in each topic

Each topic is characterized by a mixture of words. Let's now extract the keywords from each topic to infer the real topic of the group. 

In [29]:
def get_topicword(LDA, vectorizer):
  return pd.DataFrame(data=LDA.components_.T, 
                  index=vectorizer.get_feature_names(),
                  columns=['topic_'+str(i) for i in range(1, num_topics+1)])

def get_top_keywords(topicword, topic, top):
  order = np.argsort(topicword['topic_'+str(topic)])[::-1]
  top_words = np.array(topicword.index)[order][:top]
  return top_words

`count_LDA.components_` is a topic word distribution table. `components[i, j]` can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization. References:

1. <a target='_blank' href='https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html'> sklearn.decomposition.LatentDirichletAllocation </a>


2. <a target='_blank' href='https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py'>Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation</a>

In [30]:
topicword_count = get_topicword(count_LDA, count_vectorizer)
display(topicword_count)

# get top 15 keywords in each topic 
for n in range(1, 9):
  print('\033[35m Topic', n , '\n \033[30m')
  print(get_top_keywords(topicword=topicword_count, 
                 topic=n, top=15), '\n')

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
abalone,0.125185,0.128036,0.125045,0.125069,0.125084,26.121341,0.125113,0.125128
abandon,0.125077,17.971764,0.125098,0.125097,0.125069,0.125142,50.277664,0.125088
abattoir,81.124682,0.125044,0.125028,0.125046,0.125037,0.125050,0.125035,0.125078
abbott,0.125075,0.125042,27.127164,0.125150,18.808663,57.127004,123.135603,21.426299
abbotts,0.125141,0.125013,0.125148,16.023613,0.125117,0.125039,6.225867,0.125062
...,...,...,...,...,...,...,...,...
zealand,0.125057,0.125027,0.125034,6.510910,0.125023,119.738896,0.125029,0.125024
zero,15.790314,0.138739,5.308147,0.125016,0.262394,0.125075,0.125194,0.125121
zimbabwe,0.125039,0.125049,0.125043,0.125118,40.460255,0.151546,73.762919,0.125030
zone,92.086316,0.127648,2.557588,0.125020,0.125132,0.125030,8.786928,15.066338


[35m Topic 1 
 [30m
['road' 'urge' 'change' 'day' 'market' 'pay' 'deal' 'coast' 'gold'
 'strike' 'farmer' 'one' 'war' 'land' 'close'] 

[35m Topic 2 
 [30m
['police' 'man' 'charge' 'court' 'crash' 'woman' 'find' 'murder' 'death'
 'car' 'kill' 'jail' 'drug' 'two' 'arrest'] 

[35m Topic 3 
 [30m
['water' 'rise' 'accuse' 'school' 'face' 'say' 'rate' 'minister' 'china'
 'warn' 'question' 'student' 'link' 'high' 'show'] 

[35m Topic 4 
 [30m
['fire' 'interview' 'year' 'australia' 'claim' 'world' 'cup' 'attack'
 'life' 'melbourne' 'sentence' 'mine' 'body' 'new' 'look'] 

[35m Topic 5 
 [30m
['govt' 'call' 'report' 'labor' 'green' 'power' 'rural' 'indigenous'
 'search' 'probe' 'fall' 'plan' 'nsw' 'run' 'continue'] 

[35m Topic 6 
 [30m
['win' 'new' 'make' 'ban' 'say' 'return' 'top' 'force' 'back' 'final'
 'name' 'trump' 'open' 'medium' 'beat'] 

[35m Topic 7 
 [30m
['election' 'country' 'iraq' 'national' 'law' 'take' 'government' 'hour'
 'act' 'kill' 'park' 'vote' 'campaign' 'da

We can also use **pyLDAvis** package to visualize the result: Note that the topic label might be different.

In [31]:
# Visualizing the models with pyLDAvis
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(count_LDA, vec_count, count_vectorizer)

## How to interpret this visualization?

**Left panel:**
- Each topic is represented by a circle

- The size of each circle represents the prevalence of the topics, which is also annotated by the number on the circle. Number 1 means the mosst popular topic amount all the documents

- The distance between two circles represent (an estimate of) the similarity between two topics (because they are high-dimensional data got mapped to the 2-dimensional space for visulization).

**Right panel:**
- The list of words represent most popular words inside the corresponding topics

- If there is a longer blue bar than the red one, it means this word is not only appearing in this topic but also in other topics (maybe even more popular in other topics)

- The value $\lambda$ tries to keep the balance between choosing exclusive words and more generic words, by setting $\lambda = 0$, the blue bar gonna disappear because the algorithm only pick the words exclusively included in that topic.




In [32]:
# for TF-IDF vectorizer
%%time
tf_idf_LDA = LatentDirichletAllocation(n_components=num_topics, random_state=0)
tf_idf_LDA.fit(vec_tfidf)

CPU times: user 2min 2s, sys: 47.7 ms, total: 2min 2s
Wall time: 2min 2s


In [33]:
topicword_tfidf = get_topicword(tf_idf_LDA, tf_idf_vectorizer)
# get top 15 keywords in each topic 
for n in range(1, 9):
  print('\033[35m Topic', n , '\n \033[30m')
  print(get_top_keywords(topicword=topicword_tfidf, 
                 topic=n, top=15), '\n')

[35m Topic 1 
 [30m
['pay' 'change' 'farmer' 'cancer' 'abc' 'safety' 'sport' 'climate'
 'strike' 'road' 'new' 'urge' 'market' 'land' 'report'] 

[35m Topic 2 
 [30m
['man' 'charge' 'police' 'crash' 'court' 'murder' 'woman' 'car' 'find'
 'death' 'miss' 'arrest' 'assault' 'accuse' 'guilty'] 

[35m Topic 3 
 [30m
['water' 'rise' 'rate' 'murray' 'wind' 'plan' 'age' 'toll' 'royal' 'say'
 'farm' 'resident' 'commission' 'school' 'question'] 

[35m Topic 4 
 [30m
['interview' 'police' 'gold' 'jail' 'sentence' 'fire' 'extend' 'year'
 'drug' 'david' 'michael' 'gas' 'claim' 'life' 'tour'] 

[35m Topic 5 
 [30m
['rural' 'govt' 'indigenous' 'call' 'national' 'council' 'labor' 'new'
 'plan' 'green' 'job' 'say' 'government' 'sale' 'community'] 

[35m Topic 6 
 [30m
['win' 'cup' 'world' 'australia' 'day' 'final' 'new' 'sign' 'medium'
 'beat' 'top' 'tiger' 'england' 'make' 'play'] 

[35m Topic 7 
 [30m
['country' 'iraq' 'hour' 'market' 'kill' 'election' 'closer' 'bomb' 'say'
 'drum' 'hill

In [34]:
pyLDAvis.sklearn.prepare(tf_idf_LDA, vec_tfidf, tf_idf_vectorizer)

\\
