#  Latent Dirichlet Allocation Process

Imagine a large law firm takes over a smaller law firm and tries to identify the documents corresponding to different types of cases such as civil or criminal cases which the smaller firm has dealt or is currently dealing with. The presumption is that the documents are not already classified by the smaller law firm. An intuitive way of identifying the documents in such situations is to look for specific sets of keywords and based on the sets of keywords found, identify the type of the documents. In Natural Language Processing (NLP), this task is referred to as topic modelling. Here, the term ‘topic’ refers to a set of words that come to mind when we think of a topic. For instance, when we think of ‘entertainment’- the topic, the words that come to the mind are ‘movies’, ‘sitcoms’, ‘web series’, ‘Netflix’, ‘YouTube’ and so on. In our example of legal documents for the law firm, a set of words such as ‘property’, ‘litigation’ and ‘tort’ help identify that the document is related to a ‘real-estate’ (topic) case. A model trained to automatically discover topics appearing in documents is referred to as a topic model. 

At this point, it is important to note that topic modelling is not the same as topic classification. Topic classification is a supervised learning approach in which a model is trained using manually annotated data with predefined topics. After training, the model accurately classifies unseen texts according to their topics. On the other hand, topic modelling is an unsupervised learning approach in which the model identifies the topics by detecting the patterns such as words clusters and frequencies. The outputs of a topic model are;
1) clusters of documents that the model has grouped based on topics and
2) clusters of words (topics) that the model has used to infer the relations.

The above discussion hints at a couple of underlying assumptions in topic modelling; 1) the distributional assumption and the statistical mixture assumption. The distributional assumption indicates that similar topics make use of similar words, and the statistical mixture assumption indicates that each document deals with several topics. Simply put, for a given corpus of documents, each document can be represented as a statistical distribution of a fixed set of topics. The role of the topic model is to identify the topics and represent each document as a distribution of these topics.

Some of the well-known topic modelling techniques are Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Correlated Topic Model (CTM). In this article, we will focus on LDA, a popular topic modelling technique. 



Before getting into the details of the Latent Dirichlet Allocation model, let’s look at the words that form the name of the technique. The word ‘Latent’ indicates that the model discovers the ‘yet-to-be-found’ or hidden topics from the documents. ‘Dirichlet’ indicates LDA’s assumption that the distribution of topics in a document and the distribution of words in topics are both Dirichlet distributions. ‘Allocation’ indicates the distribution of topics in the document.  

LDA assumes that documents are composed of words that help determine the topics and maps documents to a list of topics by assigning each word in the document to different topics. The assignment is in terms of conditional probability estimates as shown in figure 2. In the figure, the value in each cell indicates the probability of a word wj belonging to topic tk. ‘j’ and ‘k’ are the word and topic indices respectively. It is important to note that LDA ignores the order of occurrence of words and the syntactic information. It treats documents just as a collection of words or a bag of words. 

Once the probabilities are estimated (we will get to how these are estimated shortly), finding the collection of words that represent a given topic can be done either by picking top ‘r’ probabilities of words or by setting a threshold for probability and picking only the words whose probabilities are greater than or equal to the threshold value. For instance, if we focus on topic-1 in figure 2 and pick top 4 probabilities assuming that the probabilities of the words not shown in the table are less than 0.012, then topic-1 can be represented as shown below using the ‘r’ top probabilities words approach. 

In the above example, if word-k, word1, word3 and word2 are respectively trees, mountains, rivers and streams then topic-1 could correspond to ‘nature’.

One of the important inputs to LDA is the number of expected topics in the documents. In the above example if we set the expected topics to 3, each document can be represented as shown below.


In the above representation, ,  and  are the three weights for topics: topic-1, topic-2 and topic-3 respectively for a given document .  indicates the proportion of words in document  that represent topic-1,  indicates the proportion of words in document  that represent topic-2 and so on.

# LDA Algorithm

LDA assumes that each document is generated by a statistical generative process.  That is, each document is a mix of topics, and each topic is a mix of words. For example, figure 3 shows a document with ten different words. This document could be assumed to be a mix of three topics; tourism, facilities and feedback. Each of these topics, in turn, is a mix of different collections of words.  In the process of generating this document, first, a topic is selected from the document-topic distribution and later, from the selected topic, a word is selected from the multinomial topic-word distributions.

While identifying the topics in the documents, LDA does the opposite of the generation process. The general steps involved in the process are shown in figure 4. It’s important to note that LDA begins with random assignment of topics to each word and iteratively improves the assignment of topics to words through Gibbs sampling.

# Implement topic model using gensim library


In [1]:
####################### Install Prerequesties ###########################
!pip install -r requirements.txt
!python -m spacy download en_core_web_sm



# importing basic libraries...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import metrics
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import gensim
import nltk
import re
import json
import os
import spacy
import gensim
import unicodedata
import numpy as np
import pandas as pd
from operator import itemgetter
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords
nltk.download(['stopwords','punkt'])
nltk.download('stopwords')
from tqdm import tqdm
from spacy.tokenizer import Tokenizer
from gensim import corpora
from gensim.models import ldamodel
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation,strip_numeric


DATA_DIR = "COVID-19-Twitter-India/" # I have more than 1000 csv files but for now I have added around 9 to cut pocessing time.


file_names_hourly = os.listdir(DATA_DIR)
#Mapping Files From Hourly to Daily Basis
file_names_daily = [file_name[:-7] for file_name in file_names_hourly]
file_names_df = pd.DataFrame({'Hourly' : file_names_hourly, 'Daily': file_names_daily})[:100]
file_names_df.head()

def corrupt_or_not(file_name):
    """Some csv files are corrupt this is a program to spot them in DATA_DIR,
    return : True if opens False for corrupt(not open)"""
    try:
        pd.read_csv(os.path.join(*[DATA_DIR,file_name]))
        return False
    except:
        return True

file_names_df['Corrupt'] = file_names_df['Hourly'].apply(corrupt_or_not)
file_names_df.groupby('Corrupt').count()

#Removing Corrupt Files and 
#Converting the Groupby object to dict such that key is the day and values are the hourly file names
file_names_df = file_names_df[file_names_df['Corrupt'] == False]
file_daily_hourly_map = file_names_df.groupby('Daily')['Hourly'].apply(list).to_dict()

daily_full_tweets = {}

for key,files in tqdm(file_daily_hourly_map.items()):
    hourly_df = [pd.read_csv(os.path.join(*[DATA_DIR,file_name])) for file_name in files]
    daily_df = pd.concat(hourly_df)
    daily_df = daily_df[(daily_df['full_text'] != 'No Value Mentioned') | (daily_df['full_retweet_text'] != 'No Value Mentioned')]
    daily_df.loc[daily_df['full_text'] == 'No Value Mentioned','full_text'] =  daily_df.loc[daily_df['full_text'] == 'No Value Mentioned','full_retweet_text']
    #Forcefully type casting to str because some values were just float
    daily_df['full_text'] = daily_df['full_text'].astype(str)
    daily_full_tweets[key] = " ".join(daily_df['full_text'].astype(str).values)

daily_tweets = daily_full_tweets.values()
daily_tweets = list(daily_tweets)

def remove_accent_chars(text):
    text = unicodedata.normalize('NFKD',text).encode('ascii','ignore').decode('utf-8','ignore')
    return text

def remove_special_characters(text, remove_digits=False):
    """This takes text as input and then finds whether each character is not a-z A-Z 0-9 and replaces them with nothing """
    pattern = r'[^a-zA-z\s]' if not remove_digits else r'[^0-9a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def cleaner(doc):
    return " ".join(map(str.lower,(map(str,([token.lemma_ for token in doc if not token.is_stop | token.is_space | token.is_punct | token.like_url])))))


def pipeline_2_tokenizer(daily_df):
    text_data_cleaned = list(nlp.pipe(daily_df.full_text.values.tolist(),disable=["tagger", "parser","ner"]))
    text_data_cleaned = [t for t in text_data_cleaned if t]
    return text_data_cleaned

nlp = spacy.load("en_core_web_sm",max_length = 2000000)
nlp.add_pipe(cleaner,name="cleaner",first=True)
nlp.add_pipe(remove_accent_chars,name='accent_char_removal',after='cleaner')
nlp.add_pipe(remove_special_characters,name='remove_special_char',after='accent_char_removal')
tokenizer = Tokenizer(nlp.vocab)
    
def single_frame(file_names):
    "Concatenates all dataframe from a day and returns dataframe after fixing the full_text column"
    hourly_df = [pd.read_csv(os.path.join(*[DATA_DIR,file_name])) for file_name in file_names]
    daily_df = pd.concat(hourly_df)
    daily_df = daily_df[(daily_df['full_text'] != 'No Value Mentioned') | (daily_df['full_retweet_text'] != 'No Value Mentioned')]
    daily_df.loc[daily_df['full_text'] == 'No Value Mentioned','full_text'] =  daily_df.loc[daily_df['full_text'] == 'No Value Mentioned','full_retweet_text']
    daily_df['full_text'] = daily_df['full_text'].astype(str)
    return daily_df

file_daily_hourly_map
final_df_tweets = pd.DataFrame()
final_retweet_text_updated = []
for key,file_names in tqdm(file_daily_hourly_map.items()):
  final_df_tweets = pd.concat([final_df_tweets,single_frame(file_names)],ignore_index=True)
for f in list(final_df_tweets['full_retweet_text']):
  t = type(f)
  if f!='No Value Mentioned' and t==str:
    final_retweet_text_updated.append(f)

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
You should consider upgrading via the 'd:\deep_learning\anaconda3\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'D:\Deep_Learning\Anaconda3\python.exe -m pip install --upgrade pip' command.


Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
     ---------------------------------------- 12.0/12.0 MB 2.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 1/1 [00:00<00:00,  3.92it/s]
100%|██████████| 1/1 [00:00<00:00,  3.49it/s]


In [2]:
# List of all tweets on coronavirus in 2020 
final_corpus = final_retweet_text_updated

# remove commond stopwords like for a of the and to in from each text in list of documents
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in final_corpus]
all_tokens = sum(texts,[])

# remove duplicate tokens from set of words in each document of list
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
final_corpus = [[word for word in text if word not in tokens_once]
         for text in texts]

# make a bag of words corpus 
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=20)
corpus_lda = lda[corpus]


lda_topics = lda.show_topics(num_words=5) # num_words signifies total number of words to represent each topic

topics = []
filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

for topic in lda_topics:
    print(topic)
    topics.append(preprocess_string(topic[1], filters))

print(topics)
# As we can see weightage of each word to represent particular topic. 
# Note we can set num_topics parameter above to as much as we want like 10,20,30 and so on...






(7, '0.026*"&amp;" + 0.021*"world" + 0.019*"is" + 0.019*"bapu" + 0.018*"very"')
(14, '0.046*"north" + 0.045*"korea" + 0.043*"cure" + 0.030*"they" + 0.029*"is"')
(15, '0.025*"is" + 0.020*"pollution" + 0.020*"which" + 0.019*"per" + 0.018*"as"')
(1, '0.041*"are" + 0.030*"amount" + 0.022*"donated" + 0.021*"krw" + 0.016*"coronavirus"')
(8, '0.044*"save" + 0.019*"murdered" + 0.019*"corona" + 0.018*"economy" + 0.018*"quite"')
(18, '0.026*"have" + 0.024*"cases" + 0.021*"are" + 0.020*"is" + 0.018*"new"')
(16, '0.025*"financial" + 0.024*"bts" + 0.023*"coronavirus" + 0.018*"relief" + 0.017*"problem"')
(17, '0.025*"is" + 0.022*"coronavirus" + 0.020*"has" + 0.019*"staff" + 0.019*"iran"')
(2, '0.022*"you" + 0.018*"tv" + 0.018*"watch" + 0.018*"-" + 0.017*"sadhana"')
(0, '0.040*"will" + 0.036*"like" + 0.036*"is" + 0.035*"corona" + 0.026*"environment"')
[['amp', 'world', 'is', 'bapu', 'very'], ['north', 'korea', 'cure', 'they', 'is'], ['is', 'pollution', 'which', 'per', 'as'], ['are', 'amount', 'donate

One of the practical application of topic modeling is to determine what topic a given document is about.

To find that, we find the topic number that has the highest percentage contribution in that document.

The `format_topics_sentences()` function below nicely aggregates this information in a presentable table.

In [3]:

def format_topics_sentences(ldamodel=lda, corpus=corpus, texts=texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=lda, corpus=corpus, texts=texts)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,10.0,0.8533,"by, every, roils, with, global, is, coronaviru...","[#saudiarabia’s, ministry, tourism, announces,..."
1,1,4.0,0.5149,"over, china, nitrogen, dioxide, &amp;, from, a...","[alternative, read., mind-boggling., nasa, ima..."
2,2,19.0,0.5373,"is, an, that, this, has, have, coronavirus, it...","[just, in:, fifth, case, coronavirus, sydney,,..."
3,3,13.0,0.8643,"coronavirus, from, cases, first, new, death, d...","[breaking:, 35, new, coronavirus, deaths, china]"
4,4,17.0,0.9672,"is, coronavirus, has, staff, iran, past, count...","[coronavirus, epidemy, iran, has, past, crisis..."
5,5,5.0,0.3577,"made, is, help, have, decreases, @bts_twt, (no...","[most, people, who, contract, coronavirus, exp..."
6,6,16.0,0.5268,"financial, bts, coronavirus, relief, problem, ...","[new, :, -, india, suspends, all, kinds, visas..."
7,7,14.0,0.6732,"north, korea, cure, they, is, are, country, it...","[it's, been, horrible, 2, months, iran., top, ..."
8,8,1.0,0.6703,"are, amount, donated, krw, coronavirus, millio...","[indian, envoy, #iran, gaddam, dharmendra, say..."
9,9,13.0,0.905,"coronavirus, from, cases, first, new, death, d...","[breaking:, governor, washington, state, decla..."


Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. Whew!!

In [4]:
# Group top 5 sentences under each topic
sent_topics_sorteddf = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf = pd.concat([sent_topics_sorteddf, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf.head()

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0.0,0.9779,"will, like, is, corona, environment, our, viru...","[@abpnews, @amitshah, when, environment, is, i..."
1,1.0,0.9721,"are, amount, donated, krw, coronavirus, millio...","[thread:, 425, million, reasons, why, #who, re..."
2,2.0,0.9736,"you, tv, watch, -, sadhana, 07:30pm, recovery,...","[@asharamjibapu_, how, prevent, corona, virus?..."
3,3.0,0.981,"corona, virus, be, army, by, will, holi, @prak...","[@dbxpdvq6wbiv4wj, @akashvaniair, @narendramod..."
4,4.0,0.9779,"over, china, nitrogen, dioxide, &amp;, from, a...","[#savinglives, -, special, flight, iaf, c-17, ..."


The tabular output above had 20 rows, one each for a topic. It has the topic number, the keywords and the most representative document. The `Perc_Contribution` column is nothing but the percentage contribution of the topic in the given document.

# Interpret Document Topic Distributions and Summarize Findings


Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed. The below table exposes that information.

In [5]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics.head(20)

Unnamed: 0,Dominant_Topic,Topic_Keywords,Num_Documents,Perc_Documents
0.0,10.0,"by, every, roils, with, global, is, coronaviru...",381.0,0.0572
1.0,4.0,"over, china, nitrogen, dioxide, &amp;, from, a...",244.0,0.0366
2.0,19.0,"is, an, that, this, has, have, coronavirus, it...",188.0,0.0282
3.0,13.0,"coronavirus, from, cases, first, new, death, d...",263.0,0.0395
4.0,17.0,"is, coronavirus, has, staff, iran, past, count...",406.0,0.0609
5.0,5.0,"made, is, help, have, decreases, @bts_twt, (no...",247.0,0.0371
6.0,16.0,"financial, bts, coronavirus, relief, problem, ...",245.0,0.0368
7.0,14.0,"north, korea, cure, they, is, are, country, it...",255.0,0.0383
8.0,1.0,"are, amount, donated, krw, coronavirus, millio...",169.0,0.0254
9.0,13.0,"coronavirus, from, cases, first, new, death, d...",518.0,0.0778


# Visualization of topics

In [7]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus, dictionary=lda.id2word)
vis

  and should_run_async(code)
  head(R).drop('saliency', 1)


# Summary of findings

We started with understanding what topic modeling can do. We built a basic topic model using Gensim’s LDA. And then we saw how we can observe topics for each document and what topic most documents signifies to and percentage of documents related to particular topic and at the end document topic distributions in dataframe.

Finally we saw how to aggregate and present the results to generate insights that may be in a more actionable.