# Introduction to Latent Dirichlet Allocation 
 I've made this notebook to showcase the capability of Latent Dirichlet Allocation( LDA ).  
 I have used this dataset's training data to demonstarate LDA and how to implement it using   
 <b> Gensim and pyLDAvis </b>.
    
We will use LDA to perform Topic modelling.  
Topic modelling refers to the task of identifying topics that best describes a set of documents.     
These topics will only emerge during the topic modelling process (therefore called latent).   
To tell briefly, LDA imagines a fixed set of topics. Each topic represents a set of words.   
And the goal of LDA is to map all the documents to the topics in a way, such that the words  
in each document are mostly captured by those imaginary topics.

In [None]:
import numpy as np 
import pandas as pd
from collections import Counter
import pickle
#!pip install gensim
#!pip install pyLDAvis
import gensim
from gensim import corpora
import pyLDAvis.gensim
import nltk
#Download the nltk dependencies 
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
#nltk.download('stopwords')

from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

punc="!#$%&'()*+-/:;<=>?@[\]^_`{|}~@."
stop_words = list(set(stopwords.words('english')))+['dont']


In [None]:
df_train=pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
#df_test=pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

In [None]:
df_train.tail(2)

We will be using the **'text'** column of the Training Data for our task.

In [None]:
text_data=list(df['text'])

# 1.1 Pre-processing Input

## 1.1.1 Punctuation tokens
We will need to remove punctuation tokens from our text. 

## 1.1.2 Work Tokenization
We are using NLTK's word tokenizer for this task. It splits the string by ' ' and returns individual tokens.  
There are many other tokenizers available. I have personally used Spacy's tokenizer as well and it has given me good results.

In [None]:
#Removal of punctuation tokens
for i in range(len(text_data)):
    text_data[i]=text_data[i].translate(str.maketrans('', '', punc))
    
#Tokeniztion of data
word_list=[]
for i in range(0,len(text_data)):
    word_list += nltk.word_tokenize(text_data[i]) 

After punctuation tokens removal and word tokenization, our list of words looks like this.

In [None]:
print(*word_list[:20])

# 1.2 Lemmatization of strings
Lemmatizaion is the task of reducing a string to its base form.  
This helps us in groups words that have the same base form or lemma and will provide more meaning in the coming steps.

In [None]:
words_lemma_list=[]

for i in range(0,len(word_list)):
    word=lemmatizer.lemmatize(word_list[i].lower())
    if(word not in stop_words and len(word)>2):
        words_lemma_list.append(word)

List of words after lemmatizaion and reducing it to lower case.

In [None]:
print(*words_lemma_list[:20])

# 1.3 POS Tagging
POS or Parts of Speech Tagging is identifying what is the part of speech of the given string.   
Since we are performing Topic Modelling, we will consider only Nouns for our task.   
A POS tag of ['NN','NNS'] corresponds to a Noun.

In [None]:
pos_list=[]
for i in range(0,len(words_lemma_list)):
    pos_list+=nltk.pos_tag([words_lemma_list[i]])

Tuples of tokens and their corresponding POS tag

In [None]:
print(*pos_list[20:25])

In [None]:
nouns=[]
for i in range(len(pos_list)):
    if((pos_list[i][1] in ['NN','NNS'])):
        nouns.append(pos_list[i][0])

In [None]:
print("Nouns in the dataset :",*nouns[:10])
print("Number of Nouns in dataset :",len(nouns))
print("Number of distinct Nouns in the dataset : ",len(set(nouns)))

In [None]:
print("Most common Nouns in the dataset : \n",*Counter(nouns).most_common(5))

In [None]:
nouns=[[nouns[i]] for i in range(0,len(nouns))]

LDA creates a dictionary of words from the input, and converts the input into document vectors. 

In [None]:
dictionary = corpora.Dictionary(nouns)
corpus = [dictionary.doc2bow(text) for text in nouns]

#Use pickle files
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

There are a few parameters:
1. Number of Topics - Based on your use case, you can change the number of topics.
2. Passes - Number of passes you want to the algorithm to run for. 

In [None]:
NUM_TOPICS = 3
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)

We now have the LDA distibution for each topic, let's now visualize it in using pyLDAvis

In [None]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model5.gensim')

lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

This is one of my first notebooks on Kaggle. I hope you gained some insight about how to implement LDA   
and increased your curiosity about Topic Modelling.

I have not included spell-check in the pre-processing step, but do check out my other notebook,  
where I have implemented spell-check functions to handle all types of spelling mistakes.  
[https://www.kaggle.com/amarananth/spellcheck-python](http://)    

If you want to learn more and the math behind LDA, one of my references for the textual information in the beginning.
[https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158](http://)

