# WHAT IS TOPIC MODELING?  
Topic models provide an efficient way to analyze large volumes of text. While there are many different types of topic modeling, the most common and arguably the most useful for search engines is Latent Dirichlet Allocation, or LDA. Topic models based on LDA are a form of text data mining and statistical machine learning which consist of:

Clustering words into “topics”.  
Clustering documents into “mixtures of topics”.    
More specifically: A Bayesian inference model that associates each document with a probability distribution over topics, where topics are probability distributions over words.

# What is a Probability Distribution?

A probability distribution is an equation which links each possible outcome of a random variable with its probability of occurrence. For example, if we flip a coin twice, we have four possible outcomes: Heads and Heads, Heads and Tails, Tails and Heads, Tails and Tails. Now, if we make heads = 1 and tails = 0, we could have a random variable, X, with three possible outcomes represented by x: 0, 1 and 2. So the P(X=x), or the probability distribution of X, is:  
x=0 -> 0.25  
x=1 -> 0.50  
x=2 -> 0.25  

In topic modeling, a document's probability distribution over topics, i.e. the mixture of topics most likely being discussed in that document, might look like this:  

document 1  

θ’1topic 1 = .33  
θ’1topic 2 = .33  
θ’1topic 3 = .33  

A topic's probability distribution over words, i.e. the words most likely to be used in a given topic, might look like this for the top 3 words in the topic:  

topic 1  

φ'1bank = .39  
φ'1money = .32  
φ'1loan = .29  

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.


# Parameters of LDA

Alpha and Beta Hyperparameters – 

alpha represents document-topic density and

Beta represents topic-word density.


Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. 

On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

In [0]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

# Cleaning and Preprocessing

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords 

from nltk.stem.wordnet import WordNetLemmatizer

import string

stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()


def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
  

doc_clean = [clean(doc).split() for doc in doc_complete] 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


# Preparing Document-Term Matrix
To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

In [7]:
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)],
 [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(8, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (4, 1),
  (18, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

# Running LDA Model

In [0]:
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [11]:
print(ldamodel.print_topics(num_topics=3, num_words=4))

[(0, '0.075*"expert" + 0.075*"lifestyle" + 0.075*"health" + 0.075*"say"'), (1, '0.072*"father" + 0.072*"sister" + 0.041*"perform" + 0.041*"seems"'), (2, '0.084*"sugar" + 0.048*"driving" + 0.048*"pressure" + 0.048*"increased"')]
