<a href="https://colab.research.google.com/github/xmpuspus/Lectures/blob/master/notebooks/IntroTopicModeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling  
As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. 

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.  

[Reference](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)

### Import Packages

In [0]:
import nltk
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

from gensim.models import CoherenceModel

# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

nltk.download('stopwords')
nltk.download('wordnet')

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 14.0MB/s 
Collecting funcy (from pyLDAvis)
  Downloading https://files.pythonhosted.org/packages/47/a4/204fa23012e913839c2da4514b92f17da82bf5fc8c2c3d902fa3fa3c6eec/funcy-1.11-py2.py3-none-any.whl
Building wheels for collected packages: pyLDAvis
  Running setup.py bdist_wheel for pyLDAvis ... [?25l- \ done
[?25h  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.11 pyLDAvis-2.1.2
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already

True

### Preparing Documents  
We combine these documents to form a corpus.

In [0]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

### Cleaning and Preprocessing  
Remove the punctuations, stopwords and normalize the corpus.

In [0]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete] 
doc_clean

[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'],
 ['father',
  'spends',
  'lot',
  'time',
  'driving',
  'sister',
  'around',
  'dance',
  'practice'],
 ['doctor',
  'suggest',
  'driving',
  'may',
  'cause',
  'increased',
  'stress',
  'blood',
  'pressure'],
 ['sometimes',
  'feel',
  'pressure',
  'perform',
  'well',
  'school',
  'father',
  'never',
  'seems',
  'drive',
  'sister',
  'better'],
 ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]

### Preparing Document-Term Matrix

In [0]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)],
 [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(8, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (4, 1),
  (18, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

### Running Latent Dirichlet Allocation (LDA) Model

In [0]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50, random_state=42)

### Results

Topic-Word Distribution

In [0]:
print(ldamodel.print_topics(num_topics=3, num_words=2))

[(0, '0.135*"sugar" + 0.054*"consume"'), (1, '0.079*"driving" + 0.045*"blood"'), (2, '0.057*"father" + 0.057*"sister"')]


Above results show that 3 topics. We can infer them through inspection:

Topic 0 -- Sugar  
Topic 1 -- Driving  
Topic 2 -- Family  

Document-Topic Distribution

In [0]:
topic_lists = ['Sugar', 'Driving', 'Family']

In [0]:
lists = ldamodel.get_document_topics(doc_term_matrix, minimum_probability=0.0)

In [0]:
print ('Document 1: ', doc1)
[(topic_lists[i[0]], i[1]) for i in lists[0]]

Document 1:  Sugar is bad to consume. My sister likes to have sugar, but not my father.


[('Sugar', 0.9128762), ('Driving', 0.043293275), ('Family', 0.04383049)]

In [0]:
print ('Document 2: ', doc2)
[(topic_lists[i[0]], i[1]) for i in lists[1]]

Document 2:  My father spends a lot of time driving my sister around to dance practice.


[('Sugar', 0.034941718), ('Driving', 0.93001276), ('Family', 0.0350455)]

In [0]:
print ('Document 3: ', doc3)
[(topic_lists[i[0]], i[1]) for i in lists[2]]

Document 3:  Doctors suggest that driving may cause increased stress and blood pressure.


[('Sugar', 0.033592276), ('Driving', 0.93213683), ('Family', 0.034270868)]

In [0]:
print ('Document 4: ', doc4)
[(topic_lists[i[0]], i[1]) for i in lists[3]]

Document 4:  Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.


[('Sugar', 0.026379962), ('Driving', 0.026502648), ('Family', 0.9471174)]

In [0]:
print ('Document 5: ', doc5)
[(topic_lists[i[0]], i[1]) for i in lists[4]]

Document 5:  Health experts say that Sugar is not good for your lifestyle.


[('Sugar', 0.90422046), ('Driving', 0.047857955), ('Family', 0.04792159)]

### Compute Perplexity and Coherence Socre

In [0]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(doc_term_matrix))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=ldamodel, texts=doc_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -4.158875516680784

Coherence Score:  0.35696487231896973


### Visualize Topics-Keywords

In [0]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary)
vis