#### What is Topic Modeling?

As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

#### How it is different from rule-based text mining approch?

Rule based apprch use regular expressions or dictionary based keyword searching techniques where as Topic Modeling is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

#### 3 common technique of Topic Modeling


*   Latent Semantic Analysis (LSA)
*   Probabilistic Latent Semantic Analysis (pLSA)
*   Latent Dirichlet Allocation (LDA)





### LDA

In [2]:
!pip install pyLDAvis
!pip install pyLDAvis.gensim
!pip install django-model-utils

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 2.2 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Collecting gensim
  Downloading gensim-4.1.2-cp39-cp39-macosx_10_9_x86_64.whl (24.0 MB)
[K     |████████████████████████████████| 24.0 MB 15.0 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 40.7 MB/s eta 0:00:01
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25ldone
[?25h  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136900 sha256=18277c0f95a5c35f4c13da0cfa612dbe7e3943503ad656cbbde9c2d

In [5]:
import pandas as pd
import numpy as np
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
#from model_utils import qc_dict, out_topics_docs, check_topic_doc_prob, topn_docs_by_topic
#import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [22]:
df =pd.read_csv('TV Ceiling & Wall Mounts_processed.csv')

In [23]:
df.head()

Unnamed: 0,asin,reviewText,overall,summary,title,review_no_html,review_no_contraction,review_no_punc,review_no_stopwords,review_lemmatized
0,972683275,"This is a great buy, compared to a $60 or more...",5,Excelant mount for Tv. Would buy more if I nee...,VideoSecu 24&quot; Long Arm TV Wall Mount Low ...,"This is a great buy, compared to a $60 or more...","This is a great buy, compared to a $60 or more...",This is a great buy compared to a 60 or more a...,great buy compared 60 retail store,great buy compare retail store
1,972683275,This mount is just what I needed. It is stron...,5,Perfect,VideoSecu 24&quot; Long Arm TV Wall Mount Low ...,This mount is just what I needed. It is stron...,This mount is just what I needed. It is stron...,This mount is just what I needed It is strong...,needed strong sturdy folds almost flat enable ...,need strong sturdy fold almost flat enable tel...
2,972683275,"Great deal, easy to mount and it appears to be...",5,32 In TV Mount,VideoSecu 24&quot; Long Arm TV Wall Mount Low ...,"Great deal, easy to mount and it appears to be...","Great deal, easy to mount and it appears to be...",Great deal easy to mount and it appears to be ...,great deal easy appears fairly sturdy mounted ...,great deal easy appear fairly sturdy work great
3,972683275,This mount works really well once you get it u...,4,"Pretty simple, but definitely good!",VideoSecu 24&quot; Long Arm TV Wall Mount Low ...,This mount works really well once you get it u...,This mount works really well once you get it u...,This mount works really well once you get it u...,works really well get going problem arms littl...,work really well get go problem arm little goo...
4,972683275,This wall mount does everything it's supposed ...,4,Fairly good wall mount,VideoSecu 24&quot; Long Arm TV Wall Mount Low ...,This wall mount does everything it's supposed ...,This wall mount does everything it is supposed...,This wall mount does everything it is supposed...,everything supposed piece attaches overly long...,everything suppose piece attache overly long m...


Create the Dictionary and Corpus needed for Topic Modeling

In [24]:
reviews = df['review_lemmatized'].copy()
reviews = reviews.apply(lambda x: x.split())

In [25]:
reviews

0                     [great, buy, compare, retail, store]
1        [need, strong, sturdy, fold, almost, flat, ena...
2        [great, deal, easy, appear, fairly, sturdy, wo...
3        [work, really, well, get, go, problem, arm, li...
4        [everything, suppose, piece, attache, overly, ...
                               ...                        
13114    [reason, go, one, sometimes, remote, hard, tim...
13115    [really, well, make, extremely, durable, set, ...
13116    [great, little, contraption, keep, apple, stat...
13117    [sturdy, affix, supply, double, sided, tape, h...
13118    [wellmade, accessory, hold, apple, device, sec...
Name: review_lemmatized, Length: 13119, dtype: object

In [26]:
# Create Dictionary
dictionary = corpora.Dictionary(reviews)
dictionary.filter_extremes(no_below=3)

In [27]:
# number of terms
nd = dictionary.num_docs
nt = len(dictionary.keys())
print("number of documents", nd)
print("number of terms", nt)

number of documents 13119
number of terms 5707


In [28]:
# Create Corpus
texts = reviews

# Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]]


In [29]:
# Human readable format of corpus (term-frequency)
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('buy', 1), ('compare', 1), ('great', 1), ('retail', 1), ('store', 1)]]

Building the Topic Model

In [30]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True,
                                           iterations = 400)

View the topics in LDA model

In [31]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.084*"drywall" + 0.084*"behind" + 0.074*"range" + 0.060*"plug" + '
  '0.054*"hide" + 0.040*"super" + 0.040*"call" + 0.036*"send" + 0.035*"finder" '
  '+ 0.034*"durable"'),
 (1,
  '0.121*"use" + 0.079*"screw" + 0.058*"bracket" + 0.054*"bolt" + 0.049*"stud" '
  '+ 0.043*"two" + 0.037*"level" + 0.036*"include" + 0.034*"small" + '
  '0.029*"attach"'),
 (2,
  '0.050*"much" + 0.045*"installation" + 0.044*"instal" + 0.043*"set" + '
  '0.041*"unit" + 0.038*"think" + 0.037*"right" + 0.035*"instruction" + '
  '0.035*"enough" + 0.034*"place"'),
 (3,
  '0.097*"work" + 0.085*"easy" + 0.077*"great" + 0.071*"good" + 0.063*"well" + '
  '0.055*"install" + 0.051*"buy" + 0.051*"hold" + 0.044*"price" + '
  '0.034*"sturdy"'),
 (4,
  '0.090*"screen" + 0.090*"flat" + 0.053*"vesa" + 0.048*"low" + 0.046*"x" + '
  '0.045*"view" + 0.041*"plasma" + 0.041*"turn" + 0.039*"space" + 0.036*"mm"'),
 (5,
  '0.424*"hole" + 0.214*"drill" + 0.067*"measure" + 0.049*"concrete" + '
  '0.047*"apart" + 0.033*"write" + 

#### Chages on each run

Topic 0 is a represented as _0.082*"minute" + 0.031*"player" + 0.025*"customer" + 0.023*"dvd" + '
  '0.022*"hd" + 0.022*"inside" + 0.018*"beat" + 0.016*"double" + 0.016*"ne" + '
  '0.015*"function"


It means the top 10 keywords that contribute to this topic are: minute, player, customer.. and so on and the weight of minute on topic 0 is 0.082.

The weights reflect how important a keyword is to that topic.

 Compute Model Perplexity and Coherence Score

In [32]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -11.28093983685892

Coherence Score:  0.3123686289502873


Visualize the topics-keywords

In [33]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [34]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
vis


  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload


Interpretation: 
Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if we move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

 Building LDA Mallet Model

Upnext, we will improve upon this model by using Mallet’s version of LDA algorithm and then we will focus on how to arrive at the optimal number of topics given any large corpus of text.

Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
[K     |████████████████████████████████| 23.4 MB 70.2 MB/s eta 0:00:01
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py) ... [?25ldone
[?25h  Created wheel for gensim: filename=gensim-3.8.3-cp39-cp39-macosx_10_9_x86_64.whl size=24175443 sha256=1f5f866c11c0e573c7b21addd02b7665390d75a43f490ec54b9e15a3608ad0ef
  Stored in directory: /Users/swapnilkailassethi/Library/Caches/pip/wheels/ca/5d/af/618594ec2f28608c1d6ee7d2b7e95a3e9b06551e3b80a491d6
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.1.2
    Uninstalling gensim-4.1.2:
      Successfully uninstalled gensim-4.1.2
Successfully installed gensim-3.8.3


ImportError: cannot import name 'Vocab' from 'gensim.models.word2vec' (/Users/swapnilkailassethi/opt/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py)

AttributeError: module 'gensim.models' has no attribute 'wrappers'