<a href="https://colab.research.google.com/github/yohanesnuwara/66DaysOfData/blob/main/D15_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with Latent Dirichlet Allocation (LDA)

**Caution**: To use pyLDAvis in Colab, run the first cell "pip install pyldavis" first. Then **Restart Runtime** to upgrade Pandas.

In [1]:
# Install pyLDAvis
!pip install pyldavis

Collecting pyldavis
[?25l  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 8.6MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/3f/03/c3526fb4e79a793498829ca570f2f868204ad9a8040afcd72d82a8f121db/numpy-1.21.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7MB)
[K     |████████████████████████████████| 15.7MB 183kB/s 
Collecting pandas>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
[K     |██████████████

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

from pprint import pprint

import spacy

import pickle
import re 
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

  from collections import Iterable
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://nu

In [2]:
## Access the full tweet dataset. Not used here.
# !wget https://github.com/yohanesnuwara/datasets/raw/master/dp-export-80169544-c25f-4e7d-8a37-c231441be607.zip

# Instead, the dataset has been processed
!wget https://raw.githubusercontent.com/yohanesnuwara/datasets/master/6k_tweet_processed.csv

--2021-07-18 04:49:58--  https://raw.githubusercontent.com/yohanesnuwara/datasets/master/6k_tweet_processed.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18197176 (17M) [text/plain]
Saving to: ‘6k_tweet_processed.csv’


2021-07-18 04:49:59 (70.6 MB/s) - ‘6k_tweet_processed.csv’ saved [18197176/18197176]



In [3]:
path = '/content/6k_tweet_processed.csv'

tweets = pd.read_csv(path)

tweets = tweets.Tweets.values.tolist()
tweets = [t.split(',') for t in tweets]

# Print one of the tweets
print(tweets[0])

['speak', 'health', 'care', 'reform', 'morning', 'live', 'tweeting', 'allow', 'happen', 'true', 'cover', 'hard', 'gawk', 'son', 'build', 'key', 'question', 'hit', 'plan', 'year', 'hit', 'plan', 'year', 'must', 'read', 'ask', 'orderly', 'generation', 'family', 'can', 'speak', 'branch', 'order', 'specific', 'action', 'declaration', 'administrative', 'orderly', 'transition', 'massive', 'disruption', 'go', 'protester', 'chant', 'really', 'motorcade', 'head', 'protester', 'many', 'slice', 'read', 'work', 'solution', 'trade', 'off', 'view', 'chant', 'singe', 'scene', 'start', 'spread', 'quick', 'thought', 'ask', 'how', 'view', 'grind', 'poster', 'reference', 'interview', 'continue', 'press', 'drug', 'rebate', 'classification', 'will', 'fallout', 'happiness', 'speak', 'see', 'spark', 'teach', 'dem', 'board', 'aca', 'replacement', 'republican', 'should', 'take', 'table', 'ugh', 'scandal', 'behavior', 'incentive', 'donation', "'d", 'sing', 'sit', 'dock', 'rap', 'chicken', 'dance', 'thank', 'day

Create bag of words, and its frequencies.

In [4]:
id2word = Dictionary(tweets)

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in tweets]
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1), (5, 2), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 2), (13, 1), (14, 1), (15, 1), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 2), (37, 1), (38, 9), (39, 1), (40, 1), (41, 1), (42, 2), (43, 2), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 2), (54, 2), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 2), (71, 1), (72, 5), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 5), (83, 1), (84, 2), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 5), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 4), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1)

In [5]:
[[(id2word[i], freq) for i, freq in doc] for doc in corpus[:1]]

[[("'d", 1),
  ('-', 1),
  ('absolutely', 1),
  ('aca', 3),
  ('act', 1),
  ('action', 2),
  ('add', 2),
  ('administrative', 1),
  ('affordable', 1),
  ('allow', 1),
  ('amazing', 1),
  ('arrive', 2),
  ('ask', 2),
  ('audits', 1),
  ('av', 1),
  ('avoid', 1),
  ('away', 2),
  ('back', 1),
  ('ball', 1),
  ('baseball', 1),
  ('beget', 2),
  ('begin', 1),
  ('behavior', 1),
  ('believe', 1),
  ('bid', 1),
  ('big', 2),
  ('billy', 1),
  ('board', 1),
  ('bout', 1),
  ('branch', 1),
  ('break', 1),
  ('bring', 1),
  ('brother', 1),
  ('build', 1),
  ('call', 2),
  ('can', 1),
  ('cap', 2),
  ('car', 1),
  ('care', 9),
  ('cell', 1),
  ('certainly', 1),
  ('chair', 1),
  ('change', 2),
  ('chant', 2),
  ('chicken', 1),
  ('child', 1),
  ('chip', 1),
  ('choice', 1),
  ('choke', 1),
  ('chuck', 1),
  ('classification', 1),
  ('close', 1),
  ('come', 2),
  ('community', 2),
  ('compare', 2),
  ('competition', 1),
  ('competitively', 1),
  ('compliant', 1),
  ('conclusion', 1),
  ('conferen

LDA to differentiate topics.

In [6]:
# Build LDA model
lda_model = LdaModel(corpus=corpus,
                   id2word=id2word,
                   num_topics=10, 
                   random_state=0,
                   chunksize=100,
                   alpha='auto',
                   per_word_topics=True)

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.142*"more" + 0.051*"today" + 0.017*"cancer" + 0.015*"pisce" + '
  '0.011*"capricorn" + 0.011*"aquarius" + 0.010*"arie" + 0.008*"feel" + '
  '0.006*"day" + 0.006*"gemini"'),
 (1,
  '0.019*"game" + 0.017*"good" + 0.017*"play" + 0.015*"win" + 0.012*"great" + '
  '0.010*"team" + 0.010*"go" + 0.010*"look" + 0.009*"think" + 0.009*"time"'),
 (2,
  '0.020*"video" + 0.016*"love" + 0.015*"like" + 0.013*"go" + 0.013*"watch" + '
  '0.013*"good" + 0.008*"fuck" + 0.007*"know" + 0.007*"live" + 0.007*"new"'),
 (3,
  '0.013*"how" + 0.009*"new" + 0.007*"business" + 0.007*"market" + 0.006*"s" + '
  '0.006*"price" + 0.006*"pay" + 0.005*"why" + 0.005*"money" + 0.005*"growth"'),
 (4,
  '0.005*"think" + 0.005*"thing" + 0.005*"woman" + 0.005*"know" + '
  '0.004*"write" + 0.004*"read" + 0.004*"people" + 0.004*"old" + 0.004*"word" '
  '+ 0.004*"man"'),
 (5,
  '0.024*"trump" + 0.014*"people" + 0.012*"think" + 0.010*"know" + '
  '0.008*"vote" + 0.008*"go" + 0.008*"need" + 0.007*"right" + 0.007*"say" + '

Use pyLDAvis to visualize these topics. 

9 groups are identified. Here is my analysis of topics of each group:
1. Daily life, work, positivity
2. Socio-politics, woman
3. President, voting, activism
4. Workplace, tech
5. Music, sport, film
6. Music, sport, film
7. Unidentified
8. Economics, business, market
9. Life, relationship
10. Unidentified

One circle can contains >1 topics, therefore there should be >9 circles. Topic 5 and 6 seem to interfere. Topic 7 and 10 can't be identified clearly. 

In [7]:
# Creating Topic Distance Visualization 
pyLDAvis.enable_notebook()
gensimvis.prepare(lda_model, corpus, id2word)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


We can optimize these (making more separated topic) using LDA Mallet. Full workflow in the below article.

References:

* https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know