<a href="https://colab.research.google.com/github/rcdbe/sma-online/blob/master/day-3/Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Social Media Analytics Worskhop - Telkom University*


---



# Topic Modelling

Topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. 

In [0]:
# Install Library
! pip install pyLDAvis

In [0]:
# Import Libraries
import nltk
import os
import numpy as np, pyLDAvis, pyLDAvis.sklearn; pyLDAvis.enable_notebook()

# Import Modules
from __future__ import print_function 
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from matplotlib import pyplot as plt

In [0]:
# Clone Library and Data from Github
! git clone https://github.com/rcdbe/sma-online

In [0]:
# Set Data Directory
os.chdir('sma-online/day-3/Source')

In [0]:
# Import Stop Words
nltk.download('stopwords')

# Import Data
data_file = 'berita_batubara.csv'

# Load Tweets Data
import MyLib as TS
Tweets = TS.LoadTxt(data_file) 
print('Total loaded tweets = {0}'.format(len(Tweets)))

In [0]:
n_topics = 4
top_topics = 4
top_words = 8
max_df = 0.75
min_df = 10

In [0]:
# Feature Extraction (Word Embedding)
count_vector = CountVectorizer(lowercase = True, token_pattern = r'\b[a-zA-Z]{3,}\b',max_df = max_df, min_df = min_df) 
dtm_tf = count_vector.fit_transform(Tweets)
tf_terms = count_vector.get_feature_names()
del Tweets

In [0]:
# Topic Search Function
lda_tf = LatentDirichletAllocation(n_components=n_topics, learning_method='online', random_state=0).fit(dtm_tf)

# Show Topics
vsm_topics = lda_tf.transform(dtm_tf); doc_topic =  [a.argmax()+1 for a in tqdm(vsm_topics)] # topic of docs
print('In total there are {0} major topics, distributed as follows'.format(len(set(doc_topic))))
plt.hist(np.array(doc_topic), alpha=0.5); plt.show()
print('Printing top {0} Topics, with top {1} Words:'.format(top_topics, top_words))
TS.print_Topics(lda_tf, tf_terms, top_topics, top_words)

In [0]:
# Interactively visualizing the Topics, please ignore the Warnings
# Wait few minutes and then hover the Mouse over the Topics to Explore
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, count_vector) 