<a href="https://colab.research.google.com/github/solharsh/Capstone_Sentiment_Analysis/blob/master/Topic_Modeling_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling

Topic models have been designed specifically for the purpose of extracting various distinguishing concepts or topics from a large corpus having various types of documents where each document talks about one or more concepts. These concepts can be anything from thoughts, opinions, facts, outlooks, statements and so on. The main aim of topic modeling is to use mathematical and statistical techniques to discover hidden and latent semantic structures in a corpus. Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms which are distinguishable from each other and these cluster of words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and also make semantic connections amongst words which co-occur together frequently in various documents. There are various frameworks and algorithms to build topic models. The most popular ones include

Latent Semantic Indexing
Latent Dirichlet Allocation
Non-negative Matrix Factorization
The last technique we will look at is non-negative matrix factorization (NNMF), which is another matrix decomposition technique similar to SVD but operates on non-negative matrices and works well for multivariate data. NNMF can be formally defined as, given a non-negative matrix V, the objective is to find two non-negative matrix factors, W and H such that when they are multiplied, they can approximately reconstruct V. Mathematically this is represented by$$ V ≈ WH $$

such that all three matrices are non-negative.

To get to this approximation, we usually use a cost function like the Euclidean distance or L2 norm between two matrices or the Frobenius norm which is a slight modification of the L2 norm.

This implementation is available in the NMF class in the scikit-learn decomposition module which we will be using in the section.



In [0]:
import pickle
DATA_PATH = "/content/drive/My Drive/Capstone Project - NLP/Harsh/Project_Checkpoints/"
infile = open(DATA_PATH+'/speech_cleaned_checkpoint.pkl','rb')
df = pickle.load(infile)

In [0]:
import pandas as pd
import numpy as np
import warnings
import nltk

warnings.filterwarnings("ignore")

# Extract features from Speeches

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# get tf-idf features for all the speeches
speeches = [speech for speech in df.Speech_Cleaned]
ptvf = TfidfVectorizer(use_idf=True, min_df=0.02, max_df=0.75, ngram_range=(1, 2), sublinear_tf=True)
ptvf_features = ptvf.fit_transform(speeches)
# view feature set dimensions
print(ptvf_features.shape)

(12, 73663)


# Topic Modeling on Speeches

In [7]:
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 2.8MB/s 
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 16.9MB/s 
Building wheels for collected packages: pyLDAvis, funcy
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=cb1ff93e13a73190b5a32fafd16e263cd52c78d3770c31f6e7da59db4d864058
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... [?25l[?25hdone
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=75aa325d

In [12]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=3, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(ptvf_features)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2','T3'])
features

Unnamed: 0,T1,T2,T3
0,0.990795,0.004596,0.004609
1,0.991518,0.004233,0.004249
2,0.991638,0.004169,0.004193
3,0.988292,0.005839,0.005868
4,0.00394,0.003861,0.992199
5,0.004295,0.004188,0.991517
6,0.991772,0.0041,0.004128
7,0.990232,0.004876,0.004892
8,0.99242,0.003776,0.003805
9,0.990713,0.004637,0.00465


# Show topics and their weights

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(df.Speech_Cleaned)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,aa,aaby,aadhaar,aadhar,aadmi,aai,aajeevika,aakansha,aam,aamayaah,aapka,aapke,aar,aasha,aayakar,aaykar,aayog,ab,abatement,abettor,abeyance,abhiyan,abide,ability,able,abled,abolish,abolished,abolition,abroad,abrupt,absence,absolute,absolutely,absorb,absorbent,absorptive,abundance,abundant,abuse,...,xii,xiii,xiv,xix,xv,xvi,xvii,xviii,xx,xylene,yacht,yannai,yards,yarn,year,yeh,yen,yeoman,yeomen,yesterday,yet,yield,yoga,yogi,yojana,yojanamaking,yojna,young,youth,youthful,yuva,zarda,zari,zeolite,zero,zinc,zirconia,zone,zoo,zozila
0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.17,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.02,0.0,0.03,0.01,0.01,0.02,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0
3,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.06,0.0,0.01,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.01,0.0,0.07,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
6,0.01,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
7,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23,0.0,0.01,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
9,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0


In [16]:
tt_matrix = lda.components_
for topic_weights in tt_matrix:
    topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
    topic = sorted(topic, key=lambda x: -x[1])
    topic = [item for item in topic if item[1] > 0.3]
    print(topic)
    print()






# Clustering documents using topic model features

In [18]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit_transform(features)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel_TM'])
pd.concat([df, cluster_labels], axis=1)

Unnamed: 0,Speaker_Name,Date_Of_Speech,Speech,Speech_Cleaned,word_count,negation,length,has_url,quest_mark,excl_mark,ClusterLabel_TM
0,Pranab Mukherjee,"March 16, 2012",Budget 2012-2013 \n\nSpeech of \n\nPranab Mukh...,budget speech pranab mukherjee minister financ...,14077,True,89122,True,0,0,1
1,Arun Jaitley,"July 10, 2014",Budget 2014-2015 \n\nSpeech of \n\nArun Jaitle...,budget speech arun jaitley minister finance ju...,16395,True,103238,False,3,0,1
2,Arun Jaitley,"February 28, 2015",CONTENTS \n\nPART -A \n\nPage No. \nIntroducti...,content part page no introduction major challe...,17885,True,112015,False,2,1,1
3,Piyush Goyal,"February 1, 2019",Interim Budget 2019-2020 \n\n \n\nSpeech of \n...,interim budget speech piyush goyal minister fi...,8044,True,51078,False,0,0,2
4,Nirmala Sitharaman,"July 5, 2019",Budget \n2019-2020 \n\n\nSpeech \nof \nNirmala...,budget speech nirmala sitharaman minister fina...,19329,True,147404,False,1,1,0
5,Arun Jaitley,"February 1, 2017",CONTENTS \n\n \n\nPART - A \n\n Page No. \n\n ...,content part page no introduction farmer ii ru...,18643,True,120097,False,1,0,0
6,Arun Jaitley,"February 1, 2018",Budget 2018-2019 \n\n \n\nSpeech of \n\nArun J...,budget speech arun jaitley minister finance fe...,17919,True,118836,False,0,0,1
7,Pranab Mukherjee,"February 26, 2010",Budget 2010-2011\n\n \n\nSpeech of\n\nPranab...,budget speech pranab mukherjee minister financ...,12330,True,79370,True,2,0,1
8,Arun Jaitley,"February 29, 2016",CONTENTS \n\nPART -A \n\nPage No. \n\nIntroduc...,content part page no introduction agriculture ...,24551,True,156692,False,1,0,1
9,Pranab Mukherjee,"February 28, 2011",Budget 2011-2012 \n\nSpeech of \n\nPranab Mukh...,budget speech pranab mukherjee minister financ...,13901,True,87777,True,0,1,1


In [0]:
import pyLDAvis
import pyLDAvis.sklearn
from sklearn.decomposition import NMF
#import topic_model_utils as tmu

pyLDAvis.enable_notebook()
total_topics = 3

#Display and visualize topics

In [46]:
text = " ".join(speech for speech in df.Speech_Cleaned)
print ("There are {} words in the combination of all speeches.".format(len(text)))

There are 763219 words in the combination of all speeches.


In [0]:
speech_new = np.array(df['Speech_Cleaned'])

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

# get tf-idf features for speech corpus
all_speeches = [speech for speech in speech_new]
ptvf = TfidfVectorizer(use_idf=True, min_df=0.50, max_df=0.95, ngram_range=(1,1), sublinear_tf=True)
ptvf_features = ptvf.fit_transform(all_speeches)
# view feature set dimensions
print(ptvf_features.shape)

(12, 1290)


In [0]:
# prints components of all the topics 
# obtained from topic modeling
def print_topics_udf(topics, total_topics=5,
                     weight_threshold=0.0001,
                     #weight_threshold=0.01,
                     display_weights=False,
                     num_terms=None):
    
    for index in range(total_topics):
        topic = topics[index]
        topic = [(term, float(wt))
                 for term, wt in topic]
        topic = [(word, round(wt,2)) 
                 for word, wt in topic 
                 if abs(wt) >= weight_threshold]
                     
        if display_weights:
            print('Topic #'+str(index+1)+' with weights')
            print(topic[:num_terms]) if num_terms else topic
        else:
            print('Topic #'+str(index+1)+' without weights')
            tw = [term for term, wt in topic]
            print(tw[:num_terms]) if num_terms else tw
        print()
        

# extracts topics with their terms and weights
# format is Topic N: [(term1, weight1), ..., (termn, weightn)]        
def get_topics_terms_weights(weights, feature_names):
    feature_names = np.array(feature_names)
    sorted_indices = np.array([list(row[::-1]) 
                           for row 
                           in np.argsort(np.abs(weights))])
    sorted_weights = np.array([list(wt[index]) 
                               for wt, index 
                               in zip(weights,sorted_indices)])
    sorted_terms = np.array([list(feature_names[row]) 
                             for row 
                             in sorted_indices])
    
    topics = [np.vstack((terms.T, 
                     term_weights.T)).T 
              for terms, term_weights 
              in zip(sorted_terms, sorted_weights)]     
    
    return topics         

In [59]:
# build topic model on positive sentiment review features
pos_nmf = NMF(n_components=total_topics, 
          random_state=42, alpha=0.1, l1_ratio=0.2)
pos_nmf.fit(ptvf_features)      
# extract features and component weights
pos_feature_names = ptvf.get_feature_names()
pos_weights = pos_nmf.components_
# extract and display topics and their components
pos_topics = get_topics_terms_weights(pos_weights, pos_feature_names)
print_topics_udf(topics=pos_topics,
                 total_topics=total_topics,
                 num_terms=30,
                 display_weights=False)

Topic #1 without weights
['part', 'excise', 'cent', 'extend', 'farmer', 'provision', 'improve', 'base', 'basic', 'plan', 'section', 'term', 'person', 'available', 'certain', 'exist', 'institution', 'taxpayer', 'relate', 'specify', 'enable', 'encourage', 'nil', 'yojana', 'process', 'transaction', 'gain', 'like', 'purpose', 'gst']

Topic #2 without weights
['percent', 'rank', 'promise', 'must', 'aadhaar', 'regular', 'recall', 'ago', 'goal', 'young', 'million', 'ten', 'plan', 'whole', 'hence', 'honble', 'derivative', 'branch', 'responsibility', 'states', 'serve', 'crisis', 'minority', 'inflation', 'announcement', 'accept', 'gap', 'restructure', 'trend', 'tonne']

Topic #3 without weights
['connection', 'kisan', 'vision', 'movement', 'healthy', 'promise', 'monthly', 'coverage', 'electricity', 'transform', 'pradhan', 'mantri', 'transparent', 'digital', 'dignity', 'artificial', 'middle', 'approximately', 'rent', 'thank', 'hard', 'aayog', 'pensioner', 'community', 'break', 'de', 'class', 'fam

In [60]:
pyLDAvis.sklearn.prepare(pos_nmf, ptvf_features, ptvf, R=15)