<a href="https://colab.research.google.com/github/solharsh/Capstone_Sentiment_Analysis/blob/master/Text_to_Vectors_Checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Traditional Statistical Models

The importance of feature engineering is even more important for unstructured, textual data because we need to convert free flowing text into some numeric representations which can then be understood by machine learning algorithms.

Here I will explore the following feature engineering techniques:

- Bag of Words Model (TF)
- Bag of N-grams Model
- TF-IDF Model
- Similarity Features

In [0]:
import pickle
DATA_PATH = "/content/drive/My Drive/Capstone Project - NLP/Harsh/Project_Checkpoints/"
infile = open(DATA_PATH+'/speech_cleaned_checkpoint.pkl','rb')
df = pickle.load(infile)

## Bag of Words Model - TF

This is perhaps the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature\attribute. The bag of words model represents each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0) or even weighted values. The model’s name is such because each document is represented literally as a ‘bag’ of its own words, disregarding word orders, sequences and grammar.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(df.Speech_Cleaned)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 9, ..., 2, 0, 0],
       [0, 0, 0, ..., 3, 0, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 0, 3, ..., 2, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 0, 7, ..., 1, 0, 0]])

Our speeches now have been converted into numeric vectors such that each document is represented by one vector (row) in the above feature matrix. The following code will help represent this in a more easy to understand format.

In [6]:
import pandas as pd
# get all unique words in the corpus
vocab = cv.get_feature_names()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,aa,aaby,aadhaar,aadhar,aadmi,aai,aajeevika,aakansha,aam,aamayaah,aapka,aapke,aar,aasha,aayakar,aaykar,aayog,ab,abatement,abettor,abeyance,abhiyan,abide,ability,able,abled,abolish,abolished,abolition,abroad,abrupt,absence,absolute,absolutely,absorb,absorbent,absorptive,abundance,abundant,abuse,...,xii,xiii,xiv,xix,xv,xvi,xvii,xviii,xx,xylene,yacht,yannai,yards,yarn,year,yeh,yen,yeoman,yeomen,yesterday,yet,yield,yoga,yogi,yojana,yojanamaking,yojna,young,youth,youthful,yuva,zarda,zari,zeolite,zero,zinc,zirconia,zone,zoo,zozila
0,0,0,9,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,3,0,1,3,0,0,3,0,0,0,0,4,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,82,0,0,0,0,1,0,3,0,0,4,0,0,0,2,0,0,1,0,0,0,0,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,1,3,0,0,0,0,3,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,1,1,76,0,0,0,0,0,0,2,0,0,13,0,0,4,7,0,0,0,0,0,0,0,0,3,0,0
2,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,7,1,0,1,1,2,6,0,2,0,2,3,0,0,1,1,1,0,0,0,0,1,...,1,1,0,0,0,0,0,0,0,0,0,0,0,0,67,1,0,0,0,0,2,3,4,0,12,1,2,6,11,0,0,0,0,1,0,0,1,0,1,0
3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,65,0,0,0,0,0,2,0,0,2,15,0,1,0,5,1,0,0,0,0,0,0,0,0,0,0
4,2,0,21,1,0,0,0,1,0,0,0,0,0,1,0,0,1,1,0,0,0,3,0,2,3,0,1,0,0,2,0,0,0,0,1,0,0,0,0,2,...,0,0,0,1,0,0,0,0,0,0,0,1,0,3,84,0,0,0,0,0,1,0,3,2,12,0,0,1,4,0,0,0,1,0,1,0,0,1,0,0
5,2,0,0,7,0,0,0,0,0,0,0,0,3,0,0,0,1,1,0,0,0,1,0,0,5,0,1,0,2,2,0,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1,0,0,0,1,91,0,0,0,0,0,1,0,0,0,13,0,3,0,11,0,0,0,0,0,4,0,0,1,0,0
6,3,0,0,2,0,2,1,0,1,0,0,0,0,0,0,0,2,1,0,0,0,1,0,1,2,1,1,0,3,0,0,0,0,0,0,0,0,1,0,0,...,2,1,2,0,0,0,0,0,0,0,0,0,0,1,84,0,0,0,0,0,0,0,0,0,22,0,3,1,3,0,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,2,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,5,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93,0,1,1,0,1,2,0,0,0,5,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0
8,6,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,1,2,15,0,0,7,1,0,2,0,2,0,1,0,1,0,0,0,0,1,0,0,0,1,...,2,1,1,1,1,1,1,1,1,0,0,0,0,1,110,0,0,0,0,0,1,4,0,0,15,0,10,0,6,0,1,0,0,0,2,1,0,4,0,0
9,0,0,3,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,4,0,0,2,0,1,5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,1,99,0,0,0,0,0,1,2,0,0,3,0,3,3,2,0,0,0,0,0,1,0,0,2,0,0


We can clearly see that each column or dimension in the feature vectors represents a word from the corpus and each row represents one of our speeches. The value in any cell, represents the number of times that word (represented by column) occurs in the specific document (represented by row. Hence, if a corpus of documents consists of N unique words across all the documents, we would have an N-dimensional vector for each of the documents.



# Bag of N-Grams Model

A word is just a single token, often known as a unigram or 1-gram. We already know that the Bag of Words model doesn’t consider order of words. But what if we also wanted to take into account phrases or collection of words which occur in a sequence? N-grams help us achieve that. An N-gram is basically a collection of word tokens from a text document such that these tokens are contiguous and occur in a sequence. Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate n-grams of order 3 (three words), and so on. The Bag of N-Grams model is hence just an extension of the Bag of Words model so we can also leverage N-gram based features. The following example depicts bi-gram based features in each document feature vector.

In [7]:
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams
bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(df.Speech_Cleaned)

bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,aa ab,aa customs,aa grade,aa income,aa new,aa rate,aa rating,aa refer,aaby jsby,aadhaar authentication,aadhaar bank,aadhaar bottom,aadhaar card,aadhaar crore,aadhaar due,aadhaar enable,aadhaar enrolment,aadhaar ensure,aadhaar holder,aadhaar improve,aadhaar interchangeable,aadhaar mobile,aadhaar mr,aadhaar near,aadhaar no,aadhaar number,aadhaar obtain,aadhaar order,aadhaar pan,aadhaar pay,aadhaar place,aadhaar platform,aadhaar prescribe,aadhaar propose,aadhaar realise,aadhaar shall,aadhaar system,aadhaar therefore,aadhaar tool,aadhaar use,...,youth turn,youth vulnerable,youth want,youth woman,youth word,youthful nation,yuva sashakthikaran,zarda scented,zari fish,zeolite ceria,zero budget,zero custom,zero duty,zero excise,zero income,zero investor,zero liquid,zero part,zero people,zero rate,zero rebate,zero tax,zinc alloy,zinc lead,zinc rich,zirconia compound,zone anchor,zone avail,zone begin,zone get,zone government,zone intensify,zone nimz,zone nimzs,zone propose,zone revive,zone sezs,zone well,zoo nationalpark,zozila pass
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,4,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,2,1,1,2,1,2,0,1,1,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
5,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
8,1,0,0,4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,2,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0


This gives us feature vectors for our documents, where each feature consists of a bi-gram representing a sequence of two words and values represent how many times the bi-gram was present for our documents.

# TF-IDF Model

There are some potential problems which might arise with the Bag of Words model when it is used on large corpora like ours. Since the feature vectors are based on absolute term frequencies, there might be some terms which occur frequently across all documents and these may tend to overshadow other terms in the feature set. The TF-IDF model tries to combat this issue by using a scaling or normalizing factor in its computation. TF-IDF stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf). This technique was developed for ranking results for queries in search engines and now it is an indispensable model in the world of information retrieval and NLP.

Mathematically, we can define TF-IDF as tfidf = tf x idf

The term tf(w, D) represents the term frequency of the word w in document D, which can be obtained from the Bag of Words model.
The term idf(w, D) is the inverse document frequency for the term w, which can be computed as the log transform of the total number of documents in the corpus C divided by the document frequency of the word w, which is basically the frequency of documents in the corpus where the word w occurs.
There are multiple variants of this model but they all end up giving quite similar results. Let’s apply this on our corpus now!



In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(df.Speech_Cleaned)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,aa,aaby,aadhaar,aadhar,aadmi,aai,aajeevika,aakansha,aam,aamayaah,aapka,aapke,aar,aasha,aayakar,aaykar,aayog,ab,abatement,abettor,abeyance,abhiyan,abide,ability,able,abled,abolish,abolished,abolition,abroad,abrupt,absence,absolute,absolutely,absorb,absorbent,absorptive,abundance,abundant,abuse,...,xii,xiii,xiv,xix,xv,xvi,xvii,xviii,xx,xylene,yacht,yannai,yards,yarn,year,yeh,yen,yeoman,yeomen,yesterday,yet,yield,yoga,yogi,yojana,yojanamaking,yojna,young,youth,youthful,yuva,zarda,zari,zeolite,zero,zinc,zirconia,zone,zoo,zozila
0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.17,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.02,0.0,0.03,0.01,0.01,0.02,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0
3,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.06,0.0,0.01,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.01,0.0,0.07,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.03,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
6,0.01,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
7,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23,0.0,0.01,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.04,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
9,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0


In [17]:
print('BOW model:> Train features shape:', cv_matrix.shape)
print('TFIDF model:> Train features shape:', bv_matrix.shape)

BOW model:> Train features shape: (12, 7942)
TFIDF model:> Train features shape: (12, 66579)


The TF-IDF based feature vectors for each of our text documents show scaled and normalized values as compared to the raw Bag of Words model values.



# Document Similarity

Document similarity is the process of using a distance or similarity based metric that can be used to identify how similar a text document is with any other document(s) based on features extracted from the documents like bag of words or tf-idf.

Thus we can see that we can build on top of the tf-idf based features we engineered in the previous section and use them to generate new features which can be useful in domains like search engines, document clustering and information retrieval by leveraging these similarity based features.

Pairwise document similarity in a corpus involves computing document similarity for each pair of documents in a corpus. Thus if you have C documents in a corpus, you would end up with a C x C matrix such that each row and column represents the similarity score for a pair of documents, which represent the indices at the row and column, respectively. There are several similarity and distance metrics that are used to compute document similarity. These include cosine distance/similarity, euclidean distance, manhattan distance, BM25 similarity, jaccard distance and so on. In our analysis, we will be using perhaps the most popular and widely used similarity metric, cosine similarity and compare pairwise document similarity based on their TF-IDF feature vectors.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.0,0.623869,0.586365,0.491889,0.536193,0.57989,0.582843,0.835902,0.562226,0.906467,0.534529,0.470884
1,0.623869,1.0,0.716259,0.62075,0.690696,0.691878,0.701598,0.6938,0.672874,0.681637,0.793175,0.677594
2,0.586365,0.716259,1.0,0.666768,0.779599,0.776191,0.757284,0.64512,0.82128,0.649619,0.651844,0.533075
3,0.491889,0.62075,0.666768,1.0,0.666372,0.675469,0.709283,0.577089,0.608201,0.569946,0.563707,0.549967
4,0.536193,0.690696,0.779599,0.666372,1.0,0.770577,0.768824,0.612162,0.778363,0.602718,0.620732,0.505653
5,0.57989,0.691878,0.776191,0.675469,0.770577,1.0,0.763673,0.65486,0.829513,0.645138,0.639248,0.541403
6,0.582843,0.701598,0.757284,0.709283,0.768824,0.763673,1.0,0.657169,0.746744,0.648004,0.634216,0.549497
7,0.835902,0.6938,0.64512,0.577089,0.612162,0.65486,0.657169,1.0,0.633089,0.853445,0.58871,0.538038
8,0.562226,0.672874,0.82128,0.608201,0.778363,0.829513,0.746744,0.633089,1.0,0.624177,0.62388,0.490618
9,0.906467,0.681637,0.649619,0.569946,0.602718,0.645138,0.648004,0.853445,0.624177,1.0,0.585846,0.532736


Cosine similarity basically gives us a metric representing the cosine of the angle between the feature vector representations of two text documents. Lower the angle between the documents, the closer and more similar they are.

# Clustering using Document Similarity Features

I will use a very popular partition based clustering method, K-means clustering to cluster or group these speeches based on their similarity based feature representations. In K-means clustering, we have an input parameter k, which specifies the number of clusters it will output using the document features. This clustering method is a centroid based clustering method, where it tries to cluster these speeches into clusters of equal variance. It tries to create these clusters by minimizing the within-cluster sum of squares measure, also known as inertia.



In [15]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, random_state=0)
km.fit_transform(similarity_matrix)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([df.Speech_Cleaned, cluster_labels], axis=1)

Unnamed: 0,Speech_Cleaned,ClusterLabel
0,budget speech pranab mukherjee minister financ...,2
1,budget speech arun jaitley minister finance ju...,0
2,content part page no introduction major challe...,1
3,interim budget speech piyush goyal minister fi...,1
4,budget speech nirmala sitharaman minister fina...,1
5,content part page no introduction farmer ii ru...,1
6,budget speech arun jaitley minister finance fe...,1
7,budget speech pranab mukherjee minister financ...,2
8,content part page no introduction agriculture ...,1
9,budget speech pranab mukherjee minister financ...,2
