## Part 1 : Clustering :

For this part I used the Inaugral address of US Presidents starting from George Washington to President Barack Obama. I was interested in exploring patterns and underlying groupings that can be discovered by clusteting the documents. 

#### Broadly I carried out the following steps:
   * Donwloaded the corpus from NLTK
   * Augmented the data by adding the year of speech and the political party of each President
   * Ran TFIDF vecotrizer to vectorize the document 
   * Ran K-Nearest Neigbor algorithm on top of the 60000+ dimension sparse vector
   * Used Principal Components Analysis to reduce the dimension of the sparse vector to two-dimensions so that It would aid in vizualisation
   * Ran KNN on top of the dimensionality reduced data using mutiple K values and settled on K=3
   
   
#### Observations
* It is interesting to note that when we carry out vizualise the KNN used on top of dimension reduced data .. There is a definite pattern of clusteting in terms of the era from which the speech came from . Speeeches from 19th cetury to early 20th centtury prominently form a cluster while early 20th to mid 20th form their seperate cluster and mid 20th to present emerging as a cluster. 

* While I tried applying coloring based on party affiliation and did not find significant grouping 

* It can be reasoned that that the theme of inaugural address usually remains same within a period while the key issues and ligustic style of speech may differ among periods. The 

<img src= "cluster.jpg">

In [58]:
# Importing Libraries

import numpy as np 
import string
from urllib2 import urlopen
import nltk, re, pprint
from nltk.corpus import wordnet as wn

In [59]:
import nltk
#nltk.download()
from nltk.corpus import gutenberg

In [60]:
import nltk 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import  PunktSentenceTokenizer
from string import punctuation
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

from nltk.corpus import inaugural
import pandas as pd
from nltk.corpus import state_union



In [61]:

# Exploring the Inaugural address corpus
fields=inaugural.fileids()
print np.unique(fields)

[u'1789-Washington.txt' u'1793-Washington.txt' u'1797-Adams.txt'
 u'1801-Jefferson.txt' u'1805-Jefferson.txt' u'1809-Madison.txt'
 u'1813-Madison.txt' u'1817-Monroe.txt' u'1821-Monroe.txt'
 u'1825-Adams.txt' u'1829-Jackson.txt' u'1833-Jackson.txt'
 u'1837-VanBuren.txt' u'1841-Harrison.txt' u'1845-Polk.txt'
 u'1849-Taylor.txt' u'1853-Pierce.txt' u'1857-Buchanan.txt'
 u'1861-Lincoln.txt' u'1865-Lincoln.txt' u'1869-Grant.txt'
 u'1873-Grant.txt' u'1877-Hayes.txt' u'1881-Garfield.txt'
 u'1885-Cleveland.txt' u'1889-Harrison.txt' u'1893-Cleveland.txt'
 u'1897-McKinley.txt' u'1901-McKinley.txt' u'1905-Roosevelt.txt'
 u'1909-Taft.txt' u'1913-Wilson.txt' u'1917-Wilson.txt' u'1921-Harding.txt'
 u'1925-Coolidge.txt' u'1929-Hoover.txt' u'1933-Roosevelt.txt'
 u'1937-Roosevelt.txt' u'1941-Roosevelt.txt' u'1945-Roosevelt.txt'
 u'1949-Truman.txt' u'1953-Eisenhower.txt' u'1957-Eisenhower.txt'
 u'1961-Kennedy.txt' u'1965-Johnson.txt' u'1969-Nixon.txt'
 u'1973-Nixon.txt' u'1977-Carter.txt' u'1981-Reagan.t

In [62]:
# Data Augmentation

data=pd.DataFrame(columns=("President", "Speech"))


for i in range(len(fields)):
    field=fields[i]
    text=inaugural.raw(field)
    data.loc[i]=[field,text]
a=["NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","Democrat","Democrat","Republican","Republican"
 ,"Republican","Republican","Republican","Republican","Democrat","Republican","Democrat","Republican","Republican"
,"Republican","Republican","Democrat","Democrat","Republican","Republican","Republican","Democrat","Democrat"
,"Democrat","Democrat","Democrat","Republican","Republican","Democrat","Democrat","Republican","Republican"
 ,"Democrat","Republican","Republican","Republican","Democrat","Democrat","Republican","Republican","Democrat"]
data["Party"]=a


In [63]:
# Referred  Brandonroses code for stemming : Modified my stemming code I used for 
# Keyphrase assignment 

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [64]:
# Vectoring the speech data 

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1,tokenizer=tokenize_and_stem, ngram_range=(1,2))
x_data=vectorizer.fit_transform(data["Speech"])

In [65]:
# Using PCA to reduce the dimension of the sparse TFIDF vector to 2 principle 
#components . The reason is to help vizualise clusters
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_x=pca.fit(x_data.toarray()).transform(x_data.toarray())

In [66]:
# Running K NN on top of original TFIDF Vector - I am not using this further

from sklearn.cluster import KMeans

num_clusters = 2

km = KMeans(n_clusters=num_clusters)

%time km.fit(x_data)

clusters = km.labels_.tolist()

CPU times: user 2 s, sys: 11.9 ms, total: 2.01 s
Wall time: 2.01 s


In [67]:
# Running KNN on top of the dimension reduced speech data 

from sklearn.cluster import KMeans

num_clusters = 4

km = KMeans(n_clusters=num_clusters)

%time km.fit(reduced_x)

clusters = km.labels_.tolist()

CPU times: user 19.7 ms, sys: 2.17 ms, total: 21.8 ms
Wall time: 20.3 ms


In [68]:
# Augmenting the data frame with year of speech
import re
data["clusters"]=clusters
year=[]
for i in data["President"]:
    c= re.findall('\d+', i)
    year.append(c[0])
data["year"]=year

In [69]:
# Augmenting the data with the political party affiliation

par_cat=[]
for i in data["Party"]:
    if i=="NA":
        a=0
    elif i== "Democrat":
        a=1
    else:
        a=2
    par_cat.append(a)
    

In [70]:

data["party_cat"]=par_cat
data.head()

Unnamed: 0,President,Speech,Party,clusters,year,party_cat
0,1789-Washington.txt,Fellow-Citizens of the Senate and of the House...,,2,1789,0
1,1793-Washington.txt,"Fellow citizens, I am again called upon by the...",,3,1793,0
2,1797-Adams.txt,"When it was first perceived, in early times, t...",,2,1797,0
3,1801-Jefferson.txt,Friends and Fellow Citizens:\n\nCalled upon to...,,2,1801,0
4,1805-Jefferson.txt,"Proceeding, fellow citizens, to that qualifica...",,2,1805,0


In [90]:
# Running vizualisation 

import matplotlib as mpl
import matplotlib.pyplot as plt
import mpld3
mpld3.enable_notebook()


fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))
N = 100

y=[]
x=[]
for i in reduced_x:
    y.append(i[1])
    x.append(i[0])

colors=np.asarray(data["clusters"])
scatter = ax.scatter(x,
                     y,
                     c=colors,
                     alpha=0.3,
                     s=50,
                     cmap=plt.cm.jet)
ax.grid(color='white', linestyle='solid')

ax.set_title("Scatter Plot (with tooltips!)", size=20)

labels = np.ndarray.tolist(np.asarray(data["year"]))
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
mpld3.plugins.connect(fig, tooltip)


mpld3.display()


In [98]:
# Running visualisation : coloring based on year and it can be observed
# That the points from left to right are temporally sequential 

colors=np.asarray(data["year"])
scatter = ax.scatter(x,
                     y,
                     c=colors,
                     alpha=0.3,
                     s=50,
                     cmap=plt.cm.jet)
ax.grid(color='white', linestyle='solid')

ax.set_title("Scatter Plot (with tooltips!)", size=20)

labels = np.ndarray.tolist(np.asarray(data["year"]))
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
mpld3.plugins.connect(fig, tooltip)


mpld3.display()



## Part2 : Word2vec

* In this part I used pretrained word2vec trianed by Google. I further compared a word between word2vec similarity i.e the words embedded near a given noun/adjective/verb and the a given noun/adjective/verb's synset (cognitive similarity). 

The observations are given below 

In [78]:
import gensim
from gensim.models import Word2Vec
from nltk.data import find
from nltk.corpus import reuters


In [198]:
#word2vec_sample = str(find('models/word2vec_sample.word2vec.txt'))
#text = word_tokenize(reuters.words())
#model = model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  


In [199]:
#model.save("google_embed")


In [79]:
#Loading up the trained model 
new_model = gensim.models.Word2Vec.load('google_embed')

In [80]:
# Tokenizing and POS tagging of words
text = word_tokenize(inaugural.raw())

In [81]:
tagged_words=nltk.pos_tag(text)

In [82]:
tagged_words[:10]

[(u'Fellow-Citizens', 'NNS'),
 (u'of', 'IN'),
 (u'the', 'DT'),
 (u'Senate', 'NNP'),
 (u'and', 'CC'),
 (u'of', 'IN'),
 (u'the', 'DT'),
 (u'House', 'NNP'),
 (u'of', 'IN'),
 (u'Representatives', 'NNPS')]

In [84]:
#Extracting and defining nouns , verbs and adjectives of interest

Nouns=[]
for i in tagged_words:
    if i[1]=="NNP" and i[0] not in Nouns:
        Nouns.append(i[0])

Adj=[]
for i in tagged_words:
    if i[1]=="JJ" and i[0] not in Adj:
        Adj.append(i[0])

Verb=[]
for i in tagged_words:
    if i[1] in ["VBD", "VB", "VBG"]  and i[0] not in Verb:
        Verb.append(i[0])
        

In [85]:
Nouns_interest= ["senate", "president", "governmnet", "magistrate", "constitution"]
Adj_interest=["immutable", "apprehensive", "durable", "universal", "soverign" ]
Verb_interest =["endure", "restrain", "regulate", "inheriting", "judge" ]

In [97]:

for i in Nouns_interest:
    print "Noun of interest  "
    print i
    print "   "
    print " Word2vec similarity"
    n_vec=new_model.most_similar(positive=i, topn = 5)
    print n_vec
    print "Wordnet synset"
    n_syn=wn.synsets(i)
    print  n_syn
    print " "
    



#model.most_similar(positive=['Americans'], topn = 10)

Noun of interest  
senate
   
 Word2vec similarity
[(u'Senate', 0.7548928260803223), (u'legislature', 0.637714147567749), (u'Senates', 0.6302423477172852), (u'senates', 0.6061011552810669), (u'senators', 0.5849824547767639)]
Wordnet synset
[Synset('senate.n.01'), Synset('united_states_senate.n.01')]
 
Noun of interest  
president
   
 Word2vec similarity
[(u'President', 0.8006276488304138), (u'chairman', 0.6708744764328003), (u'vice_president', 0.6700226068496704), (u'chief_executive', 0.6691274642944336), (u'CEO', 0.6590126752853394)]
Wordnet synset
[Synset('president.n.01'), Synset('president_of_the_united_states.n.01'), Synset('president.n.03'), Synset('president.n.04'), Synset('president.n.05'), Synset('president_of_the_united_states.n.02')]
 
Noun of interest  
governmnet
   
 Word2vec similarity
[(u'governmet', 0.6096758246421814), (u'goverment', 0.6085272431373596), (u'governemnt', 0.6030837297439575), (u'government', 0.5923144817352295), (u'governement', 0.5695896148681641)]
Wo

In [87]:

for i in Adj_interest:
    print "Adjective of interest  "
    print i
    print "   "
    print " Word2vec similarity"
    n_vec=new_model.most_similar(positive=i, topn = 5)
    print n_vec
    print "Wordnet synset"
    n_syn=wn.synsets(i)[:5]
    print  n_syn
    print " "

Adjective of interest  
immutable
   
 Word2vec similarity
[(u'unalterable', 0.6462440490722656), (u'unchangeable', 0.6407211422920227), (u'unchanging', 0.6349762082099915), (u'inalterable', 0.5698657035827637), (u'changeless', 0.5557388663291931)]
Wordnet synset
[Synset('immutable.a.01')]
 
Adjective of interest  
apprehensive
   
 Word2vec similarity
[(u'worried', 0.6918711066246033), (u'leery', 0.6721755862236023), (u'hesitant', 0.6567758321762085), (u'concerned', 0.6544361114501953), (u'nervous', 0.6543329358100891)]
Wordnet synset
[Synset('apprehensive.s.01'), Synset('apprehensive.s.02'), Synset('apprehensive.s.03')]
 
Adjective of interest  
durable
   
 Word2vec similarity
[(u'Durable', 0.6746456027030945), (u'sturdy', 0.5750996470451355), (u'Casey_Janssen_Curtis_Granderson', 0.5669221878051758), (u'durability', 0.5513916611671448), (u'UV_resistant', 0.5325372219085693)]
Wordnet synset
[Synset('durable.s.01'), Synset('durable.s.02'), Synset('durable.s.03')]
 
Adjective of intere

In [88]:
for i in Verb_interest:
    print "Verb of interest  "
    print i
    print "   "
    print " Word2vec similarity"
    n_vec=new_model.most_similar(positive=i, topn = 5)
    print n_vec
    print "Wordnet synset"
    n_syn=wn.synsets(i)[:5]
    print  n_syn
    print " "

Verb of interest  
endure
   
 Word2vec similarity
[(u'endured', 0.7731415033340454), (u'endures', 0.5982805490493774), (u'subjected', 0.5974172353744507), (u'enduring', 0.5964847207069397), (u'suffer', 0.5808500051498413)]
Wordnet synset
[Synset('digest.v.03'), Synset('weather.v.01'), Synset('survive.v.01'), Synset('suffer.v.01'), Synset('wear.v.06')]
 
Verb of interest  
restrain
   
 Word2vec similarity
[(u'restraining', 0.7259188890457153), (u'restrained', 0.607153058052063), (u'subdue', 0.6048118472099304), (u'rein', 0.5926220417022705), (u'restrains', 0.5815330147743225)]
Wordnet synset
[Synset('restrain.v.01'), Synset('restrict.v.03'), Synset('restrain.v.03'), Synset('restrain.v.04'), Synset('intimidate.v.02')]
 
Verb of interest  
regulate
   
 Word2vec similarity
[(u'regulating', 0.8049445152282715), (u'regulates', 0.7252904772758484), (u'regulated', 0.6486166715621948), (u'restrict', 0.6361327171325684), (u'Regulate', 0.5880163908004761)]
Wordnet synset
[Synset('regulate.v.01

#  Observation:
* It is clear from comparison that wrod2vec tries to provide similar class of things related to a word eg: President : CEO, Chairman which reflect the notion of leadership eventhought each of those nouns are non-synonymous , while Synsets capture the cognitive synonymys associated with a word eg: President :President, President_of_United_States etc. all reinforcing the one single notion of Presidency . 

* In certain cases the verbs generated by both word2vec and wordnet are more or less similar eg: The verb gerund 'Inheriting' is consistent across both word2vec and synsets. 



In [2]:
#Test

1+1

2