## Overview;
I make a count matrix from 226 SOU addresses by 41 presidents in the United States. (build a dictionary of words and count the number of occurences of words in the speech after cleaning). Then I apply the Top function to the matrix and get some interesting results.

In [2]:
import sys
sys.path.insert(0,'../code/')
import Top

In [57]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import pickle
import pyspark
import nltk
nltk.download('punkt')
import itertools
import string
import os
import re
import math
from scipy.spatial import distance_matrix
#import Top

[nltk_data] Downloading package punkt to /Users/ontheroad/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [64]:
f = open("../data/speeches.pkl",'rb')
speeches = pickle.load(f)
speeches = pd.DataFrame(speeches,columns = ['president','words','year'])

## Compute Word Counts for Each SOU Address

In [65]:
## clean the data
def clean_and_split(s):
             # encode to UTF-8, convert to lowercase and translate all hyphens and
             # punctuation to whitespace
             s = s.lower().replace('-',' ').translate(string.punctuation)
             # replace \r\n
             s = re.sub('(\r\n)+',' ',s)
             # replace whitespace substrings with one whitespace and remove
             # leading/trailing whitespaces
             s = re.sub(' +',' ',s.strip())
             return s.split(' ')


In [66]:
speeches['cleaned'] = speeches['words'].apply(clean_and_split)

In [67]:
## decide vocabulary 
voc_pool = list(itertools.chain.from_iterable(speeches['cleaned']))
voc_counts = pd.value_counts(voc_pool)
voc_counts = pd.DataFrame(voc_counts)
voc_counts = voc_counts.loc[voc_counts.loc[:,0] > 50,:]
voc_counts = voc_counts.iloc[20:,:]
voc = voc_counts.index

In [68]:
## begin to compute "TF-IDF"
from collections import Counter
word_counter = [Counter(x) for x in speeches['cleaned']]
## speech["TF-IDF"] here is now only the n_i(d)
speeches["TF-IDF"] = [pd.DataFrame([y[x] for x in voc],index = voc)\
                      for y in word_counter] 




## Construct Count Matrix X 

In [79]:
## X is of shape [p,n], where p is the length of the vocabulary, n is the number of documents

p = len(voc)
N = p
n = len(speeches["TF-IDF"])
X = np.concatenate(speeches["TF-IDF"], axis = 1)
X.shape

(3279, 226)

In [45]:
## Take out rows with zero rowsums

In [81]:
p0 = np.where(X.sum(axis = 1) == 0)
p0
## "no need to take out any word from vocabulary"

(array([], dtype=int64),)

## Run Experiments

In [73]:
import time
start = time.time()
test = Top.Top(X)
print("run finished after: ", str(time.time() - start))

run finished after:  8.34918999671936


In [82]:
test["K"]

44

In [91]:
speeches['president'].unique().shape[0]

41

### Comment:
The 226 documents are from 41 presidents, and the algorithm tells us there are 44 groups!

In [83]:
anchor_group = []
for group in test["Anchor groups"]:
    anchor_group.append(voc[group])
    

In [84]:
anchor_group

[Index(['.+.+'], dtype='object'),
 Index(['scheme', 'tariff', 'manufacturers', 'imported'], dtype='object'),
 Index(['notes', 'gold'], dtype='object'),
 Index(['posts'], dtype='object'),
 Index(['banks', 'banks,', 'banking'], dtype='object'),
 Index(['soviet'], dtype='object'),
 Index(['hundred', 'seven', 'thousand', 'fleet', 'fifty', 'play', 'twelve'], dtype='object'),
 Index(['can't'], dtype='object'),
 Index(['challenge', 'parents'], dtype='object'),
 Index(['texas', 'annexation'], dtype='object'),
 Index(['inflation', 'foundation', 'path', 'percent.', 'anti'], dtype='object'),
 Index(['.+.+'], dtype='object'),
 Index(['democracy'], dtype='object'),
 Index(['specie', 'paper'], dtype='object'),
 Index(['atomic'], dtype='object'),
 Index(['gentlemen', 'majesty', 'engagements'], dtype='object'),
 Index(['acquisition'], dtype='object'),
 Index(['reserves', 'forest'], dtype='object'),
 Index(['railway', 'menace'], dtype='object'),
 Index(['german', 'germany'], dtype='object'),
 Index(['

### Comment:
Some of the anchor words indeed make sense...