In [67]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option("display.width", 500)
pd.set_option("display.max_columns", 100)
pd.set_option("display.notebook_repr_html", True)
import seaborn as sns
sns.set_style("darkgrid")
sns.set_context("poster")

##TF-IDF term weighting 
TF-IDF (term-frequency inverse document-frequency) is a method for extracting the usefulness of words and terms. It weights up for term frequency in document, and weights down for the number of documents a term appears in. The formula is:  $$\text{new_frequency} = tf * (idf + 1)$$
$tf = \text{number of times term appears in document}/\text{number of words in document}$, $idf = \text{total number of documents}/\text{number of documents in which term appears}$

See: http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting, http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

The below function takes a bag-of-words matrix (size n_documents\*vocab_size), and outputs the corresponding tfidf matrix. We show an example below. 

In [68]:
from sklearn.feature_extraction.text import TfidfTransformer
#this function takes a matrix of size n_documents*vocab_size, and outputs the corresponding tfidf matrix 
def tfidf_mat_creator(wordmatrix):
    tf_idf_transform=TfidfTransformer()
    return(tf_idf_transform.fit_transform(wordmatrix).toarray())

#Example
samplematrix = np.random.randint(0, high=5, size=(3,4))
print "Sample bag-of-words matrix of 3 documents, vocab size 4 words: "
print samplematrix
print "tfidf transformation of above matrix: " 
print tfidf_mat_creator(samplematrix)

Sample bag-of-words matrix of 3 documents, vocab size 4 words: 
[[2 1 0 0]
 [4 2 1 4]
 [0 0 0 3]]
tfidf transformation of above matrix: 
[[ 0.89442719  0.4472136   0.          0.        ]
 [ 0.65121271  0.32560635  0.21406661  0.65121271]
 [ 0.          0.          0.          1.        ]]


##SVM
We try running some SVMs on nouns. 

First, we must import the noun training and test matrices, and we convert them to tfidf matrices.

In [16]:
#GET TRAINING AND TEST MATRICES FOR NOUNS
from numpy import genfromtxt
##%%time
noun_train_df = pd.read_csv('noun_train_mat.csv',sep=',',header=None)
noun_train_mat = noun_train_df.values
noun_test_df = pd.read_csv('noun_test_mat.csv',sep=',',header=None)
noun_test_mat = noun_test_df.values
train_issue_areas_df = pd.read_csv('train_issue_areas.csv',sep=',',header=None)
train_issue_areas = train_issue_areas_df.values.ravel()
test_issue_areas_df = pd.read_csv('test_issue_areas.csv',sep=',',header=None)
test_issue_areas = test_issue_areas_df.values.ravel()

In [24]:
#convert the matrices to tfidf matrices 
noun_train_tfidf = tfidf_mat_creator(noun_train_mat)
noun_test_tfidf = tfidf_mat_creator(noun_test_mat)

###Simple practice SVM
We train and run a simple linear SVM. We follow https://www.quantstart.com/articles/Supervised-Learning-for-Document-Classification-with-Scikit-Learn.

In [25]:
%%time
from sklearn.svm import SVC
def train_svm(X, y):
    #this creates and trains the SVM
    svm = SVC(C=1000000.0, gamma=0.0, kernel='rbf')
    svm.fit(X, y)
    return svm
svm1 = train_svm(noun_train_tfidf,train_issue_areas)

CPU times: user 37.5 s, sys: 165 ms, total: 37.6 s
Wall time: 37.7 s


In [115]:
len(svm1.dual_coef_[1])

1044

In [26]:
pred = svm1.predict(noun_test_tfidf)
#accuracy rate on test set 
print(svm1.score(noun_test_tfidf, test_issue_areas))

0.663385826772


In [27]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(pred, test_issue_areas))

[[89 13  3  3  1  1  0  2  6  0  0  0  1]
 [ 8 63  6  3  1  1  0 14 12  2  1  2  0]
 [ 0  3 23  0  0  0  0  1  1  0  0  0  0]
 [ 1  0  0  3  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  2  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 23  0  2  4  0  0  0]
 [ 2  7  0  5  0  1  2 78 10  7  0  6  0]
 [ 4 10  6  1  1  0  1  7 36  1  0  0  0]
 [ 0  1  0  0  0  0  0  2  1  9  2  0  0]
 [ 0  0  0  0  0  0  0  0  0  1  2  0  0]
 [ 0  0  0  0  0  1  0  0  0  0  0  9  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0]]


###Trying multiple SVMs on nouns
We now test a variety of parameters and do cross-validation to optimize the nouns SVM. To do this, we implement functions cv_optimize and do_classify, which are similar to (and largely taken from) the equivalently named functions in problem set 3.

In [69]:
"""
Function
--------
cv_optimize

Inputs
------
clf : an instance of a scikit-learn classifier
parameters: a parameter grid dictionary thats passed to GridSearchCV
X: a document-word matrix (e.g., noun_train_tfidf). Should be training data.
y: the response vector (train_issue_areas)
n_folds: the number of cross-validation folds (default 5)
score_func: a score function we might want to pass (default python None)
   
Returns
-------
Two things: (1) The best estimator from the GridSearchCV, after the GridSearchCV has been used to
fit the model; and (2) the best parameter. 
     
Notes
-----
see do_classify below for an example of how this is used
"""
from sklearn.grid_search import GridSearchCV
#note: this code comes directly from lab 6
def cv_optimize(clf, parameters, X, y, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(X, y)
    best_estimator = gs.best_estimator_
    best_param = gs.best_params_
    return (best_estimator,best_param)

In [98]:
"""
Function
--------
do_classify

Inputs
------
clf : an instance of a scikit-learn classifier
parameters: a parameter grid dictionary thats passed to GridSearchCV
Xtrain: a training matrix document-word matrix (e.g., noun_train_tfidf)
ytrain: the corresponding training response vector (train_issue_areas)
Xtest: a test matrix document-word matrix (e.g., noun_test_tfidf)
ytest: the corresponding test resonse vector (e.g., noun)
n_folds: the number of cross-validation folds (default 5)
score_func: a score function we might want to pass (default python None)
   
Returns
-------
4 things, in the following order: (1) an array of predicted y values (ie, topic areas) for the test data; 
                                  (2) the accuracy score; (3) the confusion matrix of the test data predictions;
                                  (4) the best parameter from the gridsearch.

"""

from sklearn.metrics import confusion_matrix
def do_classify(clf, parameters, Xtrain, ytrain, Xtest, ytest, score_func=None, n_folds=5):
    if parameters:
        clf,best_param = cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print "Best parameter: ", best_param
    pred = clf.predict(Xtest)
    ##print confusion_matrix(ytest, pred)
    print "########################################################"
    return(pred,test_accuracy,'confusion_matrix_broken',best_param)
    ##return(pred,test_accuracy,confusion_matrix(ytest,pred),best_param)

####Linear SVMs

In [30]:
from sklearn.svm import LinearSVC
#Try do_classify on best 6 parameters
%time
parameters1 = {"C": [0.0001, 0.01, 1.0, 100.0, 1000.0, 100000.0]}
predictions1,accuracy1,confusion_matrix1,best_param1=do_classify(LinearSVC(loss='hinge'), parameters1, 
                                                                 noun_train_tfidf,train_issue_areas,
                                                                 noun_test_tfidf,test_issue_areas)

CPU times: user 5 µs, sys: 5 µs, total: 10 µs
Wall time: 12.9 µs
############# based on standard predict ################
Accuracy on training data: 0.96
Accuracy on test data:     0.68
Best parameter:  {'C': 1.0}
[[91  4  1  1  0  0  0  5  2  0  0  0  0]
 [15 64  2  0  0  0  0  5 11  0  0  0  0]
 [ 6  4 21  0  0  0  1  2  4  0  0  0  0]
 [ 4  3  0  2  0  0  1  3  2  0  0  0  0]
 [ 1  0  0  0  0  0  0  1  1  0  0  0  0]
 [ 1  0  0  0  0  2  0  1  1  0  0  1  0]
 [ 0  0  0  0  0  0 23  2  1  0  0  0  0]
 [ 6  4  3  0  0  0  1 88  2  0  0  0  0]
 [ 8  5  1  0  0  0  3 14 37  0  0  0  0]
 [ 0  2  0  0  0  0  6  9  1  6  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  1  4  0  0]
 [ 0  1  0  0  0  0  0  7  0  0  0  9  0]
 [ 1  0  0  0  0  0  0  0  0  0  0  0  0]]
########################################################




In [31]:
#Rerun do_classify based on the best parameter above
paramvalue = best_param1.values()[0]
parameters2 = {"C": [paramvalue/10.,paramvalue,paramvalue*10]}
predictions2,accuracy2,confusion_matrix2,best_param2=do_classify(LinearSVC(loss='hinge'), parameters2, 
                                                             noun_train_tfidf,train_issue_areas,
                                                             noun_test_tfidf,test_issue_areas)

############# based on standard predict ################
Accuracy on training data: 0.96
Accuracy on test data:     0.68
Best parameter:  {'C': 1.0}
[[91  4  1  1  0  0  0  5  2  0  0  0  0]
 [15 64  2  0  0  0  0  5 11  0  0  0  0]
 [ 6  4 21  0  0  0  1  2  4  0  0  0  0]
 [ 4  3  0  2  0  0  1  3  2  0  0  0  0]
 [ 1  0  0  0  0  0  0  1  1  0  0  0  0]
 [ 1  0  0  0  0  2  0  1  1  0  0  1  0]
 [ 0  0  0  0  0  0 23  2  1  0  0  0  0]
 [ 6  4  3  0  0  0  1 88  2  0  0  0  0]
 [ 8  5  1  0  0  0  3 14 37  0  0  0  0]
 [ 0  2  0  0  0  0  6  9  1  6  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  1  4  0  0]
 [ 0  1  0  0  0  0  0  7  0  0  0  9  0]
 [ 1  0  0  0  0  0  0  0  0  0  0  0  0]]
########################################################


####Non-linear SVCs

The next cell optimizes a kernelized SVM across an array of parameters. Because it takes a long time to run, we only run it once. (We've included an image of the initial output below.) 

In [32]:
from sklearn.svm import SVC
##%time
#parameters grid
##parameters3 = {"C": [1e-5,1e-3,1,1e3,1e5],"gamma":[0,1e-3,1e-5,1e-7]}
#test non-linear (kernelized) SVMs
##predictions3,accuracy3,confusion_matrix3,best_param3=do_classify(SVC(), parameters3, 
##                                                             noun_train_tfidf,train_issue_areas,
##                                                             noun_test_tfidf,test_issue_areas)

<img src="kernelized_svm_output.png" width=600 height=400/>

We further optimize based on the best parameters from the above cell. The result shows that in fact, the best parameters were the ones in the above cell ($C=1e^5$, gamma=$1e^{-5}$). 

In [35]:
%%time
#parameters grid
parameters4 = {"C": [1e4,1e5,1e6,1e7],"gamma":[1e-5]}
#test non-linear (kernelized) SVMs
predictions4,accuracy4,confusion_matrix4,best_param4=do_classify(SVC(), parameters4,
                                                                 noun_train_tfidf,train_issue_areas,
                                                                 noun_test_tfidf,test_issue_areas)

############# based on standard predict ################
Accuracy on training data: 1.00
Accuracy on test data:     0.68
Best parameter:  {'C': 100000.0, 'gamma': 1e-05}
[[87  6  1  2  0  0  0  4  4  0  0  0  0]
 [14 63  2  1  0  0  0  6 10  1  0  0  0]
 [ 4  8 21  0  0  0  0  1  4  0  0  0  0]
 [ 3  3  0  3  0  0  0  5  1  0  0  0  0]
 [ 1  1  0  0  0  0  0  0  1  0  0  0  0]
 [ 2  0  0  0  0  2  0  1  0  0  0  1  0]
 [ 0  1  0  0  0  0 22  2  1  0  0  0  0]
 [ 3  4  1  0  0  0  1 87  6  2  0  0  0]
 [ 6  7  1  0  0  0  2 13 39  0  0  0  0]
 [ 0  2  0  0  0  0  6  7  0  9  0  0  0]
 [ 0  0  0  0  0  0  0  1  0  2  2  0  0]
 [ 0  0  0  0  0  0  0  8  0  0  0  9  0]
 [ 1  0  0  0  0  0  0  0  0  0  0  0  0]]
########################################################
CPU times: user 11min 8s, sys: 2.84 s, total: 11min 10s
Wall time: 11min 40s


###SVMs on other types of words
So far, we have only tried running SVMs on nouns. We got equally good output (accuracy score = .68) for both the linear SVM and the kernelized SVM, using their respective optimized parameters. However, we may be able to get better ouput by running the SVM on other types of words (eg, verbs, precedents, etc.), or by running it on multiple types of words (e.g., nouns and precedents together). We try some of these things below.

In [101]:
%%time
#first, we import the different types of words 
prec_train_df = pd.read_csv('prec_train_mat.csv',sep=',',header=None)
prec_train_mat = prec_train_df.values
prec_test_df = pd.read_csv('prec_test_mat.csv',sep=',',header=None)
prec_test_mat = prec_test_df.values

verb_train_df = pd.read_csv('verb_train_mat.csv',sep=',',header=None)
verb_train_mat = verb_train_df.values
verb_test_df = pd.read_csv('verb_test_mat.csv',sep=',',header=None)
verb_test_mat = verb_test_df.values

adj_train_df = pd.read_csv('adj_train_mat.csv',sep=',',header=None)
adj_train_mat = adj_train_df.values
adj_test_df = pd.read_csv('adj_test_mat.csv',sep=',',header=None)
adj_test_mat = adj_test_df.values

for_train_df = pd.read_csv('for_train_mat.csv',sep=',',header=None)
for_train_mat = for_train_df.values
for_test_df = pd.read_csv('for_test_mat.csv',sep=',',header=None)
for_test_mat = for_test_df.values

We define a function `matrix_combine`, which takes a list of matrix tuples as inputs. Each tuple has two elements, which must be the training and test matrices of the same word type. For example, an input could be: `[(for_train_mat,for_test_mat),(adj_train_mat,adj_test_mat)]`. The function outputs the concatenated test and training matrices. 

In [102]:
def matrix_combine(matrix_list):    
    train_mat = np.concatenate(([element[0] for element in matrix_list]),axis=1)
    test_mat = np.concatenate(([element[1] for element in matrix_list]),axis=1)
    return(train_mat,test_mat)

####Linear SVM
We try running the linear SVM on each of the four types of words.

In [103]:
#EACH INDIVIDUALLY
linear_1_resultsdict= {}
#loop through each type of words 
for matrixpair in [(prec_train_mat,prec_test_mat,'prec'),(adj_train_mat,adj_test_mat,'adj'),
                   (verb_train_mat,verb_test_mat,'verb'),(for_train_mat,for_test_mat,'for')]:
    #convert to TF-IDF form 
    train_tfidf = tfidf_mat_creator(matrixpair[0])
    test_tfidf = tfidf_mat_creator(matrixpair[1])
    ##print(train_tfidf.shape)
    ##print(test_tfidf.shape)
    #run do-classify
    predictions,accuracy,confusion_matrix,best_param=do_classify(LinearSVC(loss='hinge'), {'C':[1.0]},
                                                                 train_tfidf,train_issue_areas,
                                                                 test_tfidf,test_issue_areas)
    #save results 
    linear_1_resultsdict[matrixpair[2]]=accuracy
for key,value in linear_1_resultsdict.iteritems(): 
    print key, ': ', value

############# based on standard predict ################
Accuracy on training data: 0.99
Accuracy on test data:     0.29
Best parameter:  {'C': 1.0}
########################################################
############# based on standard predict ################
Accuracy on training data: 0.89
Accuracy on test data:     0.42
Best parameter:  {'C': 1.0}
########################################################
############# based on standard predict ################
Accuracy on training data: 0.94
Accuracy on test data:     0.47
Best parameter:  {'C': 1.0}
########################################################
############# based on standard predict ################
Accuracy on training data: 0.11
Accuracy on test data:     0.09
Best parameter:  {'C': 1.0}
########################################################
verb :  0.474409448819
adj :  0.42125984252
for :  0.0925196850394
prec :  0.28937007874


Verbs and adjectives give decent results, and precedents is not terrible. Let's try combining each of these with nouns and seeing what the results are.

In [132]:
#EACH OF VERBS, PRECEDENTS, ADJ COMBINED WITH NOUNS 
%%time
nouncomboslist = [[(verb_train_mat,verb_test_mat),(noun_train_mat,noun_test_mat),'verbs-nouns'],
                  [(prec_train_mat,prec_test_mat),(noun_train_mat,noun_test_mat),'prec-nouns'],
                  [(adj_train_mat,adj_test_mat),(noun_train_mat,noun_test_mat),'adj-nouns']]
combosdict={}
for combo in nouncomboslist: 
    #concatenate the nouns matrix and the other matrix 
    train_combo,test_combo = matrix_combine([combo[0],combo[1]])
    #convert to tfidf 
    train_tfidf = tfidf_mat_creator(train_combo)
    test_tfidf = tfidf_mat_creator(test_combo)
    #do_classify 
    predictions,accuracy,confusion_matrix,best_param=do_classify(LinearSVC(loss='hinge'), {'C':[1.0]},train_tfidf,
                                                                 train_issue_areas,test_tfidf,test_issue_areas)
    #get accuracies in dict 
    combosdict[combo[2]] = accuracy 
for key,value in combosdict.iteritems():
    print key, ': ', value

############# based on standard predict ################
Accuracy on training data: 0.97
Accuracy on test data:     0.69
Best parameter:  {'C': 1.0}
########################################################
############# based on standard predict ################
Accuracy on training data: 0.97
Accuracy on test data:     0.68
Best parameter:  {'C': 1.0}
########################################################
############# based on standard predict ################
Accuracy on training data: 0.97
Accuracy on test data:     0.68
Best parameter:  {'C': 1.0}
########################################################
verbs-nouns :  0.692913385827
adj-nouns :  0.681102362205
prec-nouns :  0.681102362205
CPU times: user 8.03 s, sys: 1.56 s, total: 9.59 s
Wall time: 9.51 s


In [133]:
#VERBS, NOUNS, PRECEDENTS COMBINED 
verbnounsprec = [(verb_train_mat,verb_test_mat),(noun_train_mat,noun_test_mat),(prec_train_mat,prec_test_mat)]
#concatenate 
vnp_train,vnp_test = matrix_combine(verbnounsprec)
#convert to TFIDF
vnp_train_tfidf = tfidf_mat_creator(vnp_train)
vnp_test_tfidf = tfidf_mat_creator(vnp_test)
#run do_classify
do_classify(LinearSVC(loss='hinge'), {'C':[1.0]},vnp_train_tfidf,train_issue_areas,vnp_test_tfidf,test_issue_areas)

############# based on standard predict ################
Accuracy on training data: 0.97
Accuracy on test data:     0.69
Best parameter:  {'C': 1.0}
########################################################


(array([  8.,   8.,   9.,   1.,   9.,   1.,   8.,   9.,   1.,   9.,   7.,
          2.,   1.,   7.,  10.,   2.,   8.,   7.,   7.,   8.,  10.,   3.,
          2.,   7.,   1.,   9.,   9.,   2.,  10.,   2.,   2.,   2.,   1.,
          8.,   8.,   3.,   1.,   8.,   2.,   7.,   8.,   8.,   1.,   1.,
          1.,  12.,   8.,   8.,   8.,   2.,   8.,   8.,   8.,   8.,   4.,
          3.,   8.,   2.,   8.,   1.,   8.,   2.,   8.,   1.,   7.,   8.,
          2.,   3.,   9.,   2.,   8.,   9.,   2.,   1.,   8.,   8.,   1.,
          8.,   2.,   8.,   8.,   8.,   8.,   8.,   1.,   9.,   9.,  12.,
          8.,  10.,   2.,   1.,   1.,   1.,   3.,   3.,  11.,   1.,   9.,
          9.,   8.,   1.,   9.,   3.,   6.,  12.,   9.,   7.,   1.,   8.,
          1.,   2.,   7.,   8.,   1.,   7.,   9.,   1.,  11.,   1.,   7.,
          8.,   8.,   9.,   3.,   1.,  10.,   8.,  10.,   8.,   2.,   2.,
          2.,   8.,   7.,   9.,   9.,   1.,   8.,   3.,   8.,   2.,   1.,
          9.,   1.,   2.,   8.,   2., 

###Visualization ideas (for this and other parts of the project) 
- Graph of the SVM confusion matrix to show which issue areas were best predicted, and to show overlap between areas. Maybe a 12x12 (or 14x14) heat map showing overlap between columns. For instance, if issue 12 is frequently predicted as issue 10, then the 12-10 (or 10-12) box of the heat map would be dark.
- Bar graph comparing the accuracy rating of the best iteration of each model 
- Show the words that have the greatest influence (e.g., highest absolute coefficients) in each model 
- Exporatory data analysis: show the most common words; visualize the TF-IDF matrix. Perhaps take the sum of each column (word) of the TF-IDF matrix, as this represents "overall influence" of a given word. Then make a plot of these values -- maybe even a visualization that represents importance with word size.  

###TO DO: 
- Keep a dict of all accuracies and models, to select the best one at the end
- Explain why we only consider linear after nouns (b/c linear performs essentially as well as kernelized)
- Do visualizations