# Twitter Notebook

## 4.1 a-c)

In [1]:
from twitter import *



In [2]:
np.random.seed(1234)
    
# read the tweets and its labels   
dictionary = extract_dictionary('../data/tweets.txt')
X = extract_feature_vectors('../data/tweets.txt', dictionary)
y = read_vector_file('../data/labels.txt')

print("Before split:")

print("Shape of X: " + str(X.shape))
print("Shape of y: " + str(y.shape))

print("After split: ")

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.11, random_state = 42)

print("Shape of Xtrain: " + str(X_train.shape))
print("Shape of Xtest: " + str(X_test.shape))
print("Shape of ytrain: " + str(y_train.shape))
print("Shape of ytest: " + str(y_test.shape))

Before split:
Shape of X: (630, 1811)
Shape of y: (630,)
After split: 
Shape of Xtrain: (560, 1811)
Shape of Xtest: (70, 1811)
Shape of ytrain: (560,)
Shape of ytest: (70,)


## 4.1d)

We see that after the split, the examples have been split into the same ratio as given by the problem. In particular, X_train is now a 560 x 1811 array and X_test is a 70 x 1811 array. 

## 4.2 a-b)

In [3]:
metric_list = ["accuracy", "f1_score", "auroc"]
kf = StratifiedKFold(y = y_train, n_folds = 5) #create indices of 5-vold CV

The reason it's a good idea to maintain the class proportions across the folds is because it resembles the original distrubution the best. Our original distrubution is our best "guess" of what the true underlying distrubution looks like and we assume it's proportional are similar to that of the training data. If the proportions aren't similar to the train data, then this is not representative the true underlying distrubution and may make performance on the test set worse. 

## 4.2 c-d)

In [4]:
best_c_dict = [{}, {}, {}]
for i, metric in enumerate(metric_list):
    best_c, best_c_dict[i] = select_param_linear(X=X, y=y, kf=kf, metric=metric)
    print(best_c_dict[i])
        
    


Linear SVM Hyperparameter Selection based on accuracy:
10.0
The performance with this value of C was: 0.783601
{0.001: 0.7081825662577874, 0.01: 0.7064126547533627, 0.1: 0.7207316830104441, 1.0: 0.7800133255885469, 10.0: 0.7836005569412648, 100.0: 0.7836005569412648}
Linear SVM Hyperparameter Selection based on f1_score:
10.0
The performance with this value of C was: 0.834326
{0.001: 0.8137795566454397, 0.01: 0.8122673615234884, 0.1: 0.8092921769811928, 1.0: 0.8318029165265688, 10.0: 0.8343262047121867, 100.0: 0.8343262047121867}
Linear SVM Hyperparameter Selection based on auroc:
10.0
The performance with this value of C was: 0.734549
{0.001: 0.5, 0.01: 0.5062629399585921, 0.1: 0.5962365997858499, 1.0: 0.7309176038160226, 10.0: 0.7345489002783662, 100.0: 0.7345489002783662}


We see that the best C was found to be 10 and the corresponding performance with the given metric is also reported. In addition, all C values with their corresponding performance are reported for all 3 metrics. C = 10 was the best for all of them. Interstingly, the performance seems to get capped off at some maximum value after C = 10. 

## 4.3 a)

In [5]:
clf = SVC(kernel="linear", C=10)
clf.fit(X_train, y_train)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## 4.3 b-c)

In [6]:
perf_dict = {} 
for metric in metric_list:
    perf_dict[metric] = performance_test(clf, X_test, y_test, metric)
    
print(perf_dict)

{'accuracy': 0.8142857142857143, 'f1_score': 0.8505747126436781, 'auroc': 0.7976190476190477}


So we see the given performance scores for the three given metrics using the optimal hyperparameter settings for the SVM model found in 4.2 (i.e. C = 10).