# Classification models and feature selection
In this notebook, we train some classification models based on the thread level features. Then we use the weights in the linear models to select the most related features in detecting rumour.
The steps include:
1. Load dependancies
2. Read the thread level features from the csv file containing the samples for each event
3. Define some classification models to be trained
4. Train and test the models on the dataset:
    4.1. Decide on the train data and the test data
    4.2. Train and test 

## Load dependencies for this Jupyter Notebook
We need the function to read the thread level csv files. We also need to plot the results. Some classification models from Scikit Learn are also imported.

In [1]:
# Load dependencies for this Jupyter Notebook
import pandas as pd
import time
import numpy as np
from functools import reduce
from lib.util import fetch_thread
import matplotlib.pyplot as plt

import seaborn as sns

#Train and Test preprocessing
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

#Classifiers:
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

### Read CSV file for thread level features and separate is_rumor tag with data:
For each event, fetch the thread level features and store them in a dictionary. 

In [2]:
events=[
            "germanwings-crash",
            "sydneysiege",
            "ottawashooting",
            "ferguson",
            "charliehebdo",
        ]

In [3]:
events_threads={}
for event in events:
    X,y=fetch_thread(event)
    X=X.drop(X.columns.values[np.where(np.isnan(X.values))[1]],axis=1)
    events_threads[event]={'X':X.values,'y':y.values,'columns':X.columns}

### Used functions:
* **test_models**: given a list of classification models and train/test data, train the models on the data and report the train and test accuracy.
* **split_train_and_test_data**: given the names of some of the events for training and testing, split the data into train and test data:
    * If an event should go in both the train and test data, split the data in that event into 75% train data and 25% test data. 
    * If an event should go in only train or test data, store all the samples in that event in the respective train or test set.

In [39]:
def test_models(models,X_train, X_test, y_train, y_test):
    for model_name in models:
        model=models[model_name]
        model.fit(X_train,y_train)
        y_test_hat=model.predict(X_test)
        print('%s train accuracy:' % model_name, np.mean(model.predict(X_train)==y_train))
        print('%s test accuracy:' % model_name, np.mean(y_test_hat==y_test))
        print()
        
def split_train_and_test_data(train_events,test_events):
    d=events_threads[train_events[0]]['X'].shape[1]
    X_train=np.zeros((0,d))
    X_test=np.zeros((0,d))
    y_train=np.zeros((0))
    y_test=np.zeros((0))
    for event in train_events:
        if event in test_events:
            X_train1, X_test1, y_train1, y_test1 = train_test_split(events_threads[event]['X'], events_threads[event]['y'], test_size=0.25, random_state=1)
            X_train=np.concatenate((X_train,X_train1),axis=0)
            y_train=np.concatenate((y_train,y_train1),axis=None)  
            X_test=np.concatenate((X_test,X_test1),axis=0)
            y_test=np.concatenate((y_test,y_test1),axis=None)
        else:
            X_train=np.concatenate((X_train,events_threads[event]['X']),axis=0)
            y_train=np.concatenate((y_train,events_threads[event]['y']),axis=0)


    for event in test_events:
        if event not in train_events:
            X_test=np.concatenate((X_test,events_threads[event]['X']),axis=0)
            y_test=np.concatenate((y_test,events_threads[event]['y']),axis=0)

    le = preprocessing.LabelEncoder()
    le.fit(y_train)
    y_train=le.transform(y_train)
    y_test=le.transform(y_test)
    return X_train, X_test, y_train, y_test


## Testing different Classification Models on Different parts of dataset
### Models dictionary:

In [34]:
models={
    'LinearSVC_with_L1_Regularization' : svm.LinearSVC(penalty='l1',dual=False,max_iter=3000),
    'linear_SVM':svm.SVC(gamma='scale', kernel='linear'),
    'SVM_with_RBF_kernel': svm.SVC(gamma='scale', kernel='rbf'),
    'SVM_with_sigmoid_kernel' : svm.SVC(gamma='scale', kernel='sigmoid'),
    'KNN_with_k=5':KNeighborsClassifier(n_neighbors=5),
    'Decision_Tree_Classifier':DecisionTreeClassifier(random_state=0),
    'Random_Forest_Classifier_n=100_maxDepth=3':RandomForestClassifier(n_estimators=100, max_depth=3, random_state=4),
    'AdaBoost_n=100':AdaBoostClassifier(n_estimators=100),
    'Gaussian_Process_Classifier':GaussianProcessClassifier(1.0 * RBF(1.0)),
}

### 1. Train and test on the *charliehebdo* event:
This is the largest event.

In [29]:
X_train, X_test, y_train, y_test=split_train_and_test_data(['charliehebdo'],['charliehebdo'])
test_models(models,X_train, X_test, y_train, y_test)

(1501, 111) (501, 111) (1501,) (501,)
y_train bincount: [ 0.77948035  0.22051965]
y_test bincount: [ 0.76846307  0.23153693]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.778443113772

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.694610778443

KNN_with_k=5 train accuracy: 0.82944703531
KNN_with_k=5 test accuracy: 0.744510978044

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.796802131912
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.782435129741

linear_SVM train accuracy: 0.8254497002
linear_SVM test accuracy: 0.802395209581

SVM_with_RBF_kernel train accuracy: 0.830779480346
SVM_with_RBF_kernel test accuracy: 0.776447105788

SVM_with_sigmoid_kernel train accuracy: 0.758161225849
SVM_with_sigmoid_kernel test accuracy: 0.744510978044

AdaBoost_n=100 train accuracy: 0.9127248501
AdaBoost_n=100 test accuracy: 0.828343313373

LinearSVC_with_L1_Regularization train accuracy: 0

### 2. Train and test on the *charliehebdo* and *sydneysiege* events:

In [30]:
X_train, X_test, y_train, y_test=split_train_and_test_data(['sydneysiege','charliehebdo'],['sydneysiege','charliehebdo'])
test_models(models,X_train, X_test, y_train, y_test)

(2380, 111) (795, 111) (2380,) (795,)
y_train bincount: [ 0.70504202  0.29495798]
y_test bincount: [ 0.6918239  0.3081761]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.734591194969

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.666666666667

KNN_with_k=5 train accuracy: 0.815546218487
KNN_with_k=5 test accuracy: 0.711949685535

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.752100840336
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.732075471698

linear_SVM train accuracy: 0.737394957983
linear_SVM test accuracy: 0.700628930818

SVM_with_RBF_kernel train accuracy: 0.819327731092
SVM_with_RBF_kernel test accuracy: 0.725786163522

SVM_with_sigmoid_kernel train accuracy: 0.668907563025
SVM_with_sigmoid_kernel test accuracy: 0.679245283019

AdaBoost_n=100 train accuracy: 0.849579831933
AdaBoost_n=100 test accuracy: 0.768553459119

LinearSVC_with_L1_Regularization train accuracy



### 3. Train on the *ferguson* and testing on *sydneysiege* events:

In [31]:
X_train, X_test, y_train, y_test=split_train_and_test_data(['germanwings-crash'],['ottawashooting'])
test_models(models,X_train, X_test, y_train, y_test)

(405, 111) (857, 111) (405,) (857,)
y_train bincount: [ 0.50123457  0.49876543]
y_test bincount: [ 0.46674446  0.53325554]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.541423570595

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.501750291715

KNN_with_k=5 train accuracy: 0.750617283951
KNN_with_k=5 test accuracy: 0.521586931155

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.841975308642
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.537922987165

linear_SVM train accuracy: 0.844444444444
linear_SVM test accuracy: 0.462077012835

SVM_with_RBF_kernel train accuracy: 0.883950617284
SVM_with_RBF_kernel test accuracy: 0.487747957993

SVM_with_sigmoid_kernel train accuracy: 0.641975308642
SVM_with_sigmoid_kernel test accuracy: 0.534422403734

AdaBoost_n=100 train accuracy: 1.0
AdaBoost_n=100 test accuracy: 0.558926487748

LinearSVC_with_L1_Regularization train accuracy: 0.8197530

### 4. Train and test on all of the events:

In [32]:
X_train, X_test, y_train, y_test=split_train_and_test_data(events,events)
test_models(models,X_train, X_test, y_train, y_test)

(4082, 111) (1365, 111) (4082,) (1365,)
y_train bincount: [ 0.66438021  0.33561979]
y_test bincount: [ 0.63882784  0.36117216]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.681318681319

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.65347985348

KNN_with_k=5 train accuracy: 0.802792748653
KNN_with_k=5 test accuracy: 0.679853479853

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.68765311122
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.658608058608

linear_SVM train accuracy: 0.692062714356
linear_SVM test accuracy: 0.652014652015

SVM_with_RBF_kernel train accuracy: 0.823370896619
SVM_with_RBF_kernel test accuracy: 0.688644688645

SVM_with_sigmoid_kernel train accuracy: 0.621509064184
SVM_with_sigmoid_kernel test accuracy: 0.604395604396

AdaBoost_n=100 train accuracy: 0.789563939245
AdaBoost_n=100 test accuracy: 0.728937728938

LinearSVC_with_L1_Regularization train accura

### 5. Train on all of the events except *germanwings-crash* event and test on *germanwings-crash* event:

In [35]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["sydneysiege","ottawashooting","ferguson","charliehebdo"]
                                    ,["germanwings-crash"])
test_models(models,X_train, X_test, y_train, y_test)

(5042, 111) (405, 111) (5042,) (405,)
y_train bincount: [ 0.67056724  0.32943276]
y_test bincount: [ 0.50123457  0.49876543]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.518518518519

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.493827160494

KNN_with_k=5 train accuracy: 0.804046013487
KNN_with_k=5 test accuracy: 0.511111111111

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.704680682269
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.501234567901

linear_SVM train accuracy: 0.692383974613
linear_SVM test accuracy: 0.553086419753

SVM_with_RBF_kernel train accuracy: 0.821697738992
SVM_with_RBF_kernel test accuracy: 0.483950617284

SVM_with_sigmoid_kernel train accuracy: 0.638040460135
SVM_with_sigmoid_kernel test accuracy: 0.513580246914

AdaBoost_n=100 train accuracy: 0.794922649742
AdaBoost_n=100 test accuracy: 0.483950617284

LinearSVC_with_L1_Regularization train accura

### 6. Train on all of the events and test on *ottawashooting* event:

In [36]:
X_train, X_test, y_train, y_test=split_train_and_test_data(events
                                    ,["ottawashooting"])
test_models(models,X_train, X_test, y_train, y_test)

(5232, 111) (215, 111) (5232,) (215,)
y_train bincount: [ 0.66704893  0.33295107]
y_test bincount: [ 0.4372093  0.5627907]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.613953488372

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.613953488372

KNN_with_k=5 train accuracy: 0.799694189602
KNN_with_k=5 test accuracy: 0.595348837209

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.682148318043
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.497674418605

linear_SVM train accuracy: 0.688073394495
linear_SVM test accuracy: 0.479069767442

SVM_with_RBF_kernel train accuracy: 0.820718654434
SVM_with_RBF_kernel test accuracy: 0.6

SVM_with_sigmoid_kernel train accuracy: 0.626720183486
SVM_with_sigmoid_kernel test accuracy: 0.46511627907

AdaBoost_n=100 train accuracy: 0.786697247706
AdaBoost_n=100 test accuracy: 0.623255813953

LinearSVC_with_L1_Regularization train accuracy: 0.70087920

### 7. Train on all of the events except *ferguson* event and test on *ferguson* event:

In [37]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["germanwings-crash","sydneysiege","ottawashooting","charliehebdo"]
                                    ,["ferguson"])
test_models(models,X_train, X_test, y_train, y_test)

(4437, 111) (1010, 111) (4437,) (1010,)
y_train bincount: [ 0.63804372  0.36195628]
y_test bincount: [ 0.74554455  0.25445545]
Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.493069306931

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.50198019802

KNN_with_k=5 train accuracy: 0.794230335812
KNN_with_k=5 test accuracy: 0.552475247525

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.709939148073
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.60495049505

linear_SVM train accuracy: 0.686499887311
linear_SVM test accuracy: 0.661386138614

SVM_with_RBF_kernel train accuracy: 0.830290736984
SVM_with_RBF_kernel test accuracy: 0.666336633663

SVM_with_sigmoid_kernel train accuracy: 0.599729546991
SVM_with_sigmoid_kernel test accuracy: 0.580198019802

AdaBoost_n=100 train accuracy: 0.788145143115
AdaBoost_n=100 test accuracy: 0.49900990099

LinearSVC_with_L1_Regularization train accurac

## Feature selection based on the weight of the LinearSVC_with_L1_Regularization model
We find the most important features as the features that have the 15 most negative weights in the Linear SVM and the 15 most positive weights in the Linear SVM.

In [41]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["charliehebdo"]
                                    ,["ferguson"])
model=models['linear_SVM']
model.fit(X_train,y_train)
plt.figure()
plt.title("Linear-SVM Weights identifting most important features for the 15 most negative weights")
labels = events_threads["ferguson"]["columns"].values
coefs=model.coef_.flatten()
sorted_labels = [label for _,label in sorted(zip(coefs,labels), key=lambda pair: pair[0])]
sorted_coefs=np.sort(coefs)
ax = sns.barplot(y=sorted_labels[:15], x=sorted_coefs[:15], palette="Set2")
ax.set(xlabel="Linear-SVM Weights", ylabel="Feature")
plt.show()

plt.figure()
plt.title("Linear-SVM Weights identifting most important features for the 15 most positive weights")
ax = sns.barplot(y=sorted_labels[-15:], x=sorted_coefs[-15:], palette="Set2")
ax.set(xlabel="Linear-SVM Weights", ylabel="Feature")
plt.show()