# Classification models and feature selection
In this notebook, we train some classification models based on the thread level features. Then we use the weights in the linear models to select the most related features in detecting rumour.
The steps include:
1. Load dependancies
2. Read the thread level features from the csv file containing the samples for each event
3. Define some classification models to be trained
4. Train and test the models on the dataset:
    4.1. Decide on the train data and the test data
    4.2. Train and test 

## Load dependencies for this Jupyter Notebook
We need the function to read the thread level csv files. We also need to plot the results. Some classification models from Scikit Learn are also imported.

In [40]:
# Load dependencies for this Jupyter Notebook
import pandas as pd
import time
import numpy as np
from functools import reduce
from lib.util import fetch_thread
import matplotlib.pyplot as plt

import seaborn as sns

#Train and Test preprocessing
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

#Classifiers:
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

### Read CSV file for thread level features and separate is_rumor tag with data:
For each event, fetch the thread level features and store them in a dictionary. 

In [41]:
events=[
            "germanwings-crash",
            "sydneysiege",
            "ottawashooting",
            "ferguson",
            "charliehebdo",
        ]

In [42]:
events_threads={}
for event in events:
    X,y=fetch_thread(event)
    X=X.drop(X.columns.values[np.where(np.isnan(X.values))[1]],axis=1)
    events_threads[event]={'X':X.values,'y':y.values,'columns':X.columns}

### Used functions:
* **test_models**: given a list of classification models and train/test data, train the models on the data and report the train and test accuracy.
* **split_train_and_test_data**: given the names of some of the events for training and testing, split the data into train and test data:
    * If an event should go in both the train and test data, split the data in that event into 75% train data and 25% test data. 
    * If an event should go in only train or test data, store all the samples in that event in the respective train or test set.

In [43]:
def test_models(models,X_train, X_test, y_train, y_test):
    for model_name in models:
        model=models[model_name]
        model.fit(X_train,y_train)
        y_test_hat=model.predict(X_test)
        print('%s train accuracy:' % model_name, np.mean(model.predict(X_train)==y_train))
        print('%s test accuracy:' % model_name, np.mean(y_test_hat==y_test))
        print()
        
def split_train_and_test_data(train_events,test_events):
    d=events_threads[train_events[0]]['X'].shape[1]
    X_train=np.zeros((0,d))
    X_test=np.zeros((0,d))
    y_train=np.zeros((0))
    y_test=np.zeros((0))
    for event in train_events:
        if event in test_events:
            X_train1, X_test1, y_train1, y_test1 = train_test_split(events_threads[event]['X'], events_threads[event]['y'], test_size=0.25, random_state=1)
            X_train=np.concatenate((X_train,X_train1),axis=0)
            y_train=np.concatenate((y_train,y_train1),axis=None)  
            X_test=np.concatenate((X_test,X_test1),axis=0)
            y_test=np.concatenate((y_test,y_test1),axis=None)
        else:
            X_train=np.concatenate((X_train,events_threads[event]['X']),axis=0)
            y_train=np.concatenate((y_train,events_threads[event]['y']),axis=0)


    for event in test_events:
        if event not in train_events:
            X_test=np.concatenate((X_test,events_threads[event]['X']),axis=0)
            y_test=np.concatenate((y_test,events_threads[event]['y']),axis=0)

    le = preprocessing.LabelEncoder()
    le.fit(y_train)
    y_train=le.transform(y_train)
    y_test=le.transform(y_test)
    print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
    print('y_train bincount:', np.bincount(y_train)/np.sum(np.bincount(y_train)))
    print('y_test bincount:', np.bincount(y_test)/np.sum(np.bincount(y_test)))
    return X_train, X_test, y_train, y_test


## Testing different Classification Models on Different parts of dataset

In [44]:
models={
    'linear_SVM':svm.SVC(gamma='scale', kernel='linear'),
    'SVM_with_RBF_kernel': svm.SVC(gamma='scale', kernel='rbf'),
    'SVM_with_sigmoid_kernel' : svm.SVC(gamma='scale', kernel='sigmoid'),
    'KNN_with_k=5':KNeighborsClassifier(n_neighbors=5),
    'Decision_Tree_Classifier':DecisionTreeClassifier(random_state=0),
    'Random_Forest_Classifier_n=100_maxDepth=3':RandomForestClassifier(n_estimators=100, max_depth=3, random_state=4),
    'AdaBoost_n=100':AdaBoostClassifier(n_estimators=100),
    'Gaussian_Process_Classifier':GaussianProcessClassifier(1.0 * RBF(1.0)),
}

### 1. Train and Test on the charliehebdo event:
This is the largest event.

In [45]:
X_train, X_test, y_train, y_test=split_train_and_test_data(['charliehebdo'],['charliehebdo'])
test_models(models,X_train, X_test, y_train, y_test)

(1501, 111) (501, 111) (1501,) (501,)
y_train bincount: [0.77948035 0.22051965]
y_test bincount: [0.76846307 0.23153693]
SVM_with_RBF_kernel train accuracy: 0.8307794803464357
SVM_with_RBF_kernel test accuracy: 0.7764471057884231

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.7968021319120586
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.782435129740519

AdaBoost_n=100 train accuracy: 0.9127248500999334
AdaBoost_n=100 test accuracy: 0.8283433133732535

SVM_with_sigmoid_kernel train accuracy: 0.7581612258494337
SVM_with_sigmoid_kernel test accuracy: 0.7445109780439122

KNN_with_k=5 train accuracy: 0.8294470353097935
KNN_with_k=5 test accuracy: 0.7445109780439122

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.7784431137724551

linear_SVM train accuracy: 0.8254497001998667
linear_SVM test accuracy: 0.8023952095808383

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.694610778443

### 2. Train and Test on the charliehebdo and sydneysiege events:

In [59]:
X_train, X_test, y_train, y_test=split_train_and_test_data(['sydneysiege','charliehebdo'],['sydneysiege','charliehebdo'])
test_models(models,X_train, X_test, y_train, y_test)

(2380, 111) (795, 111) (2380,) (795,)
y_train bincount: [0.70504202 0.29495798]
y_test bincount: [0.6918239 0.3081761]
linear_SVM train accuracy: 0.7373949579831933
linear_SVM test accuracy: 0.70062893081761

AdaBoost_n=100 train accuracy: 0.8495798319327731
AdaBoost_n=100 test accuracy: 0.7685534591194969

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.7345911949685534

SVM_with_RBF_kernel train accuracy: 0.819327731092437
SVM_with_RBF_kernel test accuracy: 0.7257861635220125

SVM_with_sigmoid_kernel train accuracy: 0.66890756302521
SVM_with_sigmoid_kernel test accuracy: 0.6792452830188679

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.7521008403361344
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.7320754716981132

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.6666666666666666

KNN_with_k=5 train accuracy: 0.815546218487395
KNN_with_k=5 test accuracy: 0.7119496855345911



In [60]:
X_train, X_test, y_train, y_test=split_train_and_test_data(['charliehebdo'],['sydneysiege'])
test_models(models,X_train, X_test, y_train, y_test)

(2002, 111) (1173, 111) (2002,) (1173,)
y_train bincount: [0.77672328 0.22327672]
y_test bincount: [0.57374254 0.42625746]
linear_SVM train accuracy: 0.8221778221778222
linear_SVM test accuracy: 0.4595055413469736

AdaBoost_n=100 train accuracy: 0.9000999000999002
AdaBoost_n=100 test accuracy: 0.4919011082693947

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.505541346973572

SVM_with_RBF_kernel train accuracy: 0.8336663336663337
SVM_with_RBF_kernel test accuracy: 0.5754475703324808

SVM_with_sigmoid_kernel train accuracy: 0.7472527472527473
SVM_with_sigmoid_kernel test accuracy: 0.5711849957374254

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.8026973026973027
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.5754475703324808

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.5720375106564365

KNN_with_k=5 train accuracy: 0.8311688311688312
KNN_with_k=5 test accuracy: 0.5694799658

In [61]:
X_train, X_test, y_train, y_test=split_train_and_test_data(events,events)
test_models(models,X_train, X_test, y_train, y_test)

(4082, 111) (1365, 111) (4082,) (1365,)
y_train bincount: [0.66438021 0.33561979]
y_test bincount: [0.63882784 0.36117216]
linear_SVM train accuracy: 0.692062714355708
linear_SVM test accuracy: 0.652014652014652

AdaBoost_n=100 train accuracy: 0.7895639392454679
AdaBoost_n=100 test accuracy: 0.7289377289377289

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.6813186813186813

SVM_with_RBF_kernel train accuracy: 0.8233708966193043
SVM_with_RBF_kernel test accuracy: 0.6886446886446886

SVM_with_sigmoid_kernel train accuracy: 0.6215090641842235
SVM_with_sigmoid_kernel test accuracy: 0.6043956043956044

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.6876531112199902
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.6586080586080586

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.6534798534798535

KNN_with_k=5 train accuracy: 0.8027927486526213
KNN_with_k=5 test accuracy: 0.67985347985

In [62]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["sydneysiege","ottawashooting","ferguson","charliehebdo"]
                                    ,["germanwings-crash"])
test_models(models,X_train, X_test, y_train, y_test)

(5042, 111) (405, 111) (5042,) (405,)
y_train bincount: [0.67056724 0.32943276]
y_test bincount: [0.50123457 0.49876543]
linear_SVM train accuracy: 0.6923839746132487
linear_SVM test accuracy: 0.5530864197530864

AdaBoost_n=100 train accuracy: 0.7949226497421658
AdaBoost_n=100 test accuracy: 0.4839506172839506

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.5185185185185185

SVM_with_RBF_kernel train accuracy: 0.8216977389924633
SVM_with_RBF_kernel test accuracy: 0.4839506172839506

SVM_with_sigmoid_kernel train accuracy: 0.6380404601348671
SVM_with_sigmoid_kernel test accuracy: 0.5135802469135803

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.7046806822689409
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.5012345679012346

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.49382716049382713

KNN_with_k=5 train accuracy: 0.8040460134867117
KNN_with_k=5 test accuracy: 0.5111111111

In [63]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["germanwings-crash","sydneysiege","ferguson","charliehebdo"]
                                    ,["ottawashooting"])
test_models(models,X_train, X_test, y_train, y_test)

(4590, 111) (857, 111) (4590,) (857,)
y_train bincount: [0.69368192 0.30631808]
y_test bincount: [0.46674446 0.53325554]
linear_SVM train accuracy: 0.7030501089324619
linear_SVM test accuracy: 0.4632438739789965

AdaBoost_n=100 train accuracy: 0.8065359477124183
AdaBoost_n=100 test accuracy: 0.5320886814469078

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.514585764294049

SVM_with_RBF_kernel train accuracy: 0.822875816993464
SVM_with_RBF_kernel test accuracy: 0.47607934655775963

SVM_with_sigmoid_kernel train accuracy: 0.6516339869281046
SVM_with_sigmoid_kernel test accuracy: 0.49241540256709454

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.6984749455337691
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.46674445740956827

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.4865810968494749

KNN_with_k=5 train accuracy: 0.8052287581699347
KNN_with_k=5 test accuracy: 0.4714119019

In [64]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["germanwings-crash","sydneysiege","ottawashooting","charliehebdo"]
                                    ,["ferguson"])
test_models(models,X_train, X_test, y_train, y_test)

(4437, 111) (1010, 111) (4437,) (1010,)
y_train bincount: [0.63804372 0.36195628]
y_test bincount: [0.74554455 0.25445545]
linear_SVM train accuracy: 0.6864998873112463
linear_SVM test accuracy: 0.6613861386138614

AdaBoost_n=100 train accuracy: 0.7881451431147172
AdaBoost_n=100 test accuracy: 0.499009900990099

Decision_Tree_Classifier train accuracy: 1.0
Decision_Tree_Classifier test accuracy: 0.49306930693069306

SVM_with_RBF_kernel train accuracy: 0.830290736984449
SVM_with_RBF_kernel test accuracy: 0.6663366336633664

SVM_with_sigmoid_kernel train accuracy: 0.5997295469912103
SVM_with_sigmoid_kernel test accuracy: 0.5801980198019802

Random_Forest_Classifier_n=100_maxDepth=3 train accuracy: 0.7099391480730223
Random_Forest_Classifier_n=100_maxDepth=3 test accuracy: 0.6049504950495049

Gaussian_Process_Classifier train accuracy: 1.0
Gaussian_Process_Classifier test accuracy: 0.501980198019802

KNN_with_k=5 train accuracy: 0.7942303358124859
KNN_with_k=5 test accuracy: 0.55247524752

In [47]:
X_train, X_test, y_train, y_test=split_train_and_test_data(["germanwings-crash","sydneysiege","ottawashooting","charliehebdo"]
                                    ,["ferguson"])
model=models['linear_SVM']
model.fit(X_train,y_train)
plt.figure()
plt.title("Linear-SVM Weights identifting most important features")
labels = events_threads["ferguson"]["columns"].values
coefs=model.coef_.flatten()
sorted_labels = [label for _,label in sorted(zip(coefs,labels), key=lambda pair: pair[0])]
sorted_coefs=np.sort(coefs)
ax = sns.barplot(y=sorted_labels, x=sorted_coefs, palette="Set2")
ax.set(xlabel="Linear-SVM Weights", ylabel="Feature")

(4437, 111) (1010, 111) (4437,) (1010,)
y_train bincount: [0.63804372 0.36195628]
y_test bincount: [0.74554455 0.25445545]


[Text(0,0.5,'Feature'), Text(0.5,0,'Linear-SVM Weights')]