# Replicating paper results
This notebook will be used for replicating the results that are achieved in the paper. 



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
from os import path

data_path = '../features/final_feat'
feature_path = 'features.pkl'
label_path = 'labels.npy'

features = pd.read_pickle(path.join(data_path, feature_path))
labels = np.load(path.join(data_path, label_path))

## Compute class priors

In [3]:
_, counts = np.unique(labels, return_counts=True)

n_nonclickbait = counts[0]
n_clickbait = counts[1]


## Features
Below is a list of the features that are going to be used.

**Note**:
The features are a subset of all the features that are being used in the original.
We expect that scores will be slightly lower as we have less information about the post and the article. Still, the overal trend is expected to be the same because their results also so that the differences between using all features and 20 features is only a marginal improvement. 

Expected results training set

| Measure   | Interval      |
|-----------|---------------|
| AUC       | 0.583 - 0.715 |
| Accuracy  | 0.636 - 0.732 |
| Precision | 0.743 - 0.75  |
| Recall    | 0.721 - 0.92  |


Expected results validation set  

| Measure   | Interval      |
|-----------|---------------|
| AUC       | 0.653 - 0.8   |
| Accuracy  | 0.725 - 0.812 |
| Precision | 0.814 - 0.824 |
| Recall    | 0.811 - 0.966 |


In [25]:
[print(x) for x in features.columns] 

numChars_post_title
numChars_article_title
numChars_post_image
numChars_article_kw
numChars_article_desc
numChars_article_par
numQuestionMarks_post_title
numQuestionMarks_article_title
numQuestionMarks_post_image
numQuestionMarks_article_keywords
numQuestionMarks_article_desc
numQuestionMarks_article_par
ratioChars_post_title_article_title
ratioChars_post_title_post_image
ratioChars_post_title_article_kw
ratioChars_post_title_article_desc
ratioChars_post_title_article_par
ratioChars_article_title_post_image
ratioChars_article_title_article_kw
ratioChars_article_title_article_desc
ratioChars_article_title_article_par
ratioChars_post_image_article_kw
ratioChars_post_image_article_desc
ratioChars_post_image_article_par
ratioChars_article_kw_article_desc
ratioChars_article_kw_article_par
ratioChars_article_desc_article_par
diffChars_post_title_article_title
diffChars_post_title_post_image
diffChars_post_title_article_kw
diffChars_post_title_article_desc
diffChars_post_title_article_par
dif

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

## Optimization by grid search

The authors did not explain what parameters their optimal classifier used. They also did not explain how these classifier where optimized.

Therefore, we made the decision to apply a grid search to find suitable parameters (such as #estimators etc.). The grids are given below.

The following settings are used during this grid search: 
- **10-fold cross validation with shuffeled data.** This will remove bias from our estimation as the data is shuffeld and the cross validation is a different one then used during testing. 
- **F1-metric as performance measure.** The classes of clickbait vs no-clickbait are unbalanced, making measure as accuracy useless (70% accuracy could mean that we only assign to one class!). As we want the performance for both classes to be equally good, the F1 score is used which is the harmonic mean of precision and recall. 

## GRIDS:

In [5]:
randomforest_grid = {
    'criterion': ['entropy'],
    'n_estimators': [10,25,50,100],
    'max_depth': [1,3,5,None],
    'max_features': [5,10,15, None],
    'class_weight': ['balanced']
}

adaboost_grid = {
    'n_estimators': [10,25,50,100],
    'learning_rate': [0.1,0.2,0.3],
    'base_estimator__max_depth': [1,3,5,None],
    'base_estimator__criterion': ['entropy'],
    'base_estimator__class_weight': ['balanced'],
    
}

xgb_grid = {
    'objective': ['binary:logistic'],
    'learning_rate': [0.1,0.2,0.3],
    'n_estimators': [10,25,50,100],
    'max_depth': [1,3,5,7,9],
    'scale_pos_weight': [n_nonclickbait / n_clickbait] 
}

# This classifier does not really have parameters to tweak
# Use default values to not break the pipeline
naivebayes_grid = {
    'priors': [None], # priors are computed by the algorithm
    'var_smoothing': [1e-9]
}

svc_grid = {
    'kernel': ['linear', 'rbf']
}

## Define classifiers

In [14]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

classifiers_options = [
    {
        'name': 'RandomForest',
        'clf': RandomForestClassifier(),
#         'grid': randomforest_grid,
        'optimized_param': {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': 15, 'n_estimators': 50}
    },
    {
        'name': 'XGDBoost',
        'clf': XGBClassifier(),
#         'grid': xgb_grid,
        'optimized_param': {'learning_rate': 0.2, 'max_depth': 1, 'n_estimators': 100, 'objective': 'binary:logistic', 'scale_pos_weight': 2.227034120734908}
    },
    {
        'name': 'GaussianNaiveBays',
        'clf': GaussianNB(),
        'optimized_param': {} ## Can not be optimized

    },
    {
        'name': 'AdaBoost',
        'clf': AdaBoostClassifier(DecisionTreeClassifier()),
#         'grid': adaboost_grid
        'optimized_param': {'base_estimator__class_weight': 'balanced', 'base_estimator__criterion': 'entropy', 'base_estimator__max_depth': 1, 'learning_rate': 0.3, 'n_estimators': 100}
    },
    # Takes to long to optimize so we just assume that the linear kernel is good enough
#     {
#         'name': 'SVM',
#         'clf': SVC(probability=True),
# # #         'grid': svc_grid,
#         'optimized_param': {'kernel': 'linear'}
#     }
]

import classification
classifiers = classification.Classifiers(features,labels, classifiers_options)


## Perform the grid search - optimize on F1

This may take a while.  
Output = optimized settings per classifier, which can be used in the model with the `optimized_param` key

In [7]:
classifiers.optimize(metric='f1')

-- Start optimizing by grid search --
Optimizing: RandomForest
Optimal settings RandomForest:
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': 15, 'n_estimators': 50}
Optimizing: XGDBoost
Optimal settings XGDBoost:
{'learning_rate': 0.2, 'max_depth': 1, 'n_estimators': 100, 'objective': 'binary:logistic', 'scale_pos_weight': 2.227034120734908}
Optimizing: AdaBoost
Optimal settings AdaBoost:
{'base_estimator__class_weight': 'balanced', 'base_estimator__criterion': 'entropy', 'base_estimator__max_depth': 1, 'learning_rate': 0.3, 'n_estimators': 100}
-- Finished optimizing -- 


## Performance - optimized on F1

The authors test their classifiers by doing a 10-fold internal (?) cross validation. To have a fair comparison, we will use the same strategy.

In [21]:
classifiers.cross_val()

-- Cross validation with 10-folds --
Cross validation performance RandomForest:
TRAIN
train_accuracy     0.790792
train_precision    0.651681
train_f1           0.674160
train_roc_auc      0.850985
train_recall       0.698416

TEST
test_accuracy     0.725504
test_precision    0.551277
test_f1           0.569722
test_roc_auc      0.744672
test_recall       0.594166


Cross validation performance XGDBoost:
TRAIN
train_accuracy     0.747594
train_precision    0.574388
train_f1           0.637206
train_roc_auc      0.809124
train_recall       0.715509

TEST
test_accuracy     0.715718
test_precision    0.533675
test_f1           0.585330
test_roc_auc      0.760781
test_recall       0.653341


Cross validation performance GaussianNaiveBays:
TRAIN
train_accuracy     0.436219
train_precision    0.341750
train_f1           0.492686
train_roc_auc      0.706546
train_recall       0.883042

TEST
test_accuracy     0.433956
test_precision    0.340102
test_f1           0.487934
test_roc_auc      0.68

Results show that some classifiers have a extremely low recall. We dont have access to the actual predictions that were made, so to analyse the results we do another test on a random subset. This gives us access to the confusion matrices.

In [22]:
classifiers.test()

-- Performance on split: 70% train - 30% split --
Test performance: RandomForest
              precision    recall  f1-score   support

no-clickbait       0.81      0.78      0.80       521
   clickbait       0.52      0.57      0.54       217

   micro avg       0.72      0.72      0.72       738
   macro avg       0.67      0.67      0.67       738
weighted avg       0.73      0.72      0.72       738

AUC on binary labels: 0.6740051478457769
AUC on probabilities: 0.7432312904110315
Confusion matrix:
[[407 114]
 [ 94 123]]


Test performance: XGDBoost
              precision    recall  f1-score   support

no-clickbait       0.79      0.74      0.76       500
   clickbait       0.52      0.58      0.55       238

   micro avg       0.69      0.69      0.69       738
   macro avg       0.65      0.66      0.66       738
weighted avg       0.70      0.69      0.69       738

AUC on binary labels: 0.6609159663865547
AUC on probabilities: 0.7283109243697479
Confusion matrix:
[[371 129]
 [