# Replicating paper results
This notebook will be used for replicating the results that are achieved in the paper. 



In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
from os import path

data_path = '../features/2019-04-04_202426'
feature_path = 'features.pkl'
label_path = 'labels.npy'

features = pd.read_pickle(path.join(data_path, feature_path))
labels = np.load(path.join(data_path, label_path))

## Compute class priors

In [4]:
_, counts = np.unique(labels, return_counts=True)

n_nonclickbait = counts[0]
n_clickbait = counts[1]


## Features
Below is a list of the features that are going to be used.

**Note**:
The features are a subset of all the features that are being used in the original.
We expect that scores will be slightly lower as we have less information about the post and the article. Still, the overal trend is expected to be the same because their results also so that the differences between using all features and 20 features is only a marginal improvement. 

Expected results training set

| Measure   | Interval      |
|-----------|---------------|
| AUC       | 0.583 - 0.715 |
| Accuracy  | 0.636 - 0.732 |
| Precision | 0.743 - 0.75  |
| Recall    | 0.721 - 0.92  |


Expected results validation set  

| Measure   | Interval      |
|-----------|---------------|
| AUC       | 0.653 - 0.8   |
| Accuracy  | 0.725 - 0.812 |
| Precision | 0.814 - 0.824 |
| Recall    | 0.811 - 0.966 |


In [5]:
print(features.columns)

Index(['numChars_post_title', 'numChars_post_image', 'numChars_article_kw',
       'numChars_article_desc', 'numChars_article_title',
       'numChars_article_par', 'numWords_post_title', 'numWords_post_image',
       'numWords_article_kw', 'numWords_article_desc',
       ...
       'diffStopWords_post_image_article_kw',
       'diffStopWords_post_image_article_desc',
       'diffStopWords_post_image_article_title',
       'diffStopWords_post_image_article_par',
       'diffStopWords_article_kw_article_desc',
       'diffStopWords_article_kw_article_title',
       'diffStopWords_article_kw_article_par',
       'diffStopWords_article_desc_article_title',
       'diffStopWords_article_desc_article_par',
       'diffStopWords_article_title_article_par'],
      dtype='object', length=150)


## Optimization by grid search

The authors did not explain what parameters their optimal classifier used. They also did not explain how these classifier where optimized.

Therefore, we made the decision to apply a grid search to find suitable parameters (such as #estimators etc.). The grids are given below.

The following settings are used during this grid search: 
- **10-fold cross validation with shuffeled data.** This will remove bias from our estimation as the data is shuffeld and the cross validation is a different one then used during testing. 
- **F1-metric as performance measure.** The classes of clickbait vs no-clickbait are unbalanced, making measure as accuracy useless (70% accuracy could mean that we only assign to one class!). As we want the performance for both classes to be equally good, the F1 score is used which is the harmonic mean of precision and recall. 

## GRIDS:

In [6]:
randomforest_grid = {
    'criterion': ['entropy', 'gini'],
    'n_estimators': [10,25,50,100],
    'max_depth': [1,3,5,None],
    'max_features': [5,10,15, None],
    'class_weight': ['balanced', None]
}

adaboost_grid = {
    'n_estimators': [10,25,50,100],
    'learning_rate': [0.1,0.2,0.3],
    'base_estimator__max_depth': [1,3,5,None],
    'base_estimator__criterion': ['entropy', 'gini'],
    'base_estimator__class_weight': ['balanced', None]
    
}

xgb_grid = {
    'objective': ['binary:logistic'],
    'learning_rate': [0.1,0.2,0.3],
    'n_estimators': [10,25,50,100],
    'max_depth': [1,3,5,7,9],
    'scale_pos_weight': [1, n_nonclickbait / n_clickbait, n_clickbait / n_nonclickbait] 
}

# This classifier does not really have parameters to tweak
# Use default values to not break the pipeline
naivebayes_grid = {
    'priors': [None], # priors are computed by the algorithm
    'var_smoothing': [1e-9]
}

svc_grid = {
    'kernel': ['linear', 'rbf']
}

## Define classifiers

In [7]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

classifiers_options = [
    {
        'name': 'RandomForest',
        'clf': RandomForestClassifier(),
#         'grid': randomforest_grid
        'optimized_param': {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 3, 'max_features': 15, 'n_estimators': 100}
    },
    {
        'name': 'XGDBoost',
        'clf': XGBClassifier(),
#         'grid': xgb_grid
        'optimized_param': {'learning_rate': 0.3, 'max_depth': 3, 'n_estimators': 25, 'objective': 'binary:logistic', 'scale_pos_weight': 2.227034120734908}
    },
    {
        'name': 'GaussianNaiveBays',
        'clf': GaussianNB(),
        'optimized_param': {} ## Can not be optimized

    },
    {
        'name': 'AdaBoost',
        'clf': AdaBoostClassifier(DecisionTreeClassifier()),
#         'grid': adaboost_grid
        'optimized_param': {'base_estimator__class_weight': 'balanced', 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': 1, 'learning_rate': 0.2, 'n_estimators': 50}
    },
    # Takes to long to optimize so we just assume that the linear kernel is good enough
#     {
#         'name': 'SVM',
#         'clf': SVC(probability=True),
# # #         'grid': svc_grid,
#         'optimized_param': {'kernel': 'linear'}
#     }
]

import classification
classifiers = classification.Classifiers(features,labels, classifiers_options)


## Perform the grid search - optimize on F1

This may take a while.  
Output = optimized settings per classifier, which can be used in the model with the `optimized_param` key

In [18]:
classifiers.optimize(metric='f1')

-- Start optimizing by grid search --
Optimizing: RandomForest
Optimal settings RandomForest:
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 3, 'max_features': 15, 'n_estimators': 100}
Optimizing: XGDBoost
Optimal settings XGDBoost:
{'learning_rate': 0.3, 'max_depth': 3, 'n_estimators': 25, 'objective': 'binary:logistic', 'scale_pos_weight': 2.227034120734908}
Optimizing: AdaBoost
Optimal settings AdaBoost:
{'base_estimator__class_weight': 'balanced', 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': 1, 'learning_rate': 0.2, 'n_estimators': 50}
-- Finished optimizing -- 


## Performance - optimized on F1

The authors test their classifiers by doing a 10-fold internal (?) cross validation. To have a fair comparison, we will use the same strategy.

In [8]:
classifiers.cross_val()

-- Cross validation with 10-folds --
Cross validation performance RandomForest:
TRAIN
train_accuracy     0.720438
train_precision    0.542794
train_f1           0.580049
train_roc_auc      0.768475
train_recall       0.623038

TEST
test_accuracy     0.671417
test_precision    0.476475
test_f1           0.507665
test_roc_auc      0.691357
test_recall       0.546137


Cross validation performance XGDBoost:
TRAIN
train_accuracy     0.808730
train_precision    0.660463
train_f1           0.718524
train_roc_auc      0.890174
train_recall       0.787892

TEST
test_accuracy     0.670199
test_precision    0.474328
test_f1           0.515638
test_roc_auc      0.702012
test_recall       0.571936


Cross validation performance GaussianNaiveBays:
TRAIN
train_accuracy     0.629974
train_precision    0.443805
train_f1           0.516139
train_roc_auc      0.675467
train_recall       0.636787

TEST
test_accuracy     0.622641
test_precision    0.436469
test_f1           0.506364
test_roc_auc      0.65

Results show that some classifiers have a extremely low recall. We dont have access to the actual predictions that were made, so to analyse the results we do another test on a random subset. This gives us access to the confusion matrices.

In [9]:
classifiers.test()

-- Performance on split: 70% train - 30% split --
Test performance: RandomForest
              precision    recall  f1-score   support

no-clickbait       0.77      0.71      0.74       513
   clickbait       0.44      0.52      0.47       225

   micro avg       0.65      0.65      0.65       738
   macro avg       0.60      0.61      0.61       738
weighted avg       0.67      0.65      0.66       738

AUC on binary labels: 0.6125536062378167
AUC on probabilities: 0.6265713666883258
Confusion matrix:
[[364 149]
 [109 116]]


Test performance: XGDBoost
              precision    recall  f1-score   support

no-clickbait       0.78      0.70      0.74       519
   clickbait       0.43      0.54      0.48       219

   micro avg       0.65      0.65      0.65       738
   macro avg       0.61      0.62      0.61       738
weighted avg       0.68      0.65      0.66       738

AUC on binary labels: 0.6191173753530235
AUC on probabilities: 0.6707577797133581
Confusion matrix:
[[363 156]
 [

## Performance - optimized on Recall

The results above show that the classifiers have a hard time with getting a good recall score, especially on the clickbait class which has less samples. This means that the samples that are classified as clickbait is only a small fraction of the total clickbait posts. 

When we look at the paper, we see that they have a outstanding recall score and its higher than all other metrics that are provided. This could indicate that they decide that to optimize on the recall metric. As we have a binary classification problem (clickbait vs no-clickbait), the recall measure can be seen as the probability of detection (sort of)(wikipedia source). So optimizing on this metric makes sense in the clickbait context, as we want to make sure that we our detection is correct (and thus make a small amount of False negatives).



In [12]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

classifiers_options = [
    {
        'name': 'RandomForest',
        'clf': RandomForestClassifier(),
#         'grid': randomforest_grid
        'optimized_param': {'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 3, 'max_features': 15, 'n_estimators': 10}
    },
    {
        'name': 'XGDBoost',
        'clf': XGBClassifier(),
#         'grid': xgb_grid
        'optimized_param': {'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 100, 'objective': 'binary:logistic', 'scale_pos_weight': 2.227034120734908}
    },
    {
        'name': 'GaussianNaiveBays',
        'clf': GaussianNB(),
        'optimized_param': {} ## Can not be optimized
    },
    {
        'name': 'AdaBoost',
        'clf': AdaBoostClassifier(DecisionTreeClassifier()),
#         'grid': adaboost_grid
        'optimized_param': {'base_estimator__class_weight': 'balanced', 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': 1, 'learning_rate': 0.3, 'n_estimators': 10} 
    },
#     {
#         'name': 'SVM',
#         'clf': SVC(probability=True),
# #         'grid': svc_grid,
#         'optimized_param': {'kernel': 'linear'}
#     }
]

import classification
classifiers = classification.Classifiers(features,labels, classifiers_options)

In [11]:
classifiers.optimize(metric='recall')

-- Start optimizing by grid search --
Optimizing: RandomForest
Optimal settings RandomForest:
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 3, 'max_features': 15, 'n_estimators': 10}
Optimizing: XGDBoost
Optimal settings XGDBoost:
{'learning_rate': 0.1, 'max_depth': 1, 'n_estimators': 100, 'objective': 'binary:logistic', 'scale_pos_weight': 2.227034120734908}
Optimizing: AdaBoost
Optimal settings AdaBoost:
{'base_estimator__class_weight': 'balanced', 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': 1, 'learning_rate': 0.3, 'n_estimators': 10}
-- Finished optimizing -- 


In [13]:
classifiers.cross_val()

-- Cross validation with 10-folds --
Cross validation performance RandomForest:
TRAIN
train_accuracy     0.698657
train_precision    0.511935
train_f1           0.553291
train_roc_auc      0.737571
train_recall       0.603119

TEST
test_accuracy     0.655147
test_precision    0.449339
test_f1           0.482847
test_roc_auc      0.665696
test_recall       0.524120


Cross validation performance XGDBoost:
TRAIN
train_accuracy     0.702770
train_precision    0.516060
train_f1           0.579131
train_roc_auc      0.763872
train_recall       0.659901

TEST
test_accuracy     0.658804
test_precision    0.460696
test_f1           0.511556
test_roc_auc      0.704153
test_recall       0.580045


Cross validation performance GaussianNaiveBays:
TRAIN
train_accuracy     0.629208
train_precision    0.443776
train_f1           0.515361
train_roc_auc      0.675457
train_recall       0.637850

TEST
test_accuracy     0.615701
test_precision    0.431155
test_f1           0.500378
test_roc_auc      0.65

In [14]:
classifiers.test()

-- Performance on split: 70% train - 30% split --
Test performance: RandomForest
              precision    recall  f1-score   support

no-clickbait       0.78      0.69      0.73       510
   clickbait       0.45      0.57      0.50       228

   micro avg       0.65      0.65      0.65       738
   macro avg       0.62      0.63      0.62       738
weighted avg       0.68      0.65      0.66       738

AUC on binary labels: 0.6311661506707946
AUC on probabilities: 0.666047471620227
Confusion matrix:
[[353 157]
 [ 98 130]]


Test performance: XGDBoost
              precision    recall  f1-score   support

no-clickbait       0.83      0.67      0.74       530
   clickbait       0.44      0.65      0.53       208

   micro avg       0.67      0.67      0.67       738
   macro avg       0.64      0.66      0.64       738
weighted avg       0.72      0.67      0.68       738

AUC on binary labels: 0.6637155297532655
AUC on probabilities: 0.7220564223512336
Confusion matrix:
[[357 173]
 [ 

## TODOS
- **Scaling:** test performance if the features were scaled, this also requires optimizing again I think
- **Feature reduction:** only use the top k features based on the mutial_information criterion