# Clickbait Headline Identification
In this experiment you are to pit three cross-validated models against either other:

    k-nearest neighbors
    Naïve Bayes
    Multilayer perceptron

Each model must be cross-validated using a 5-fold cross validator (for example, in the k-nearest neighbors case, the value of the hyperparameter "k" must be selected through cross-validation).  In the case of the neural network, select 10 network configurations (layer sizes) to use in your experiment where the "winner" is selected by cross validation.

In addition to your computations, at the end of your notebook, please include a markdown block indicating:

   - What data representation you used (counts or Tfidf);
   - What metric you selected to rank the models;
   - How each model scored on each metric both on testing (give the mean CV result +/- the standard deviation) and training data;
   - What values of the hyperparameters gave optimal results in the cross validation;
   - Describe a way in which your classifier could be used as a plugin for a web browser, for example.

In [1]:
# Import Statements
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk
from IPython.core.display import display
from numpy import average, std
from sklearn.metrics import accuracy_score, plot_confusion_matrix, r2_score, mean_squared_error, \
    average_precision_score, make_scorer
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate, cross_val_score, cross_val_predict
from sklearn.preprocessing import LabelEncoder



In [2]:
# Dataset imports
# first import separate datasets for Clickbait and nonClickbait

clickBaitDF = pd.read_table('data/clickbait_data',header=None, names=['message'])
display(clickBaitDF)
nonClickBaitDF = pd.read_table('data/non_clickbait_data',header=None, names=['message'])
display(nonClickBaitDF)

Unnamed: 0,message
0,Should I Get Bings
1,Which TV Female Friend Group Do You Belong In
2,"The New ""Star Wars: The Force Awakens"" Trailer..."
3,"This Vine Of New York On ""Celebrity Big Brothe..."
4,A Couple Did A Stunning Photo Shoot With Their...
...,...
15994,"There Was A Mini ""Sisterhood Of The Traveling ..."
15995,21 Dogs Who Are Thankful For Their Best Friends
15996,People Are Proving No Dick Is Too Big By Dropp...
15997,"I'm An Atheist, But I'm Not"


Unnamed: 0,message
0,Bill Changing Credit Card Rules Is Sent to Oba...
1,"In Hollywood, the Easy-Money Generation Toughe..."
2,1700 runners still unaccounted for in UK's Lak...
3,Yankees Pitchers Trade Fielding Drills for Put...
4,Large earthquake rattles Indonesia; Seventh in...
...,...
15996,"To Make Female Hearts Flutter in Iraq, Throw a..."
15997,"British Liberal Democrat Patsy Calton, 56, die..."
15998,Drone smartphone app to help heart attack vict...
15999,"Netanyahu Urges Pope Benedict, in Israel, to D..."


With the datasets imported into dataframes, we can start thinking about features and targets.
Since the goal is to guess what headlines are likely to be 'Clickbait', we'll add a target column to both datasets
and combine the results into a pool we can start sampling.

In [3]:
clickBaitDF['target'] = pd.Series(['clickbait' for x in range(len(clickBaitDF.index))])
nonClickBaitDF['target'] = pd.Series(['non_clickbait' for x in range(len(nonClickBaitDF.index))])
headlinesDataset = pd.concat([clickBaitDF, nonClickBaitDF])
display(headlinesDataset)

Unnamed: 0,message,target
0,Should I Get Bings,clickbait
1,Which TV Female Friend Group Do You Belong In,clickbait
2,"The New ""Star Wars: The Force Awakens"" Trailer...",clickbait
3,"This Vine Of New York On ""Celebrity Big Brothe...",clickbait
4,A Couple Did A Stunning Photo Shoot With Their...,clickbait
...,...,...
15996,"To Make Female Hearts Flutter in Iraq, Throw a...",non_clickbait
15997,"British Liberal Democrat Patsy Calton, 56, die...",non_clickbait
15998,Drone smartphone app to help heart attack vict...,non_clickbait
15999,"Netanyahu Urges Pope Benedict, in Israel, to D...",non_clickbait


In [4]:
# Creation of untransformed 'features' and 'target'
X = headlinesDataset['message']
y = headlinesDataset['target']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
print(y)

[0 0 0 ... 1 1 1]


#     k-nearest neighbors

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
knnPipeline = Pipeline([
               ('classifier', KNeighborsClassifier(n_neighbors=1))
])
param_grid =[
    {'classifier__n_neighbors': list(range(1,5))}
]
knnModel = GridSearchCV(knnPipeline, param_grid=param_grid)
knnModel.fit(X_train, y_train)
print(knnModel.best_params_)

{'classifier__n_neighbors': 3}


In [6]:
# scoring KN
scores = {'precision':'average_precision', 'acc': 'accuracy'}
train_scores= cross_validate(knnModel, X_train, y_train,
                                                 cv=KFold(n_splits=5),
                                                 scoring=scores,
                                                 return_train_score=True)

In [7]:
train_accuracy = train_scores.get('train_acc')
train_precision = train_scores.get('train_precision')
test_accuracy = train_scores.get('test_acc')
test_precision = train_scores.get('test_precision')
knn_avg_train_accuracy = average(train_accuracy)
knn_std_train_accuracy = std(train_accuracy)
knn_avg_train_precision = average(train_precision)
knn_std_train_precision = std(train_precision)
knn_avg_test_accuracy = average(test_accuracy)
knn_std_test_accuracy = std(test_accuracy)
knn_avg_test_precision = average(test_precision)
knn_std_test_precision = std(test_precision)


#     Naïve Bayes

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
bayesianPipeline = Pipeline([
               ('classifier', MultinomialNB())
])
param_grid =[
    {'classifier__alpha': [0.01, 0.1, 0.3, 0.6, 0.9, 1]}
]
bayesianModel = GridSearchCV(bayesianPipeline, param_grid=param_grid)
bayesianModel.fit(X_train, y_train)
print(bayesianModel.best_params_)

{'classifier__alpha': 0.1}


In [9]:
# scoring Bayesian Approach
scores = {'precision':'average_precision', 'acc': 'accuracy'}
train_scores= cross_validate(bayesianModel, X_train, y_train,
                                                 cv=KFold(n_splits=5),
                                                 scoring=scores,
                                                 return_train_score=True)

In [10]:
train_accuracy = train_scores.get('train_acc')
train_precision = train_scores.get('train_precision')
test_accuracy = train_scores.get('test_acc')
test_precision = train_scores.get('test_precision')
bayesian_avg_train_accuracy = average(train_accuracy)
bayesian_std_train_accuracy = std(train_accuracy)
bayesian_avg_train_precision = average(train_precision)
bayesian_std_train_precision = std(train_precision)
bayesian_avg_test_accuracy = average(test_accuracy)
bayesian_std_test_accuracy = std(test_accuracy)
bayesian_avg_test_precision = average(test_precision)
bayesian_std_test_precision = std(test_precision)
print(train_scores)


{'fit_time': array([0.1605711 , 0.1665554 , 0.14660811, 0.16954732, 0.15461969]), 'score_time': array([0.00299191, 0.00199437, 0.00199533, 0.00299168, 0.00199413]), 'test_precision': array([0.99721499, 0.99747512, 0.99664473, 0.99626934, 0.9956693 ]), 'train_precision': array([0.99963603, 0.99965255, 0.99945385, 0.99969622, 0.9995335 ]), 'test_acc': array([0.97621269, 0.97318097, 0.96991604, 0.97108209, 0.96758396]), 'train_acc': array([0.99090485, 0.99090485, 0.98857276, 0.99148787, 0.98973881])}


#     Multilayer perceptron

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
mlpPipeline = Pipeline([
               ('classifier', MLPClassifier(max_iter=500))
])
param_grid =[
    {'classifier__hidden_layer_sizes': [(2,2),(2,5),(2,10),
                                        (5,2), (5,5), (5,10),
                                        (10,2), (10,5), (10,10), (10,20)]}
]
mlpModel = GridSearchCV(mlpPipeline, param_grid=param_grid)
mlpModel.fit(X_train, y_train)
print(mlpModel.best_params_)

{'classifier__hidden_layer_sizes': (5,)}
{'classifier__hidden_layer_sizes': (10, 20)}


In [12]:
# scoring MLP Approach
scores = {'precision':'average_precision', 'acc': 'accuracy'}
train_scores= cross_validate(mlpModel, X_train, y_train,
                                                 cv=KFold(n_splits=5),
                                                 scoring=scores,
                                                 return_train_score=True)

train_accuracy = train_scores.get('train_acc')
train_precision = train_scores.get('train_precision')
test_accuracy = train_scores.get('test_acc')
test_precision = train_scores.get('test_precision')
mlp_avg_train_accuracy = average(train_accuracy)
mlp_std_train_accuracy = std(train_accuracy)
mlp_avg_train_precision = average(train_precision)
mlp_std_train_precision = std(train_precision)
mlp_avg_test_accuracy = average(test_accuracy)
mlp_std_test_accuracy = std(test_accuracy)
mlp_avg_test_precision = average(test_precision)
mlp_std_test_precision = std(test_precision)
print(train_scores)



# Results

    Data representation used:
        Tfidf
    Metrics used to rank the models:
        Accuracy and Precision were used to both track accuracy and the number of false positives and negatives.

    Hyperparameters for optimal results in the cross validation:
        KNN: Cross validated with a range of neighbors from 1 to 5. The optimal amount was 3

        Naive Bayes: Cross validated with alpha values from 0.01 to 1. The optimal values was 0.1

        MLP: Cross validated with a set of Arrangements. The optimal was (5,0) or 1 hidden layer of 5 neurons.

    Describe a way in which your classifier could be used as a plugin for a web browser, for example.
        A clickbait classifier could be used for applications such as blocking, content moderation, or to determine the credibility of an article.

Scores listed below:


#### K-Nearest Neighbors

In [14]:
print('KNN average train precision: ', knn_avg_train_precision, ' +/- ', knn_std_train_precision)
print('KNN average test precision: ', knn_avg_test_precision, ' +/- ', knn_std_test_precision)
print('KNN average train accuracy: ', knn_avg_train_accuracy, ' +/- ', knn_std_train_accuracy)
print('KNN average test accuracy: ', knn_avg_test_accuracy, ' +/- ', knn_std_test_accuracy)

KNN average train precision:  0.9939542112245213  +/-  0.0002160508816649522
KNN average test precision:  0.9672433150236728  +/-  0.0020032999940039014
KNN average train accuracy:  0.9681086753731343  +/-  0.00040224319754337504
KNN average test accuracy:  0.9326492537313433  +/-  0.0022725611408731163


#### Bayesian Model

In [15]:
print('Bayesian average train precision: ', bayesian_avg_train_precision, ' +/- ', bayesian_std_train_precision)
print('Bayesian average test precision: ', bayesian_avg_test_precision, ' +/- ', bayesian_std_test_precision)
print('Bayesian average train accuracy: ', bayesian_avg_train_accuracy, ' +/- ', bayesian_std_train_accuracy)
print('Bayesian average test accuracy: ', bayesian_avg_test_accuracy, ' +/- ', bayesian_std_test_accuracy)

Bayesian average train precision:  0.9995944274565298  +/-  8.825443342709197e-05
Bayesian average test precision:  0.9966546948669276  +/-  0.0006490966211695306
Bayesian average train accuracy:  0.990321828358209  +/-  0.001042942153684622
Bayesian average test accuracy:  0.9715951492537312  +/-  0.0029328749854967352


#### Multi-Layer Perceptron

In [16]:
print('MLP average train precision: ', mlp_avg_train_precision, ' +/- ', mlp_std_train_precision)
print('MLP average test precision: ', mlp_avg_test_precision, ' +/- ', mlp_std_test_precision)
print('MLP average train accuracy: ', mlp_avg_train_accuracy, ' +/- ', mlp_std_train_accuracy)
print('MLP average test accuracy: ', mlp_avg_test_accuracy, ' +/- ', mlp_std_test_accuracy)

MLP average train precision:  1.0  +/-  4.965068306494546e-17
MLP average test precision:  0.9967605519826179  +/-  0.0003260941418391932
MLP average train accuracy:  0.9999533582089551  +/-  9.32835820895317e-05
MLP average test accuracy:  0.9732276119402986  +/-  0.00112714981096963
