# Text classification of clickbait headlines
## Bag-of-words: count vectorisation

Count vectorisation is a method where you convert every document into a n-dimensional vector, where the elements represent one word in the corpus vocabulary. It is one of the simplest methods of converting text into inputs for a ML model.

## Load in dependencies and data

In [20]:
import pandas as pd
import numpy as np

from keras import models, layers

In [38]:
# Read in clickbait data: https://github.com/bhargaviparanjape/clickbait/tree/master/dataset
clickbait_train = pd.read_csv("data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv("data/clickbait_val.csv", sep="\t", header=0)

In [39]:
clickbait_train[:10]

Unnamed: 0,text,label
0,New insulin-resistance discovery may help diab...,0
1,Gates Group Plans to Give More in 2009 Despite...,0
2,"Heather Graham Rides A Garbage Truck, Remains ...",1
3,Irish Developer Found Dead in His Home,0
4,Boat accident in Democratic Republic of the Co...,0
5,"Here's Where ""Joy"" Went Very, Very Wrong",1
6,Russian ICBM test launch failed again,0
7,17 Misconceptions Sorority Girls Want To Set S...,1
8,"How Well Do You Remember The Intro To ""Danny P...",1
9,Harry Potter Fans Are Paying It Forward And Le...,1


## String cleaning
A usual first step in bag-of-words models is to apply cleaning to the text strings. This is for two reasons:
* We usually don't want punctuation to be included as part of the vocabulary
* We want to homogenise the tokens as much as possible so that the same words end up in the same column.

In [23]:
def apply_string_cleaning(dataset: pd.Series) -> pd.Series:
    """
    Applies a series of string cleaning tasks to a Pandas Series containing string data. The following cleaning
    steps are applied:
    - Convert all text to lowercase
    - Remove strings starting with @ (tags), # (hashtags), `r/` (Reddit sub reference)
      or `u/` (Reddit user reference).
    - Remove all non-alphabetic characters
    - Remove all single character words
    - Remove all whitespace
    """

    return (
        dataset
        .str.lower()
        .str.replace("@\w+", "", regex=True)
        .str.replace("#\w+", "", regex=True)
        .str.replace("\s[u|r]/\w+", "", regex=True)
        .str.replace("[^a-zA-Z]", " ", regex=True)
        .str.replace(r"\b\w\b", "", regex=True)
        .str.replace("\s+", " ", regex=True)
        .str.strip()
    )

In [40]:
clickbait_train["text_clean"] = apply_string_cleaning(clickbait_train["text"])
clickbait_val["text_clean"] = apply_string_cleaning(clickbait_val["text"])

## Count vectorise the text

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

countVectoriser = CountVectorizer()
countVectoriser.fit(clickbait_train["text_clean"])

In [42]:
X_train_cv = countVectoriser.transform(clickbait_train["text_clean"]).toarray()
X_val_cv = countVectoriser.transform(clickbait_val["text_clean"]).toarray()

In [43]:
# Show what the count vectorisator does to each of the texts
pd.DataFrame(X_train_cv).rename(columns={v: k for k, v in countVectoriser.vocabulary_.items()})

Unnamed: 0,aa,aaa,aaevpc,aaron,ab,abandon,abandoned,abandoning,abandons,abba,...,zooey,zoolander,zoombak,zoomed,zotob,zowie,zuckerberg,zuma,zurawski,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19197,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
(
    pd.merge(clickbait_train[["text_clean", "label"]],
             pd.DataFrame(X_train_cv).rename(columns={v: k for k, v in countVectoriser.vocabulary_.items()}),
             left_index=True,
             right_index=True)
    .query("believe == 1")
    [["text_clean", "believe", "gigantic", "dogs", "confessions", "label_x"]]
    .rename(columns = {"label_x": "clickbait_label"})
)

Unnamed: 0,text_clean,believe,gigantic,dogs,confessions,clickbait_label
3907,beaches you won believe actually exist in india,1,0,0,0,1
4425,dogs who are so gigantic you won believe they ...,1,1,1,0,1
4664,untrue facts about mental health you probably ...,1,0,0,0,1
5230,confessions from people who are afraid to admi...,1,0,0,1,1
6072,movies you won believe are turning in,1,0,0,0,1
6579,things you won believe are turning in,1,0,0,0,1
6930,celebrities you won believe are turning this year,1,0,0,0,1
6947,songs you won believe are turning in,1,0,0,0,1
7574,guy made fake fact go viral to prove not to be...,1,0,0,0,1
7723,you won believe how many sharks are swimming o...,1,0,0,0,1


## Train a simple model

We're going to train a simple neural net with one hidden layer. The details of this model are not important: the only thing to note is that the model will be pretty much the same across all the different variations of text processing that we're going to do.

In [44]:
def train_text_classification_model(
        train_features: np.ndarray,
        train_labels: np.ndarray,
        validation_features: np.ndarray,
        validation_labels: np.ndarray,
        input_size: int,
        num_epochs: int,
        hidden_layer_size: int) -> models.Sequential:
    model = models.Sequential()
    model.add(layers.Dense(hidden_layer_size, activation="relu", input_shape=(input_size,)))
    model.add(layers.Dense(1, activation="sigmoid"))

    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"]
                  )

    model.fit(train_features,
              train_labels,
              epochs=num_epochs,
              batch_size=1920,
              validation_data=(validation_features, validation_labels)
              )
    return model

In [45]:
def generate_predictions(model: models.Sequential,
                         validation_features: np.ndarray,
                         validation_labels: np.ndarray) -> list:
    predicted_proba = model.predict(validation_features)
    predicted_labels = [sl for l in np.rint(predicted_proba) for sl in l]

    print(pd.crosstab(validation_labels, predicted_labels))
    return predicted_labels

In [46]:
baseline_model = train_text_classification_model(
    X_train_cv,
    clickbait_train["label"].to_numpy(),
    X_val_cv,
    clickbait_val["label"].to_numpy(),
    X_train_cv.shape[1],
    2,
    1700
)

Epoch 1/2
Epoch 2/2


In [47]:
clickbait_val["baseline_pred"] = generate_predictions(baseline_model, X_val_cv, clickbait_val["label"].to_numpy())

col_0   0.0   1.0
row_0            
0      3104   100
1        69  3127


In [52]:
# Predicted as non-clickbait when they are clickbait
clickbait_val.loc[(clickbait_val["label"] == 1) & (clickbait_val["baseline_pred"] == 0), "text"][:5]

30     The Iconic Beatles Ashram In Rishikesh Is Once...
49     This Body Cam Footage Shows A Vehicle Plow Int...
83     Photographer Gregory Crewdson Releases Hauntin...
139    21 New Year's Resolutions For TV To Consider I...
222    Oscar-Nominated Movie Posters With White Actor...
Name: text, dtype: object

In [55]:
# Predicted as clickbait when they are not
clickbait_val.loc[(clickbait_val["label"] == 0) & (clickbait_val["baseline_pred"] == 1), "text"][:5]

4                              Where Is Oil Going Next?
69            A World of Lingo (Out of This World, Too)
184             Posted deadlines for Christmas delivery
422     Dolls Resembling Daughters Displease First Lady
443    Panasonic GH1 Merges S.L.R. Photos With HD Video
Name: text, dtype: object