# Text classification of clickbait headlines
## Iteration 3: feature weighting

Raw counts are not always the most informative metric, as the most common words in a corpus can occur with equal frequency across both clickbait and non-clickbait titles. We can apply a weighting called term frequency-inverse document frequency (tf-idf) which upweights terms that are found in only a few documents, and downweights terms that are found commonly across most documents.

## Load in dependencies and data

In [9]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from support_functions import apply_string_cleaning, train_text_classification_model, generate_predictions, lemmatise_text

In [3]:
# Read in clickbait data
clickbait_train = pd.read_csv("data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv("data/clickbait_val.csv", sep="\t", header=0)
clickbait_test = pd.read_csv("data/clickbait_test.csv", sep="\t", header=0)

In [7]:
clickbait_train["text_lemmatised"] = apply_string_cleaning(lemmatise_text(clickbait_train["text"]))
clickbait_val["text_lemmatised"] = apply_string_cleaning(lemmatise_text(clickbait_val["text"]))
clickbait_test["text_lemmatised"] = apply_string_cleaning(lemmatise_text(clickbait_test["text"]))

## Weighting the features

The tf-idf weighting can be applied using the `TfidfVectorizer`, making it as easy to implement as the CountVectorizer.

In [10]:
tfidfVectoriser = TfidfVectorizer(max_features=6000)
tfidfVectoriser.fit(clickbait_train["text_lemmatised"])

In [11]:
X_train_tfidf = tfidfVectoriser.transform(clickbait_train["text_lemmatised"]).toarray()
X_val_tfidf = tfidfVectoriser.transform(clickbait_val["text_lemmatised"]).toarray()

## Train our simple model

We're going to train the same model we did last time, with just one adjustment to account for the change in vocabulary size.

In [15]:
tidy_model = train_text_classification_model(
    X_train_tfidf,
    clickbait_train["label"].to_numpy(),
    X_val_tfidf,
    clickbait_val["label"].to_numpy(),
    X_train_tfidf.shape[1],
    12,
    64
)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


In [17]:
clickbait_val["tfidf_pred"] = generate_predictions(tidy_model, X_val_tfidf, clickbait_val["label"].to_numpy())

col_0   0.0   1.0
row_0            
0      3066   138
1        70  3126


We can see that not only is our model is a similar accuracy to baseline, but now it's better at identifying non-clickbait titles than clickbait titles. We can see that we have more instances where the model has predicted clickbait when it shouldn't than the other way around.

In [18]:
# False negatives (model is not predicting clickbait when it should)
clickbait_val.loc[(clickbait_val["label"] == 1) & (clickbait_val["tfidf_pred"] == 0), "text"][:5]

49     This Body Cam Footage Shows A Vehicle Plow Int...
83     Photographer Gregory Crewdson Releases Hauntin...
139    21 New Year's Resolutions For TV To Consider I...
190    17 Things Vegetarians In The South Have To Dea...
283    What's Your Stance On These Unspoken Rules For...
Name: text, dtype: object

In [19]:
# False positives (model is predicting clickbait when it shouldn't)
clickbait_val.loc[(clickbait_val["label"] == 0) & (clickbait_val["tfidf_pred"] == 1), "text"][:5]

2          Former 'Dudley Boys' sign with TNA Wrestling
4                              Where Is Oil Going Next?
22          Irish Obama song proves popular on Internet
69            A World of Lingo (Out of This World, Too)
73    Minutes Behind the Leaders, Landis Speaks of a...
Name: text, dtype: object

We can see that the non-clickbait titles that the model missed contain a lot of terms we associate with clickbait, like "popular", "internet" or a sentence starting with "where". The clickbait titles are actually quite clearly clickbait, but I guess the words they contain are also present enough in non-clickbait titles to confuse the model.

We've gotten as far as we can with BOW methods, so let's move onto our first word embeddings model using word2vec.