# Text classification of clickbait headlines
## Iteration 2: consolidating the feature columns

Count vectorisation turns every token into a column, including variants of the same word with grammatically different forms. It also makes a huge number of columns, some of which might only be non-zero for one or two documents. Consolidating columns with the same meaning and removing unhelpful columns can improve model performance.

## Load in dependencies and data

In [6]:
import pandas as pd
import spacy

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS

from support_functions import apply_string_cleaning, train_text_classification_model, generate_predictions

In [7]:
# Read in clickbait data
clickbait_train = pd.read_csv("data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv("data/clickbait_val.csv", sep="\t", header=0)
clickbait_test = pd.read_csv("data/clickbait_test.csv", sep="\t", header=0)

## Consolidate feature columns

We have a lot of feature columns (more than 17K). As we saw in the previous notebook, the feature matrix is also very sparse. There are a few things we can do to tidy it up:
* Lemmatisation: this is where we take words that mean the same thing but have the same grammatical form and reduce them all down to the same base form, or lemma (cat, cats -> cat; is, am, are -> be).
* Removing stop words: this is where we remove very common words that usually don't have semantic meaning from the feature set.
* Keeping only the top n words: we should try to avoid having words in the model that only occur once, as the model cannot use them to detect patterns and therefore they don't add anything to a model's learning.

## Lemmatise the text
To get started, we'll lemmatise the text using Spacy. We run the EN model over each of the texts, and extract the lemmas for each token.

In [None]:
!python3 -m spacy download en_core_web_sm

In [9]:
nlp = spacy.load("en_core_web_sm")

def lemmatise_text(texts: pd.Series):
    lemmatised_texts = []
    for doc in nlp.pipe(texts):
        lemmatised_texts.append(" ".join([token.lemma_ for token in doc]))
    return pd.Series(lemmatised_texts)

In [10]:
clickbait_train["text_lemmatised"] = apply_string_cleaning(lemmatise_text(clickbait_train["text"], nlp))
clickbait_val["text_lemmatised"] = apply_string_cleaning(lemmatise_text(clickbait_val["text"], nlp))
clickbait_test["text_lemmatised"] = apply_string_cleaning(lemmatise_text(clickbait_test["text"], nlp))

## Create count vectoriser which removes stop words and keeps top n vocabulary

Here we'll repeat what we did in the previous notebook, but this time we'll ask the count vectoriser to:
* Remove stop words: we use the `stop_words = "english"` argument;
* Keep only the top n most frequent terms: we use the `max_features` argument. I've chosen to set this to 6000 words as I did a prior analysis which showed that this is the cut-off for 3 or more occurrences in the corpus.

In [11]:
tidyCountVectoriser = CountVectorizer(stop_words="english", max_features=6000)
tidyCountVectoriser.fit(clickbait_train["text_lemmatised"])

In [12]:
X_train_tidy = tidyCountVectoriser.transform(clickbait_train["text_lemmatised"]).toarray()
X_val_tidy = tidyCountVectoriser.transform(clickbait_val["text_lemmatised"]).toarray()

## Train our simple model

We're going to train the same model we did last time, with just one adjustment to account for the change in vocabulary size.

In [15]:
tidy_model = train_text_classification_model(
    X_train_tidy,
    clickbait_train["label"].to_numpy(),
    X_val_tidy,
    clickbait_val["label"].to_numpy(),
    X_train_tidy.shape[1],
    2,
    64
)

Epoch 1/2
Epoch 2/2


In [16]:
clickbait_val["tidy_pred"] = generate_predictions(tidy_model, X_val_tidy, clickbait_val["label"].to_numpy())

col_0   0.0   1.0
row_0            
0      2961   243
1       148  3048


It turns out our accuracy has gone down! After doing a bit of digging, it turns out the issue is the stopword removal. When we examine the texts that the model misclassified, it's clear that the stopwords actually give clickbait titles a lot of their meaning.

In [17]:
def filter_stopwords(row):
    return " ".join([w for w in row.split() if w not in ENGLISH_STOP_WORDS])

In [18]:
clickbait_val["text_lemmatised_no_stopwords"] = clickbait_val["text_lemmatised"].apply(lambda x: filter_stopwords(x))

In [22]:
clickbait_val.loc[
    (clickbait_val["label"] == 1) & (clickbait_val["tidy_pred"] == 0),
    ["text", "text_lemmatised_no_stopwords"]][:10]

Unnamed: 0,text,text_lemmatised_no_stopwords
30,The Iconic Beatles Ashram In Rishikesh Is Once...,iconic beatles ashram rishikesh open public
49,This Body Cam Footage Shows A Vehicle Plow Int...,body cam footage vehicle plow cop car head
83,Photographer Gregory Crewdson Releases Hauntin...,photographer gregory crewdson releases hauntin...
104,We Found Out Who The BABE Was Sitting Behind J...,babe sit jake tapper gop debate
109,Are You More Target Or Walmart,target walmart
190,17 Things Vegetarians In The South Have To Dea...,thing vegetarians south deal
283,What's Your Stance On These Unspoken Rules For...,stance unspoken rules society
292,Which Newly Revealed Wizard School Should You ...,newly reveal wizard school study abroad
359,What's Your Personal Slogan,personal slogan
383,"Stephanie Mills Destroyed Us In NBC's ""The Wiz""",stephanie mills destroy nbc wiz


In [23]:
clickbait_val.loc[
    (clickbait_val["label"] == 1) & (clickbait_val["tidy_pred"] == 1),
    ["text", "text_lemmatised_no_stopwords"]][:10]

Unnamed: 0,text,text_lemmatised_no_stopwords
0,People Keep Making Huge Facebook Chats With Pe...,people make huge facebook chats people
6,Phoebe Buffay Is Supposed To Die On October 15...,phoebe buffay suppose die october
8,"The #Blessed Life Of Kaskade, EDM's Voice Of R...",blessed life kaskade edm voice reason
10,Can You Guess The Christmas Movie From Its Ama...,guess christmas movie amazon review
11,16 Questions We Have About Kylie Jenner,question kylie jenner
12,19 Texts All Twentysomethings Have Sent Their Dad,text twentysomething send dad
15,20 Signs That Definitely Have A Hilarious Stor...,sign definitely hilarious story
16,"12 Things You Probably Didn't Know About The ""...",thing probably know shadowhunters cast
17,Americans Try Canadian Candy For The First Time,americans try canadian candy time
18,"Who Would Be Your ""Harry Potter"" Best Friend B...",harry potter good friend base zodiac sign


## Train a model with the stopwords included

Let's confirm our guess and retrain the same model with the stopwords included. We'll need to create a new vectoriser which does not remove the stopwords.

In [24]:
tidySwCountVectoriser = CountVectorizer(max_features=6000)
tidySwCountVectoriser.fit(clickbait_train["text_lemmatised"])

In [25]:
X_train_tidy_sw = tidySwCountVectoriser.transform(clickbait_train["text_lemmatised"]).toarray()
X_val_tidy_sw = tidySwCountVectoriser.transform(clickbait_val["text_lemmatised"]).toarray()

In [30]:
tidy_sw_model = train_text_classification_model(
    X_train_tidy_sw,
    clickbait_train["label"].to_numpy(),
    X_val_tidy_sw,
    clickbait_val["label"].to_numpy(),
    X_train_tidy.shape[1],
    7,
    64
)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [32]:
clickbait_val["tidy_sw_pred"] = generate_predictions(tidy_sw_model, X_val_tidy_sw, clickbait_val["label"].to_numpy())

col_0   0.0   1.0
row_0            
0      3097   107
1        82  3114


We're now back to around the same accuracy as with the baseline model. It seems that our lemmatisation and restricting to top n vocabulary haven't improved model fit.