# Text classification of clickbait headlines
## Word embeddings: word2vec

Word embeddings are representations of each word's meaning, which are derived by examining the context that a word is used in across a large text corpus. The meanings are represented as n-dimensional vectors, which in this case will be derived from the hidden layer of a word2vec model. These embeddings can be compared to each other in an n-dimensional space, with words that have similar meaning in the training corpus ending up close together, while those with dissimilar meanings being far apart.

## Load in dependencies and data

In [27]:
import pandas as pd
import numpy as np

from sklearn.manifold import TSNE
import plotly.express as px
from support_functions import train_text_classification_model, generate_predictions

In [28]:
# Load in train and validation data
clickbait_train = pd.read_csv("data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv("data/clickbait_val.csv", sep="\t", header=0)

## Prepare data for word2vec training

In order to get the data ready for word2vec training, we need to do a small amount of pre-preparation.

Firstly, we do some light string cleaning, including converting all characters to lowercase, removing all numbers and punctuation, and removing additional whitespace. This is because word2vec models, like bag-of-words models, are based on word tokens, so we want to normalise the text as much as possible before creating the embeddings.

In [29]:
def apply_light_string_cleaning(dataset: pd.Series) -> pd.Series:
    """
    Cleans a string: converts all characters to lowercase, removes all non-alphanumeric characters and removes additional whitespace.
    """
    return (
        dataset
        .str.lower()
        .str.replace("[\W_]+", " ", regex=True)
        .str.replace("\s+", " ", regex=True)
        .str.strip()
    )

In [30]:
# Apply the string cleaning to the train and validation data
clickbait_train["text_clean"] = apply_light_string_cleaning(clickbait_train["text"])
clickbait_val["text_clean"] = apply_light_string_cleaning(clickbait_val["text"])

Finally, we split each sentence into a list of words, the expected format for a word2vec model.

In [31]:
# Convert sentences into list of lists for training
clickbait_w2v_training = clickbait_train["text_clean"].str.split("\s").to_list()

# Remove nans
clickbait_w2v_training = [s for s in clickbait_w2v_training if type(s) is list]

In [32]:
# Example of the cleaned and converted clickbait headline
clickbait_w2v_training[0]

['new',
 'insulin',
 'resistance',
 'discovery',
 'may',
 'help',
 'diabetes',
 'sufferers']

## Train w2v model to get word embeddings

In [33]:
# Import gensim Word2Vec method
from gensim.models import Word2Vec

# Train word2vec model
w2v_model = Word2Vec(sentences=clickbait_w2v_training,
                     vector_size=100,
                     window=5,
                     min_count=2,
                     workers=4,
                     sg=1)

In [34]:
# Retrieve word embedding for "best"
print(w2v_model.wv["best"])

[-1.89133793e-01 -3.55381295e-02 -1.15797691e-01  2.04730988e-01
  9.72761214e-02 -6.47749603e-01  4.28570360e-01  6.69306040e-01
 -6.74471080e-01 -4.72689331e-01  2.04731878e-02 -5.08865058e-01
  6.74913451e-02  1.02014065e-01  3.42079908e-01  2.36438252e-02
  5.07577598e-01  3.21764871e-02 -2.72184730e-01 -6.26558185e-01
 -2.26681884e-02  4.43158686e-01 -4.18877676e-02 -3.44026312e-02
  9.81348902e-02  1.88852921e-01 -1.88589960e-01 -1.38748825e-01
  5.55962771e-02 -1.16419028e-02  4.55057591e-01 -2.95428764e-02
 -8.34257249e-03 -4.18107569e-01 -5.32170832e-01  4.81347322e-01
  1.96669310e-01 -6.85726106e-02 -1.54913321e-01 -4.49474901e-01
  1.35256141e-01  1.36328012e-01  1.13667771e-02 -8.53236988e-02
  1.78325325e-02 -4.70572561e-02 -7.68152475e-02 -4.84123796e-01
  1.20714316e-02  1.16457053e-01  1.90856516e-01 -3.03020418e-01
 -1.41172409e-01 -5.24448156e-01 -3.07662696e-01  3.39142606e-02
  8.93116519e-02 -1.65323600e-01  6.70489594e-02  2.77279228e-01
  6.52871327e-03 -1.83777

In [35]:
# Find words most similar to "best"
w2v_model.wv.most_similar("best")

[('worst', 0.945489227771759),
 ('greatest', 0.9432611465454102),
 ('twitter', 0.9333745241165161),
 ('most', 0.9332753419876099),
 ('funniest', 0.9330690503120422),
 ('important', 0.9200592637062073),
 ('absolute', 0.9144424796104431),
 ('costume', 0.9141934514045715),
 ('thing', 0.9140698909759521),
 ('cutest', 0.913139283657074)]

## Extract vectors and average them across the documents

In [36]:
def extract_document_vectors(model: Word2Vec, text: str, len_vectors: int):
    """
    Takes in a clickbait headline, and iterates over every word in the sequence. For each word, it retrieves
    its word embedding from the word2vec model, and then appends it to a NumPy array. Returns this array of
    word embeddings.
    """
    # Create empty NumPy array
    vectors = np.empty((0, len_vectors), float)
    # Loop over each word in clickbait headline
    for word in text.split():
        # Checks if word is in word2vec model
        if word in model.wv.key_to_index:
            # Retrieves embedding and appends it to the vectors array
            v = model.wv[word]
            vectors = np.append(vectors, np.array([v]), axis=0)
    return vectors


def calculate_w2v_dataset(model: Word2Vec, dataset: pd.DataFrame, len_vectors: int):
    """
    Create a NumPy array which contains the average embedding for a headline, as well as the label
    (whether it is clickbait or non-clickbait).
    """
    # Create an empty NumPy array to contain the averaged headline vectors
    document_vectors = np.empty((0, len_vectors), float)
    # Create an empty NumPy array for the headline labels
    matched_labels = []
    # Iterate over the dataset containing the cleaned headline and the label
    for index, row in dataset.iterrows():
        # Extract the array of word embeddings for each headline
        v = extract_document_vectors(model, row["text_clean"], len_vectors)
        # Check if the array is not empty
        if v.shape[0] > 0:
            # Average the array to yield one headline embedding
            v_mean = v.mean(axis=0)
            # Append the headline embedding and label
            document_vectors = np.append(document_vectors, np.array([v_mean]), axis=0)
            matched_labels.append(row["label"])
        else:
            pass
    return document_vectors, np.array(matched_labels)

In [37]:
# Extract the document embeddings for each dataset
document_vectors_train, final_labels_train = calculate_w2v_dataset(w2v_model, clickbait_train, 100)
document_vectors_val, final_labels_val = calculate_w2v_dataset(w2v_model, clickbait_val, 100)

In [39]:
print(f"Document embedding for '{clickbait_train['text'][0]}'")
print(document_vectors_train[0])

Document embedding for 'New insulin-resistance discovery may help diabetes sufferers'
[-0.06827156  0.14209813  0.05956285  0.17069656  0.01225918 -0.27898166
  0.05946321  0.32773007 -0.11368255 -0.12256466 -0.00497826 -0.26394722
 -0.04155935  0.14071047  0.03343765 -0.16796383  0.04735166 -0.13539201
  0.08044776 -0.44081703  0.16462538  0.07199759  0.09642835 -0.0716313
  0.04317991 -0.02760717 -0.07350119 -0.1345146  -0.28000989 -0.01771149
  0.14346436 -0.08980052  0.156179   -0.16703596 -0.07125437  0.18236644
  0.06444096 -0.01729699 -0.05607518 -0.24937303  0.04274284 -0.28840888
 -0.24059521  0.0952557   0.10226476 -0.13289442 -0.19955119 -0.009789
  0.145786    0.19490744  0.09943286 -0.23060559 -0.01701118 -0.05828549
 -0.09470549 -0.03297623  0.12767985 -0.06695042 -0.16603108  0.08497682
  0.03606476  0.0675676   0.05494965 -0.02640681 -0.10660581  0.16696225
 -0.09116359  0.13701866 -0.0981861   0.15390789 -0.07519459  0.18070303
  0.17520152  0.03232608  0.19558767  0.1

## Visualise groupings of headlines

In [12]:
# Create TSNE chart to project 100 dimensional vectors onto 2 dimensional space
document_vectors_val_tsne = TSNE(n_components=2,
                                 learning_rate='auto',
                                 init='random',
                                 perplexity=3).fit_transform(document_vectors_val)

In [13]:
# Create a dataset which contains the 2-dimensional projections of the headline embeddings,
# plus the headline labels and raw text
document_vectors_plotting = (
    pd.DataFrame(document_vectors_val_tsne, columns=["dimension_1", "dimension_2"])
    .assign(labels=final_labels_val)
    .assign(text=clickbait_val["text"])
)

In [14]:
document_vectors_plotting.to_csv("data/plotting_sample_document_vectors.csv")

In [15]:
# Plot using Plotly
fig = px.scatter(
    document_vectors_plotting,
    x="dimension_1",
    y="dimension_2",
    color="labels",
    title="Vector space of documents in validation set",
    custom_data=["labels", "text"]
)
fig.update_traces(
    hovertemplate="<br>".join([
        "Category: %{customdata[0]}",
        "Headline: %{customdata[1]}"
    ])
)
fig.show()

## Train clickbait classifier

In [40]:
# Create a simple neural net which trains on the training data and
# confirms the model performance on the validation set
w2v_classification_model = train_text_classification_model(
    document_vectors_train,
    final_labels_train,
    document_vectors_val,
    final_labels_val,
    100,
    20,
    32
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [41]:
# Generate a column in the validation data with the predictions
clickbait_val["w2v_baseline_pred"] = generate_predictions(w2v_classification_model, document_vectors_val,
                                                          final_labels_val)

col_0   0.0   1.0
row_0            
0      3033   171
1       131  3065


In [42]:
# Headlines the model thought were not clickbait, but which are
clickbait_val.loc[(clickbait_val["label"] == 1) & (clickbait_val["w2v_baseline_pred"] == 0), "text"][:10]

6      Phoebe Buffay Is Supposed To Die On October 15...
49     This Body Cam Footage Shows A Vehicle Plow Int...
52     Ariana Grande Flawlessly Shut Down Sexist Comm...
78     Robert Pattinson Has Grown A Humongously Bushy...
83     Photographer Gregory Crewdson Releases Hauntin...
92     Amandla Stenberg Co-Wrote A Comic Starring A Y...
160    Watch Footage Of Two Sikh Men Unraveling Their...
234    Joe Biden And Stephen Colbert Have A Remarkabl...
304    Watch 100 Years Of Brazilian Beauty In A Littl...
360     Day 3 Of BuzzFeed's 7-Day Clean Eating Challenge
Name: text, dtype: object

In [43]:
# Headlines the model thought were clickbait, but which are not
clickbait_val.loc[(clickbait_val["label"] == 0) & (clickbait_val["w2v_baseline_pred"] == 1), "text"][:10]

4                               Where Is Oil Going Next?
46     With High-Speed Camera, Glimpsing Worlds Too F...
69             A World of Lingo (Out of This World, Too)
112         Advertisers Change Game Plans for Super Bowl
184              Posted deadlines for Christmas delivery
391    For Refugees, Recession Makes Hard Times Even ...
421        Samsung + T-Mobile = Phone With a Real Camera
430                           Sears Tower Is Going Green
443     Panasonic GH1 Merges S.L.R. Photos With HD Video
488        TomTom Go 740 Live Has Cellphone Connectivity
Name: text, dtype: object