# Text classification of clickbait headlines
## Word embeddings: word2vec

Word embeddings are representations of each word's meaning, which are derived by examining the context that a word is used in across a large text corpus. The meanings are represented as n-dimensional vectors, which in this case will be derived from the hidden layer of a word2vec model. These embeddings can be compared to each other in an n-dimensional space, with words that have similar meaning in the training corpus ending up close together, while those with dissimilar meanings being far apart.

## Load in dependencies and data

In [1]:
import pandas as pd
import numpy as np

from sklearn.manifold import TSNE
import plotly.express as px
from support_functions import train_text_classification_model, generate_predictions

In [72]:
clickbait_train = pd.read_csv("data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv("data/clickbait_val.csv", sep="\t", header=0)

## Prepare data for word2vec training

In order to get the data ready for word2vec training, we need to do a small amount of pre-preparation.

Firstly, we do some light string cleaning, including converting all characters to lowercase, removing all numbers and punctuation, and removing additional whitespace. This is because word2vec models, like bag-of-words models, are based on word tokens, so we want to normalise the text as much as possible before creating the embeddings.

In [73]:
def apply_light_string_cleaning(dataset: pd.Series) -> pd.Series:
    return (
        dataset
        .str.lower()
        .str.replace("[\W_]+", " ", regex=True)
        .str.replace("\s+", " ", regex=True)
        .str.strip()
    )

In [74]:
clickbait_train["text_clean"] = apply_light_string_cleaning(clickbait_train["text"])
clickbait_val["text_clean"] = apply_light_string_cleaning(clickbait_val["text"])

Finally, we split each sentence into a list of words, the expected format for a word2vec model.

In [76]:
# Convert sentences into list of lists for training
clickbait_w2v_training = clickbait_train["text_clean"].str.split("\s").to_list()

# Remove nans
clickbait_w2v_training = [s for s in clickbait_w2v_training if type(s) is list]

In [77]:
clickbait_w2v_training[0]

['new',
 'insulin',
 'resistance',
 'discovery',
 'may',
 'help',
 'diabetes',
 'sufferers']

## Train w2v model to get word embeddings

In [78]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(sentences=clickbait_w2v_training,
                     vector_size=100,
                     window=5,
                     min_count=2,
                     workers=4,
                     sg = 1)

In [79]:
print(w2v_model.wv["best"])

[-0.14747071  0.04011309 -0.10927655  0.14902066  0.12776141 -0.6504799
  0.44718814  0.5880604  -0.5520279  -0.70186365 -0.05306865 -0.4031026
  0.06684253  0.01200101  0.28834534  0.08133595  0.3530518  -0.05653202
 -0.11454131 -0.45856938  0.19001697  0.35391077 -0.06029467 -0.28179187
 -0.10636678  0.04873601 -0.15744053 -0.15019718  0.04027539 -0.1660255
  0.48455793  0.03202617 -0.00139145 -0.46137607 -0.53922224  0.59351826
  0.17670548 -0.02025568 -0.11744449 -0.42973134  0.1310476  -0.00641482
  0.11426919  0.13336578  0.40492824  0.10240365 -0.13101658 -0.5351497
  0.0510365   0.16228396  0.05715303 -0.2773195   0.03768624 -0.45286545
 -0.07627048 -0.04741495  0.09568824  0.10578471  0.07903598  0.17954843
  0.13526057  0.07581329  0.63539827 -0.12035474 -0.20268515  0.41552052
 -0.2112141   0.4608939  -0.42605555  0.23912817 -0.02042929  0.15214413
  0.16695537  0.13802026  0.87651426  0.10828011  0.01392671  0.34544763
 -0.26921824 -0.13895583 -0.35669312 -0.04676785 -0.260

In [80]:
w2v_model.wv.most_similar("best")

[('worst', 0.9472451210021973),
 ('greatest', 0.9398711323738098),
 ('twitter', 0.9330142736434937),
 ('funniest', 0.9257383346557617),
 ('cutest', 0.9215297102928162),
 ('absolute', 0.920747697353363),
 ('instagram', 0.9169497489929199),
 ('most', 0.916392982006073),
 ('friend', 0.9141044020652771),
 ('costume', 0.9123106598854065)]

## Extract vectors and average them across the documents

In [81]:
def extract_document_vectors(model: Word2Vec, text: str, len_vectors: int):
    vectors = np.empty((0, len_vectors), float)
    for word in text.split():
        if word in model.wv.key_to_index:
            v = model.wv[word]
            vectors = np.append(vectors, np.array([v]), axis=0)
    return vectors


def calculate_w2v_dataset(model: Word2Vec, dataset: pd.DataFrame, len_vectors: int):
    document_vectors = np.empty((0, len_vectors), float)
    matched_labels = []
    for index, row in dataset.iterrows():
        v = extract_document_vectors(model, row["text_clean"], len_vectors)
        if v.shape[0] > 0:
            v_mean = v.mean(axis=0)
            document_vectors = np.append(document_vectors, np.array([v_mean]), axis=0)
            matched_labels.append(row["label"])
        else:
            pass
    return document_vectors, np.array(matched_labels)

In [82]:
document_vectors_train, final_labels_train = calculate_w2v_dataset(w2v_model, clickbait_train, 100)
document_vectors_val, final_labels_val = calculate_w2v_dataset(w2v_model, clickbait_val, 100)

In [83]:
print(document_vectors_train[0])

[-0.04292156  0.12544873  0.03882096  0.21273381  0.01500233 -0.26236484
  0.06187111  0.32737116 -0.12300739 -0.0826752  -0.0093294  -0.25763105
 -0.09535895  0.10149708  0.05331289 -0.15364419  0.02159551 -0.11620372
  0.10725712 -0.42554655  0.13211197  0.08984447  0.11354826 -0.11116657
  0.00545645 -0.01523238 -0.07476065 -0.11439017 -0.25137747  0.01816821
  0.15365209 -0.12164204  0.1276247  -0.18941424 -0.06403412  0.19660579
  0.08466378 -0.01745256 -0.0814019  -0.24173464  0.05425641 -0.29057403
 -0.20547444  0.15290917  0.1895702  -0.16748803 -0.19885572  0.03154338
  0.17334231  0.15439048  0.01876524 -0.21415271 -0.02258946 -0.04729926
 -0.10879836 -0.026668    0.11981433 -0.11571548 -0.14648805  0.08123164
  0.05798304  0.135803    0.02847759 -0.00208709 -0.12765199  0.10054199
 -0.0687885   0.15100165 -0.09648447  0.14492192 -0.06739147  0.18583114
  0.16957987  0.1033375   0.20763204  0.12201632  0.00132056  0.00649403
  0.01943747 -0.05439365 -0.14649371 -0.13144331 -0

## Visualise groupings of headlines

In [84]:
# Create TSNE chart to project 100 dimensional vectors onto 2 dimensional space
document_vectors_val_tsne = TSNE(n_components=2,
                                 learning_rate='auto',
                                 init='random',
                                 perplexity=3).fit_transform(document_vectors_val)

In [85]:
document_vectors_plotting = (
    pd.DataFrame(document_vectors_val_tsne, columns=["dimension_1", "dimension_2"])
    .assign(labels = final_labels_val)
    .assign(text = clickbait_val["text"])
)

In [86]:
fig = px.scatter(
    document_vectors_plotting,
    x = "dimension_1",
    y = "dimension_2",
    color = "labels",
    title = "Vector space of documents in validation set",
    custom_data=["labels", "text"]
)
fig.update_traces(
    hovertemplate = "<br>".join([
        "Category: %{customdata[0]}",
        "Headline: %{customdata[1]}"
    ])
)
fig.show()

## Train clickbait classifier

In [89]:
 w2v_classification_model = train_text_classification_model(
    document_vectors_train,
    final_labels_train,
    document_vectors_val,
    final_labels_val,
    100,
    20,
    32
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [90]:
clickbait_val["w2v_baseline_pred"] = generate_predictions(w2v_classification_model, document_vectors_val,
                                                          final_labels_val)

col_0   0.0   1.0
row_0            
0      3015   189
1        95  3101


In [91]:
clickbait_val.loc[(clickbait_val["label"] == 1) & (clickbait_val["w2v_baseline_pred"] == 0), "text"][:10]

6      Phoebe Buffay Is Supposed To Die On October 15...
49     This Body Cam Footage Shows A Vehicle Plow Int...
52     Ariana Grande Flawlessly Shut Down Sexist Comm...
78     Robert Pattinson Has Grown A Humongously Bushy...
83     Photographer Gregory Crewdson Releases Hauntin...
160    Watch Footage Of Two Sikh Men Unraveling Their...
234    Joe Biden And Stephen Colbert Have A Remarkabl...
304    Watch 100 Years Of Brazilian Beauty In A Littl...
365                  7 Struggles Of Taking One More Shot
383      Stephanie Mills Destroyed Us In NBC's "The Wiz"
Name: text, dtype: object

In [93]:
clickbait_val.loc[(clickbait_val["label"] == 0) & (clickbait_val["w2v_baseline_pred"] == 1), "text"][:10]

4                               Where Is Oil Going Next?
46     With High-Speed Camera, Glimpsing Worlds Too F...
69             A World of Lingo (Out of This World, Too)
112         Advertisers Change Game Plans for Super Bowl
184              Posted deadlines for Christmas delivery
391    For Refugees, Recession Makes Hard Times Even ...
421        Samsung + T-Mobile = Phone With a Real Camera
430                           Sears Tower Is Going Green
443     Panasonic GH1 Merges S.L.R. Photos With HD Video
488        TomTom Go 740 Live Has Cellphone Connectivity
Name: text, dtype: object