# Text classification of clickbait headlines
## Word embeddings: word2vec

Word embeddings are representations of each word's meaning, which are derived by examining the context that a word is used in across a large text corpus. The meanings are represented as n-dimensional vectors, which in this case will be derived from the hidden layer of a word2vec model. These embeddings can be compared to each other in an n-dimensional space, with words that have similar meaning in the training corpus ending up close together, while those with dissimilar meanings being far apart.

## Load in dependencies and data

In [3]:
import pandas as pd
import numpy as np

from support_functions import train_text_classification_model, generate_predictions

In [12]:
cwd = "Users/jodie.burchell/Documents/git/text-to-vectors"

In [2]:
# Load in train and validation data
clickbait_train = pd.read_csv(f"{cwd}/data/clickbait_train.csv", sep="\t", header=0)
clickbait_val = pd.read_csv(f"{cwd}/data/clickbait_val.csv", sep="\t", header=0)

## Prepare data for word2vec training

In order to get the data ready for word2vec training, we need to do a small amount of pre-preparation.

Firstly, we do some light string cleaning, including converting all characters to lowercase, removing all numbers and punctuation, and removing additional whitespace. This is because word2vec models, like bag-of-words models, are based on word tokens, so we want to normalise the text as much as possible before creating the embeddings.

In [3]:
def apply_light_string_cleaning(dataset: pd.Series) -> pd.Series:
    """
    Cleans a string: converts all characters to lowercase, removes all non-alphanumeric characters and removes additional whitespace.
    """
    return (
        dataset
        .str.lower()
        .str.replace("[\W_]+", " ", regex=True)
        .str.replace("\s+", " ", regex=True)
        .str.strip()
    )

In [4]:
# Apply the string cleaning to the train and validation data
clickbait_train["text_clean"] = apply_light_string_cleaning(clickbait_train["text"])
clickbait_val["text_clean"] = apply_light_string_cleaning(clickbait_val["text"])

Finally, we split each sentence into a list of words, the expected format for a word2vec model.

In [5]:
# Convert sentences into list of lists for training
clickbait_w2v_training = clickbait_train["text_clean"].str.split("\s").to_list()

# Remove nans
clickbait_w2v_training = [s for s in clickbait_w2v_training if type(s) is list]

In [6]:
# Example of the cleaned and converted clickbait headline
clickbait_w2v_training[0]

['new',
 'insulin',
 'resistance',
 'discovery',
 'may',
 'help',
 'diabetes',
 'sufferers']

## Train w2v model to get word embeddings

In [7]:
# Import gensim Word2Vec method
from gensim.models import Word2Vec

# Train word2vec model
w2v_model = Word2Vec(sentences=clickbait_w2v_training,
                     vector_size=100,
                     window=5,
                     min_count=2,
                     workers=4,
                     sg=1)

In [8]:
# Retrieve word embedding for "best"
print(w2v_model.wv["best"])

[-2.88190663e-01  2.56053746e-01 -1.04162320e-01  8.60497132e-02
  1.00232720e-01 -4.76431906e-01  4.36868221e-01  6.02400482e-01
 -5.12246609e-01 -5.40899277e-01  2.65833810e-02 -4.35974747e-01
  1.53366581e-01  1.78240821e-01  5.79746664e-01 -5.21296030e-03
  2.77763575e-01 -6.06821477e-03 -1.45945728e-01 -5.57567596e-01
  7.00066388e-02  2.30736330e-01 -1.01522334e-01 -1.40830933e-04
  1.05173676e-03 -8.62729475e-02 -1.89191237e-01 -2.32163504e-01
 -1.39115909e-02 -8.23591948e-02  6.46053910e-01 -1.72430322e-01
 -8.42885580e-03 -4.32763845e-01 -5.46497881e-01  4.41423714e-01
  1.84663013e-01 -6.35258481e-02 -1.93889454e-01 -5.05963504e-01
 -1.45569175e-01  1.49315089e-01 -7.83952773e-02 -4.75108176e-02
  2.12346032e-01 -6.14013262e-02 -1.68439135e-01 -4.50822860e-01
  1.54646203e-01  1.44193724e-01  2.48018473e-01 -4.35005307e-01
  6.89142719e-02 -6.29392207e-01 -2.33305573e-01 -2.27976829e-01
  4.10295092e-02  7.56473839e-02  3.51655111e-02  3.00977468e-01
 -1.04890116e-01 -2.65275

In [9]:
# Find words most similar to "best"
w2v_model.wv.most_similar("best")

[('worst', 0.9578117728233337),
 ('greatest', 0.950883150100708),
 ('funniest', 0.936430037021637),
 ('cutest', 0.9352415204048157),
 ('most', 0.9339564442634583),
 ('twitter', 0.9231436252593994),
 ('friend', 0.9160161018371582),
 ('cast', 0.9158626198768616),
 ('thing', 0.9145550727844238),
 ('absolute', 0.9111383557319641)]

## Extract vectors and average them across the documents

In [10]:
def extract_document_vectors(model: Word2Vec, text: str, len_vectors: int):
    """
    Takes in a clickbait headline, and iterates over every word in the sequence. For each word, it retrieves
    its word embedding from the word2vec model, and then appends it to a NumPy array. Returns this array of
    word embeddings.
    """
    # Create empty NumPy array
    vectors = np.empty((0, len_vectors), float)
    # Loop over each word in clickbait headline
    for word in text.split():
        # Checks if word is in word2vec model
        if word in model.wv.key_to_index:
            # Retrieves embedding and appends it to the vectors array
            v = model.wv[word]
            vectors = np.append(vectors, np.array([v]), axis=0)
    return vectors


def calculate_w2v_dataset(model: Word2Vec, dataset: pd.DataFrame, len_vectors: int):
    """
    Create a NumPy array which contains the average embedding for a headline, as well as the label
    (whether it is clickbait or non-clickbait).
    """
    # Create an empty NumPy array to contain the averaged headline vectors
    document_vectors = np.empty((0, len_vectors), float)
    # Create an empty NumPy array for the headline labels
    matched_labels = []
    # Iterate over the dataset containing the cleaned headline and the label
    for index, row in dataset.iterrows():
        # Extract the array of word embeddings for each headline
        v = extract_document_vectors(model, row["text_clean"], len_vectors)
        # Check if the array is not empty
        if v.shape[0] > 0:
            # Average the array to yield one headline embedding
            v_mean = v.mean(axis=0)
            # Append the headline embedding and label
            document_vectors = np.append(document_vectors, np.array([v_mean]), axis=0)
            matched_labels.append(row["label"])
        else:
            pass
    return document_vectors, np.array(matched_labels)

In [11]:
# Extract the document embeddings for each dataset
document_vectors_train, final_labels_train = calculate_w2v_dataset(w2v_model, clickbait_train, 100)
document_vectors_val, final_labels_val = calculate_w2v_dataset(w2v_model, clickbait_val, 100)

In [12]:
print(f"Document embedding for '{clickbait_train['text'][0]}'")
print(document_vectors_train[0])

Document embedding for 'New insulin-resistance discovery may help diabetes sufferers'
[-4.28797632e-02  1.90623574e-01  5.63060039e-02  1.70299198e-01
  1.26286471e-02 -2.95280412e-01  7.38005054e-02  3.30918918e-01
 -9.69956213e-02 -1.26176431e-01 -9.31461652e-04 -2.19160682e-01
 -4.91200850e-02  1.05296968e-01  1.49029087e-02 -1.32236180e-01
  1.81723403e-02 -1.46027402e-01  1.01369721e-01 -4.44048584e-01
  1.26423729e-01  5.65344516e-02  9.69023287e-02 -1.00688658e-01
  2.89612760e-03  9.48078775e-04 -3.84141317e-02 -9.09979551e-02
 -2.67467407e-01  1.18401208e-02  1.44312273e-01 -8.80012494e-02
  1.29057348e-01 -1.83880170e-01 -2.65405855e-02  1.96613671e-01
  9.31484954e-02 -2.29062152e-02 -7.69391339e-02 -2.46617089e-01
  6.63064066e-02 -2.92988983e-01 -2.18627728e-01  1.14907216e-01
  1.60316158e-01 -1.49795049e-01 -1.91479467e-01 -1.64924543e-02
  1.70000556e-01  1.37826833e-01  8.17044700e-02 -2.00281414e-01
 -2.91220637e-02 -5.16959407e-02 -1.51485297e-01 -2.22565561e-02
  8.

## Train clickbait classifier

In [17]:
# Create a simple neural net which trains on the training data and
# confirms the model performance on the validation set
w2v_classification_model = train_text_classification_model(
    document_vectors_train,
    final_labels_train,
    document_vectors_val,
    final_labels_val,
    100,
    20,
    32
)

2022-09-12 13:25:38.947884: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [18]:
# Generate a column in the validation data with the predictions
clickbait_val["w2v_baseline_pred"] = generate_predictions(w2v_classification_model,
                                                          document_vectors_val,
                                                          final_labels_val)

col_0   0.0   1.0
row_0            
0      3021   183
1       113  3083


In [13]:
# Headlines the model thought were not clickbait, but which are
pd.read_csv(f"{cwd}/data/word2vec_incorrect_prediction_not_clickbait.csv",
            sep = "\t",
            header = 0)

Unnamed: 0,text
0,Phoebe Buffay Is Supposed To Die On October 15...
1,This Body Cam Footage Shows A Vehicle Plow Int...
2,Ariana Grande Flawlessly Shut Down Sexist Comm...
3,Photographer Gregory Crewdson Releases Hauntin...
4,Watch Footage Of Two Sikh Men Unraveling Their...
5,Joe Biden And Stephen Colbert Have A Remarkabl...
6,Watch 100 Years Of Brazilian Beauty In A Littl...
7,7 Struggles Of Taking One More Shot
8,"Stephanie Mills Destroyed Us In NBC's ""The Wiz"""
9,We Had Pro Gamers Compete Against Vets At A Sh...


In [14]:
# Headlines the model thought were clickbait, but which are not
pd.read_csv(f"{cwd}/data/word2vec_incorrect_prediction_clickbait.csv",
            sep = "\t",
            header = 0)

Unnamed: 0,text
0,Where Is Oil Going Next?
1,"With High-Speed Camera, Glimpsing Worlds Too F..."
2,"A World of Lingo (Out of This World, Too)"
3,Advertisers Change Game Plans for Super Bowl
4,Posted deadlines for Christmas delivery
5,"For Refugees, Recession Makes Hard Times Even ..."
6,Samsung + T-Mobile = Phone With a Real Camera
7,Sears Tower Is Going Green
8,Panasonic GH1 Merges S.L.R. Photos With HD Video
9,TomTom Go 740 Live Has Cellphone Connectivity
