# PA4 Part 1. Embeddings and Sentence Classification [30 Marks]

<center>
    <img src="./assets/embeddings.jpeg">
</center>

### Introduction

Embeddings are a way to represent words (or more generally, *tokens*) as vectors. These vectors are useful for many tasks in NLP, including but not limited to: Text Generation, Machine Translation, and Sentence Classification. In this notebook, we will be exploring the concept of Embeddings, and using them for Sentence Classification.

After this notebook, you should be able to:

- Understand the concepts of Embeddings and Vector Similarity.

- Use pre-trained Embeddings for Sentence Classification.

### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions and Plagiarism Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span> We will be using the `print` statements to grade you.

- <span style="color: red;">You must attempt all parts.</span> Do not assume that because something is for 0 marks, you can leave it - it will definitely be used in later parts.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import re

import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

!pip install gpt4all
from gpt4all import Embed4All

import torch
import torch.nn as nn

nltk.download('stopwords')
nltk.download('wordnet')

Collecting gpt4all
  Downloading gpt4all-2.0.2-py3-none-manylinux1_x86_64.whl (4.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: gpt4all
Successfully installed gpt4all-2.0.2


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Exploring Embeddings [10 Marks]

Put simply, Embeddings are fixed-size **dense** vector representations of tokens in natural language. This means you can represent words as vectors, sentences as vectors, even other entities like entire graphs as vectors.

So what really makes them different from something like One-Hot vectors?

What's special is that they have semantic meaning baked into them. This means you can model relationships between entities in text, which itself leads to a lot of fun applications. All modern architectures make use of Embeddings in some way.

You can read more about them [here](https://aman.ai/primers/ai/word-vectors/).

We will be using *pretrained* Embeddings: this means that we will be using Embeddings that have already been trained on a large corpus of text. This is because training Embeddings from scratch is a very computationally expensive task, and we don't have the resources to do so. Fortunately, there were some good samaritans who have already done this for us, and we can use their publicly available Embeddings for our own tasks.


This part will allow you to explore what Embeddings are. We will load in pretrained Embeddings here and examine some of their properties. If you're interested, feel free look up the [Word2Vec model](https://arxiv.org/abs/1301.3781): this is the model that was trained to give us the embeddings you will see below.

In [None]:
# Download the pretrained word2vec model (this may take a few minutes)
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec

corpus = api.load('text8')
w2vmodel = Word2Vec(corpus)

print("Done loading word2vec model!")

Done loading word2vec model!


Now that we've loaded in the Embeddings, we can create an Embedding **layer** in PyTorch, `nn.Embedding`, that will perform the processing step for us.

Note in the following cell how there is a given **vocab size** and **embedding dimension** for the model: this is important to note because some sets of Embeddings may be defined for a large set of words (a large vocab), whereas older ones perhaps have a smaller set (a small vocab); the Embedding dimension essentially tells us how many *features* have been learned for a given word, that will allow us to perform further processing on top of.

In [None]:
# Define embedding layer using gensim
embedding_layer = nn.Embedding.from_pretrained(torch.FloatTensor(w2vmodel.wv.vectors))

# Get some information from the w2vmodel
print(f"Vocab size: {len(w2vmodel.wv.key_to_index)}")

print(f"Some of the words in the vocabulary:\n{list(w2vmodel.wv.key_to_index.keys())[:10]}")

print(f"Embedding dimension: {w2vmodel.wv.vectors.shape[1]}")

Vocab size: 71290
Some of the words in the vocabulary:
['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two']
Embedding dimension: 100


Now, for a demonstration, we instantiate two words, turn them into numbers (encoding them via their index in the vocab), and pass them through the Embedding layer.

Note how the resultant Embeddings both have the same shape: 1 word, and 100 elements in the vector.

In [None]:
# Take two words and get their embeddings
word1 = "king"
word2 = "queen"

def word2vec(word):
    return embedding_layer(torch.LongTensor([w2vmodel.wv.key_to_index[word]]))

king_embedding = word2vec(word1)
queen_embedding = word2vec(word2)

print(f"Embedding Shape for '{word1}': {king_embedding.shape}")
print(f"Embedding Shape for '{word2}': {queen_embedding.shape}")

Embedding Shape for 'king': torch.Size([1, 100])
Embedding Shape for 'queen': torch.Size([1, 100])


When we have vectors whose scale is arbitrary, one nice way to measure how *similar* they are is with the Cosine Similarity measure.


$$ \text{Cosine Similarity}(\mathbf{u},\mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} $$


We can apply this idea to our Embeddings. To see how "similar" two words are to the model, we can generate their Embeddings and take the Cosine Similarity of them. This will be a number between -1 and 1 (just like the range of the cosine function). When the number is close to 0, the words are not similar.

In [None]:
def cosine_similarity(vec1, vec2):
    '''
    Computes the cosine similarity between two vectors
    '''

    # TODO: Compute the cosine similarity between the two vectors (using PyTorch)
    dot_product = torch.sum(vec1 * vec2)
    norm_vec1 = torch.norm(vec1)
    norm_vec2 = torch.norm(vec2)

    similarity = dot_product / (norm_vec1 * norm_vec2)

    return similarity

def compute_word_similarity(word1, word2):
    '''
    Takes in two words, computes their embeddings and returns the cosine similarity
    '''
    if word1 not in w2vmodel.wv or word2 not in w2vmodel.wv:
        return "One or both words not in vocabulary"

    embedding_word1 = embedding_layer(torch.LongTensor([w2vmodel.wv.key_to_index[word1]]))
    embedding_word2 = embedding_layer(torch.LongTensor([w2vmodel.wv.key_to_index[word2]]))

    similarity = cosine_similarity(embedding_word1, embedding_word2)

    return similarity

# TODO: Define three words (one pair should be similar and one pair should be dissimilar) and compute their similarity
word1 = 'good'
word2 = 'bad'
word3 = 'good'
print(f"Similarity between '{word1}' and '{word2}': {compute_word_similarity(word1, word2)}")
print(f"Similarity between '{word1}' and '{word3}': {compute_word_similarity(word1, word3)}")

Similarity between 'good' and 'bad': 0.7351477742195129
Similarity between 'good' and 'good': 1.0


In [None]:
# Run this cell if you're done with the above section
del embedding_layer

## Sentence Classification Classification with Sentence Embeddings [20 Marks]

Now let's move on to an actual application: classifying whether a tweet is about a real disaster or not. As you can imagine, this could be a valuable model when monitoring social media for disaster relief efforts.

Since we are using Sentence Embeddings, we want something that will take in a sequence of words and throw out a single fixed-size vector. For this task, we will make use of an LLM via the `gpt4all` library.

This library will allow us to generate pretrained embeddings for sentences, that we can use as **features** to feed to any classifier of our choice.

In [None]:
# Read in the data here
df = pd.read_csv("/content/drive/MyDrive/ML PA4/disaster_tweets.csv")
df = df[["text", "target"]]

# Split the data
train, val = train_test_split(df, test_size=0.2, random_state=42)

print(train.shape, val.shape)

(6090, 2) (1523, 2)


Before jumping straight to Embeddings, since our data is sourced from the cesspool that is Twitter, we should probably do some cleaning. This can involve the removal of URLs, punctuation, numbers that don't provide any meaning, stopwords, and so on.'

In the following cell, write functions to clean the sentences. You are allowed to add more functions if you wish, but the ones provided are the bare minimum.

**Note:** After cleaning your sentences, it is possible that you may end up with empty sentences (or some that are so short they have lost all meaning). In this event, since we want to demonstrate setting up a Sentence Classification task, you should remove them from your dataset (data cleaning is not the center of this notebook).

In [None]:
# TODO: Clean the sentences (5 marks)
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# TODO: Fill out the following functions, adding more if desired
nltk.download('punkt')
nltk.download('stopwords')

def lowercase(txt):

    return txt.lower()

def remove_punctuation(txt):

    return re.sub(r'[^\w\s]', '', txt)

def remove_stopwords(txt):

    stop_words = set(stopwords.words('english'))
    words = word_tokenize(txt)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

def remove_numbers(txt):

    return re.sub(r'\d+', '', txt)

def remove_url(txt):

    return re.sub(r'http\S+', '', txt)

def normalize_sentence(txt):
    '''
    Aggregates all the above functions to normalize/clean a sentence
    '''
    txt = lowercase(txt)
    txt = remove_punctuation(txt)
    txt = remove_stopwords(txt)
    txt = remove_numbers(txt)
    txt = remove_url(txt)
    return txt

def filter_short_sentences(df, min_length=20):
    return df[df['text'].apply(lambda x: len(x) >= min_length)]

df = pd.read_csv("/content/drive/MyDrive/ML PA4/disaster_tweets.csv")
df = df[["text", "target"]]


# TODO: Clean the sentences
df['clean_text'] = df['text'].apply(normalize_sentence)

# TODO: Filter sentences that are too short (less than 20ish characters)
df = filter_short_sentences(df)
train, val = train_test_split(df, test_size=0.2, random_state=42)

print(train.shape, val.shape)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(6035, 3) (1509, 3)


Now for the fun part, creating our Embeddings!

We will be using the `gpt4all.Embed4All` class for this purpose. You can look up the documentation [here](https://docs.gpt4all.io/gpt4all_python_embedding.html#gpt4all.gpt4all.Embed4All.embed).

This functionality makes use of a model called [Sentence-BERT](https://arxiv.org/abs/1908.10084). This is a Transformer-based model that has been trained on a large corpus of text, and is able to generate high-quality Sentence Embeddings for us.

In [None]:
from gpt4all import Embed4All

# TODO: Generate embeddings for train and validation sentences (5 marks)
feature_extractor = Embed4All()

# TODO: Encode the train samples
train_embeddings = [feature_extractor.embed(sentence) for sentence in train['clean_text'].tolist()]

# TODO: Encode the train sentences
train_embeddings = [feature_extractor.embed(sentence) for sentence in train['clean_text'].tolist()]

# TODO: Encode the validation sentences
val_embeddings = [feature_extractor.embed(sentence) for sentence in val['clean_text'].tolist()]

train_embeddings = np.vstack(train_embeddings)
val_embeddings = np.vstack(val_embeddings)

# TODO: Ready the labels
train_labels = train['target'].tolist()
val_labels = val['target'].tolist()

print(f"Train embeddings shape: {train_embeddings.shape}")
print(f"Validation embeddings shape: {val_embeddings.shape}")

100%|██████████| 45.9M/45.9M [00:01<00:00, 32.8MiB/s]


Train embeddings shape: (6035, 384)
Validation embeddings shape: (1509, 384)


Now with our Embeddings ready, we can move on to the actual classification task.

You have the choice of using **any** classifier you wish. You can use a simple Logistic Regression model, get fancy with Support Vector Machines, or even use a Neural Network. The choice is yours.

We will be looking for a model with a **Validation Accuracy** of around $0.8$. You must also use this model to make predictions on your own provided inputs, after completing the `predict` function.

In [None]:
# TODO: Get 0.8 Validation Acc with a Classifier (5 marks)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

classifier = LogisticRegression(random_state=42)

scaler = StandardScaler()

train_embeddings_scaled = scaler.fit_transform(train_embeddings)
val_embeddings_scaled = scaler.transform(val_embeddings)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=3)
grid_search.fit(train_embeddings_scaled, train_labels)

best_classifier = grid_search.best_estimator_

val_predictions = best_classifier.predict(val_embeddings_scaled)

val_accuracy = accuracy_score(val_labels, val_predictions)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(val_labels, val_predictions))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Validation Accuracy: 0.7932

Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.83      0.82       876
           1       0.76      0.75      0.75       633

    accuracy                           0.79      1509
   macro avg       0.79      0.79      0.79      1509
weighted avg       0.79      0.79      0.79      1509



In [None]:
# TODO: Create a function to predict on a sentence (5 marks)

def predict(sentence, clf, feature_extractor):
    '''
    Takes in a sentence and returns the predicted class along with the probability
    '''
    # TODO: Clean and encode the sentence
    cleaned_sentence = normalize_sentence(sentence)

    embedding = feature_extractor.embed(cleaned_sentence)

    embedding = np.array(embedding).reshape(1, -1)

    # TODO: Predict the class and probability
    prediction = clf.predict(embedding)
    probability = clf.predict_proba(embedding)

    return prediction, probability

# TODO: Predict on a few of your own sentences
sentence1 = "Life is like a race!"
sentence2 = "DO good have good."

prediction1, probability1 = predict(sentence1, classifier, feature_extractor)
prediction2, probability2 = predict(sentence2, classifier, feature_extractor)

# Showing Result
print(f"Sentence: '{sentence1}'\nPrediction: {prediction1[0]}, Probability: {probability1[0][1]:.4f}")
print()
print(f"Sentence: '{sentence2}'\nPrediction: {prediction2[0]}, Probability: {probability2[0][1]:.4f}")


Sentence: 'Life is like a race!'
Prediction: 0, Probability: 0.3809

Sentence: 'DO good have good.'
Prediction: 0, Probability: 0.3730


Hopefully now you realize the power of Embeddings, and the usefulness of pretrained models.

# Fin.