<a href="https://colab.research.google.com/github/vlassner/DSML_4220_Deep_Learning/blob/main/Lab6_airline_tweets_w_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6: Airline Tweets with (and without) Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/DSML4220/blob/main/lab6_airline_tweets_w_embeddings.ipynb)

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/DSML4220/blob/main/lab6_airline_tweets_w_embeddings.ipynb)

In this notebook we'll build revisit the Airline Tweets dataset (from [Lab 1](https://github.com/sgeinitz/DSML4220/blob/main/lab1_text_data.ipynb)) and compare using an MLP with one-hot encodings as the input vs using word embeddings as the input.

In this lab there are three (3) questions/tasks. These questions are listed here but are also inline below.

1. Q1: Choose two words to compare (different from "_wonderful_" vs "_incredible_"). Re-run the parts of the notebook that plot the histogram of the differences between learned weight parameter values for each of your chosen words across the 128 hidden units in the first layer.
2. Q2: Add your two words the list of words whose embeddings are displayed and compared. Do your two chosen words have similar embeddings? In other words, is the distance between your embeddings very small?
3. Q3: Compare the size of the two models used in this notebook, one of which uses one-hot encoded inputs and the other which uses GloVe embeddings.

In [None]:
!pip uninstall -y numpy torch torchvision torchtext torchmetrics torchaudio transformers
!pip install numpy==1.25.2
!pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 torchtext==0.16.1 torchmetrics==0.11.4


In [None]:
#!pip install torchmetrics
#!pip install torchmetrics tqdm

In [None]:
import torch
import random
#import tqdm #import notebook
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
#from torchmetrics.functional import pairwise_cosine_similarity


In [None]:
data_URL = 'https://raw.githubusercontent.com/sgeinitz/DSML4220/main/data/airlinetweets.csv'
df = pd.read_csv(data_URL)
print(f"df.shape: {df.shape}")
pd.set_option("display.max_colwidth", 240)
df.head(10)

In [None]:
random.seed(2)
indices = list(range(len(df)))
random.shuffle(indices)

df_test = df.iloc[indices[9000:],]
df = df.iloc[indices[:9000],]

In [None]:
df_test.shape
df.shape

Recall that about 2/3 of the data have negative labels, and that the remaining labels are roughly split between positive and neutral (slightly more neutral than positive).

In [None]:
df.sentiment.value_counts(normalize=True)

Let's start with the nltk TweetTokenizer, which will split the text into separate words and characters based on common Twitter conventions.

In [None]:
from nltk.tokenize import TweetTokenizer
tk = TweetTokenizer()
df['tokens_raw'] = df['text'].apply(lambda x: tk.tokenize(x.lower()))
df.head()

Previously, we did not do a lot of exploratory data analysis (EDA) on this airline tweet dataset. We will not do too much here either, but at the very least let's look at a histogram of the lengths of the tweets. Note that here we are defining length to be the number of tokens, but it may also be useful to look at the number of characters. And, of course, there are other EDA steps we could do.

In [None]:
df['tweet_length'] = df['tokens_raw'].apply(lambda x: len(x))
plt.figure(figsize=(12,6))
df['tweet_length'].hist() #bins=100, range=(0,45), width=0.9) #, df['tweet_length'].mean(), df['tweet_length'].median()
plt.show()

In [None]:
import nltk
nltk.download('stopwords')

Next, let's remove common stop words (e.g. "_the_", "_in_", etc.). In this next cell we will also remove some characters/punctuation, as well as hashtag tokens.

Note: If the following cell causes an error, then uncomment the code cell above and run it to download and load the nltk stopwords.

In [None]:
import re
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
chars2remove = set(['.','!','/', '?'])
df['tokens_raw'] = df['tokens_raw'].apply(lambda x: [w for w in x if w not in stops])
df['tokens_raw'] = df['tokens_raw'].apply(lambda x: [w for w in x if w not in chars2remove])
df['tokens_raw'] = df['tokens_raw'].apply(lambda x: [w for w in x if not re.match('^#', w)]) # remove hashtags
#df['tokens_raw'] = df['tokens_raw'].apply(lambda x: [w for w in x if not re.match('^http', w)]) # remove web links
#df['tokens_raw'] = df['tokens_raw'].apply(lambda x: [w for w in x if not re.match('^@', w)]) # remove web links

df.head()

For the final step of text pre-processing we will lemmatize the tokens. Note that there are much better ways to do this but that we want to use a simple lemmatizer. For example, some lemmatizers also utilize a model internally to predict the part-of-speech for each word, since whether the word is a noun, adjective, verb, etc. will affect how lemmatization is done. Since we want to keep things simple here, and focus only on the lemmatization step, we'll assume every word is the same part of speech. Note that this is not by any means ideal (try to identify the incorrectly lemmatized token in the five tweets printed out below). In practice we would certainly utilize a 'smarter' lemmatizer.

The last step below is to combined the tokens back into a single string, which is stored in the column `textclean`.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()
df['tokens'] = df['tokens_raw'].apply(lambda x: [lemmatizer.lemmatize(w, pos="v") for w in x])
#df['tokens'] = df['tokens_raw'].apply(lambda x: [lemmatizer.lemmatize(w) for w in x])

df['textclean'] = df['tokens'].apply(lambda x: ' '.join(x))
df.head()

Now we will perform one-hot encoding using sklearn's, `CountVectorizer`, with the option `binary=True`. We'll go ahead and call the resulting vectorized data, `X`, or `X_train` since it is only the training dataset. As with conventional statistical models, "_X_" represents the set of predictors, or independent variables.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

#count_vectorizer = CountVectorizer(binary=True)
count_vectorizer = CountVectorizer(binary=True, min_df=2)
X_np = count_vectorizer.fit_transform(df['textclean']).toarray()

print(f"X_np.shape = {X_np.shape}")
type(X_np)

Here is the full vocabulary created by the the `CountVectorizer`.

In [None]:
vocab = count_vectorizer.vocabulary_
vocab = {k: v for k, v in sorted(vocab.items(), key=lambda item: item[1], reverse=False)}
print(vocab)

---

### Q1: Choose two words to compare (different from "_wonderful_" vs "_incredible_").

Below you will choose your two words, which have similar meaning and which you suspect the model will treat similarly. Then, re-train the model and plot the histogram of the differences between learned weight values for each of your chosen words across the 128 hidden units in the first layer. Did the histograms show that the learned weight values were similar for your words? More similar than for the neighboring words compared to each other?

They were more similar to each other than neighboring words because in this context there is a good, neutral and bad label for all the tweets, so similar words in the "good" tweets can be grouped easily even without knowing the definition.

---

In [None]:
word1 = 'plane'
word2 = 'flight'

word1_idx = vocab[word1]
print(f"The index for '{word1}': {word1_idx}")

word2_idx = vocab[word2]
print(f"The index for '{word2}': {word2_idx}")


Next, let's look at the tweets themselves that contained the word _"great"_.

In [None]:
rows_w_word1 = np.where(X_np[:, word1_idx] == 1)[0]
print(rows_w_word1)
df.iloc[rows_w_word1,]

In [None]:
rows_w_word2 = np.where(X_np[:, word2_idx] == 1)[0]
print(rows_w_word2)
df.iloc[rows_w_word2,]

Confirm that the input, `X`, has n rows and a column for each word (token) in the vocabulary.

In [None]:
X = torch.tensor(X_np, dtype=torch.float32)
X.size()

In [None]:
# look at one observation and see how may tokens there are (i.e. how many 1's are in the row, and how many 0's)
pd.DataFrame(X_np[1,:]).value_counts()

In [None]:
labels = df['sentiment'].unique()
enum_labels = enumerate(labels)
label_to_idx = dict((lab, i) for i,lab in enum_labels)
print(f"label dictionary: {label_to_idx}")
y = torch.tensor([label_to_idx[lab] for lab in df['sentiment']])

In [None]:
class AirlineTweetDataset(Dataset):
    def __init__(self, observations, labels):
        self.obs = observations
        self.labs = labels
        self.create_split(len(observations))

    def create_split(self, n, seed=2, train_perc=0.7):
        random.seed(seed)
        indices = list(range(n))
        random.shuffle(indices)
        self._train_ids = list(indices[:int(n * train_perc)])
        self._test_ids = list(indices[int(n * train_perc):])
        self._split_X = self.obs[self._train_ids]
        self._split_y = self.labs[self._train_ids]

    def set_split(self, split='train'):
        if split == 'train':
            self._split_X = self.obs[self._train_ids]
            self._split_y = self.labs[self._train_ids]
        else:
            self._split_X = self.obs[self._test_ids]
            self._split_y = self.labs[self._test_ids]

    def __len__(self):
        return len(self._split_y)

    def __getitem__(self, idx):
        return {'x':self._split_X[idx], 'y':self._split_y[idx]}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

dataset = AirlineTweetDataset(X, y)
dataset.create_split(len(X), seed=42, train_perc=0.85)

In [None]:
dataset.set_split('train')
print(f"len(dataset) = {len(dataset)}")
#len(dataset[:]['x'])
dataset[0]['x']

Confirm that there are no NaN, and that all numerical values are finite.

In [None]:
!pip install numpy==2.0.2
dataset[:]['x'].numpy()[0,:5]

In [None]:
assert not np.any(np.isnan(dataset[:]['x'].numpy()))
assert np.all(np.isfinite(dataset[:]['x'].numpy()))

In [None]:
class AirlineTweetClassifier(nn.Module):
    """ A 2-layer Multilayer Perceptron for classifying surnames """
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input embeddings
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(AirlineTweetClassifier, self).__init__()

        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, 32)
        self.fc3 = nn.Linear(32, output_dim)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the classifier

        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate_vector = F.relu(self.fc1(x_in))

        intermediate_vector = F.relu(self.fc2(intermediate_vector))
        intermediate_vector = self.dropout(intermediate_vector)

        prediction_vector = self.fc3(intermediate_vector)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector

#### Hyperparameters for model with one-hot encoded inputs

In [None]:
batch_size = 32
learning_rate = 0.001
num_epochs = 20
device = 'cpu'

Take one quick look at the size of the training and validation splits.

In [None]:
dataset.set_split('train')
#print(len(dataloader) * batch_size)
dataset.set_split('val')
#print(len(dataloader) * batch_size)

In [None]:
seed = 2
np.random.seed(seed)
torch.manual_seed(seed)
random.seed(seed)

# create the dataset, model and define loss function and optimizer
dataloader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
model = AirlineTweetClassifier(len(dataset[0]['x']), 128, 3)
loss_fun = nn.CrossEntropyLoss()#weights)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
import tqdm.notebook
epoch_bar = tqdm.notebook.tqdm(desc='training routine', total=num_epochs, position=0)

dataset.set_split('train')
train_bar = tqdm.notebook.tqdm(desc='split=train', total=dataset.get_num_batches(batch_size), position=1, leave=True)

dataset.set_split('val')
val_bar = tqdm.notebook.tqdm(desc='split=val', total=dataset.get_num_batches(batch_size), position=1, leave=True)

losses = {'train':[], 'val':[]}

for epoch in range(num_epochs):

    dataset.set_split('train')
    model.train()
    running_loss_train = 0.0

    for batch_i, batch_data in enumerate(dataloader):
        tweets = batch_data['x'].to(device)
        labels = batch_data['y'].to(device)

        # forward
        outputs = model(tweets)
        loss = loss_fun(outputs, labels)
        losses['train'].append(loss.item())
        running_loss_train += loss.item()

        # backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #if (batch_i+1) % 10 == 0:
        #    print(f"    train batch {batch_i+1:3.0f} (of {len(dataloader):3.0f}) loss: {loss.item():.4f}")
            # update bar
        train_bar.set_postfix(loss=running_loss_train, epoch=epoch)
        train_bar.update()

    train_bar.set_postfix(loss=running_loss_train/dataset.get_num_batches(batch_size), epoch=epoch)
    train_bar.update()


    running_loss_train = running_loss_train / len(dataset)

    dataset.set_split('val')
    model.eval() # turn off the automatic differentiation
    running_loss_val = 0.0

    for batch_i, batch_data in enumerate(dataloader):
        tweets = batch_data['x'].to(device)
        labels = batch_data['y'].to(device)


        # forward (no backward step for validation data)
        outputs = model(tweets)
        loss = loss_fun(outputs, labels)
        losses['val'].append(loss.item())
        running_loss_val += loss.item()
        #if (batch_i+1) % 20 == 0:
        #    print(f"    valid batch {i+1:3.0f} (of {len(dataloader):3.0f}) loss: {loss.item():.4f}")
        val_bar.set_postfix(loss=running_loss_val, epoch=epoch)
        val_bar.update()

    val_bar.set_postfix(loss=running_loss_val/dataset.get_num_batches(batch_size), epoch=epoch)
    val_bar.update()

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

    running_loss_val = running_loss_val / len(dataset)


In [None]:
matplotlib.rc('figure', figsize=(15,4))
val_ticks = [(i+1)*len(losses['train'])/len(losses['val']) for i in range(len(losses['val']))]
plt.plot(range(len(losses['train'])), losses['train'], c='blue', lw=0.75)
plt.plot(val_ticks, losses['val'], c='orange', lw=0.75)
for i in range(num_epochs):
    plt.axvline(x=i*len(losses['train'])/num_epochs, c='black', lw=0.25, alpha=0.5)
plt.ylabel('Loss')
plt.xlabel('Epoch and Batch')
plt.legend(('Train','Validation'))

In [None]:
# Test the model on full validation set
dataset.set_split('val')

y_true = []
y_pred = []
with torch.no_grad():
    correct = 0
    total = 0
    for batch_data in dataloader:
        tweets = batch_data['x'].to(device)
        labels = batch_data['y'].to(device)
        outputs = model(tweets)
        _, predicted = torch.max(outputs.data, 1)
        y_true += labels.tolist()
        y_pred += predicted.tolist()
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f"Accuracy (on {len(dataloader)*batch_size} validation tweets): {100 * correct / total:.2f}%")


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['positive','negative','neutral'])
disp.plot()

In [None]:
import torchinfo
torchinfo.summary(model, tuple(dataset[0]['x'].size()))

Let's now retrieve the weight parameters that are associated with the words (i.e. tokens) that have similar meaning, "great", "amazing", "incredible". These words were in the vocabulary at following locations.
* index for 'great': 4140
* index for 'incredible': 4608
* index for 'terrific': 7896

In [None]:
fc1_weights = model.fc1.weight.data
print(f"first model layer has weight matrix with shape = {fc1_weights.shape}")

In [None]:
#wonderful_idx = vocab['wonderful']
#incredible_idx = vocab['incredible']
unit_i = 0
print(f"word1 index: {word1_idx}")
print(f"  fc1_weights[{unit_i},{[word1_idx-1,word1_idx, word1_idx+1]}]: {fc1_weights[unit_i,word1_idx-1:word1_idx+2]}")
print(f"word2 index: {word2_idx}")
print(f"  fc1_weights[{unit_i},{[word2_idx-1,word2_idx, word2_idx+1]}]: {fc1_weights[unit_i,word2_idx-1:word2_idx+2]}")

In [None]:
diffs = {"cont1":[], "word1_vs_word2":[], "cont2":[]}
for i in range(128):
    diffs["cont1"].append(abs(fc1_weights[i,word1_idx-1] - fc1_weights[i,word2_idx-1]))
    diffs["word1_vs_word2"].append(abs(fc1_weights[i,word1_idx] - fc1_weights[i,word2_idx]))
    diffs["cont2"].append(abs(fc1_weights[i,word1_idx+1] - fc1_weights[i,word2_idx+1]))

# convert each list to a numpy array
for key in diffs:
    diffs[key] = np.array(diffs[key])

In [None]:
# generate summary statistics for the differences for weight values
diffs_df = pd.DataFrame(diffs)
diffs_df.describe()

In [None]:
vocab = count_vectorizer.vocabulary_

# find which key vocab is associated with the index 4139
for key, value in vocab.items():
    if value == word1_idx-1:
        w_at_incredible_idx_minus_1 = key
        print(f"word at index {word1_idx-1}: {key}")
    if value == word2_idx-1:
        w_at_wonderful_idx_minus_1 = key
        print(f"word at index {word2_idx-1}: {key}")
    if value == word1_idx+1:
        w_at_incredible_idx_plus_1 = key
        print(f"word at index {word1_idx+1}: {key}")
    if value == word2_idx+1:
        w_at_wonderful_idx_plus_1 = key
        print(f"word at index {word2_idx+1}: {key}")



In [None]:
# plots of the differences as three different histograms
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.hist(diffs["cont1"], bins=20)
# set x-axis limits to be the same for all three plots
plt.xlim(0,0.4)
plt.title(f"cont1: {w_at_incredible_idx_minus_1} vs {w_at_wonderful_idx_minus_1}")
plt.subplot(1,3,2)
plt.hist(diffs["word1_vs_word2"], bins=20)
plt.xlim(0,0.4)
plt.title(f"{word1} vs {word2}")
plt.subplot(1,3,3)
plt.hist(diffs["cont2"], bins=20)
plt.xlim(0,0.4)
plt.title(f"cont2: {w_at_incredible_idx_plus_1} vs {w_at_wonderful_idx_plus_1}")
plt.show()


In [None]:
# length of an input is
len(dataset[0]['x'])

In [None]:

import torchtext as text
vec = text.vocab.GloVe(name='6B', dim=50)

---

### Q2: Add your two words the list of words whose embeddings are displayed and compared. Do your two chosen words have similar embeddings? In other words, is the distance between your embeddings very small?

Below you will choose your two words, which have similar meaning and which you suspect the model will treat similarly. Then, re-train the model and plot the histogram of the differences between learned weight values for each of your chosen words across the 128 hidden units in the first layer. Did the histograms show that the learned weight values were similar for your words? More similar than for the neighboring words compared to each other?

I think it will be similar since they are used in the same way and show up constantly in all categories of tweets.

---

In [None]:
examples = ['annoy', 'annoyed', 'disappointed', 'sad', 'happy', 'pilot', 'attendant', 'crew', 'suitcase', 'luggage', 'carryon', 'great', 'amazing', 'terrific',
'incredible', 'wonderful', 'flight','plane']
embeddings = vec.get_vecs_by_tokens(examples, lower_case_backup=True)
embeddings[0,:] # just the first embedding (you can verify by confirming that it is 50 elements long)

In [None]:
def compare_words_with_colors(vecs, wds):
    wdsr = wds[:]
    wdsr.reverse()

    dim = len(vecs[0])

    fig = plt.figure(num=None, figsize=(16, 4), dpi=80, facecolor='w', edgecolor='k')
    ax = fig.add_subplot(111)
    ax.set_facecolor('gray')

    for i,v in enumerate(vecs):
        ax.scatter(range(dim), [i]*dim, c=vecs[i], cmap='Spectral', s=150, marker='s')

    plt.xticks(range(50), [i+1 for i in range(50)])
    plt.xlabel('Dimension')
    plt.yticks(range(len(wds)), wds)

    plt.show()

compare_words_with_colors(embeddings, examples)
#examples.reverse()

In [None]:
similarities = pairwise_cosine_similarity(embeddings, zero_diagonal=False)
distances = 1 - similarities
print(f"the first row of the distance matrix for our set of words looks like: {distances[0,:]}")
pairwise_top = pd.DataFrame(
    distances,
    columns = examples,
    index = examples
)

In the cell above we created a distance matrix, let's now see what it looks like. Note that since we are plotting pairwise distances, larger values will be red and will suggest that the word the corresponding row is far away from the word in the corresponding columns (and vice versa).

Similarly, words that are similar to each other will have a smaller distance (close to zero), and will be plotted in green.

In [None]:
plt.figure(figsize=(8,6))
#sns.color_palette("viridis", as_cmap=True)
sns.color_palette("mako", as_cmap=True)
sns.heatmap(
    pairwise_top,
    cmap='RdYlGn_r',  # Reverse the 'RdYlGn' colormap to have green for larger values and red for smaller values
    linewidth=1
)


In [None]:
data_URL = 'https://raw.githubusercontent.com/sgeinitz/DSML4220/main/data/airlinetweets.csv'
df = pd.read_csv(data_URL)
print(f"df.shape: {df.shape}")
pd.set_option("display.max_colwidth", 240)
df.head(10)

In [None]:
random.seed(2)
indices = list(range(len(df)))
random.shuffle(indices)

df_test = df.iloc[indices[9000:],]
df = df.iloc[indices[:9000],]

In [None]:
df.sentiment.value_counts(normalize=False)

In [None]:
import torchtext
from torchtext.data import get_tokenizer
tokenizer = get_tokenizer("basic_english") # "basic_english"   "subword" uses revtok module (but does not work with GLoVE)
df['tokens_raw'] = df['text'].apply(lambda x: tokenizer(x.lower()))
df.head()

In [None]:
df['tweet_length'] = df['tokens_raw'].apply(lambda x: len(x))
#plt.figure(figsize=(12,6))
#df['tweet_length'].hist() #bins=100, range=(0,45), width=0.9) #, df['tweet_length'].mean(), df['tweet_length'].median()
#plt.show()

In [None]:
df.iloc[rows_w_word1,].index.sort_values()

In [None]:
tweet_i= 53
tweet_embeddings = vec.get_vecs_by_tokens(df['tokens_raw'][tweet_i], lower_case_backup=True)
print(f"sentiment of this tweet: {df['sentiment'][tweet_i]}")
print(f"tweet_embeddings.shape = {tweet_embeddings.shape}")
for i in range(df['tweet_length'][tweet_i]):
    print(f"    token, '{df['tokens_raw'][tweet_i][i]}' (at pos {i:2.0f}) has tweet_embeddings[:5] = {tweet_embeddings[i][:5]}")

In [None]:
df.iloc[rows_w_word2,].index.sort_values()

In [None]:
tweet_i= 18
tweet_embeddings = vec.get_vecs_by_tokens(df['tokens_raw'][tweet_i], lower_case_backup=True)
print(f"sentiment of this tweet: {df['sentiment'][tweet_i]}")
print(f"tweet_embeddings.shape = {tweet_embeddings.shape}")
for i in range(df['tweet_length'][tweet_i]):
    print(f"    token, '{df['tokens_raw'][tweet_i][i]}' (at pos {i:2.0f}) has tweet_embeddings[:5] = {tweet_embeddings[i][:5]}")

The tweet above had 9 tokens in it, which we can quickly confirm here by looking at the shape of it:

In [None]:
tweet_embeddings.shape

Before we continue we must decide what a good length will be for a max-length of the number of tokens to keep. Let's look at a histogram of the lenghts of each tweet (where length equals the number of raw tokens).

In [None]:
def meanTweetEmbeddings(raw_tokens):
    embeddings = vec.get_vecs_by_tokens(raw_tokens, lower_case_backup=True)
    n_embs = 0
    emb_sum = torch.zeros((embeddings.shape[1]))
    for i in range(min(embeddings.shape[0], 35)): # max number of tokens in a tweet is 35
        if embeddings[i].abs().sum() > 0:
            n_embs += 1
            emb_sum += embeddings[i]
    if n_embs > 0:
        emb_avg = emb_sum / n_embs
    else:
        emb_avg = torch.zeros((embeddings.shape[1]))
    if np.any(np.isnan(emb_avg.numpy())):
        print(f"exists an nan: {emb_sum}")
    return emb_avg

X_int = df['tokens_raw'].apply(lambda x: meanTweetEmbeddings(x)).values
print(f"X_int.shape = {X_int.shape}")
X_int[:2]

In [None]:
X_int[0].shape

In [None]:
if len(X_int[0] > 50):
    avg_embedding = False
else:
    avg_embedding = True

X = torch.stack(tuple(X_int))
X.shape
#X[:2]

There should be 9000 rows in X, since this is the number of tweets (i.e. observations) in the training data.

The number of columns is the _embedding size_ itself.

In [None]:
labels = df['sentiment'].unique()
enum_labels = enumerate(labels)
label_to_idx = dict((lab, i) for i,lab in enum_labels)
print(f"label dictionary: {label_to_idx}")
y = torch.tensor([label_to_idx[lab] for lab in df['sentiment']])

In [None]:
# Can be a good idea to occassionally check that the dims (or shapes) agree for the inputs (X) and labels (y)
assert len(X) == len(y)

In [None]:
class AirlineTweetDataset(Dataset):
    def __init__(self, observations, labels):
        self.obs = observations
        self.labs = labels
        self.create_split(len(observations))

    def create_split(self, n, seed=2, train_perc=0.7):
        random.seed(seed)
        indices = list(range(n))
        random.shuffle(indices)
        self._train_ids = list(indices[:int(n * train_perc)])
        self._test_ids = list(indices[int(n * train_perc):])
        self._split_X = self.obs[self._train_ids]
        self._split_y = self.labs[self._train_ids]

    def set_split(self, split='train'):
        if split == 'train':
            self._split_X = self.obs[self._train_ids]
            self._split_y = self.labs[self._train_ids]
        else:
            self._split_X = self.obs[self._test_ids]
            self._split_y = self.labs[self._test_ids]

    def __len__(self):
        return len(self._split_y)

    def __getitem__(self, idx):
        return {'x':self._split_X[idx], 'y':self._split_y[idx]}

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

dataset = AirlineTweetDataset(X, y)
dataset.create_split(len(X), seed=42, train_perc=0.85)

In [None]:
dataset.set_split('train')
print(f"len(dataset) = {len(dataset)}")
len(dataset[:]['x'])
dataset[0]['x']

In [None]:
assert not np.any(np.isnan(dataset[:]['x'].numpy()))
assert np.all(np.isfinite(dataset[:]['x'].numpy()))

#### Hyperparameters for model with GloVe embeddings

We'll use the same training configuration as before, although it is worth noting that this model would likely benefit from more training.

In [None]:
# use same batch_size, learning_rate, and epochs as before
batch_size = 32
learning_rate = 0.001
num_epochs = 20
device = 'cpu'

In [None]:
seed = 2
np.random.seed(seed)
torch.manual_seed(seed)
random.seed(seed)

# create dataset, model and define loss function and optimizer
dataloader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True)
model_w_embeddings = AirlineTweetClassifier(len(dataset[0]['x']), 128, 3)
loss_fun = nn.CrossEntropyLoss()#weights)
optimizer = torch.optim.Adam(model_w_embeddings.parameters(), lr=learning_rate)

In [None]:
epoch_bar = tqdm.notebook.tqdm(desc='training routine', total=num_epochs, position=0)

dataset.set_split('train')
train_bar = tqdm.notebook.tqdm(desc='split=train', total=dataset.get_num_batches(batch_size), position=1, leave=True)

dataset.set_split('val')
val_bar = tqdm.notebook.tqdm(desc='split=val', total=dataset.get_num_batches(batch_size), position=1, leave=True)

losses = {'train':[], 'val':[]}

for epoch in range(num_epochs):

    dataset.set_split('train')
    model_w_embeddings.train()
    running_loss_train = 0.0

    for batch_i, batch_data in enumerate(dataloader):
        tweets = batch_data['x'].to(device)
        labels = batch_data['y'].to(device)

        # forward
        outputs = model_w_embeddings(tweets)
        loss = loss_fun(outputs, labels)
        losses['train'].append(loss.item())
        running_loss_train += loss.item()

        # backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #if (batch_i+1) % 10 == 0:
        #    print(f"    train batch {batch_i+1:3.0f} (of {len(dataloader):3.0f}) loss: {loss.item():.4f}")
            # update bar
        train_bar.set_postfix(loss=running_loss_train, epoch=epoch)
        train_bar.update()

    train_bar.set_postfix(loss=running_loss_train/dataset.get_num_batches(batch_size), epoch=epoch)
    train_bar.update()


    running_loss_train = running_loss_train / len(dataset)

    dataset.set_split('val')
    model_w_embeddings.eval() # turn off the automatic differentiation
    running_loss_val = 0.0

    for batch_i, batch_data in enumerate(dataloader):
        tweets = batch_data['x'].to(device)
        labels = batch_data['y'].to(device)


        # forward (no backward step for validation data)
        outputs = model_w_embeddings(tweets)
        loss = loss_fun(outputs, labels)
        losses['val'].append(loss.item())
        running_loss_val += loss.item()
        #if (batch_i+1) % 20 == 0:
        #    print(f"    valid batch {i+1:3.0f} (of {len(dataloader):3.0f}) loss: {loss.item():.4f}")
        val_bar.set_postfix(loss=running_loss_val, epoch=epoch)
        val_bar.update()

    val_bar.set_postfix(loss=running_loss_val/dataset.get_num_batches(batch_size), epoch=epoch)
    val_bar.update()

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

    running_loss_val = running_loss_val / len(dataset)


In [None]:
matplotlib.rc('figure', figsize=(15,4))
val_ticks = [(i+1)*len(losses['train'])/len(losses['val']) for i in range(len(losses['val']))]
plt.plot(range(len(losses['train'])), losses['train'], c='blue', lw=0.75)
plt.plot(val_ticks, losses['val'], c='orange', lw=0.75)
for i in range(num_epochs):
    plt.axvline(x=i*len(losses['train'])/num_epochs, c='black', lw=0.25, alpha=0.5)
plt.ylabel('Loss')
plt.xlabel('Epoch and Batch')
plt.legend(('Train','Validation'))

In [None]:
# Test the model
model_w_embeddings.eval()
dataset.set_split('val')
y_true = []
y_pred = []

with torch.no_grad():
    correct = 0
    total = 0
    for batch_data in dataloader:
        tweets = batch_data['x'].to(device)
        labels = batch_data['y'].to(device)
        outputs = model_w_embeddings(tweets)
        _, predicted = torch.max(outputs.data, 1)
        y_true += labels.tolist()
        y_pred += predicted.tolist()
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f"Accuracy (on {len(dataloader)*batch_size} validation tweets): {100 * correct / total:.2f}%")


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['positive','negative','neutral'])
disp.plot()

In [None]:
# length of an input is
len(dataset[0]['x'])

In [None]:
import torchinfo
torchinfo.summary(model_w_embeddings, tuple(dataset[0]['x'].size()))

In [None]:
50*128 + 128

---

### Q3: How much smaller is the model with embeddings than the model with one-hot encoded inputs?

The model with embeddings will be significantly smaller because it isn't made up of a large number of vectors filled mostly with zeros and a few ones like one-hot encoding. While embedding contains fewer and more denser vectors.

---