# Code Embedding

Now that you have learned a little about the theory behind word embedding it is time you practice it. We'll also take this opportunity to teachyou how you can work with text data in Pytorch!

## What will you learn in this course? 🧐🧐

This course is a code demonstration that will walk you through manipulating text data with pytorch as well as train a model with an embedding layer !

In [3]:
import io
import os
import re
import shutil
import tarfile
import string
import torch
import tiktoken
import requests
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split

### Load and format the dataset

We begin by loading the text data.

In [4]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

output_path = "aclImdb_v1.tar.gz"
extract_dir = "./aclImdb"
def load_and_extract_files(url,output_path,extract_dir):
    # Download the dataset
    if not os.path.exists(output_path):
        response = requests.get(url, stream=True)
        with open(output_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:
                    file.write(chunk)
        print("Download complete.")
    if not os.path.exists(extract_dir):
        # Extract all contents to a specific folder
        with tarfile.open(output_path, "r:gz") as tar:
            tar.extractall(path=extract_dir)  # Replace with your target folder

        print("Extraction complete!")

load_and_extract_files(url,output_path,extract_dir)

Download complete.
Extraction complete!


In [5]:
remove_dir = os.path.join(extract_dir,"aclImdb/train", 'unsup')
shutil.rmtree(remove_dir)

We have imported data corresponding to film reviews organized in two categories: positive or negative. 
We'll load the text data into python variables.

In [6]:
def load_imdb_data(split="train"):
    data, labels = [], []
    for label in ["pos", "neg"]:
        folder_path = os.path.join(extract_dir,"aclImdb/",split,label)
        for file_name in os.listdir(folder_path):
            with open(os.path.join(folder_path, file_name), "r", encoding="utf-8") as f:
                data.append(f.read())
                labels.append(1 if label == "pos" else 0)
    return data, labels

train_texts, train_labels = load_imdb_data("train")
test_texts, test_labels = load_imdb_data("test")

Let's take a look at a sample of text data.

In [7]:
print( train_labels[0], train_texts[0])

1 For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.


In order to turn text into sequences of indices ready to be embededded, we need a tokenizer. This is where a lot of the Language Model magic happens. We'll cover this in the next lecture.

Here we use the `cl100k_base` tokenizer which is used in GPT-4.

In [8]:
tokenizer = tiktoken.get_encoding("cl100k_base")

def encode_texts(texts):
    return [tokenizer.encode(text) for text in texts]

train_tokens = encode_texts(train_texts)
test_tokens = encode_texts(test_texts)

In [9]:
tokenizer.n_vocab

100277

As you can see the vocabulary size is significant. However we can count on the fact that most of these token will not appear at all in our text dataset, so very few of these token will actually lead to parameter training at each step. Which means learning will not be slowed down by the total vocab size.

In order to build the data loader, we need all sequences to be of the same length.

This is where truncating and padding comes in. Long sequences can be cut shorter, this called truncating. However shorter sequences have to be lengthened, we can fill the missing spots with `0` until the sequence reaches the desired length.

In [81]:
# How are sequence lengths distributed?
seq_lens = [len(seq) for seq in train_tokens]
np.mean(seq_lens)

np.float64(297.07392)

The average length is arount 300 tokens. Let's use this as our sequence length.

In [10]:
def pad_sequences(sequences, max_length=300):
    return [seq[:max_length] + [0] * (max_length - len(seq)) for seq in sequences]

train_tokens = pad_sequences(train_tokens)
test_tokens = pad_sequences(test_tokens)

class IMDBDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = torch.tensor(texts, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float32)
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

df_dataset = IMDBDataset(train_tokens, train_labels)
test_dataset = IMDBDataset(test_tokens, test_labels)

# Split dataset into training (80%) and validation (20%)
train_size = int(0.8 * len(df_dataset))
val_size = len(df_dataset) - train_size
train_dataset, val_dataset = random_split(df_dataset, [train_size, val_size])

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

In [11]:
text, label = next(iter(train_loader))
print(text)

tensor([[17773,   383, 14355,  ...,  1109,   430,    13],
        [ 3915, 10307,  1193,  ...,     0,  3639,   264],
        [74627,   380,   676,  ...,   832,    11,  8051],
        ...,
        [ 2028,  1633, 15234,  ...,     0,     0,     0],
        [12331, 12490,  3320,  ...,     0,     0,     0],
        [   40,   574, 14243,  ...,    11,   779,  1364]])


### Embedding layer

Pytorch makes it really easy to use an embedding layer. It can be seen as a look up table that's mapping encoded tokens to their vector representation. It takes two main arguments: `num_embeddings` the number of unique tokens in the input vocabulary, and `embedding_dim` the number of components for representing the words. We should also use `padding_idx=0` to indicate that `0`is not a token but only padding that should be ignored during gradient descent.

The input data for the embedding layer already needs to be encoded, meaning that each token needs to be replaced by an integer. Let's give an example of how it works.

In [12]:
vocab_size = tokenizer.n_vocab

In [13]:
# Example usage

embedding_layer = nn.Embedding(num_embeddings=vocab_size,
                               embedding_dim=5, 
                               padding_idx=0)
# We create a random list of three integers and use it as input for the embedding
# layer and take a look at the output.
sample_input = torch.tensor([[1, 2, 3]])
embedded_output = embedding_layer(sample_input)
print(embedded_output.shape)  # Output: (3, 5)

torch.Size([1, 3, 5])


In [14]:
# The result is a collection of three 5-dimensional vectors. Each element in the 
# original list has been replaced by a vector of 5 floating points.
embedded_output

tensor([[[-1.5571, -0.8905, -0.1998, -0.6145,  0.0438],
         [ 1.0645,  1.4452,  0.5393,  0.0646,  0.3227],
         [-1.7981, -0.3515,  0.2251,  0.6581,  0.8487]]],
       grad_fn=<EmbeddingBackward0>)

More generally the embedding layer takes an input with dimension `(batch_size, sequence_length)`, and ouputs an object with dimensions `(batch_size, sequence_length, output_dim)`.

The number of parameters in an embedding layer is equal to `input_dim * output_dim`, every unique token in the vocaulary maps to a unique vector with `output_dim` components, each component being a parameter of the layer.

In [16]:
# The shape of the trainable_variables attribute confirms what we just explained.
embedding_layer.weight.shape

torch.Size([100277, 5])

### Text Classification

Now that we have built the vectorization layer, we are ready to build the model. We will build a very simple classification model.

In [17]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.pooling = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(embed_dim, num_class)

    def forward(self, text):
        embedded = self.embedding(text)
        pooled = self.pooling(embedded.permute(0, 2, 1)).squeeze(2)
        return torch.sigmoid(self.fc(pooled))

model = TextClassifier(vocab_size=vocab_size,
                      embed_dim=16, 
                      num_class=1)

In [18]:
sample_input.shape

torch.Size([1, 3])

In [19]:
from torchinfo import summary

print(model)

# Print model summary
summary(model, input_data=sample_input)  # (batch_size, input_features)


TextClassifier(
  (embedding): Embedding(100277, 16, padding_idx=0)
  (pooling): AdaptiveAvgPool1d(output_size=1)
  (fc): Linear(in_features=16, out_features=1, bias=True)
)


Layer (type:depth-idx)                   Output Shape              Param #
TextClassifier                           [1, 1]                    --
├─Embedding: 1-1                         [1, 3, 16]                1,604,432
├─AdaptiveAvgPool1d: 1-2                 [1, 16, 1]                --
├─Linear: 1-3                            [1, 1]                    17
Total params: 1,604,449
Trainable params: 1,604,449
Non-trainable params: 0
Total mult-adds (M): 1.60
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 6.42
Estimated Total Size (MB): 6.42

We'll then compile the model

In [20]:
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train(model, train_loader, val_loader, criterion, optimizer, epochs=100):
    """
    Function to train a PyTorch model with training and validation datasets.
    
    Parameters:
    model: The neural network model to train.
    train_loader: DataLoader for the training dataset.
    val_loader: DataLoader for the validation dataset.
    criterion: Loss function (e.g., Binary Cross Entropy for classification).
    optimizer: Optimization algorithm (e.g., Adam, SGD).
    epochs: Number of training epochs (default=100).
    
    Returns:
    history: Dictionary containing loss and accuracy for both training and validation.
    """
    
    # Dictionary to store training & validation loss and accuracy over epochs
    history = {'loss': [], 'val_loss': [], 'accuracy': [], 'val_accuracy': []}
    
    for epoch in range(epochs):  # Loop over the number of epochs
        model.train()  # Set model to training mode
        total_loss, correct = 0, 0  # Initialize total loss and correct predictions
        
        # Training loop
        for inputs, labels in train_loader:
            optimizer.zero_grad()  # Reset gradients before each batch
            outputs = model(inputs).squeeze()  # Forward pass
            loss = criterion(outputs, labels)  # Compute loss
            loss.backward()  # Backpropagation (compute gradients)
            optimizer.step()  # Update model parameters
            
            total_loss += loss.item()  # Accumulate batch loss
            correct += ((outputs > 0.5) == labels).sum().item()  # Count correct predictions
        
        # Compute average loss and accuracy for training
        train_loss = total_loss / len(train_loader)
        train_acc = correct / len(train_loader.dataset)
        
        # Validation phase (without gradient computation)
        model.eval()  # Set model to evaluation mode
        val_loss, val_correct = 0, 0
        with torch.no_grad():  # No need to compute gradients during validation
            for inputs, labels in val_loader:
                outputs = model(inputs).squeeze()  # Forward pass
                loss = criterion(outputs, labels)  # Compute loss
                val_loss += loss.item()  # Accumulate validation loss
                val_correct += ((outputs > 0.5) == labels).sum().item()  # Count correct predictions
        
        # Compute average loss and accuracy for validation
        val_loss /= len(val_loader)
        val_acc = val_correct / len(val_loader.dataset)
        
        # Store metrics in history dictionary
        history['loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['accuracy'].append(train_acc)
        history['val_accuracy'].append(val_acc)
        
        # Print training progress
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {train_loss:.4f}, Acc: {train_acc:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
    
    return history  # Return training history

history = train(model,
                train_loader=train_loader,
                val_loader=val_loader,
                criterion=criterion,
                optimizer=optimizer,
                epochs=10)

Epoch [1/10], Loss: 0.6828, Acc: 0.6133, Val Loss: 0.6632, Val Acc: 0.6812
Epoch [2/10], Loss: 0.6098, Acc: 0.7559, Val Loss: 0.5644, Val Acc: 0.7676
Epoch [3/10], Loss: 0.4982, Acc: 0.8159, Val Loss: 0.4771, Val Acc: 0.8074
Epoch [4/10], Loss: 0.4106, Acc: 0.8563, Val Loss: 0.4165, Val Acc: 0.8362
Epoch [5/10], Loss: 0.3505, Acc: 0.8786, Val Loss: 0.3795, Val Acc: 0.8492
Epoch [6/10], Loss: 0.3085, Acc: 0.8934, Val Loss: 0.3557, Val Acc: 0.8578
Epoch [7/10], Loss: 0.2767, Acc: 0.9051, Val Loss: 0.3405, Val Acc: 0.8628
Epoch [8/10], Loss: 0.2510, Acc: 0.9146, Val Loss: 0.3278, Val Acc: 0.8654
Epoch [9/10], Loss: 0.2290, Acc: 0.9223, Val Loss: 0.3202, Val Acc: 0.8658
Epoch [10/10], Loss: 0.2099, Acc: 0.9291, Val Loss: 0.3131, Val Acc: 0.8680


We have successfully trained an embedding layer! 

An important note! Embedding layers are trainable, therefore in this case, the tokens are mapped to embedding vectors so as to best determine which review is positive or negative, this embedding would not necessarily work well for a recipe classication problem or translation because of its specificity. The upside from this is that pre-trained word embeddings exist, and may be used for transfer learning!