# Language Model Basics

Wei Li

In [1]:
import torch
import torch.nn as nn
import torch.nn.utils.rnn as rnn
from torch.utils.data import Dataset, DataLoader, TensorDataset
import numpy as np
import time

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE

'cpu'

## Embeddings

Embeddings are a way to convert discrete categorical variables (e.g., words represented as indices) into continuous vectors. The purpose of this is to capture more meaningful representations of the words.

In the context of natural language processing, embeddings are used to capture semantic meanings of words. Words that are semantically similar are mapped to vectors that are close to each other in the embedding space. This is much more informative than using one-hot encoded vectors, where each word is represented as a vector of 0s with a single 1, and all vectors are at the same distance from each other.

Let's consider an example. We have a vocabulary of four words: {'hello': 0, 'world': 1, 'good': 2, 'day': 3}. If we use one-hot encoding, 'hello' might be represented as [1, 0, 0, 0], 'world' as [0, 1, 0, 0], and so on. In this representation, all words are equidistant from each other, which doesn't capture any semantic relationships between the words.

On the other hand, with embeddings, each word is represented as a dense vector of continuous values. For example, 'hello' might be represented as [0.1, 0.3, -0.2, 0.8, 0.5] and 'world' as [0.2, 0.3, -0.15, 0.85, 0.45]. Notice that these vectors are not just random. They are learned during the training of the model. The model adjusts these vectors to reduce the prediction error on the training data.

So, if 'hello' and 'world' often appear in similar contexts in the training data, their embeddings will be adjusted to be close to each other. This way, the model learns to capture the semantic relationships between words. For example, synonyms would be close to each other in the embedding space because they often appear in similar contexts.

**The use of embedding**: When we first initialize an embedding layer in a model (like an LSTM or a Transformer), the embedding table is filled with random numbers. Each word in our vocabulary is assigned one of these random vectors.

During training, the model receives input data (e.g., the words in our text, represented as indices), looks up the corresponding vectors in the embedding table, and uses those vectors to make predictions. The model's predictions are compared to the actual target values, and the difference (the error) is used to update the model's parameters, including the vectors in the embedding table. This is done through a process called backpropagation and an optimization algorithm like stochastic gradient descent (SGD).

The goal is to adjust the vectors in such a way that the model's predictions improve. This often means that words that are used in similar contexts or have similar meanings end up with similar vectors. The exact values and similarities will depend on the data and the specifics of the training process.

So while the vectors are retrieved from the embedding table in a deterministic way during any single pass (forward or backward) through the model, the contents of that table change over the course of training. The vectors are not just random, they are learned based on the data.

**Pytorch module**:

https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

`nn.Embedding` is a simple lookup table that stores embeddings of a fixed dictionary and size.
This module is often used to store word embeddings and retrieve them using **indices**. 

- Input: $(*)$, IntTensor or LongTensor of arbitrary shape containing the indices to extract
- Output: $(*, H)$, where * is the input shape and $H=$ embedding_dim

When you create an embedding layer in PyTorch with `nn.Embedding(vocab_size, embed_size)`, it initializes a table of size vocab_size x embed_size with random values. Each row in this table corresponds to the dense vector representation of a word (or whatever your indices represent). The vocab_size is the number of unique words (or indices) you have, and embed_size is the dimensionality of the output vectors you want.

#### Numerical examples:

Let's consider a simple example of a text classification task. Suppose we have a vocabulary of four words: {'hello': 0, 'world': 1, 'good': 2, 'day': 3}. Each word is mapped to a unique index.

We use indices instead of the words themselves because computers understand numbers better than text. By representing words as indices, we can easily convert them into dense vectors of fixed size (word embeddings), which can be processed efficiently by machine learning models.

Now, let's say we have a sentence "hello world". We first convert this sentence into indices: [0, 1].

We want to feed this sentence into a model to predict some target value (for example, the sentiment of the sentence). But before we can do that, we need to convert these indices into word embeddings.

In [13]:
import torch
from torch import nn

# Create an embedding layer
vocab_size = 4  # We have 4 words in our vocabulary
embed_size = 5  # We want to represent each word as a 5-dimensional vector
embedding = nn.Embedding(vocab_size, embed_size)

# Convert our sentence "hello world" into indices
sentence = torch.tensor([0, 1])  # "hello" is 0 and "world" is 1

# Convert the indices into word embeddings
embeddings = embedding(sentence)

print(embeddings)
# The output will be a 2D tensor of shape (2, 5). 
# Each row is a 5-dimensional vector representing a word in the sentence.


tensor([[-0.4084, -0.0216,  2.0937, -0.5386,  0.4335],
        [-0.4920,  0.2761,  0.9611, -0.5332, -0.1987]],
       grad_fn=<EmbeddingBackward0>)


Another example: let's say we have a vocabulary of 5 unique characters, represented as integers: 0, 1, 2, 3, 4. And we want to represent each character as a 3-dimensional embedding. So we create an embedding layer like this:

In [14]:
vocab_size = 5
embed_size = 3
embedding = nn.Embedding(vocab_size, embed_size)
# This creates an embedding layer that can transform any integer from 0 to 4 into a 3-dimensional vector.

Now let's say we have a batch of 2 sequences (sequences of indices), each of length 4, represented as a 2D tensor:

In [15]:
seq_batch = torch.tensor([[0, 1, 2, 3], [2, 3, 4, 0]])
seq_batch.shape #  2 sequences, each of length 4.

torch.Size([2, 4])

The embedding layer transforms each integer in the tensor into its corresponding 3-dimensional embedding. So the output embed is a 3D tensor with shape (2, 4, 3): 2 sequences, each of length 4, and each integer represented as a 3-dimensional vector.

In [16]:
embed = embedding(seq_batch)
embed.shape

torch.Size([2, 4, 3])

## pad and pack

It is important to pack a paded sequence before passing it through an RNN as well as pad
a packed sequence after computation from the RNN.

**a list of sequences (sorted by length descending) -> input pad_sequence -> `rnn.pack_padded_sequence()`->PackedSequence object -> LSTM -> PackedSequence object -> `rnn.pad_packed_sequence()`-> output pad_sequence**

e.g., 
input pad_sequence (B, T, feature size)  -> nn.LSTM (feature size, hidden_size) -> output pad_sequence (B, T, hidden_size)

### LSTM with input sequences of variable lengths

In PyTorch, the LSTM model `torch.nn.LSTM()` can process sequences of variable lengths using `PackedSequence` objects. `PackedSequence` is a class in PyTorch that holds the data and list of `batch_sizes` of a packed sequence.

All RNN modules, including LSTM, accept packed sequences as inputs.
To process variable-length sequences with LSTM, you can follow these steps:

1. Sort the sequences by length: Before packing the sequences, sort them in descending order based on their lengths. This is required for the efficient processing of variable-length sequences.
2. Pad the sequences: Pad the sequences with zeros (or any other appropriate value) to make them all have the same length. This can be done using the `torch.nn.utils.rnn.pad_sequence()` function.
3. Pack the padded sequences: Use the `torch.nn.utils.rnn.pack_padded_sequence()` function to pack the padded sequences into a `PackedSequence` object . This function takes the input tensor and the lengths of the sequences as arguments.
4. Process the sequences with LSTM: Pass the `PackedSequence` object to the LSTM model. The LSTM model will process the sequences efficiently, taking into account their actual lengths .
5. Unpack the output: After processing the sequences with LSTM, you can use the `torch.nn.utils.rnn.pad_packed_sequence()` function to unpack the output back into padded sequences. This will give you the output for each time step in the original sequences (with padding).

In [1]:
import torch
from torch import nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

# Create some example sequences, each shape (time_len, features_dim)
seq1 = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])  # (3, 2)
seq2 = torch.tensor([[7.0, 8.0], [9.0, 10.0]])  # (2, 2)
seq3 = torch.tensor([[11.0, 12.0]])  # (1, 2)

print(seq1.shape, seq2.shape, seq3.shape)

torch.Size([3, 2]) torch.Size([2, 2]) torch.Size([1, 2])


In [2]:
# Pad the sequences
padded_sequences = pad_sequence([seq1, seq2, seq3], batch_first=True)

print(padded_sequences, type(padded_sequences), padded_sequences.shape)
# padded_sequences.shape: torch.Size([3, 3, 2]= (B, T, F).
# so padded_sequences are grouped in 3 sequences, and each sequence has array 3 by 2.

tensor([[[ 1.,  2.],
         [ 3.,  4.],
         [ 5.,  6.]],

        [[ 7.,  8.],
         [ 9., 10.],
         [ 0.,  0.]],

        [[11., 12.],
         [ 0.,  0.],
         [ 0.,  0.]]]) <class 'torch.Tensor'> torch.Size([3, 3, 2])


<img src="./images/pack_padded_seq.png" width="600" height="430">

Source: https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch

In [3]:
# list of sequence lengths of each batch element
# first batch (sequence) has sequence (time) lengths 3,
# second batch has sequence (time) lengths 2,
# third batch has sequence (time) lengths 1,
seq_lengths = torch.tensor([3, 2, 1])

# Pack the padded sequences
packed_sequences = pack_padded_sequence(
    padded_sequences, seq_lengths, batch_first=True, enforce_sorted=False
)

type(packed_sequences)  # torch.nn.utils.rnn.PackedSequence
print(packed_sequences.data.shape)
print()
packed_sequences

torch.Size([6, 2])



PackedSequence(data=tensor([[ 1.,  2.],
        [ 7.,  8.],
        [11., 12.],
        [ 3.,  4.],
        [ 9., 10.],
        [ 5.,  6.]]), batch_sizes=tensor([3, 2, 1]), sorted_indices=tensor([0, 1, 2]), unsorted_indices=tensor([0, 1, 2]))

In [4]:
# Create an LSTM model
lstm = nn.LSTM(input_size=2, hidden_size=4, batch_first=True)

# Process the packed sequences with LSTM
output, (hn, cn) = lstm(packed_sequences)
# output is also a PackedSequence object, the return of packed_sequences through LSTM.
# output.data has shape (6, H=4)

output.data.shape

torch.Size([6, 4])

In [5]:
# This line unpacks the output from the LSTM layer back into padded sequences.
# Unpack the output
unpacked_output, _ = pad_packed_sequence(output, batch_first=True)

print(unpacked_output.shape, unpacked_output)
# shape (3, 3, 4) : (B, T, hidden_size)
print()
print("compare to padded_sequences (input)")

print(padded_sequences.shape, padded_sequences)
# shape (3, 3, 2) : (B, T, feature size)

torch.Size([3, 3, 4]) tensor([[[-3.4508e-02, -1.7152e-02, -6.1093e-02,  1.2922e-02],
         [-3.8915e-02, -1.6568e-02,  5.7424e-02,  6.1641e-03],
         [-1.4188e-02, -1.1229e-02,  2.7400e-01,  1.7242e-03]],

        [[-1.2410e-03, -1.2493e-04,  3.1285e-01,  1.1075e-04],
         [-4.5575e-04, -8.3041e-05,  4.9538e-01,  3.6702e-05],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]],

        [[-3.5044e-05, -1.6551e-06,  3.8911e-01,  1.7866e-06],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00]]],
       grad_fn=<IndexSelectBackward0>)

compare to padded_sequences (input)
torch.Size([3, 3, 2]) tensor([[[ 1.,  2.],
         [ 3.,  4.],
         [ 5.,  6.]],

        [[ 7.,  8.],
         [ 9., 10.],
         [ 0.,  0.]],

        [[11., 12.],
         [ 0.,  0.],
         [ 0.,  0.]]])
