# Transformer Pre-processing

This notebook, adapted from Deeplearning.ai's Deep Learning course, explores the pre-processing methods applied to raw text before it is passed to the encoder and decoder blocks of the transformer architecture.

## Objectives

- Create visualizations to gain intuition on positional encodings
- Visualize how positional encodings affect word embeddings

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## Positional Encoding

Here are the positional encoding equations:

$$
PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
$$
<br>

$$
PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
$$

In natural language processing tasks, it is common practice to convert sentences into tokens before inputting them into a language model. Each token is then represented as a fixed-length numerical vector called an embedding, which encapsulates the meaning of the words. In the Transformer architecture, a positional encoding vector is added to the embedding to convey positional information throughout the model.

Understanding these vectors can be challenging when only numerical representations are examined. However, visualizations can provide insight into the semantic and positional relationships between words. Reducing embeddings to two dimensions and plotting them shows that semantically similar words cluster together, while dissimilar words are spaced further apart. Similarly, positional encoding vectors can be visualized to reveal that words closer together in a sentence appear closer on a Cartesian plane, while those farther apart appear more distant.

In this notebook, a series of visualizations will be created to explore word embeddings and positional encoding vectors, aiming to illustrate how positional encodings impact word embeddings and convey sequential information through the Transformer architecture.
    



### Positional Encoding Visualizations

The next code cell includes the `positional_encoding` function that was implemented previously. This notebook will build upon that work to create additional visualizations using this function.

In [None]:
def positional_encoding(positions, d):
    """
    Precomputes a matrix with all the positional encodings 
    
    Arguments:
        positions (int) -- Maximum number of positions to be encoded 
        d (int) -- Encoding size 
    
    Returns:
        pos_encoding -- (1, position, d_model) A matrix with the positional encodings
    """

    # initialize a matrix angle_rads of all the angles 
    angle_rads = np.arange(positions)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d)[np.newaxis, :]//2)) / np.float32(d))
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

Define the embedding dimension as 100, which should match the dimensionality of the word embeddings. In the ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) paper, embedding sizes range from 100 to 1024 depending on the task, with maximum sequence lengths varying from 40 to 512. For this notebook:

- Set the maximum sequence length to 100
- Set the maximum number of words to 64


In [None]:
EMBEDDING_DIM = 100
MAX_SEQUENCE_LENGTH = 100
MAX_NB_WORDS = 64
pos_encoding = positional_encoding(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)

plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('d')
plt.xlim((0, EMBEDDING_DIM))
plt.ylabel('Position')
plt.colorbar()
plt.show()

We've previously created this visualization, but let's explore it further. Observe some interesting properties of the matrix: notably, the norm of each vector remains constant. Regardless of the value of `pos`, the norm always equals 7.071068. This constancy implies that the dot product of two positional encoding vectors is unaffected by the scale of the vector, which is significant for correlation calculations.

In [None]:
pos = 34
tf.norm(pos_encoding[0,pos,:])

Another noteworthy property is that the norm of the difference between two vectors separated by `k` positions remains constant. When `k` is fixed and `pos` varies, the difference remains approximately the same. This characteristic highlights that the difference is determined by the relative separation between encodings rather than their absolute positions. Expressing positional encodings as linear functions of one another can aid the model in focusing on the relative positions of words.

Achieving this representation of word position differences through vector encodings is challenging, particularly because the values of these encodings must be small enough to avoid distorting the word embeddings.

In [None]:
pos = 70
k = 2
print(tf.norm(pos_encoding[0,pos,:] -  pos_encoding[0,pos + k,:]))

Having observed some interesting properties of the positional encoding vectors, the next step is to create visualizations to explore how these properties influence the relationships between encodings and embeddings.

### Comparing Positional Encodings

#### Correlation

The positional encoding matrix provides insight into the uniqueness of each vector for every position. However, it remains unclear how these vectors represent the relative positions of words within a sentence. To clarify this, calculate the correlation between pairs of vectors at each position. An effective positional encoder will generate a symmetric matrix where the highest values are found along the main diagonal—vectors at similar positions should exhibit the highest correlation. Accordingly, correlation values are expected to decrease as they move away from the diagonal.

In [None]:
# Positional encoding correlation
corr = tf.matmul(pos_encoding, pos_encoding, transpose_b=True).numpy()[0]
plt.pcolormesh(corr, cmap='RdBu')
plt.xlabel('Position')
plt.xlim((0, MAX_SEQUENCE_LENGTH))
plt.ylabel('Position')
plt.colorbar()
plt.show()

#### Euclidean Distance

Alternatively, the Euclidean distance can be used to compare the positional encoding vectors. In this approach, the visualization will show a matrix where the main diagonal has values of 0, and the off-diagonal values increase as they move away from the diagonal.

In [None]:
# Positional encoding euclidean distance
eu = np.zeros((MAX_SEQUENCE_LENGTH, MAX_SEQUENCE_LENGTH))
print(eu.shape)
for a in range(MAX_SEQUENCE_LENGTH):
    for b in range(a + 1, MAX_SEQUENCE_LENGTH):
        eu[a, b] = tf.norm(tf.math.subtract(pos_encoding[0, a], pos_encoding[0, b]))
        eu[b, a] = eu[a, b]
        
plt.pcolormesh(eu, cmap='RdBu')
plt.xlabel('Position')
plt.xlim((0, MAX_SEQUENCE_LENGTH))
plt.ylabel('Position')
plt.colorbar()
plt.show()

## Semantic Embedding

Insights into the relationships between positional encoding vectors and other vectors at different positions have been gained through the creation of correlation and distance matrices. To further understand how positional encodings impact word embeddings, visualize the sum of these vectors for a clearer perspective.

### Load Pretrained Embedding

To integrate a pretrained word embedding with the positional encodings, begin by loading an embedding from the [GloVe](https://nlp.stanford.edu/projects/glove/) project. The pretrained embeddings file can be downloaded from [this link](https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt). We will use the embedding with 100 features for this purpose.

In [None]:
embeddings_index = {}
# put the downloaded glove file in the same directory as this script
# or change the path accordingly
GLOVE_DIR = "glove"
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
print('d_model:', embeddings_index['hi'].shape)

**Note:** This embedding is composed of 400,000 words and each word embedding has 100 features.

Consider the following text, which contains just two sentences. Note that these sentences are constructed to illustrate specific points:

* Each sentence is made up of word sets with semantic similarities within each group.
* In the first sentence, similar terms are placed consecutively, whereas in the second sentence, the order is random.

In [None]:
texts = ['king queen man woman dog wolf football basketball red green yellow',
         'man queen yellow basketball green dog  woman football  king red wolf']

First, run the following code cell to apply tokenization to the raw text. While the details of this step will be covered in later ungraded labs, here’s a brief overview (not crucial for understanding the current lab):

* The code processes an array of plain text with varying sentence lengths and produces a matrix where each row corresponds to a sentence, represented as an array of size `MAX_SEQUENCE_LENGTH`.
* Each value in this array represents a word from the sentence, indexed according to a dictionary (`word_index`).
* Sequences shorter than `MAX_SEQUENCE_LENGTH` are padded with zeros to ensure uniform length.

Detailed explanations will follow in subsequent ungraded labs, so there’s no need to focus too much on this step right now!

In [None]:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, padding='post', maxlen=MAX_SEQUENCE_LENGTH)

print(data.shape)

print(data)

To streamline the model, focus on obtaining embeddings for only the distinct words present in the text being examined. In this case, filter out the embeddings for the 11 specific words found in the sentences. The first vector will be an array of zeros, representing all unknown words.

In [None]:
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)

Create an embedding layer using the weights extracted from the pretrained glove embeddings.

In [None]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
                            trainable=False)

Transform the input tokenized data to the embedding using the previous layer. Check the shape of the embedding to make sure the last dimension of this matrix contains the embeddings of the words in the sentence. 

In [None]:
embedding = embedding_layer(data)
print(embedding.shape)

### Visualization on a Cartesian Plane

Next, create a function to visualize the word encodings on a Cartesian plane. This will involve using PCA to reduce the 100-dimensional GloVe embeddings to just 2 components for easier visualization.

In [None]:
from sklearn.decomposition import PCA

def plot_words(embedding, sequences, sentence):
    pca = PCA(n_components=2)
    X_pca_train = pca.fit_transform(embedding[sentence,0:len(sequences[sentence]),:])


    fig, ax = plt.subplots(figsize=(12, 6)) 
    plt.rcParams['font.size'] = '12'
    ax.scatter(X_pca_train[:, 0], X_pca_train[:, 1])
    words = list(word_index.keys())
    for i, index in enumerate(sequences[sentence]):
        ax.annotate(words[index-1], (X_pca_train[i, 0], X_pca_train[i, 1]))


Now we can plot the embedding of each of the sentences. Each plot should disply the embeddings of the different words. 


In [None]:
plot_words(embedding, sequences, 0)

Plot the word embeddings of the second sentence, which contains the same words as the first sentence but in a different order. This visualization will demonstrate that the order of the words does not impact their vector representations.

In [None]:
plot_words(embedding, sequences, 1)

## Semantic and Positional Embeddings

Next, combine the original GloVe embeddings with the positional encodings calculated earlier. For this exercise, use a 1-to-1 weight ratio between the semantic and positional embeddings.

In [None]:
embedding2 = embedding * 1.0 + pos_encoding[:,:,:] * 1.0

plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)

We can observe significant differences between the plots. Both plots have undergone drastic changes compared to their original versions. In the second image, which represents the sentence where similar words are not grouped together, we can observe that very dissimilar words like `red` and `wolf` appear closer together.

Experiment with different relative weights to see how they strongly influence the vector representations of the words in the sentence.

In [None]:
W1 = 1 # Change me
W2 = 10 # Change me
embedding2 = embedding * W1 + pos_encoding[:,:,:] * W2
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)

# For reference
#['king queen man woman dog wolf football basketball red green yellow',
# 'man queen yellow basketball green dog  woman football  king red wolf']

If `W1 = 1` and `W2 = 10`, the arrangement of the words will start to exhibit a clockwise or counterclockwise pattern, depending on their positions in the sentence. With these parameters, the positional encoding vectors will dominate the embeddings.

Now, try using `W1 = 10` and `W2 = 1`. Under these conditions, the plot will closely resemble the original embedding visualizations, with only minor changes in the positions of the words.

In the previous Transformer notebook, the word embedding was multiplied by `sqrt(EMBEDDING_DIM)`. In this case, using `W1 = sqrt(EMBEDDING_DIM) = 10` and `W2 = 1` will be equivalent.

#### Recap
- Positional encodings can be expressed as linear functions of each other, allowing the model to learn based on the relative positions of words.
- While positional encodings can influence word embeddings, a small relative weight for the positional encoding will preserve the semantic meaning of the words.