## Learning Objectives

At the end of the experiment, you will be able to:

* understand the big picture of transformers
* understand and work with the TextVectorization layer
* understand and work with the Embedding layer
* learn word embeddings during model training
* perform visualization of word embeddings

### The Big Picture of Transformer

TextVectorization and Embedding Layers are used in Encoder-Decoder Transformer.

<br>

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST%205%20Big%20Picture.png" width=800px/>
</center>

Above is the entire architecture of transformer. A TextVectorization layer, Embedding layer, an Encoder and a Decoder.

Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The Transformer architecture was originally designed for translation. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

TextVectorization and Embedding Layers are also required in Encoder-only Transformer.

<br>

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5_AST_05_Image1_Transformer.png" width=900px/>
</center>

In this assignment encoder & decoder will not form the topic of discussion, the main focus will be on the TextVectorization and Embedding Layers.
This has been discussed in detail in the later sections of this notebook.

## Dataset Description

The **IMDb Movie Reviews dataset** is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as *positive* or *negative*. The dataset contains an even number of positive and negative reviews.

This dataset is processed and used in the later sections of this notebook.

### Setup Steps:

### Importing required packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os, pathlib, shutil, random

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization, Embedding, Dense
from tensorflow.keras.utils import text_dataset_from_directory

from sklearn.decomposition import PCA

## TextVectorization

It invloves preparing the text data:
  * Text standardization
  * Text splitting into tokens
  * Vocabulary indexing
  


A flowchart depicting the procedure or sequence of steps followed in a TextVectorization layer.
* 'Standardization' is taking care of basic preprocessing of text data such as removing the punctuation and converting the text to lower case.
* 'Tokenization' is giving the list of words from the sentence.
* Later, these words are represented with indices and with the help of embedding to get the vector encoding of indices.

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5_AST_05_Transformer_Encoder_Text_data_prep.png" width=650px/>
</center>


All these steps are performed in a TextVectorization Layer.


*   Keras provides a TextVectorization layer which can be dropped directly into
      - a tf.data pipeline **or**
      - a Keras model

*  MOREOVER, TextVectorization also handles both approaches of representing groups of words:
      - Words as a set or Bag-of-words
      - Words as a sequence







### Define a dummy dataset and a test sentence


In [None]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]

test_sentence = "I write, rewrite, and still rewrite again"
# dataset_t = ["I write, rewrite, and still rewrite again"]
#Q: Is the word 'still' in the dataset (vocabulary)? Is it there in the test_sentence?
#Q: How many words in test_sentence?

### Create a TextVectorization layer and adapt to dummy dataset

Create and demonstrate the working of a TextVectorization layer.

In [None]:
#Q: What 3 things does a TextVec layer do?

# Instantiating a TextVectorization layer/object with output mode as integer
text_vectorization = TextVectorization(
    output_mode="int",              # int is default. There are different kinds of modes available
    max_tokens=15,                  # Vocabulary size
    output_sequence_length=10,      # Maximum length of output sequence
    # We can use custom functions also for standardizing and splitting the text - see the Book by Chollet
    # standardize=custom_standardization_fn,
    # split=custom_split_fn,
)

# Adapt to data
text_vectorization.adapt(dataset)      # Computes a vocabulary of string terms from tokens in a dataset


In [None]:
# To see the working of TextVectorization

vocabulary = text_vectorization.get_vocabulary()
print(f"vocabulary = {vocabulary}")
print(f"len(vocabulary) = {len(vocabulary)}")

# To see how the the text_vec layer transforms/vectorizes the raw text
encoded_sentence = text_vectorization(test_sentence)
# YOUR CODE HERE to show the 'encoded_sentence' and its length


# decode back for comparison with test_sentence
inverse_vocab = dict(enumerate(vocabulary)) # making a dictionary to decode embeddings
print(f"inverse_vocab = {inverse_vocab}")
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(f"decoded sentence = {decoded_sentence}")

print(f"test_sentence = {test_sentence}")

# Q: What is a vocabulary?
# Q: No. of tokens in vocabulary?
# Q: Length of encoded_sentence (output of TextVec layer)?
# Q: Type of elements in encoded_sentence (embedding)?
# Q: Is decoded sentence the same as the test_sentence? Why?

## Processing the dataset using TextVectorization layer of keras

### Data Preparation

A pre-processed version of the IMDB dataset provided by Keras was used in the previous assignments.

Originally IMDB dataset contains the *train* and the *test* folders.
Here, the original dataset will be used and pre-processing related to it will be explored.

In [None]:
# List subdirectories
!cd aclImdb && ls -d */

In [None]:
# Remove unnecessary folder
!rm -r aclImdb/train/unsup

In [None]:
# Visualise a sample
!cat aclImdb/train/pos/4077_10.txt

### Create a validation directory and move 20% of the train data to it

In [None]:
# move 20% of the training data to the validation folder
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    # random.Random(1337).shuffle(files) # We should shuffle. Only commenting for demonstration
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

### Create batches of data using `text_dataset_from_directory`

In [None]:
# Create dataset using utility
batch_size = 32

# Q: Name other such utilities seen earlier ?
train_ds = text_dataset_from_directory("aclImdb/train", batch_size=batch_size)

val_ds = # YOUR CODE HERE to apply text_dataset_from_directory() with path "aclImdb/val"

test_ds = # YOUR CODE HERE to apply text_dataset_from_directory() with path "aclImdb/test"

# Extracting only the review text(not labels); to be used later to adapt the TextVec layer
text_only_train_ds = train_ds.map(lambda x, y: x)             # lambda x, y: x  --> replace x,y with x. That is remove labels, just keep text data.


There are 20000, 5000, and 25000 records in train, validation, and test directories with two class as positive and negative.

In [None]:
# Check shapes

for inputs, targets in train_ds:

    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)

    # YOUR CODE HERE to show shape and datatype of 'targets'

    print("inputs[2]:", inputs[2])
    print("targets[2]:", targets[2])

    break

### Create TextVectorization layer and adapt to dataset

In [None]:
# Vectorizing the data
max_length = 600
max_tokens = 20000
text_vectorization = # YOUR CODE HERE to create a TextVectorization layer using max_tokens as vocabular size, "int" as output type for a token, 600 as maximum length of review

# YOUR CODE HERE to adapt text_vectorization() layer on 'text_only_train_ds'


# Apply TextVec to train, val, test set

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), tf.reshape(y, (-1,1))),
                            num_parallel_calls=4)

int_val_ds = # YOUR CODE HERE to apply text_vectorization() on val_ds

int_test_ds = # YOUR CODE HERE to apply text_vectorization() on test_ds


### Visualize and compare the raw and processed data

In [None]:
# Let's visualize the raw text and the vectorized (to int) text
for text, label in train_ds:
  print(text[0])
  print(label[0])
  break

# YOUR CODE HERE to create a for loop to the sample text and label from 'int_train_ds'


# Q: How can you verify whether the index of movie is 18?


Vector representation of the word 'movie'

In [None]:
text_vectorization("movie")
# Q: What is the shape of the TV output?
# Q: Why so many 0s?


Vector representation of "great movie" and "a fine story"

In [None]:
text_vectorization(["great movie", "a fine story"])
#Q: shape?

## Word Embeddings

**Why do we need Word Embeddings?**

To deal with textual data, we need to convert it into numbers before feeding it into any machine learning model. For simplicity, words can be compared to categorical variables. We use one-hot encoding to convert categorical features into numbers. To do so, we create dummy features for each of the category and populate them with 0's and 1's.

Similarly, if we use one-hot encoding on words in textual data, we will have a dummy feature for each word, which means 10,000 features for a vocabulary of 10,000 words. This is not a feasible embedding approach as it demands large storage space for the word vectors and reduces model efficiency and no relation is captured between words.

**Word embeddings** are vector representations of words that achieve exactly this: they map human language into a structured geometric space.

* dense (floats)
* low-dimensional (1024 dims for large vocabs)

There are two ways to obtain word embeddings:

* Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors, in the same way you learn the weights of a neural network. **Move away from manual feature engineering.**
* Load into your model word embeddings that were precomputed using a different machine learning task than the one you’re trying to solve. These are called pretrained word embeddings.

**Q: Do two ways remind you of something we studied in CNNs ?**

In this assignment the main agenda is to explore the Learning of word embeddings.




### Embedding Layer


The procedure if as follows:

*   Like a dictionary that **maps integer indices** (which stand for specific words) **to dense vectors**

*   Input: a rank-2 tensor of integers, of shape (batch_size, sequence_length)
*   Output: 3D floating-point tensor of shape (batch_size, sequence_length, embedding_dimensionality)
*   WORD INDEX ⭢ EMBEDDING LAYER ⭢ CORRESPONDING WORD VEC

*   Initial weights are random
*   Learns specialized structure upon training



### Visualization of Word Embeddings

Apply dimensionality reduction to the word embeddings to convert it into 2D. Later, plot this 2D vector.

<br><br>
<center>
<img src="https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Word_Embedding.png" width="650" height="450">
</center>

Visualization in 3D:

<center>
<img src= "https://cdn.iisc.talentsprint.com/AIandMLOps/Images/M5%20AST5%20Embedding%20Layer.png" width=750px/>
</center>


### Define a NN architecture with a TextVectorization layer, an Embedding layer, and Dense layers

In [None]:
max_tokens = 20000
inputs = keras.Input(shape=(1,), dtype=tf.string)           # shape=(None,), dtype="int64"

# The Text Vectoritation layer
txt_vec_out = # YOUR CODE HERE to add text_vectorization() layer                # Note that this TextVec layer is already apadted on the train dataset

# The Embedding layer
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, name='embedding')(txt_vec_out)    # the largest integer (i.e. word index) in the input
                                                                                                    # should be no larger than 19999 (vocabulary size).
# Q: What is the input to the Embedding layer?
# Q: What is the dimension of the output embeddings?
# Q: In embedding layer shape, what are None and None?

x = # YOUR CODE HERE to add a GlobalAveragePooling1D layer
x = # YOUR CODE HERE to add a Dense layer with 64 neurons
x = # YOUR CODE HERE to add a Dense layer with 32 neurons
x = # YOUR CODE HERE to add a Dropout layer
x = # YOUR CODE HERE to add a Dense layer with 16 neurons
x = # YOUR CODE HERE to add a Dropout layer
outputs = # YOUR CODE HERE to add a final Dense layer with 1 neuron, use the appropriate activation function

model = # YOUR CODE HERE to create the keras model with (inputs, outputs)

# YOUR CODE HERE to compile the model use "rmsprop" optimizer, "binary_crossentropy" loss, "accuracy" as performance metric)

model.summary()
#Q: Weights in the embedding layer?
#Hint: Dict; 1 input word => embedding of size ___ .

### Visualize the words in 2D-plane by reducing the dimensions using PCA

Use the word embeddings from before and after model training

In [None]:
# Get the embedding layer
embedding_layer = model.get_layer('embedding')

# Get the embeddings
embeddings = embedding_layer.get_weights()[0]
embeddings.shape

In [None]:
# Get the vocabulary from the TextVectorization layer
vocab = text_vectorization.get_vocabulary()
len(vocab)

In [None]:
# Sample words to visualize word embeddings for
test_words = ['good', 'bad', 'nice', 'poor', 'terrible', 'terrific', 'awesome', 'awful', 'best', 'worst']

print(f"{'Word':<15} {'Index'}")
print("="*30)
for word in test_words:
    print(f"{word:<15} {vocab.index(word)}")

In [None]:
# Embedding dimension
embeddings[vocab.index('good')].shape

In [None]:
from sklearn.decomposition import PCA

# Create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class
# n_components in PCA specifies the no. of dimensions
pca = PCA(n_components=2, random_state=42)

# Fit and transform the vectors using PCA model
reduced_untrained_emb = pca.fit_transform(embeddings)
reduced_untrained_emb.shape

In [None]:
# Reduced embedding for word 'good'
reduced_untrained_emb[vocab.index('good')]

In [None]:
# Visualize the embeddings
plt.figure(figsize=(8, 6))
for word in test_words:
    if word != '':  # Skip the empty string token
        x, y = reduced_untrained_emb[vocab.index(word)]
        plt.scatter(x, y)
        plt.annotate(word, (x, y), xytext=(5, 2), textcoords='offset points')

plt.title("Word Embeddings Visualization (Before training)")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.tight_layout()
plt.show()

### Train the model *(Switch to GPU runtime if needed)*

In [None]:
# Fit the model on train set
callbacks = [keras.callbacks.ModelCheckpoint("one_hot_dense.keras", save_best_only=True)]

# Change target shape from (None,) to (None, 1)
train_dataset = train_ds.map(lambda x, y: (x, tf.reshape(y, (-1,1))))                # Note that we are using 'train_ds' and not 'int_train_ds'
val_dataset = # YOUR CODE HERE to updated val_ds target

# YOUR CODE HERE to train the model on 'train_dataset', use 'val_dataset' for validation, train it for 20 epochs, specify the callbacks list


In [None]:
## Load saved model
# model = keras.models.load_model("one_hot_dense.keras")

# Check model performance
test_dataset = test_ds.map(lambda x, y: (x, tf.reshape(y, (-1,1))))

# YOUR CODE HERE to evaluate model on 'test_dataset'

From the above test accuracy, it can be seen that the model perfomance is not that well. It is expected as we are using only Dense layers.

Let's see if the embeddings learned during training were able to capture the semantic relationships between words.

In [None]:
# Get the embedding layer
trained_embedding_layer = # YOUR CODE HERE to get 'embedding' layer

# Get the embeddings
trained_embeddings = # YOUR CODE HERE to get the embedding_layer weights
trained_embeddings.shape

In [None]:
# Create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class
# n_components in PCA specifies the no.of dimensions
pca = # YOUR CODE HERE to instantiate PCA()

# Fit and transform the vectors using PCA model
reduced_trained_emb = # YOUR CODE HERE to apply pca and transform the 'trained_embeddings'
reduced_trained_emb.shape

In [None]:
# Visualize the embeddings after model training

# YOUR CODE HERE


From the above plot, it can be seen that good nice are more related, bad poor are more related, and so on.