<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Integrating Word2Vec**


Estimated time needed: **60** minutes


"The worlds most valuable resource is no longer oil, but data", you hear that a lot, but did you ever wonder what that really means. Through word2vec, you'll unlock the power of words in large datasets, providing you with the tools to tackle real-world problems effectively.
 <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/data_oil.png" alt="new oild">
    


Here, you will be introduced to the Skip-gram and CBOW models, teaching how to build and apply these for text classification in PyTorch. You'll also incorporate pretrained GloVe embeddings to enhance the models. An optional section on advanced embedding applications is available for further exploration. By the end of this lab, you'll be adept at using word embeddings for natural language processing (NLP) tasks.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-Required-Libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#Background">Background</a>
        <ol>
            <li><a href="#Word2Vec">Word2Vec</a></li>
            <li><a href="#GloVe-(Optional)">GloVe (Optional)</a></li>
        </ol>
    </li>
    <li><a href="#Create-and-train-word2vec-models">Create and train word2vec models</a></li>
        <ol>
            <li><a href="#Continuous-Bag-of-Words-(CBOW)">Continuous Bag of Words (CBOW)</a></li>
            <li><a href="#Skip-gram-model">Skip-gram model</a></li>
        </ol>
    <li><a href="#Applying-pretrained-word-embeddings-(optional)">Applying pretrained word embeddings (optional)</a></li>
        <ol>
            <li><a href="#Load-Stanford-GloVe-model">Load Stanford GloVe model</a></li>
            <li><a href="#Train-a-word2vec-model-from-gensim">Train a word2vec model from gensim</a></li>
        </ol>
     <li><a href="#Text-classification-using-pretrained-word-embeddings">Text classification using pretrained word embeddings</a>
</ol>


## Objectives

After completing this lab you will be able to:
- Comprehend word embedding with word2vec.
- Create and train basic word2vec models using CBOW and Skip-gram architectures.
- Get pretrained large embedding models and generate word embeddings with them.
- Train a word2vec model on a domain-specific data.


----


## Setup


For this lab, you will be using the following libraries:

*   [`torch`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for building NN models and preparing the data.
*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`gensim`](https://pypi.org/project/gensim/) for word2vec pretrained models.
*   [`seaborn`](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for visualizing the data.
*   [`matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for additional plotting tools.


### Installing required libraries
<h2 style="color:red;">After installing the libraries below please RESTART THE KERNEL and run all cells.</h2>


In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
!mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [None]:
!pip install gensim #4.2.0
!pip install portalocker>=2.0.0
!pip install -Uq torchtext #
!pip install -Uq torch
!pip install -Uq torchdata


### Importing required libraries

_It is recommended that you import all required libraries in one place (here):_


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE

from IPython.core.display import display, SVG


from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import Dataset


import logging
from gensim.models import Word2Vec
from collections import defaultdict
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.vocab import GloVe,vocab
from torchdata.datapipes.iter import IterableWrapper, Mapper
from torchtext.datasets import AG_NEWS
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
from torchtext.data.utils import get_tokenizer
from torch.utils.data import DataLoader
from tqdm import tqdm

%matplotlib inline

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')


Define a function to plot word embeddings in a 2d space.


In [None]:
def plot_embeddings(word_embeddings,vocab=vocab):

    tsne = TSNE(n_components=2, random_state=0)
    word_embeddings_2d = tsne.fit_transform(word_embeddings)

    # Plotting the results with labels from vocab
    plt.figure(figsize=(15, 15))
    for i, word in enumerate(vocab.get_itos() ):  # assuming vocab.itos gives the list of words in your vocab
        plt.scatter(word_embeddings_2d[i, 0], word_embeddings_2d[i, 1])
        plt.annotate(word, (word_embeddings_2d[i, 0], word_embeddings_2d[i, 1]))

    plt.xlabel("t-SNE component 1")
    plt.ylabel("t-SNE component 2")
    plt.title("Word Embeddings visualized with t-SNE")
    plt.show()

Define a function that returns similar words to a specific word by calculating Cosine distance.


In [None]:
# This function returns the most similar words to a target word by calculating word vectors' cosine distance
def find_similar_words(word, word_embeddings, top_k=5):
    if word not in word_embeddings:
        print("Word not found in embeddings.")
        return []

    # Get the embedding for the given word
    target_embedding = word_embeddings[word]

    # Calculate cosine similarities between the target word and all other words
    similarities = {}
    for w, embedding in word_embeddings.items():
        if w != word:
            similarity = torch.dot(target_embedding, embedding) / (
                torch.norm(target_embedding) * torch.norm(embedding)
            )
            similarities[w] = similarity.item()

    # Sort the similarities in descending order
    sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    # Return the top k similar words
    most_similar_words = [w for w, _ in sorted_similarities[:top_k]]
    return most_similar_words

Define a function that trains word2vec model on toy data.


In [None]:
def train_model(model, dataloader, criterion, optimizer, num_epochs=1000):
    """
    Train the model for the specified number of epochs.
    
    Args:
        model: The PyTorch model to be trained.
        dataloader: DataLoader providing data for training.
        criterion: Loss function.
        optimizer: Optimizer for updating model's weights.
        num_epochs: Number of epochs to train the model for.

    Returns:
        model: The trained model.
        epoch_losses: List of average losses for each epoch.
    """
    
    # List to store running loss for each epoch
    epoch_losses = []

    for epoch in tqdm(range(num_epochs)):
        # Storing running loss values for the current epoch
        running_loss = 0.0

        # Using tqdm for a progress bar
        for idx, samples in enumerate(dataloader):

            optimizer.zero_grad()
            
            # Check for EmbeddingBag layer in the model
            if any(isinstance(module, nn.EmbeddingBag) for _, module in model.named_modules()):
                target, context, offsets = samples
                predicted = model(context, offsets)
            
            # Check for Embedding layer in the model
            elif any(isinstance(module, nn.Embedding) for _, module in model.named_modules()):
                target, context = samples
                predicted = model(context)
                
            loss = criterion(predicted, target)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
            optimizer.step()
            running_loss += loss.item()

        # Append average loss for the epoch
        epoch_losses.append(running_loss / len(dataloader))
    
    return model, epoch_losses

# Background

## Word2Vec

Word2Vec is a family of  methods that transforms words into number vectors, positioning similar words close together in a space defined by these numbers. This way, you can quantify and analyze word relationships mathematically. For instance, words like "cat" and "kitten" or "cat" and "dog" have vectors that are close to each other, while a word like "book" is positioned further away in this vector space.


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/Words.png" alt="Word2Vec example" class="bg-primary" width="400px">



In this lab session, you'll concentrate on mastering Skip-gram and Continuous Bag of Words (CBOW) models, which are essential precursors to grasping the principles of generative modeling. Additionally, you'll explore the GloVe model, and an **optional** summary will be provided to enhance your understanding of its application in natural language processing.
## GloVe (Optional)



GloVe, on the other hand, is another popular algorithm for learning word embeddings. It stands for Global Vectors for Word Representation. Unlike word2vec, which is based on predicting context/target words, GloVe focuses on capturing the global word co-occurrence statistics from the entire corpus. It constructs a co-occurrence matrix that represents how often words appear together in the text. The matrix is then factorized to obtain the word embeddings. For example, if "Man" and "King" co-occure many times, their vectors will be simialr.

The GloVe model follows a fundamental approach by constructing a large word-context co-occurrence matrix that contains pairs of (word, context). Each entry in this matrix represents the frequency of a word occurring within a given context, which can be a sequence of words. The objective of the model is to utilize matrix factorization techniques to approximate this co-occurrence matrix. The process is illustrated in the following diagram:

1. Create a word-context co-occurrence matrix: The model begins by generating a matrix that captures the co-occurrence information of words and their surrounding contexts. Each element in the matrix represents how often a specific word and context pair co-occur in the training data.

2. Apply matrix factorization: Next, the GloVe model applies matrix factorization methods to approximate the word-context co-occurrence matrix. The goal is to decompose the original matrix into lower-dimensional representations that capture the semantic relationships between words and contexts.

3. Obtain word and context embeddings: By factorizing the co-occurrence matrix, the model obtains word and context embeddings. These embeddings are numerical representations that encode the semantic meaning and relationships of words and contexts.

To accomplish this, you can usually begin by initializing WF (Word-Feature matrix) and FC (Feature-Context matrix) with random weights.You will then perform a multiplication operation between these matrices to obtain WC' (an approximation of WC), and assess its similarity to WC. This process is repeated multiple times using Stochastic Gradient Descent (SGD) to minimize the error(WC'-WC).

Once the training is complete, the resulting Word-Feature matrix (WF) provides you with word embeddings or vector representations for each word(the green vector in the diagram). The dimensionality of the embedding vectors can be predetermined by setting the value of F to a specific number of dimensions, allowing for a compact representation of the word semantics.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/matrix%20fact.png" alt="Co-occurence matrix" class="bg-primary" width="600px">

The key advantage of GloVe is that it can incorporate both global statistics and local context information. This results in word embeddings that not only capture the semantic relationships between words but also preserve certain syntactic relationships.


# Create and train word2vec models
Now it's time to get your hands dirty. Let's start with a very simple implementation of word2vec to train it on a toy dataset:


In [None]:
toy_data = """I wish I was little bit taller
I wish I was a baller
She wore a small black dress to the party
The dog chased a big red ball in the park
He had a huge smile on his face when he won the race
The tiny kitten played with a fluffy toy mouse
The team celebrated their victory with a grand parade
She bought a small, delicate necklace for her sister
The mountain peak stood majestic and tall against the clear blue sky
The toddler took small, careful steps as she learned to walk
The house had a spacious backyard with a big swimming pool
He felt a sense of accomplishment after completing the challenging puzzle
The chef prepared a delicious, flavorful dish using fresh ingredients
The children played happily in the small, cozy room
The book had an enormous impact on readers around the world
The wind blew gently, rustling the leaves of the tall trees
She painted a beautiful, intricate design on the small canvas
The concert hall was filled with thousands of excited fans
The garden was adorned with colorful flowers of all sizes
I hope to achieve great success in my chosen career path
The skyscraper towered above the city, casting a long shadow
He gazed in awe at the breathtaking view from the mountaintop
The artist created a stunning masterpiece with bold brushstrokes
The baby took her first steps, a small milestone that brought joy to her parents
The team put in a tremendous amount of effort to win the championship
The sun set behind the horizon, painting the sky in vibrant colors
The professor gave a fascinating lecture on the history of ancient civilizations
The house was filled with laughter and the sound of children playing
She received a warm, enthusiastic welcome from the audience
The marathon runner had incredible endurance and determination
The child's eyes sparkled with excitement upon opening the gift
The ship sailed across the vast ocean, guided by the stars
The company achieved remarkable growth in a short period of time
The team worked together harmoniously to complete the project
The puppy wagged its tail, expressing its happiness and affection
She wore a stunning gown that made her feel like a princess
The building had a grand entrance with towering columns
The concert was a roaring success, with the crowd cheering and clapping
The baby took a tiny bite of the sweet, juicy fruit
The athlete broke a new record, achieving a significant milestone in her career
The sculpture was a masterpiece of intricate details and craftsmanship
The forest was filled with towering trees, creating a sense of serenity
The children built a small sandcastle on the beach, their imaginations running wild
The mountain range stretched as far as the eye could see, majestic and awe-inspiring
The artist's brush glided smoothly across the canvas, creating a beautiful painting
She received a small token of appreciation for her hard work and dedication
The orchestra played a magnificent symphony that moved the audience to tears
The flower bloomed in vibrant colors, attracting butterflies and bees
The team celebrated their victory with a big, extravagant party
The child's laughter echoed through the small room, filling it with joy
The sunflower stood tall, reaching for the sky with its bright yellow petals
The city skyline was dominated by tall buildings and skyscrapers
The cake was adorned with a beautiful, elaborate design for the special occasion
The storm brought heavy rain and strong winds, causing widespread damage
The small boat sailed peacefully on the calm, glassy lake
The artist used bold strokes of color to create a striking and vivid painting
The couple shared a passionate kiss under the starry night sky
The mountain climber reached the summit after a long and arduous journey
The child's eyes widened in amazement as the magician performed his tricks
The garden was filled with the sweet fragrance of blooming flowers
The basketball player made a big jump and scored a spectacular slam dunk
The cat pounced on a small mouse, displaying its hunting instincts
The mansion had a grand entrance with a sweeping staircase and chandeliers
The raindrops fell gently, creating a rhythmic patter on the roof
The baby took a big step forward, encouraged by her parents' applause
The actor delivered a powerful and emotional performance on stage
The butterfly fluttered its delicate wings, mesmerizing those who watched
The company launched a small-scale advertising campaign to test the market
The building was constructed with strong, sturdy materials to withstand earthquakes
The singer's voice was powerful and resonated throughout the concert hall
The child built a massive sandcastle with towers, moats, and bridges
The garden was teeming with a variety of small insects and buzzing bees
The athlete's muscles were well-developed and strong from years of training
The sun cast long shadows as it set behind the mountains
The couple exchanged heartfelt vows in a beautiful, intimate ceremony
The dog wagged its tail vigorously, a sign of excitement and happiness
The baby let out a tiny giggle, bringing joy to everyone around"""


Next, you'll prepare data by tokenizing it and creating a vocabulary from data.


In [None]:

# Step 1: Get tokenizer
tokenizer = get_tokenizer('basic_english')  # This uses basic English tokenizer. You can choose another.

# Step 2: Tokenize sentences
def tokenize_data(sentences):
    for sentence in sentences:
        yield tokenizer(sentence)

tokenized_toy_data = tokenizer (toy_data)


vocab = build_vocab_from_iterator(tokenize_data(tokenized_toy_data), specials=['<unk>'])
vocab.set_default_index(vocab["<unk>"])


Let's check how a sentence looks like after tokenization and numericalization:


In [None]:
# Test
sample_sentence = "I wish I was a baller"
tokenized_sample = tokenizer(sample_sentence)
encoded_sample = [vocab[token] for token in tokenized_sample]
print("Encoded sample:", encoded_sample)

You can write a fuction to apply numericalization on all tokens: 


In [None]:
text_pipeline = lambda tokens:[ vocab[token]  for token in tokens]

Let's delve into two main architectures for training word2vec embeddings:


## Continuous Bag of Words (CBOW)

For the Continuous Bag of Words (CBOW) model, use a "context" to predict a target word. The "context" is typically a set of surrounding words. For example, if your context window is of size 2, then you take two words before and two words after the target word as context. The target word is in red and the context is in blue:


<table border="1">
    <tr>
        <th>Time Step</th>
        <th>Phrase</th>
    </tr>
    <tr>
        <td>1</td>
        <td><span style="color:blue;">I wish</span> <span style="color:red;">I</span> <span style="color:blue;">was  little </span></td>
    </tr>
    <tr>
        <td>2</td>
        <td><span style="color:blue;">wish I</span> <span style="color:red;">was</span> <span style="color:blue;">little bit </span></td>
    </tr>
    <tr>
        <td>3</td>
        <td><span style="color:blue;">I was</span> <span style="color:red;">little</span> <span style="color:blue;">  bit taller</span></td>
    </tr>
    <tr>
        <td>4</td>
        <td><span style="color:blue;">was little</span> <span style="color:red;">bit</span> <span style="color:blue;"> taller I</span></td>
    </tr>
    <tr>
        <td>5</td>
        <td><span style="color:blue;">little bit</span> <span style="color:red;">taller</span> <span style="color:blue;"> I wish</span></td>
    </tr>
    <tr>
        <td>6</td>
        <td><span style="color:blue;">bit taller</span> <span style="color:red;">I</span> <span style="color:blue;">wish I</span></td>
    </tr>
    <tr>
        <td>7</td>
        <td><span style="color:blue;">taller I</span> <span style="color:red;">wish</span> <span style="color:blue;">I was</span></td>
    </tr>
    <tr>
        <td>8</td>
        <td><span style="color:blue;">I wish</span> <span style="color:red;">I</span> <span style="color:blue;">was a</span></td>
    </tr>
    <tr>
        <td>9</td>
        <td><span style="color:blue;">wish I</span> <span style="color:red;">was</span> <span style="color:blue;">a baller</span></td>
    </tr>
</table>



You can slide over the sequence and create training data:


In [None]:
CONTEXT_SIZE=2

cobow_data = []
for i in range(1, len(tokenized_toy_data ) - CONTEXT_SIZE):
    context = (
        [tokenized_toy_data [i - j - 1] for j in range(CONTEXT_SIZE)]
        + [tokenized_toy_data [i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = tokenized_toy_data [i]
    cobow_data.append((context, target))
    


You can print a sample, showcasing both the context words ['wish', 'i', 'was', 'little'] and the target word 'i':
<table border="1">
    <tr>
        <th>Time Step</th>
        <th>Phrase</th>
    </tr>
    <tr>
        <td>1</td>
        <td><span style="color:blue;">I wish</span> <span style="color:red;">I</span> <span style="color:blue;">was  little </span></td>
    </tr>
</table>


In [None]:
print(cobow_data[1])

You can print the next sample, showcasing both the context ['i', 'wish', 'little', 'bit'] and the target words:'was'
<table border="1">
    <tr>
        <th>Time Step</th>
        <th>Phrase</th>
    </tr>
    <tr>
        <td>2</td>
        <td><span style="color:blue;">wish I</span> <span style="color:red;">was</span> <span style="color:blue;"> little bit</span></td>
    </tr>
</table>


In [None]:
print(cobow_data[2])

You would want to find the emeddings that guide the model to predict the following probilaties. $P(w_t| w_{t-2},w_{t-1},w_{t+1},w_{t+2})$ is the probability of $w_t$, the target word, conditioned on the occurrence of context words $w_{t-2},w_{t-1},w_{t+1},w_{t+2}$.


<table border="1">
    <thead>
        <tr>
            <th>\( P(w_t| w_{t-2},w_{t-1},w_{t+1},w_{t+2}) \)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>\( P(w_1 | w_{-1},w_0,w_2,w_3) = P(I | \text{I wish, was little}) \)</td>
        </tr>
        <tr>
            <td>\( P(w_2 | w_0,w_1,w_3,w_4) = P(was | \text{wish I, little bit}) \)</td>
        </tr>
        <tr>
            <td>\( P(w_3 | w_1,w_2,w_4,w_5) = P(little | \text{I was, bit taller}) \)</td>
        </tr>
        <tr>
            <td>\( P(w_4 | w_2,w_3,w_5,w_6) = P(bit | \text{was little, taller I}) \)</td>
        </tr>
        <tr>
            <td>\( P(w_5 | w_3,w_4,w_6,w_7) = P(taller | \text{little bit, I wish}) \)</td>
        </tr>
        <tr>
            <td>\( P(w_6 | w_4,w_5,w_7,w_8) = P(I | \text{bit taller, wish I}) \)</td>
        </tr>
        <tr>
            <td>\( P(w_7 | w_5,w_6,w_8,w_9) = P(wish | \text{taller I, I was}) \)</td>
        </tr>
    </tbody>
</table>


The collate_batch function processes batches of data, converting context and target text data into numerical format using a vocabulary and arranging them for model training.


In [None]:
def collate_batch(batch):
    target_list, context_list, offsets = [], [], [0]
    for _context, _target in batch:
        
        target_list.append(vocab[_target])  
        processed_context = torch.tensor(text_pipeline(_context), dtype=torch.int64)
        context_list.append(processed_context)
        offsets.append(processed_context.size(0))
    target_list = torch.tensor(target_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    context_list = torch.cat(context_list)
    return target_list.to(device), context_list.to(device), offsets.to(device)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

Selecting the first 10 samples from the `cobow_data` list and processing them using the `collate_batch` function. The outputs are the tokenized target words (`target_list`), the surrounding context words (`context_list`), and the respective offsets for each sample (`offsets`).


In [None]:
target_list, context_list, offsets=collate_batch(cobow_data[0:10])
print(f"target_list(Tokenized target words): {target_list} , context_list(Surrounding context words): {context_list} , offsets(Starting indexes of context words for each target): {offsets} ")


Create a ```dataLoader``` object:


In [None]:
BATCH_SIZE = 64  # batch size for training

dataloader_cbow = DataLoader(
    cobow_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
print(dataloader_cbow) 

The CBOW model shown here starts with an EmbeddingBag layer, which takes a variable-length list of context word indices and produces an averaged embedding of size embed_dim. This embedding is then passed through a linear layer that reduces its dimension to ```embed_dim/2```. After applying a ReLU activation, the output is processed by another linear layer, transforming it to match the vocabulary size, thus allowing the model to predict the probability of any word from the vocabulary as the target word. The overall flow moves from contextual words' indices to predicting the central word in the Continuous Bag of Words approach.


In [None]:
class CBOW(nn.Module):
    # Initialize the CBOW model
    def __init__(self, vocab_size, embed_dim, num_class):
        
        super(CBOW, self).__init__()
         # Define the embedding layer using nn.EmbeddingBag
        # It outputs the average of context words embeddings
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        # Define the first linear layer with input size embed_dim and output size embed_dim//2
        self.linear1 = nn.Linear(embed_dim, embed_dim//2)
        # Define the fully connected layer with input size embed_dim//2 and output size vocab_size
        self.fc = nn.Linear(embed_dim//2, vocab_size)
        

        self.init_weights()
    # Initialize the weights of the model's parameters
    def init_weights(self):
        # Initialize the weights of the embedding layer
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        # Initialize the weights of the fully connected layer
        self.fc.weight.data.uniform_(-initrange, initrange)
        # Initialize the biases of the fully connected layer to zeros
        self.fc.bias.data.zero_()
        

    def forward(self, text, offsets):
        # Pass the input text and offsets through the embedding layer
        out = self.embedding(text, offsets)
        # Apply the ReLU activation function to the output of the first linear layer
        out = torch.relu(self.linear1(out))
        # Pass the output of the ReLU activation through the fully connected layer
        return self.fc(out)
        

Create an instance of the CBOW model:


In [None]:
vocab_size = len(vocab)
emsize = 24
model_cbow = CBOW(vocab_size, emsize, vocab_size).to(device)

Define the loss function, optimizer, and scheduler for training:


In [None]:

LR = 5  # learning rate

# Define the CrossEntropyLoss criterion. It is commonly used for multi-class classification tasks.
# This criterion combines the softmax function and the negative log-likelihood loss.
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer using stochastic gradient descent (SGD).
# It optimizes the parameters of the model_cbow, which are obtained by model_cbow.parameters().
# The learning rate (lr) determines the step size for parameter updates during optimization.
optimizer = torch.optim.SGD(model_cbow.parameters(), lr=LR)

# Define a learning rate scheduler.
# The StepLR scheduler adjusts the learning rate during training.
# It multiplies the learning rate by gamma every step_size epochs (here, 1.0).
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)


Let's train the model:


In [None]:
model_cbow, epoch_losses=train_model(model_cbow, dataloader_cbow, criterion, optimizer, num_epochs=400)

Now, you can plot the loss values over the course of training:


In [None]:
plt.plot(epoch_losses)
plt.xlabel("epochs")


The model's weights are the actual word embeddings. You can load them into a numpy array:


In [None]:
word_embeddings = model_cbow.embedding.weight.detach().cpu().numpy() 

Now, check the embedded vector for a sample word. Notice the shape of this vector which is equal to the `emsize = 24` that you defined earlier.


In [None]:
word = 'baller'
word_index = vocab.get_stoi()[word] # getting the index of the word in the vocab
print(word_embeddings[word_index])

Now you can check if embeddings are representing the similarities among words. To do this, for the sake of visualization,you need to do dimension reduction on word embeddings to map the embedding space to a 2-d space. You can do this using TSNE in the plot function in the helper functions section.


In [None]:
plot_embeddings(word_embeddings,vocab=vocab)

Upon examining the t-SNE projections, it is evident that even with the inevitable information loss from dimensionality reduction and the limitations of a small dataset, words with similar meanings cluster together. For instance, words such as 'bright' and 'shadow' are in proximity near the point (-15, -15) on components 1 and 2. Likewise, 'dog', 'cat', and 'mouse' are grouped around the (5, 5) coordinate, and 'sailed' as well as 'wind' can be found close to the (5, -8) point."


## Skip-gram model

The Skip-gram model is one of the two main architectures used in word2vec, a popular technique for learning word embeddings. In the Skip-gram model, the goal is to predict the surrounding words (context) given a central word (target). The main idea behind this model is that words that appear in similar contexts tend to have similar meanings.  Consider this example:

**I was little bit taller,**

Assuming a context window size of 2, the words in red represent the context, while the word highlighted in blue signifies the target:

**<span style="color:red;"> I was</span> <span style="color:blue;">little</span> <span style="color:red;">bit taller, </span>**

### Window the Skip-gram:

Training the Skip-gram model using the actual context can be computationally expensive. This is because it involves predicting probabilities for each word in the vocabulary for each position in the context as opposed to CBOW that predicts the probabilities for each word in the vocabulary for the target word only. To mitigate this, several approximation techniques are employed.

One common approximation technique is to break the full context into smaller parts and predict them one at a time. This not only simplifies the prediction task but also helps in better training as it provides multiple training examples from a single context-target pair.

### Using the first row of the table as an example:


For the example above the target word is **"little"**. The full context for this target word is:
**I was bit taller**

In the approximation, instead of using the full context to predict the target, you should break it down. There are four approximations in this example:
1. Approximation 1: **I**
2. Approximation 2: **was**
3. Approximation 3: **bit**
4. Approximation 4: **taller**

For each approximation, the Skip-gram model would aim to predict the target word "little" using just that part of the context. This means, for the first approximation, the model will try to predict "little" using only the word "I". For the second approximation, it will try using only the word "was," and so on.

In conclusion, the Skip-gram model aims to understand word relationships by predicting the context from a given target word. Approximation techniques, like the one illustrated, help simplify the training process and make it more efficient.


<table border="1">
    <tr>
        <th>Full Context with Target</th>
        <th>Target Word</th>
        <th>Original Target Context</th>
        <th>Approximation 1</th>
        <th>Approximation 2</th>
        <th>Approximation 3</th>
        <th>Approximation 4</th>
    </tr>
    <tr>
        <td><span style="color:red;"> I was</span> <span style="color:blue;">little</span> <span style="color:red;">bit taller, </span></td>
        <td>little</td>
        <td> I was bit taller,</td>
        <td>I</td>
        <td>was</td>
        <td>bit</td>
        <td>taller,</td>
    </tr>
    <tr>
        <td><span style="color:red;"> was little</span> <span style="color:blue;">bit</span> <span style="color:red;">taller, I </span></td>
        <td>bit</td>
        <td>was little taller, I</td>
        <td>was</td>
        <td>little</td>
        <td>taller,</td>
        <td>I</td>
    </tr>
    <tr>
        <td><span style="color:red;">little bit</span> <span style="color:blue;">taller,</span> <span style="color:red;">I wish </span></td>
        <td>taller,</td>
        <td>little bit I wish</td>
        <td>little</td>
        <td>bit</td>
        <td>I</td>
        <td>wish</td>
    </tr>
    <tr>
        <td><span style="color:red;"> bit taller,</span> <span style="color:blue;">I</span> <span style="color:red;">wish I </span></td>
        <td>I</td>
        <td>bit taller, wish I</td>
        <td>bit</td>
        <td>taller,</td>
        <td>wish</td>
        <td>I</td>
    </tr>
    <tr>
        <td><span style="color:red;"> taller, I</span> <span style="color:blue;">wish</span> <span style="color:red;">I was </span></td>
        <td>wish</td>
        <td>taller, I I was</td>
        <td>taller,</td>
        <td>I</td>
        <td>I</td>
        <td>was</td>
    </tr>
    <!-- More rows can be added in a similar pattern for other words in the phrase. -->
</table>


The goal is to optimize the conditional probabilities for obtaining embeddings. Optimize the conditional probabilities for obtaining high-quality word embeddings. The only difference between continuous bag of words is the structure of the conditional probabilities $P(w_{t+j}| w_{t})$  for your window size $j=-2,-1,..,1,2.$


<table border="1">
    <tr>
        <th>j</th>
        <th>Target Word t=3 </th>
        <th>Context Word</th>
        <th>Probability</th>
    </tr>
    <tr>
         <th>-2</th>
        <td>little</td>
        <td>I</td>
        <td> P(I | little) </td>
    </tr>
    <tr>
          <th>-1</th>
        <td>little</td>
        <td>was</td>
        <td> P(was | little)</td>
    </tr>
    <tr>
         <th>1</th>
        <td>little</td>
        <td>bit</td>
        <td>P(bit | little)</td>
    </tr>
    <tr>
         <th>2</th>
        <td>little</td>
        <td>taller,</td>
        <td>P(taller | little) </td>
    </tr>
    <!-- Repeat rows for each context word for each target word -->
</table>





In contrast to the standard notation in conditional probability, where the dependent variable is typically represented as the target, the current terminology reverses this convention.



<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/targetvsdepenet.gif" alt="image">




This code constructs a skip-gram dataset from a tokenized toy data, where for each word (target), it gathers the surrounding words within a specified window (context) defined by CONTEXT_SIZE.


In [None]:
# Define the window size for the context around the target word.
CONTEXT_SIZE = 2

# Initialize an empty list to store the (target, context) pairs.
skip_data = []

# Iterate over each word in the tokenized toy_data, while excluding the first 
# and last few words determined by the CONTEXT_SIZE.
for i in range(CONTEXT_SIZE, len(tokenized_toy_data) - CONTEXT_SIZE):

    # For a word at position i, the context comprises of words from the preceding CONTEXT_SIZE
    # as well as from the succeeding CONTEXT_SIZE. The context words are collected in a list.
    context = (
        [tokenized_toy_data[i - j - 1] for j in range(CONTEXT_SIZE)]  # Preceding words
        + [tokenized_toy_data[i + j + 1] for j in range(CONTEXT_SIZE)]  # Succeeding words
    )

    # The word at the current position i is taken as the target.
    target = tokenized_toy_data[i]

    # Append the (target, context) pair to the skip_data list.
    skip_data.append((target, context))


You can window the skipgram


In [None]:
skip_data_=[[(sample[0],word) for word in  sample[1]] for sample in skip_data]

You will have pairs of (target, context) words:


In [None]:
skip_data_flat= [item  for items in  skip_data_ for item in items]
skip_data_flat[8:28]

Creating a collate function to numericalize (target, context) pairs:


In [None]:
def collate_fn(batch):
    target_list, context_list = [], []
    for _context, _target in batch:
        
        target_list.append(vocab[_target]) 
        context_list.append(vocab[_context])
        
    target_list = torch.tensor(target_list, dtype=torch.int64)
    context_list = torch.tensor(context_list, dtype=torch.int64)
    return target_list.to(device), context_list.to(device)

In [None]:
dataloader = DataLoader(skip_data_flat, batch_size=BATCH_SIZE, collate_fn=collate_fn)

Let's check a sample batch of target,context after collation:


In [None]:
next(iter(dataloader))

Here, you will define the Skip-gram network.
The embeddings layer is defined using nn.Embedding, which creates word embeddings for the given vocabulary size and embedding dimension.
The fc layer is a fully connected layer with input dimension embed_dim and output dimension vocab_size.

In the forward method, the input text is passed through the embeddings layer to obtain the word embeddings. The output of the embeddings layer is then passed through the fc layer.
The ReLU activation function is applied to the output of the fc layer. The final output is returned.


In [None]:
class SkipGram_Model(nn.Module):

    def __init__(self, vocab_size, embed_dim):
        super(SkipGram_Model, self).__init__()
        # Define the embeddings layer
        self.embeddings = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embed_dim
        )
        
        # Define the fully connected layer
        self.fc = nn.Linear(in_features=embed_dim, out_features=vocab_size)

    def forward(self, text):
        # Perform the forward pass
        # Pass the input text through the embeddings layer
        out = self.embeddings(text)
        
        # Pass the output of the embeddings layer through the fully connected layer
        # Apply the ReLU activation function
        out = torch.relu(out)
        out = self.fc(out)
        
        return out

Creating an instance of the model:


In [None]:
emsize = 24
model_sg = SkipGram_Model(vocab_size, emsize).to(device)

Now you are going to train the model on toy data:


In [None]:
LR = 5  # learning rate
#BATCH_SIZE = 64  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_sg.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)

In [None]:
model_sg, epoch_losses=train_model(model_sg, dataloader, criterion, optimizer, num_epochs=400)

In [None]:
plt.plot(epoch_losses)

You can also plot the word embedding by reducing the dimensions using t-SNE:


In [None]:
word_embeddings = model_sg.embeddings.weight.detach().cpu().numpy() 
plot_embeddings(word_embeddings,vocab=vocab)


When selecting CBOW or Skip-Gram, the best approach often depends on the specifics of your task and data. If your dataset is small but you need to have a good representation of rarer words, Skip-gram might be the better choice. If the computational efficiency is more critical and the rare words are less of a concern, CBOW might be adequate. It's also worth noting that for very small datasets, the benefits of neural word embeddings might be limited, and simpler methods or leveraging pretrained embeddings might be more effective.



# Applying pretrained word embeddings (optional)
## Load Stanford GloVe model

Transfer learning, particularly through the use of pretrained word embeddings, serves as a cornerstone in modern NLP. This approach leverages knowledge gleaned from one task, typically learned over massive datasets, and applies it to another, often more specialized task. The primary advantage of this is twofold: it bypasses the need for enormous computational resources to learn from scratch, and it injects a base layer of linguistic understanding into the model. By using embeddings that have already captured complex language patterns and associations, even models with limited exposure to domain-specific data can exhibit remarkably sophisticated behavior, making transfer learning a strategic shortcut to enhanced performance in NLP.


Let's take a look at the pretrained GloVe model from Stanford:


You can specify the model name and embedding dimension: GloVe(name='GloVe_model_name', dim=300)


In [None]:
# creating an instance of the 6B version of Glove() model
glove_vectors_6B = GloVe(name ='6B') # you can specify the model with the following format: GloVe(name='840B', dim=300)

In [None]:
# creating another instance of a bigger Glove() model
#glove_vectors_840B = GloVe()

You must continue with the 6B model as it is lighter. You can load different pretrained GloVe models from torch() using ```torch.nn.Embedding.from_pretrained```. 


In [None]:
# load the glove model pretrained weights into a PyTorch embedding layer
embeddings_Glove6B = torch.nn.Embedding.from_pretrained(glove_vectors_6B.vectors,freeze=True)

Get ready to look into the embedding vectors of this large pretrained model for the words in the corpus:


You can create an array that returns the index of each word in the GloVe model's vocabulary:


In [None]:
word_to_index = glove_vectors_6B.stoi  # Vocabulary index mapping
word_to_index['team']

You will get the embedded vector for a word:


In [None]:
embeddings_Glove6B.weight[word_to_index['team']]

Let's see how successful the Glove model is in capturing the similarities between words:


In [None]:
# an array of example words
words = [
    "taller",
    "short",
    "black",
    "white",
    "dress",
    "pants",
    "big",
    "small",
    "red",
    "blue",
    "smile",
    "frown",
    "race",
    "stroll",
    "tiny",
    "huge",
    "soft",
    "rough",
    "team",
    "individual"
]


Create a dictionary of words and their embeddings


In [None]:

embedding_dict_Glove6B = {}
for word in words:
    # Get the index of the word from the vocabulary to access its embedding
    embedding_vector = embeddings_Glove6B.weight[word_to_index[word]]
    if embedding_vector is not None:
        # Words not found in the embedding index will be skipped.
        # add the embedding vector of word to the embedding_dict_Glove6B
        embedding_dict_Glove6B[word] = embedding_vector


Now that you have loaded the pretrained embeddings for the sample words, let's check if the model can capture the similarity of words by finding the distance between words:


In [None]:
# Call the function to find similar words
target_word = "small"
top_k=2
similar_words = find_similar_words(target_word, embedding_dict_Glove6B, top_k)

# Print the similar words
print("{} most similar words to {}:".format(top_k,target_word) ,similar_words)

It can be seen the pretrained GloVe model does quite good job capturing the similarity of words.


# Train a word2vec model from gensim

Here's a simple hands-on exercise to train a word2vec model using `gensim` library.
In this example, you have a small corpus consisting of four sentences. 

### Prepare your corpus:


In [None]:
sentences = [["I", "like", "to", "eat", "pizza"],
             ["Pizza", "is", "my", "favorite", "food"],
             ["I", "enjoy", "eating", "pasta"]]
sentences = [[word.lower() for word in sentence] for sentence in sentences]


The `size` parameter specifies the dimensionality of the word embeddings (in this case, 100). The `window` parameter determines the size of the context window. The `min_count` parameter sets the minimum frequency of a word to be included in the training process. Finally, the `workers` parameter controls the number of threads used for training.


In [None]:
from gensim.models import Word2Vec

# Create an instance of Word2Vec model
w2v_model = Word2Vec(sentences, vector_size=100, window=3, min_count=1, workers=4)

Create vocab from sentences:


In [None]:
# Build vocab using the training data
w2v_model.build_vocab(sentences, progress_per=10000)

Train the model:


In [None]:
# Train the model on your training data
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

That's it! You've trained a word2vec model using the `gensim` library. You can now access the word embeddings using `model.wv` and explore various operations such as finding similar words, calculating word similarities, and more.


Use the trained model to find similar words to "pizza" and calculate the similarity between "pizza" and "pasta". 


In [None]:
# Finding similar words
similar_words = w2v_model.wv.most_similar("pizza")
print("Similar words to 'pizza':", similar_words)

# Calculating word similarity
similarity = w2v_model.wv.similarity("pizza", "pasta")
print("Similarity between 'pizza' and 'pasta':", similarity)

The word embeddings obtained from the model would be more meaningful and informative with larger and more diverse training data.


Use the trained model to create a PyTorch embedding layer (just like what you did with the pretrained GloVe model) and use it in any task as an embedding layer.


In [None]:
# Extract word vectors and create word-to-index mapping
word_vectors = w2v_model.wv
# a dictionary to map words to their index in vocab
word_to_index = {word: index for index, word in enumerate(word_vectors.index_to_key)}

# Create an instance of nn.Embedding and load it with the trained vectors
embedding_dim = w2v_model.vector_size
embedding = torch.nn.Embedding(len(word_vectors.index_to_key), embedding_dim)
embedding.weight.data.copy_(torch.from_numpy(word_vectors.vectors))

# Example usage: get the embedding for a word
word = "pizza"
word_index = word_to_index[word]
word_embedding = embedding(torch.LongTensor([word_index]))
print(f"Word: {word}, Embedding: {word_embedding.detach().numpy()}")

# Text classification using pretrained word embeddings

You are ready to use the embeddings in a task, then. Let's use the pretrained embeddings to classify text data into topics:


First, you must build vocab from the pretrained GloVe:


In [None]:
from torchtext.vocab import GloVe,vocab
# Build vocab from glove_vectors
# vocab(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None)
vocab = vocab(glove_vectors_6B.stoi, 0,specials=('<unk>', '<pad>'))
vocab.set_default_index(vocab["<unk>"])

In [None]:
vocab(["<unk>","Hello","hello"])

Next, you need to tokenize text. For this you can use pretrained tokenizers from torch:


In [None]:
# Define tokenizer

tokenizer = get_tokenizer("basic_english")
# Define functions to process text and labels

Create splits from AG_NEWS() dataset for training, validation and test:


In [None]:
# Split the dataset into training and testing iterators.
train_iter, test_iter = AG_NEWS()

# Convert the training and testing iterators to map-style datasets.
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

# Determine the number of samples to be used for training and validation (5% for validation).
num_train = int(len(train_dataset) * 0.85)

# Randomly split the training dat aset into training and validation datasets using `random_split`.
# The training dataset will contain 95% of the samples, and the validation dataset will contain the remaining 5%.
split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])

Define the class labels:


In [None]:
# define class labels
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}
'''ag_news_label[y]'''
num_class = len(set([label for (label, text) in train_iter ]))

Collate data in batches:


In [None]:
def text_pipeline(x):
    x=x.lower()# you need this as your vocab is in lower case
    return vocab(tokenizer(x))

def label_pipeline(x):
    return int(x) - 1

# create label, text and offset for each batch of data
# text is the concatenated text for all text data in the batch
# you need to have the offsets(the end of text index) for later when you separate texts and predict their label
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)


Create data loaders for train, validation and test splits:


In [None]:
BATCH_SIZE = 64

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

In [None]:
label, text, offsets=next(iter(train_dataloader ))
print(label, text, offsets)
label.shape, text.shape, offsets.shape

Create the classifier model:


In [None]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = torch.nn.Embedding.from_pretrained(glove_vectors_6B.vectors,freeze=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text,offsets):
        embedded = self.embedding(text)
        # you get the average of word embeddings in the text
        means = []
        for i in range(1,len(offsets)):
            #this is like eme
          text_tmp = embedded[offsets[i-1]:offsets[i]]
          means.append(text_tmp.mean(0))

        return self.fc(torch.stack(means))

Define an evaluate function to calculate the accuracy of model:


In [None]:
def evaluate(dataloader):
    model.eval()
    total_acc, total_count= 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text,offsets)

            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

Create an instance of the model and check its prediction power before training:


In [None]:
# Define hyperparameters
vocab_size=len(vocab)
embedding_dim = 300
# Initialize the model
model = TextClassificationModel(vocab_size, embedding_dim, num_class).to(device)

In [None]:
evaluate(test_dataloader)

Not good! Let's train the model:


In [None]:
def train_TextClassification(model,dataloader,criterion,optimizer,epochs=10):
    
    cum_loss_list=[]
    acc_epoch=[]
    acc_old=0

    for epoch in tqdm(range(1, EPOCHS + 1)):
        model.train()
        cum_loss=0
        for idx, (label, text, offsets) in enumerate(train_dataloader):
            means = []
            optimizer.zero_grad()
            

            predicted_label = model(text, offsets)
            
            loss = criterion(predicted_label, label)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
            optimizer.step()
            cum_loss+=loss.item()

        cum_loss_list.append(cum_loss/len(train_dataloader))
        accu_val = evaluate(valid_dataloader)
        acc_epoch.append(accu_val)

        if accu_val > acc_old:
          acc_old= accu_val
          torch.save(model.state_dict(), 'my_model.pth')
            
    return model,cum_loss_list,acc_epoch


In [None]:
# Define hyperparameters
LR=0.1
EPOCHS = 10


criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)

model,cum_loss_list,acc_epoch  = train_TextClassification(model,train_dataloader,criterion,optimizer,EPOCHS)

Let's plot the loss and accuracy for the trained model:


In [None]:
import matplotlib.pyplot as plt
def plot(COST,ACC):
    fig, ax1 = plt.subplots()
    color = 'tab:red'
    ax1.plot(COST, color=color)
    ax1.set_xlabel('epoch', color=color)
    ax1.set_ylabel('total loss', color=color)
    ax1.tick_params(axis='y', color=color)

    ax2 = ax1.twinx()
    color = 'tab:blue'
    ax2.set_ylabel('accuracy', color=color)  # you already handled the x-label with ax1
    ax2.plot(ACC, color=color)
    ax2.tick_params(axis='y', color=color)
    fig.tight_layout()  # otherwise the right y-label is slightly clipped

    plt.show()

In [None]:
plot(cum_loss_list,acc_epoch)

Finally, evaluate the model on test data:


In [None]:
evaluate(test_dataloader)

Great job! You've acquired the skills to create and train embedding models, as well as utilize large pretrained models for practical applications. This knowledge opens up a world of possibilities where you can leverage the power of embeddings to improve various natural language processing tasks. Keep up the excellent work!


## Authors


Fateme Akbari


```{## Change Log}
```


```{|Date (YYYY-MM-DD)|Version|Changed By|Change Description||-|-|-|-||2023-10-16|0.1|Fateme|Create Lab Template|}
```


© Copyright IBM Corporation. All rights reserved.


# Sequence-to-Sequence RNN Models: Translation Task


Estimated time needed: **60** minutes


In this hands-on guide, you will explore the fundamentals of sequence-to-sequence models and learn how to implement an RNN-based model for a translation task using PyTorch.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-required-libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#Background">Background</a>
        <ol>
            <li><a href="#History-of-sequence-to-sequence-models">History of sequence-to-sequence models</a></li>
            <li><a href="#Introduction-to-RNNs">Introduction to RNNs</a></li>
            <li><a href="#Sequence-to-sequence-architecture">Sequence-to-sequence architecture</a></li>    
        </ol>
    <li><a href="#Encoder-implementation-in-PyTorch">Encoder implementation in PyTorch</a></li>
    <li><a href="#Decoder-implementation-in-PyTorch">Decoder implementation in PyTorch</a></li>
    <li><a href="#Sequence-to-sequence-model-implementation-in-PyTorch">Sequence-to-sequence model implementation in PyTorch</a>
    <li><a href="#Training-model-in-PyTorch">Training model in PyTorch</a></li>
    <li><a href="#Evaluating-model-in-PyTorch">Evaluating model in PyTorch</a></li>
    <li><a href="#Data-preprocessing">Data preprocessing</a></li>
    <li><a href="#Training-the-model">Training the model</a></li>
    <ol>
        <li><a href="#Initializations">Initializations</a></li>
        <li><a href="#Training">Training</a></li>
    </ol>
    <li><a href="#Model-inference">Model inference</a></li>
    <li><a href="#BLEU-score-metric-for-evaluation">BLEU score metric for evaluation</a></li>
    <li><a href="#Exercises">Exercises</a></li>

</ol>


## Objectives
After completing this lab, you will be able to:

 - Comprehend recurrent neural networks (RNN) architecture
 - Create an Encoder-Decoder model for a translation task
 - Train and evaluate the model
 - Create a generator for the translation task
 - Explain concepts related to Perplexity and BLEU score and use them for evaluating translations


----


## Setup


### Installing required libraries

<h2 style="color:red;">After installing the libraries below please RESTART THE KERNEL and run all cells.</h2>


In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
!mamba install -qy numpy==1.21.4 seaborn==0.9.0
# Note: If your environment doesn't support "!mamba install", use "!pip install"
# The working version of each package is commented in front of each package name

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [None]:
!pip install torchtext==0.15.1
!pip install torch==2.0.0
!pip install spacy==3.7.2
!pip install torchdata==0.6.0
!pip install portalocker>=2.0.0 #2.7.0
!pip install nltk==3.8.1
!pip install -U matplotlib

!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

### Importing required libraries

_It is recommended that you import all required libraries in one place (here):_


In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterableWrapper, Mapper
import torchtext
from torchtext.vocab import build_vocab_from_iterator
from nltk.translate.bleu_score import sentence_bleu
import torch
import torch.nn as nn
import torch.optim as optim


import numpy as np
import random
import math
import time
from tqdm import tqdm
import matplotlib.pyplot as plt


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## Background

Sequence-to-sequence (Seq2seq) models have revolutionized various natural language processing (NLP) tasks, such as machine translation, text summarization, and chatbots. These models employ Recurrent Neural Networks (RNNs) to process variable-length input sequences and generate variable-length output sequences.


### History of sequence-to-sequence models

Sequence-to-sequence models were introduced as an extension of traditional feedforward neural networks.
Researchers realized the need for models that could handle variable-length input and output sequences, such as machine translation.
The pioneering work of Sutskever et al. (2014) introduced the use of RNNs for seq2seq models.

Here are some main objectives of seq2seq models:
- Translation: Translating a sequence from one domain to another (e.g., English to French).
- Question answering: Generating a natural language response given an input sentence (e.g., chatbots).
- Summarization: Summarizing a long document into a shorter sequence of sentences.
And many more applications that involve sequence generation.


### Introduction to RNNs

RNNs are a class of neural networks designed to process sequential data.
They maintain an internal memory($h_t$) to capture information from previous steps and use it for current predictions.
RNNs have a recurrent connection that allows information to flow from one step to the next.
Recurrent Neural Networks (RNNs) operate on sequences and utilize previous states to influence the current state. Here's the general formulation of a simple RNN:


Given:

-$ \mathbf{x}_t $: input vector at time step $t$

-$ \mathbf{h}_{t-1} $: hidden state vector from the previous time step

-$ \mathbf{W}_x $ and $ \mathbf{W}_h $: weight matrices for the input and hidden state, respectively

-$ \mathbf{b} $: bias vector

-$\sigma$: activation function (often a sigmoid or tanh)

The update equations for the hidden state $ \mathbf{h}_t $ and the output $ \mathbf{y}_t $ are as follows:

$$
\begin{align*}
\mathbf{h}_t &= \sigma(\mathbf{W}_x \cdot \mathbf{x}_t + \mathbf{W}_h \cdot \mathbf{h}_{t-1} + \mathbf{b})
\end{align*}
$$

It can be seen that the hidden state function depends on the previous hidden state as well as the input at time t, which is why it has a collective memory of previous time steps.

For the output (if you're making a prediction at each time step):

$$
\begin{align*}
\mathbf{y}_t &= \text{softmax}(\mathbf{W}_o \cdot \mathbf{h}_t + \mathbf{b}_o)
\end{align*}
$$

Where:

$ \mathbf{W}_o $: weight matrix for the output AND $ \mathbf{b}_o$: bias vector for the output



Depending on the specific task, an RNN cell can either produce an output from $h_t$ or solely transfer it to the succeeding cell, serving as internal memory. While the architecture's ability to retain memory might seem elusive at first glance, let's elucidate this by implementing a simple RNN to handle the following data mechanism:

![a title](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/Screenshot%202023-10-19%20at%2011.29.23%E2%80%AFAM.png)


The diagram showcases a state machine or transition model with three distinct states, depicted by the prominent purple circles. Each state is distinctly labeled with a value for $ h $: $ h = -1 $, $ h = 0 $, and $ h = 1 $.

1. **State $ h = -1 $**:
   - Maintains itself when $ x = 1 $ (illustrated by the yellow loop).
   - Proceeds to the $ h = 0$ state upon receiving $ x = -1$ (highlighted by the red arrow).

2. **State $ h = 0 $**:
   - Moves to the $h = -1 $ state when $ x = 1$ (illustrated by the red arrow).
   - Advances to the $ h = 1 $ state with $ x = -1$ (marked by the red arrow).

3. **State $h = 1 $**:
   - Sustains its position when $ x = -1 $ (indicated by the yellow loop).
   - Transitions to the $ h = 0 $ state upon receiving $ x = 1 $ (signified by the red arrow).

To encapsulate, the diagram effectively portrays transitions among three states based on the input $ x $. Contingent on the prevailing state and the input $ x $, the state machine either transitions to a different state or remains stationary.

---



You can represent the previously mentioned state machine using the layer detailed below. Use $tanh$ as the $h$ value should fall between [-1, 1]. Note that you have excluded the output for simplification:

$$\begin{align*}
W_{xh} & = -10.0 \\\\\\\\\\\\
W_{hh} & = 10.0 \\
b_h & = 0.0 \\
x_t & = 1 \\
h_{\text{prev}} & = 0.0 \\
h_t & = \tanh(x_t \cdot W_{xh} + h_{\text{prev}} \cdot W_{hh} + b_h)
\end{align*}$$


In [None]:
 W_xh=torch.tensor(-10.0)
 W_hh=torch.tensor(10.0)
 b_h=torch.tensor(0.0)
 x_t=1
 h_prev=torch.tensor(-1)

Consider the following sequence $x_t$ for  $t=0,1,..,7$,


In [None]:
X=[1,1,-1,-1,1,1]

Assuming that you start from the intial state $h = 0$,  with the above input vector $x$, the state vector $h$ should look like this:


In [None]:
H=[-1,-1,0,1,0,-1]

In [None]:
# Initialize an empty list to store the predicted state values
H_hat = []
# Loop through each data point in the input sequence X
t=1
for x in X:
    # Assign the current data point to x_t
    print("t=",t)
    x_t = x
    # Print the value of the previous state (h at time t-1)
    print("h_t-1", h_prev.item())

    # Compute the current state (h at time t) using the RNN formula with tanh activation
    h_t = torch.tanh(x_t * W_xh + h_prev * W_hh + b_h)

    # Update h_prev to the current state value for the next iteration
    h_prev = h_t

    # Print the current input value (x at time t)
    print("x_t", x_t)

    # Print the computed state value (h at time t)
    print("h_t", h_t.item())
    print("\n")

    # Append the current state value to the H_hat list after converting it to integer
    H_hat.append(int(h_t.item()))
    t+=1




You can evaluate the accuracy of the predicted state ```H_hat``` by comparing it to the actual state ```H```. In RNNs, the state $ h_t $ is utilized to predict an output sequence $y_t $ based on the given input sequence $ x_t $.


In [None]:
H_hat

In [None]:
H

While you have pre-defined the $W_{xh}$ and $W_{hh}$  and $b_h$, in practice these values need to be identified through training on data.


In practice, modifications and enhancements, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), are often used to address issues like the vanishing gradient problem in basic RNNs.


An LSTM cell has three main components: an input gate, a forget gate, and an output gate.
- The **input gate** controls how much new information should be stored in the cell's memory. It looks at the current input and the previous hidden state and decides which parts of the new input to remember.
- The **forget gate** determines what information should be discarded or forgotten from the cell's memory. It considers the current input and the previous hidden state and decides which parts of the previous memory are no longer relevant.
- The **output gate** determines what information should be outputted from the cell. It looks at the current input and the previous hidden state and decides which parts of the cell's memory to include in the output.

The key idea behind LSTM cells is that they have a separate memory state that can selectively retain or forget information over time. This helps them handle long-range dependencies and remember important information from earlier steps in a sequence.


### Sequence-to-sequence architecture

Seq2seq models have an Encoder-Decoder structure. The encoder encodes the input sequence into a fixed-dimensional representation, often called the context vector($h_t$). The decoder generates the output sequence based on the encoded context vector.


Let's look closer into the encoder and decoder boxes in the video below. Translation is a typical sequence-to-sequence task. The input is a sequence of words in the original language("I love to travel"), while output is its translation in the destination language("J'adore voyager"). As shown in the video, input is fed into the decoder part, one word after another. Each RNN cell receives a word($x_t$) and has an internal memory($h_t$). After processing the input and $h_t$, RNN cell passes an updated context vector($h_{t+1}$) to the next RNN cell. When the end of sentence is reached, the context vector is passed to the decoder part. Decoder cells are also RNN cells that receive context vector and generate the output word by word. Each RNN receives the generated word as well as the updated context vector from its previous cell and generates the next word($y_t$). This architecture allows for generating text without length restrictions.


<video width="640" height="480"
src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/Translation_RNN.mp4"
controls>
</video>


## Encoder implementation in PyTorch

To implement the encoder part using Pytorch, you will create the sub-class of the torch.nn.Module class and define the __init__() and __forward__() method.

Let's first define the parameters that are used in __init__() function:
- The `vocab_len` is nothing but the number of unique words present in the vocabulary. After pre-processing the data, you can count the number of unique words in your vocabulary and use that count here. This will be the dimension of the model input.
- The embedding_dim is the output dimension of the embedding vector you need. A good practice is to use 256-512 for sample demo app like you are building here.
- LSTM can indeed be stacked, allowing for multiple layers. In the initial implementation, you will use only one layer. However, to accommodate future flexibility, you will pass the parameter `n_layers` to specify the number of layers in the LSTM.
- `hid_dim` is the dimensionality of the hidden and cell states.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting.

Now, let's look into the layers:
- The Embedding layer takes the input data and outputs the embedding vector, hence the dimension of those needs to be defined as `vocab_len` and `embedding_dim`.
- The LSTM Layer takes the `embedding_dim` as the input data and creates total 3 outputs: `hidden`, `cell` and `output`. Here you need to define the number of neurons you need in LSTM, which is defined using the `hid_dim`.


In the __forward__() function, the Embedding layer is defined that utilizes the `vocab_len` to internally convert the input_batch into a one-hot representation. Next, the LSTM layer receives the embedded input and outputs three vectors: Output, Hidden and cell. As for the encoder, you don't require the output vector from the LSTM as you only pass the context vector(`hidden`+`cell`) to the decoder block. Therefore, forward() only returns hidden and cell.

Note: When using an LSTM, you have an additional cell state. However, if you were using a GRU, you would only have the hidden state.


In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_len, emb_dim, hid_dim, n_layers, dropout_prob):
        super().__init__()

        self.hid_dim = hid_dim
        self.n_layers = n_layers

        self.embedding = nn.Embedding(vocab_len, emb_dim)

        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout_prob)
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, input_batch):
        #input_batch = [src len, batch size]
        embed = self.dropout(self.embedding(input_batch))
        embed = embed.to(device)
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        outputs, (hidden, cell) = self.lstm(embed)

        return hidden, cell


Now you are ready to create an encoder instance to see how it works:


In [None]:
vocab_len = 8
emb_dim = 10
hid_dim=8
n_layers=1
dropout_prob=0.5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

encoder_t = Encoder(vocab_len, emb_dim, hid_dim, n_layers, dropout_prob).to(device)

Let's see a simple example where the encoder forward method transforms the `src` sentence into a `hidden` and `cell` states. tensor([[0],[3],[4],[2],[1]]) is equal to `src` = 0,3,4,2,1 in which each number represents a token in the `src` vocabulary. For instance, 0:`<bos>`,3:"Das", 4:"ist",2:"schön", 1:`<eos>`. Note that here you have batch size of 1.


In [None]:
src_batch = torch.tensor([[0,3,4,2,1]])
# you need to transpose the input tensor as the encoder LSTM is in Sequence_first mode by default
src_batch = src_batch.t().to(device)
print("Shape of input(src) tensor:", src_batch.shape)
hidden_t , cell_t = encoder_t(src_batch)
print("Hidden tensor from encoder:",hidden_t ,"\nCell tensor from encoder:", cell_t)

The encoder takes the entire source sequence as input, which consists of a sequence of words or tokens. The encoder LSTM processes the entire input sequence and updates its hidden states at each time step. The hidden states of the LSTM network act as a form of memory and capture the contextual information of the input sequence. After processing the entire input sequence, the final hidden state of the encoder LSTM captures the summarized representation of the input sequence's context. This final hidden state is sometimes referred to as the "context vector".


## Decoder implementation in PyTorch

To have a better understanding of the internal mechanism of the decoder part, let's take a closer look into it:


<video width="640" height="480"
       src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/decoder_RNN.mp4"
       controls>
</video>


The decoder class inherits from nn.Module, which is a base class for all neural network modules in PyTorch.
The constructor (__init__ method) initializes the parameters and layers of the decoder.
- `output_dim` is the number of possible output values(target vocab length).
- `emb_dim` is the dimensionality of the embedding layer.
- `hid_dim` is the dimensionality of the hidden state in the LSTM.
- `n_layers` is the number of layers in the LSTM.
- `dropout` is the dropout probability.

The decoder contains the following layers:
- `embedding`: An embedding layer that maps the output values to dense vectors of size emb_dim.
- `lstm`: An LSTM layer that takes the embedded input and produces hidden states of size hid_dim.
-  `fc_out`: A linear layer that maps the LSTM output to the output dimension output_dim.
- `softmax`: A log-softmax activation function applied to the output to obtain a probability distribution over the output values.
- `dropout`: A dropout layer that applies dropout to the embedded input.


In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers


        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        self.fc_out = nn.Linear(hid_dim, output_dim)
        self.softmax = nn.LogSoftmax(dim=1)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):


        #input = [batch size]

        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]

        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]

        input = input.unsqueeze(0)
        #input = [1, batch size]

        embedded = self.dropout(self.embedding(input))
        #embedded = [1, batch size, emb dim]

        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]

        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        prediction_logit = self.fc_out(output.squeeze(0))
        prediction = self.softmax(prediction_logit)
        #prediction = [batch size, output dim]


        return prediction, hidden, cell

You can create a decoder instance. The output dimension is set as the target vocab length.


In [None]:
output_dim = 6
emb_dim=10
hid_dim = 8
n_layers=1
dropout=0.5
decoder_t = Decoder(output_dim, emb_dim, hid_dim, n_layers, dropout).to(device)

Now that you have instances of both encoder and decoder, you are ready to connect them (the red box in the diagram below). First, let's see how you can pass the Hidden and Cell (the pink cell within the red box) from encoder (the green boxes container) to decoder (the orange boxes container). Looking at the diagram, you can see that the decoder also receives an input which is the previous word that it has predicted. For the first decoder cell, this input is `<bos>` token. Each decoder cell outputs a prediction and updates the cell and state to pass to the next decoder cell. prediction is a probability distribution over possible target tokens (length of target vocab).

![connection](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/ED_connection.JPG)


In [None]:
input_t = torch.tensor([0]).to(device) #<bos>
input_t.shape
prediction, hidden, cell = decoder_t(input_t, hidden_t , cell_t)
print("Prediction:", prediction, '\nHidden:',hidden,'\nCell:', cell)

# Encoder-decoder connection


Alright! You learned how to create encoder and decoder modules and how to pass input to them. Now you need to create the connection so that the model can process (`src`,`trg`) pairs and generate the translation. suppose that `trg` is tensor ([[0],[2],[3],[5],[1]]) which is equal to sequence 0,2,3,5,1 in which each number represents a token in the target vocabulary. For instance, 0:`<bos>`,2:"this", 3:"is",5:"beautiful", 1:`<eos>`.


In [None]:

#trg = [trg len, batch size]
#teacher_forcing_ratio is probability to use teacher forcing
#e.g. if teacher_forcing_ratio is 0.75 you use ground-truth inputs 75% of the time
teacher_forcing_ratio = 0.5
trg = torch.tensor([[0],[2],[3],[5],[1]]).to(device)


batch_size = trg.shape[1]
trg_len = trg.shape[0]
trg_vocab_size = decoder_t.output_dim

#tensor to store decoder outputs
outputs_t = torch.zeros(trg_len, batch_size, trg_vocab_size).to(device)

#send to device

hidden_t = hidden_t.to(device)
cell_t = cell_t.to(device)


#first input to the decoder is the <bos> tokens
input = trg[0,:]


for t in range(1, trg_len):

    #you loop through the trg len and generate tokens
    #decoder receives previous generated token, cell and hidden
    # decoder outputs it prediction(probablity distribution for the next token) and updates hidden and cell
    output_t, hidden_t, cell_t = decoder_t(input, hidden_t, cell_t)

    #place predictions in a tensor holding predictions for each token
    outputs_t[t] = output_t

    #decide if you are going to use teacher forcing or not
    teacher_force = random.random() < teacher_forcing_ratio

    #get the highest predicted token from your predictions
    top1 = output_t.argmax(1)


    #if teacher forcing, use actual next token as next input
    #if not, use predicted token
    #input = trg[t] if teacher_force else top1
    input = trg[t] if teacher_force else top1

print(outputs_t,outputs_t.shape )

The size of output tensor is (trg_len, batch_size, trg_vocab_size). This is because for each `trg` token (length of `trg`) the model outputs a probability distribution over all possible tokens(trg vocab length). Therefore, to generate the predicted tokens or translation of the `src` sentence, you need to get the maximum probability for each token:


In [None]:
# Note that you need to get the argmax from the second dimension as **outputs** is an array of **output** tensors
pred_tokens = outputs_t.argmax(2)
print(pred_tokens)

It is no surprise that the translation is not correct (trg = tensor([[0],[2],[3],[5],[1]]) as the model has not yet gone through any training.


Let's put together all the code for connecting the encoder and decoder in a seq2seq class for better usability.


## Sequence-to-sequence model implementation in PyTorch
Let's connect encoder and decoder components to create the seq2seq model.

You define the seq2seq class that inherits from nn.Module, which is the base class for all neural network modules in PyTorch.
Inputs are:
- `encoder` and `decoder` are instances of the encoder and decoder networks that you have already defined.
- `device` specifies the device (e.g., CPU or GPU) on which the computations will be performed.
- `trg_vocab` represents the vocabulary of the target language. It is used to determine the size of the output vocabulary.

**forward** method defines the forward pass of the seq2seq model. It takes three arguments: `src`, `trg`, and `teacher_forcing_ratio`.:

- `src` represents the source sequences, and `trg` represents the target sequences.
- `teacher_forcing_ratio` is a probability that determines whether teacher forcing will be used during training only. Teacher forcing is a technique where the true target sequence is fed as input to the decoder at each time step, instead of using the predicted output from the previous time step.

The **forward** method initializes some variables needed for the forward pass, such as `batch_size`, `trg_len`, and `trg_vocab_size`. It also creates an empty tensor called `outputs` to store the decoder outputs for each time step.

The `hidden` and `cell` states of the encoder are obtained by calling the encoder (src) method. These states are then used as the initial states for the decoder.

The input to the decoder at the first time step is the <bos> token of the target sequences.

The decoder is iterated over for each time step in the target sequences (`for t in range(1, trg_len)`). The input, along with the previous hidden and cell states, is passed to the decoder, and it produces an output tensor. The `output` tensor is stored in the `outputs` tensor.

At each time step, there is a decision made whether to use teacher forcing or not based on the teacher_forcing_ratio probability. If teacher forcing is used, the true next token from the target sequences (`trg[t]`) is used as the input for the next time step. Otherwise, the predicted token from the previous time step (`top1 = output.argmax(1)`) is used.

Finally, the `outputs` tensor containing the predicted outputs for each time step is returned.


In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device,trg_vocab):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        self.trg_vocab = trg_vocab

        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 you use ground-truth inputs 75% of the time


        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        hidden = hidden.to(device)
        cell = cell.to(device)


        #first input to the decoder is the <bos> tokens
        input = trg[0,:]

        for t in range(1, trg_len):

            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)

            #place predictions in a tensor holding predictions for each token
            outputs[t] = output

            #decide if you are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio

            #get the highest predicted token from your predictions
            top1 = output.argmax(1)


            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            #input = trg[t] if teacher_force else top1
            input = trg[t] if teacher_force else top1


        return outputs

## Training model in PyTorch
Now that the model is defined, you define a train function the seq2seq model. Let's go through the code and understand its components:

1. `train(model, iterator, optimizer, criterion, clip)` takes five arguments:

   - `model` is the model that will be trained.
   - `iterator` is an iterable object that provides the training data in batches.
   - `optimizer` is the optimization algorithm used to update the model's parameters.
   - `criterion` is the loss function that measures the model's performance.
   - `clip` is a value used to clip the gradients to prevent them from becoming too large during backpropagation.

2. The function starts by setting the model to training mode with `model.train()`. This is necessary to enable certain layers (e.g., dropout) that behave differently during training and evaluation.

3. It initializes a variable `epoch_loss` to keep track of the accumulated loss during the epoch.

4. The function iterates over the training data provided by the `iterator`. Each iteration retrieves a batch of input sequences (`src`) and target sequences (`trg`).

5. The input sequences (`src`) and target sequences (`trg`) are moved to the appropriate device (e.g., GPU) using `src = src.to(device)` and `trg = trg.to(device)`.

6. The gradients of the model's parameters are cleared using `optimizer.zero_grad()` to prepare for the new batch.

7. The model is then called with `output = model(src, trg)` to obtain the model's predictions for the target sequences.

8. The `output` tensor has dimensions `[trg len, batch size, output dim]`. To calculate the loss, the tensor is reshaped to `[trg len - 1, batch size, output dim]` to remove the initial `<bos>` token, which is not used for calculating the loss.

9. The target sequences (`trg`) are also reshaped to `[trg len - 1]` by removing the initial `<bos>` token and making it a contiguous tensor. This matches the shape of the reshaped `output` tensor.

10. The loss between the reshaped `output` and `trg` tensors is calculated using the specified `criterion`.

11. The gradients of the loss with respect to the model's parameters are computed using `loss.backward()`.

12. The gradients are then clipped to a maximum value specified by `clip` using `torch.nn.utils.clip_grad_norm_(model.parameters(), clip)`. This prevents the gradients from becoming too large, which can cause issues during optimization.

13. The optimizer's `step()` method is called to update the model's parameters using the computed gradients.

14. The current batch loss (`loss.item()`) is added to the `epoch_loss` variable.

15. After all the batches have been processed, the function returns the average loss per batch for the entire epoch, calculated as `epoch_loss / len(list(iterator))`.


In [None]:
def train(model, iterator, optimizer, criterion, clip):

    model.train()

    epoch_loss = 0

    # Wrap iterator with tqdm for progress logging
    train_iterator = tqdm(iterator, desc="Training", leave=False)

    for i, (src,trg) in enumerate(iterator):

        src = src.to(device)
        trg = trg.to(device)
        optimizer.zero_grad()

        output = model(src, trg)

        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]

        output_dim = output.shape[-1]

        output = output[1:].view(-1, output_dim)

        trg = trg[1:].contiguous().view(-1)

        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]

        loss = criterion(output, trg)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        # Update tqdm progress bar with the current loss
        train_iterator.set_postfix(loss=loss.item())

        epoch_loss += loss.item()


    return epoch_loss / len(list(iterator))

## Evaluating model in PyTorch
You also need to define a function to evaluate the model. Let's go through the code and understand its components:

1. `evaluate(model, iterator, criterion)` takes three arguments:
   - `model` is the neural network model that will be evaluated.
   - `iterator` is an iterable object that provides the evaluation data in batches.
   - `criterion` is the loss function that measures the model's performance.
* Note that evaluate function do not perform any optimization on the model.

2. The function starts by setting the model to evaluation mode with `model.eval()`.

3. It initializes a variable `epoch_loss` to keep track of the accumulated loss during the evaluation.

4. The function enters a `with torch.no_grad()` block, which ensures that no gradients are computed during the evaluation. This saves memory and speeds up the evaluation process since gradients are not needed for parameter updates.

5. The function iterates over the evaluation data provided by the `iterator`. Each iteration retrieves a batch of input sequences (`src`) and target sequences (`trg`).

6. The input sequences (`src`) and target sequences (`trg`) are moved to the appropriate device (e.g., GPU) using `src = src.to(device)` and `trg = trg.to(device)`.

7. The model is then called with `output = model(src, trg, 0)` to obtain the model's predictions for the target sequences. The third argument `0` is passed to indicate that teacher forcing is turned off during evaluation.  During evaluation, teacher forcing is typically turned off to evaluate the model's ability to generate sequences based on its own predictions.

8. The `output` tensor has dimensions `[trg len, batch size, output dim]`. To calculate the loss, the tensor is reshaped to `[trg len - 1, batch size, output dim]` to remove the initial `<bos>` (beginning of sequence) token, which is not used for calculating the loss.

9. The target sequences (`trg`) are also reshaped to `[trg len - 1]` by removing the initial `<bos>` token and making it a contiguous tensor. This matches the shape of the reshaped `output` tensor.

10. The loss between the reshaped `output` and `trg` tensors is calculated using the specified `criterion`.

11. The current batch loss (`loss.item()`) is added to the `epoch_loss` variable.

12. After all the batches have been processed, the function returns the average loss per batch for the entire evaluation, calculated as `epoch_loss / len(list(iterator))`.


In [None]:
def evaluate(model, iterator, criterion):

    model.eval()

    epoch_loss = 0

    # Wrap iterator with tqdm for progress logging
    valid_iterator = tqdm(iterator, desc="Training", leave=False)

    with torch.no_grad():

        for i, (src,trg) in enumerate(iterator):

            src = src.to(device)
            trg = trg.to(device)

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]

            output = output[1:].view(-1, output_dim)

            trg = trg[1:].contiguous().view(-1)


            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            # Update tqdm progress bar with the current loss
            valid_iterator.set_postfix(loss=loss.item())

            epoch_loss += loss.item()

    return epoch_loss / len(list(iterator))

## Data preprocessing


In this section, you will fetch a language translation dataset called Multi30k, collate it (tokenization, numericalization, and adding BOS/EOS and padding) and create iterable batches of src and trg tensors.

This leverages the predefined collate_fn to efficiently curate and ready batches for training the transformer model. The primary aim is to delve deeper into the intricacies of the RNN encoder and decoder components.


A "Multi30K_de_en_dataloader.py" file has been created that contains all the transformation processes on data. Here, you only download the file:


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0205EN-SkillsNetwork/Multi30K_de_en_dataloader.py'

Let's run it:


In [None]:
%run Multi30K_de_en_dataloader.py

There you go! You only need to call the function `get_translation_dataloaders(batch_size = N,flip=True)` with an arbitrary batch size `N` and setting flip to True in order for the LSTM encoder receive input sequence in reversed order. This can help the training.


In [None]:
train_dataloader, valid_dataloader = get_translation_dataloaders(batch_size = 4)#,flip=True)

You can check the `src` and `trg` tensors:


In [None]:
src, trg = next(iter(train_dataloader))
src,trg

You can also get the english and german strings using `index_to_eng` and `index_to_german` functions provided in the .py file:


In [None]:
data_itr = iter(train_dataloader)
# moving forward in the dataset to reach sequences of longer length for illustration purpose. (Remember the dataset is sorted on sequence len for optimal padding)
for n in range(1000):
    german, english= next(data_itr)

for n in range(3):
    german, english=next(data_itr)
    german=german.T
    english=english.T
    print("________________")
    print("german")
    for g in german:
        print(index_to_german(g))
    print("________________")
    print("english")
    for e in english:
        print(index_to_eng(e))


* Note: When working with PyTorch tensors that represent data, it's important to understand the conventions around representing sequences. In most cases, the rows (the first dimension) in a PyTorch tensor represent individual samples, while the columns (the second dimension) represent features or time steps in the case of sequences. When dealing with sequences in PyTorch, it's common to use functions like `pad_sequence` to ensure that all sequences have the same length. Surprisingly, the padding operation is applied along the second dimension (columns), even though sequences are typically represented in the first dimension (rows). This can be confusing at first due to the way batches of sequences are represented. In many sequence-related tasks in PyTorch, especially when working with recurrent models like RNNs, LSTMs, and GRUs, batches of sequences are usually represented with the shape [sequence_length, batch_size, feature_size], where `sequence_length` refers to the length of the longest sequence within the batch(here it is equevalent to `src_len` or `trg_len`). If you check the src tensor above, you can see that the first word of of all sentences are in the first line, the second word of all sentences are in the second line, etc. That is why the first dimension is the length of the sequence.

    When you use `pad_sequence`, it adds padding to the sequences in a batch so that they all have the same length, matching the length of the longest sequence. Since sequences are represented in the first dimension, the padding is applied along that dimension. As a result, the output tensor from `pad_sequence` will have the format [sequence_length, batch_size]. (Check the output for `src` and `trg` from the above cell.) This convention is commonly used because models like LSTMs expect the data to be in this format. However, if you're accustomed to working with more traditional tabular data in PyTorch, it can initially cause confusion. It's important to be aware of this convention to avoid potential errors and understand how to properly prepare and format sequence data for your models.


# Training the model


### Initializations


This code sets the random seed for various libraries and modules. This is done to make the results reproducible:


In [None]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Training
Now, define an instance of the model:

- `enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)`: This line creates an instance of the `Encoder` class, which represents the encoder component of the Seq2Seq model. The `Encoder` class takes the input dimension, embedding dimension, hidden dimension, number of layers, and dropout probability as arguments.

- `dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)`: This line creates an instance of the `Decoder` class, which represents the decoder component of the Seq2Seq model. The `Decoder` class takes the output dimension, embedding dimension, hidden dimension, number of layers, and dropout probability as arguments.

- `model = Seq2Seq(enc, dec, device,trg_vocab = vocab_transform['en']).to(device)`: This line creates an instance of the `Seq2Seq` class, which represents the entire Seq2Seq model. The `Seq2Seq` class takes the encoder, decoder, and device (e.g., CPU or GPU) as arguments. It combines the encoder and decoder to form the complete Seq2Seq architecture.


In [None]:
INPUT_DIM = len(vocab_transform['de'])
OUTPUT_DIM = len(vocab_transform['en'])
ENC_EMB_DIM = 128 #256
DEC_EMB_DIM = 128 #256
HID_DIM = 256 #512
N_LAYERS = 1 #2
ENC_DROPOUT = 0.3 #0.5
DEC_DROPOUT = 0.3 #0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device,trg_vocab = vocab_transform['en']).to(device)

`def init_weights(m)`defines a function named `init_weights` that takes a module `m` as input. The purpose of this function is to initialize the weights of the neural network module.

The next line `for name, param in m.named_parameters():` starts a loop that iterates over the named parameters of the module `m`. Each parameter is accessed as `param` and its corresponding name is accessed as `name`.

`nn.init.uniform_(param.data, -0.08, 0.08)`initializes the parameter's data with values drawn from a uniform distribution between `-0.08` and `0.08`. The `nn.init.uniform_` function is provided by the PyTorch library and is used to initialize the weights of neural network parameters.

Finally, `model.apply(init_weights)` applies the `init_weights` function to the `model` instance. This ensures that the weights of all the parameters in the model are initialized using the specified uniform distribution.


In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

This code defines a function `count_parameters` that counts the number of trainable parameters in a given model. It then prints the count of trainable parameters in a formatted string.


In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The following cell sets up the optimizer and loss function for training the model.

1. `optimizer = optim.Adam(model.parameters())`: This line creates an instance of the Adam optimizer and passes the model's parameters (`model.parameters()`) as the parameters to be optimized. The Adam optimizer is a popular optimization algorithm commonly used for training deep neural networks. It adjusts the model's parameters based on the gradients computed during backpropagation to minimize the loss function.

2. `PAD_IDX = vocab_transform['en'].get_stoi()['<pad>']`: This line retrieves the index of the `<pad>` token in the target vocabulary.

3. `criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)`: This line creates an instance of the CrossEntropyLoss criterion. The CrossEntropyLoss is a commonly used loss function for multi-class classification tasks. In this case, it is used for training the model to predict the next word in the translated sequence. The `ignore_index` parameter is set to `PAD_IDX`, which indicates that the loss should be ignored for any predictions where the target is the padding token. This is useful to exclude padding tokens from contributing to the loss during training.


In [None]:
optimizer = optim.Adam(model.parameters())

PAD_IDX = vocab_transform['en'].get_stoi()['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

The following helper function provides a convenient way to calculate the elapsed time in minutes and seconds given the start and end times. It will be used to measure the time taken for each epoch during training or any other time-related calculations.


In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Please be aware that training the model using CPUs can be a time-consuming process. If you don't have access to GPUs, you can jump to "loading the saved model" and proceed with loading the pre-trained model using the provided code. The model has been trained for five epochs and saved for your convenience.


Let's start the training epochs:


In [None]:
torch.cuda.empty_cache()

N_EPOCHS = 3 #run the training for at least 5 epochs
CLIP = 1

best_valid_loss = float('inf')
best_train_loss = float('inf')
train_losses = []
valid_losses = []

train_PPLs = []
valid_PPLs = []

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train(model, train_dataloader, optimizer, criterion, CLIP)
    train_ppl = math.exp(train_loss)
    valid_loss = evaluate(model, valid_dataloader, criterion)
    valid_ppl = math.exp(valid_loss)


    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)


    if valid_loss < best_valid_loss:

        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'RNN-TR-model.pt')

    train_losses.append(train_loss)
    train_PPLs.append(train_ppl)
    valid_losses.append(valid_loss)
    valid_PPLs.append(valid_ppl)

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {train_ppl:7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {valid_ppl:7.3f}')


Let's visualize the model train and validation losses over the training epochs:


In [None]:
import matplotlib.pyplot as plt



# Create a list of epoch numbers
epochs = [epoch+1 for epoch in range(N_EPOCHS)]

# Create the figure and axes
fig, ax1 = plt.subplots(figsize=(10, 6))
ax2 = ax1.twinx()

# Plotting the training and validation loss
ax1.plot(epochs, train_losses, label='Train Loss', color='blue')
ax1.plot(epochs, valid_losses, label='Validation Loss', color='orange')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss/PPL')

# Plotting the training and validation perplexity
ax2.plot(epochs, train_PPLs, label='Train PPL', color='green')
ax2.plot(epochs, valid_PPLs, label='Validation PPL', color='red')
ax2.set_ylabel('Perplexity')

# Adjust the y-axis scaling for PPL plot
ax2.set_ylim(bottom=min(min(train_PPLs), min(valid_PPLs)) - 10, top=max(max(train_PPLs), max(valid_PPLs)) + 10)

# Set the legend
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
lines = lines1 + lines2
labels = labels1 + labels2
ax1.legend(lines, labels, loc='upper right')


# Show the plot
plt.show()


It can be seen that the loss and perplexity are decreasing as model gets trained. The validation loss starts to stabilize and then grow at Epoch 9, which suggests you do not need to continue training the model to avoid overtraining.


## Loading the saved model
If you want to skip training and load the pre-trained model that is provided, go ahead and uncomment the following cell:


In [None]:
# !wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0201EN-Coursera/RNN-TR-model.pt'
# model.load_state_dict(torch.load('RNN-TR-model.pt',map_location=torch.device('cpu')))

## Model inference


Next, create a generator function that generates translations for input source sentences:


In [None]:
import torch.nn.functional as F

def generate_translation(model, src_sentence, src_vocab, trg_vocab, max_len=50):
    model.eval()  # Set the model to evaluation mode

    with torch.no_grad():
        src_tensor = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1).to(device)

        # Pass the source tensor through the encoder
        hidden, cell = model.encoder(src_tensor)

        # Create a tensor to store the generated translation
        # get_stoi() maps tokens to indices
        trg_indexes = [trg_vocab.get_stoi()['<bos>']]  # Start with <bos> token

        # Convert the initial token to a PyTorch tensor
        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(1)  # Add batch dimension

        # Move the tensor to the same device as the model
        trg_tensor = trg_tensor.to(model.device)


        # Generate the translation
        for _ in range(max_len):

            # Pass the target tensor and the previous hidden and cell states through the decoder
            output, hidden, cell = model.decoder(trg_tensor[-1], hidden, cell)

            # Get the predicted next token
            pred_token = output.argmax(1)[-1].item()

            # Append the predicted token to the translation
            trg_indexes.append(pred_token)


            # If the predicted token is the <eos> token, stop generating
            if pred_token == trg_vocab.get_stoi()['<eos>']:
                break

            # Convert the predicted token to a PyTorch tensor
            trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(1)  # Add batch dimension

            # Move the tensor to the same device as the model
            trg_tensor = trg_tensor.to(model.device)

        # Convert the generated tokens to text
        # get_itos() maps indices to tokens
        trg_tokens = [trg_vocab.get_itos()[i] for i in trg_indexes]

        # Remove the <sos> and <eos> from the translation
        if trg_tokens[0] == '<bos>':
            trg_tokens = trg_tokens[1:]
        if trg_tokens[-1] == '<eos>':
            trg_tokens = trg_tokens[:-1]

        # Return the translation list as a string

        translation = " ".join(trg_tokens)

        return translation

Now, you can check the model's output for a sample sentence:


In [None]:
# model.load_state_dict(torch.load('RNN-TR-model.pt'))

# Actual translation: Asian man sweeping the walkway.
src_sentence = 'Ein asiatischer Mann kehrt den Gehweg.'


generated_translation = generate_translation(model, src_sentence=src_sentence, src_vocab=vocab_transform['de'], trg_vocab=vocab_transform['en'], max_len=12)
#generated_translation = " ".join(generated_translation_list).replace("<bos>", "").replace("<eos>", "")
print(generated_translation)


Fantastic! You have created a translation model that can generate german-english translations pretty accurate, huh?

You can play with the model parameters and hyperparameters to improve the model performance.


## BLEU score metric for evaluation
While peplexity serves as a general metric to evaluate the performance of language model in predicting the correct next token, BLEU score is helpful in evaluating the quality of the final generated translation.
Validating the results using BLEU score is helpful when there is more than a single valid translation for a sentence as you can include many translation versions in the reference list and compare the generated translation with different versions of translations.

The BLEU (Bilingual Evaluation Understudy) score is a metric commonly used to evaluate the quality of machine-generated translations by comparing them to one or more reference translations. It measures the similarity between the generated translation and the reference translations based on n-gram matching.

The BLEU score is calculated using the following formulas:

1. **Precision**:
   - Precision measures the proportion of n-grams in the generated translation that appear in the reference translations.
   - Precision is calculated for each n-gram order (1 to N) and then combined using a geometric mean.
   - The precision for a particular n-gram order is calculated as:
   
   $$\text{Precision}_n(t) = \frac{\text{CountClip}_n(t)}{\text{Count}_n(t)}$$
   
   where:
     - $\text{CountClip}_n(t)$ is the count of n-grams in the generated translation that appear in any reference translation, clipped by the maximum count of that n-gram in any single reference translation.
     - $\text{Count}_n(t)$ is the count of n-grams in the generated translation.

2. . **Brevity penalty**:
   - The brevity penalty accounts for the fact that shorter translations tend to have higher precision scores.
   - It encourages translations that are closer in length to the reference translations.
   - The brevity penalty is calculated as:
   
  $$\text{BP} = \begin{cases} 1 & \text{if } c > r \\\\\\\\\\\\ e^{(1 - \frac{r}{c})} & \text{if } c \leq r \end{cases}$$
   
   where:
     - $c$ is the total length of the generated translation.
     - $r$ is the total length of the reference translations.

3. **BLEU score**:
   - The BLEU score is the geometric mean of the precisions, weighted by the brevity penalty.
   - It is calculated as:
   
   $$\text{BLEU} = \text{BP} \cdot \exp(\sum_{n=1}^{N}w_n \log(\text{Precision}_n(t)))$$
   
   where:
     - $N$ is the maximum n-gram order.
     - $w_n$ is the weight assigned to the precision at n-gram order $n$, commonly set as $\frac{1}{N}$ for equal weights.


In [None]:
def calculate_bleu_score(generated_translation, reference_translations):
    # Convert the generated translations and reference translations into the expected format for sentence_bleu
    references = [reference.split() for reference in reference_translations]
    hypothesis = generated_translation.split()

    # Calculate the BLEU score
    bleu_score = sentence_bleu(references, hypothesis)

    return bleu_score

Let's calculate the BLEU score for a sample sentence:


In [None]:
reference_translations = [
    "Asian man sweeping the walkway .",
    "An asian man sweeping the walkway .",
    "An Asian man sweeps the sidewalk .",
    "An Asian man is sweeping the sidewalk .",
    "An asian man is sweeping the walkway .",
    "Asian man sweeping the sidewalk ."
]

bleu_score = calculate_bleu_score(generated_translation, reference_translations)
print("BLEU Score:", bleu_score)

# Exercises


### Exercise 1 - Translate a German sentence to English.


In [None]:
# Define the German text to be translated
german_text = "Menschen gehen auf der Straße"

...

<details>
    <summary>Click here for Solution</summary>

```python
german_text = "Menschen gehen auf der Straße"

# The function should be defined to accept the text, the model, source and target vocabularies, and the device as parameters.
english_translation = generate_translation(
    model, 
    src_sentence=german_text, 
    src_vocab=vocab_transform['de'], 
    trg_vocab=vocab_transform['en'], 
    max_len=50
)

# Display the original and translated text
print(f"Original German text: {german_text}")
print(f"Translated English text: {english_translation}")
```

</details>


## Authors


[Fateme Akbari](https://www.linkedin.com/in/fatemeakbari/) is a PhD candidate in Information Systems at McMaster University with demonstrated research experience in Machine Learning and NLP.


```{## Change log}
```


```{|Date (YYYY-MM-DD)|Version|Changed By|Change Description||-|-|-|-||2020-07-17|0.1|Sam|Create Lab Template|}
```


© Copyright IBM Corporation. All rights reserved.
