# Language Translation using Transformer Architecture: English to Telugu

## Objective
The objective of this project is to develop a neural machine translation model using the Transformer architecture. The model is trained to translate text from English to another language, leveraging advanced techniques such as attention mechanisms and positional encoding to improve translation accuracy and efficiency.

**`Note:`** Trained the Transformer Model from Scratch, experimented with different hyperparameters, and found that the below hyperparameters are the best in so far experiments. For Inference, along with **Greedy Search**, the **Beam-Search** (beam-width is the hyperparameter used to select top-k predicted tokens) was implemented.

* Hyperparameters:

    * embedding_dim = 256          # dimensionality of the embeddings used for tokens in the input and target sequences
    * fully_connected_dim = 512    # dimensionality of the hidden layer of the feedforward neural network within the Transformer block
    * num_layers = 4               # number of Transformer blocks in the encoder and decoder stacks
    * num_heads = 8                # number of heads in the multi-head attention mechanism
    * dropout_rate = 0.1           # dropout rate for regularization

    * input_vocab_size = 20394    
    * target_vocab_size = 32926   

    * max_positional_encoding_input = 20394    # maximum positional encoding value for input sequence
    * max_positional_encoding_target = 32926  # maximum positional encoding value for target sequence

    * EPOCHS = 120
    * batch_size = 32

    * MAX_LEN = 325

## Steps Involved
1. Data Preparation
2. Model Architecture
3. Training
4. Translation Methods
5. Evaluation

### Data Preparation

#### Importing Necessary Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import os 

# from wordcloud import WordCloud
from tensorflow.python.ops.numpy_ops import np_config
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, Input, Dropout, LayerNormalization

import pickle
from datetime import datetime
import time
from tqdm import tqdm

np_config.enable_numpy_behavior()

2024-04-03 04:24:35.182529: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-03 04:24:35.182666: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-03 04:24:35.304904: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


#### Data Preprocessing

In [3]:
dataset_path = "input/telugu-english-translation-alpaca/yahma_alpaca_cleaned_telugu_filtered_and_romanized.csv"

df = pd.read_csv(dataset_path)

cols = ["instruction", "telugu_instruction", "output", "telugu_output"]
df = df[cols]
df['instruction_len'] = df['instruction'].apply(lambda x: len(x.split()))
df['output_len'] = df["output"].apply(lambda x: len(x.split()))
df.head()

Unnamed: 0,instruction,telugu_instruction,output,telugu_output,instruction_len,output_len
0,Give three tips for staying healthy.,ఆరోగ్యంగా ఉండటానికి మూడు చిట్కాలు ఇవ్వండి.,1. Eat a balanced and nutritious diet: Make su...,1. సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజన...,6,121
1,What are the three primary colors?,మూడు ప్రాధమిక రంగులు ఏమిటి?,"The three primary colors are red, blue, and ye...","మూడు ప్రాధమిక రంగులు ఎరుపు, నీలం మరియు పసుపు. ...",6,53
2,Describe the structure of an atom.,పరమాణువు నిర్మాణాన్ని వివరించండి.,An atom is the basic building block of all mat...,పరమాణువు అనేది అన్ని పదార్ధాల యొక్క ప్రాథమిక బ...,6,209
3,How can we reduce air pollution?,వాయు కాలుష్యాన్ని ఎలా తగ్గించవచ్చు?,There are several ways to reduce air pollution...,వాయు కాలుష్యాన్ని తగ్గించడానికి అనేక మార్గాలు ...,6,216
4,Pretend you are a project manager of a constru...,మీరు ఒక కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేన...,I had to make a difficult decision when I was ...,ఓ కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్ గ...,21,133


In [4]:
import pandas as pd
import re

def preprocess_text(text, is_telugu=False):
    # Lowercase
    text = text.lower()
    
    # Remove end-of-line periods and specific telugu punctuation
    text = re.sub("\.$", '', text)  # English and Telugu common

    if is_telugu:
        text = re.sub("。$", '', text)  # Telugu-specific
    
    # Handle punctuation (add spaces or replace based on the case)
    text = re.sub(r"([!#$%&\()*+,-./:;<=>?@[\\]^_`{|}~])", r" \1 ", text)
    text = re.sub(r"['\"]", '', text)  # Remove quotes directly
    text = text.replace(",", '')  # Handle commas specifically
    text = text.replace("comma", '') # Removes comma 
    # Remove digits
    text = re.sub(r'\d+', '', text)
    
    # Normalize spacing (e.g., after removing or altering punctuation)
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove special char at the start of the string
    pattern = r'^[^a-zA-Z0-9\s]+'
    text = re.sub(pattern, '', text)
    return text

# Apply preprocessing
df['instruction'] = df['instruction'].apply(lambda x: preprocess_text(x))
df['telugu_instruction'] = df['telugu_instruction'].apply(lambda x: preprocess_text(x, is_telugu=True))

df['output'] = df['output'].apply(lambda x: preprocess_text(x))
df['telugu_output'] = df['telugu_output'].apply(lambda x: preprocess_text(x, is_telugu=True))
# Review a sample
df.head()


Unnamed: 0,instruction,telugu_instruction,output,telugu_output,instruction_len,output_len
0,give three tips for staying healthy,ఉండటానికి మూడు చిట్కాలు ఇవ్వండి,eat a balanced and nutritious diet: make sure...,సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజనంల...,6,121
1,what are the three primary colors?,ప్రాధమిక రంగులు ఏమిటి?,the three primary colors are red blue and yell...,ప్రాధమిక రంగులు ఎరుపు నీలం మరియు పసుపు. ఈ రంగ...,6,53
2,describe the structure of an atom,నిర్మాణాన్ని వివరించండి,an atom is the basic building block of all mat...,అనేది అన్ని పదార్ధాల యొక్క ప్రాథమిక బిల్డింగ్...,6,209
3,how can we reduce air pollution?,కాలుష్యాన్ని ఎలా తగ్గించవచ్చు?,there are several ways to reduce air pollution...,కాలుష్యాన్ని తగ్గించడానికి అనేక మార్గాలు ఉన్న...,6,216
4,pretend you are a project manager of a constru...,ఒక కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్...,i had to make a difficult decision when i was ...,కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్ గా...,21,133


**Selecting first 2 sentences from the large corpus in each row**

In [6]:
cols = ["output", "telugu_output"]
df_1 = df[cols].copy()

# Select first 2 sentence
df_1['output'] = df_1['output'].apply(lambda x: x.split(".")[:2])
df_1['telugu_output'] = df_1['telugu_output'].apply(lambda x: x.split(".")[:2])

# Merge the sentences
df_1['output'] = df_1['output'].apply(lambda x: '.'.join(map(str, x)))
df_1['telugu_output'] = df_1['telugu_output'].apply(lambda x: '.'.join(map(str, x)))

# check the token len
df_1['output_tlen'] = df_1['output'].apply(lambda x: len(x.split()))
df_1['tel_output_tlen'] = df_1['telugu_output'].apply(lambda x: len(x.split()))

df_1.head(10)

Unnamed: 0,output,telugu_output,output_tlen,tel_output_tlen
0,eat a balanced and nutritious diet: make sure...,సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజనంల...,47,34
1,the three primary colors are red blue and yell...,ప్రాధమిక రంగులు ఎరుపు నీలం మరియు పసుపు. ఈ రంగ...,36,28
2,an atom is the basic building block of all mat...,అనేది అన్ని పదార్ధాల యొక్క ప్రాథమిక బిల్డింగ్...,43,26
3,there are several ways to reduce air pollution...,కాలుష్యాన్ని తగ్గించడానికి అనేక మార్గాలు ఉన్న...,26,20
4,i had to make a difficult decision when i was ...,కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్ గా...,42,29
5,the commodore was a highly successful -bit hom...,అనేది లో కమోడోర్ బిజినెస్ మెషిన్ (సిబిఎం) తయా...,49,41
6,the fraction / is equivalent to / because both...,/ / కు సమానం ఎందుకంటే రెండు భాగాలు ఒకే విలువన...,31,23
7,here are ten items a person might need for a c...,ట్రిప్ కోసం ఒక వ్యక్తికి అవసరమైన పది అంశాలు ఇ...,23,18
8,the great depression was a period of economic ...,మాంద్యం అనేది - వరకు కొనసాగిన ఆర్థిక క్షీణత క...,39,30
9,the motherboard also known as the mainboard or...,బోర్డ్ మెయిన్ బోర్డ్ లేదా సిస్టమ్ బోర్డ్ అని ...,45,37


In [7]:
df_1.output.values[0], df_1.telugu_output.values[0]

(' eat a balanced and nutritious diet: make sure your meals are inclusive of a variety of fruits and vegetables lean protein whole grains and healthy fats. this helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases',
 ' సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజనంలో వివిధ రకాల పండ్లు మరియు కూరగాయలు సన్నని ప్రోటీన్ తృణధాన్యాలు మరియు ఆరోగ్యకరమైన కొవ్వులు ఉన్నాయని నిర్ధారించుకోండి. ఇది మీ శరీరానికి ఉత్తమంగా పనిచేయడానికి అవసరమైన పోషకాలను అందించడంలో సహాయపడుతుంది మరియు దీర్ఘకాలిక వ్యాధులను నివారించడంలో సహాయపడుతుంది')

#### Data Preparation for training
- Sampling
- Tokenizing
- Padding

In [8]:
n_samples = 10000

df_1 = df_1.dropna().reset_index(drop=True)
df_sample = df_1.sample(n_samples)


# Input Sentences - English
inp_sentences_untok = df_sample.output.tolist()

# Target Sentences - Telugu
target_sentences_untok =  df_sample.telugu_output.tolist()

In [10]:
for i in range(len(target_sentences_untok)):
    target_sentences_untok[i] = "<start> " + str(target_sentences_untok[i]) + " <end>"
    inp_sentences_untok[i] = "<start> " + str(inp_sentences_untok[i]) + " <end>"

In [11]:
num_words = 15000
tokenizer_tar = Tokenizer(num_words=num_words, filters='!#$%&()*+,-/:;<=>@«»""[\\]^_`{|}~\t\n')
tokenizer_tar.fit_on_texts(target_sentences_untok)
target_sentences = tokenizer_tar.texts_to_sequences(target_sentences_untok)

word_index = tokenizer_tar.word_index
print(f"The number of words in the English vocabulary: {len(word_index)}")

The number of words in the English vocabulary: 32925


In [12]:
tokenizer_inp = Tokenizer(num_words=num_words, filters='!#$%&()*+,-/:;<=>@«»""[\\]^_`{|}~\t\n')
tokenizer_inp.fit_on_texts(inp_sentences_untok)
inp_sentences = tokenizer_inp.texts_to_sequences(inp_sentences_untok)

word_index_inp = tokenizer_inp.word_index
print(f"The number of words in the Telugu vocabulary: {len(word_index_inp)}")

The number of words in the Telugu vocabulary: 20393


In [14]:
# Find the maximum length of tokenized sentences in english and telugu tokenized sentences.
max_len_en = max(len(lst) for lst in target_sentences)
max_len_tel = max(len(lst) for lst in inp_sentences)

MAX_LEN = max(max_len_en, max_len_tel)
print(f"Max token length in English: {max_len_en}, in Telugu: {max_len_tel} and selected max-len for padding is: {MAX_LEN}")

Max token length in English: 265, in Telugu: 325 and selected max-len for padding is: 325


In [15]:
target_sentences = pad_sequences(target_sentences, maxlen = MAX_LEN, padding='post', truncating='post')
inp_sentences = pad_sequences(inp_sentences, maxlen=MAX_LEN, padding='post', truncating='post')

## Model Architecture

   
#### Helper Functions for transformer

### Positional Encoding

In [16]:
def get_angles(pos, i, embedding_dim):
    """
    Function to compute the angles for positional encoding.
    
    Returns the angle computed
    """
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(embedding_dim))
    return pos * angle_rates


def positional_encoding(position, embedding_dim):
    """
    Adds  positional encoding to the Embeddings to be fed to the Transformer model.
    
    Computes a sin and cos of the angles determined by the get_angles() function
    and adds the value computed to an axis of the embeddings.
    """
    angle_rads = get_angles(np.arange(position)[:, np.newaxis], 
                           np.arange(embedding_dim)[np.newaxis, :], embedding_dim)
    
    # apply sin to even indices in the array. ie 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # apply cos to odd indices in the array. ie 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

### Masking

In [18]:
def create_padding_mask(seq):
    """
    Creates a padding mask for a given sequence.
    
    Args:
        seq (tensor): A tensor of shape (batch_size, seq_len) containing the sequence.
        
    Returns:
        A tensor of shape (batch_size, 1, 1, seq_len) containing a mask that is 1 where the sequence is padded, and 0 otherwise.
    """
    # Convert the sequence to a boolean tensor where True indicates a pad token (value 0).
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    
    # Add an extra dimension to the mask to add the padding to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]

def create_look_ahead_mask(size):
    """
    Creates a look-ahead mask used during training the decoder of a transformer.

    Args:
        size (int): The size of the mask.

    Returns:
        tf.Tensor: A lower triangular matrix of shape (size, size) with ones on the diagonal
            and zeros below the diagonal.
    """
    # create a matrix with ones on the diagonal and zeros below the diagonal
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    
    return mask

def create_masks(inputs, targets):
    """
    Creates masks for the input sequence and target sequence.
    
    Args:
        inputs: Input sequence tensor.
        targets: Target sequence tensor.
    
    Returns:
        A tuple of three masks: the encoder padding mask, the combined mask used in the first attention block, 
        and the decoder padding mask used in the second attention block.
    """
    
    # Create the encoder padding mask.
    enc_padding_mask = create_padding_mask(inputs)
        
    # Create the decoder padding mask.
    dec_padding_mask = create_padding_mask(inputs)
        
    # Create the look ahead mask for the first attention block.
    # It is used to pad and mask future tokens in the tokens received by the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(targets)[1])
    
    # Create the decoder target padding mask.
    dec_target_padding_mask = create_padding_mask(targets)
    
    # Combine the look ahead mask and decoder target padding mask for the first attention block.
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
        
    return enc_padding_mask, combined_mask, dec_padding_mask


### Attention Mechanism - Multihead Attention


In [19]:
def scaled_dot_product_attention(q, k, v, mask):
    """
    Computes the scaled dot product attention weight for the query (q), key (k), and value (v) vectors. 
    The attention weight is a measure of how much focus should be given to each element in the sequence of values (v) 
    based on the corresponding element in the sequence of queries (q) and keys (k).
    
    Args:
    q: query vectors; shape (..., seq_len_q, depth)
    k: key vectors; shape  (..., seq_len_k, depth)
    v: value vectors; shape  (..., seq_len_v, depth_v)
    mask: (optional) mask to be applied to the attention weights
    
    Returns:
    output: The output of the scaled dot product attention computation; shape   (..., seq_len_q, depth_v)
    attention_weights: The attention weights
    """
    # Compute dot product of query and key vectors
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    
    # Compute the square root of the depth of the key vectors
    dk = tf.cast(tf.shape(k)[-1], dtype=tf.float32)
    scaled_dk = tf.math.sqrt(dk)
    
    # Compute scaled attention logits by dividing dot product by scaled dk
    scaled_attention_logits = matmul_qk / scaled_dk
    
    # Apply mask to the attention logits (if mask is not None)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
        
    # Apply softmax to the scaled attention logits to get the attention weights
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
    # Compute the weighted sum of the value vectors using the attention weights
    output = tf.matmul(attention_weights, v)
    
    return output, attention_weights


In [20]:
class MultiHeadAttention(tf.keras.layers.Layer):
    """
    MultiHeadAttention Layer that implements the attention mechanism for the Transformer.
    It splits the input into multiple heads, computes scaled dot-product attention for each head
    and then concatenates the output of the heads and passes it through a dense layer.
    """

    def __init__(self, key_dim, num_heads, dropout_rate=0.0):
        """
        Initializes the MultiHeadAttention layer.
    
        Args:
            key_dim (int): The dimensionality of the key space.
            num_heads (int): The number of attention heads.
            dropout (float): The dropout rate to apply after the dense layer.
        """
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.key_dim = key_dim
        #  ensure  that the dimension of the embedding can be evenly split across attention heads
        assert key_dim % num_heads == 0 
        self.depth = self.key_dim // self.num_heads
        
        # dense layers to project the input into queries, keys and values
        self.wq = Dense(key_dim)
        self.wk = Dense(key_dim)
        self.wv = Dense(key_dim)
    
        # dropout layer
        self.dropout = Dropout(dropout_rate)
    
        # dense layer to project the output of the attention heads
        self.dense = Dense(key_dim)
        
    def split_heads(self, x, batch_size):
        """
        Splits the last dimension of the tensor into (num_heads, depth).
        Transposes the result such that the shape is (batch_size, num_heads, seq_len, depth).
    
        Args:
            x (tensor): The tensor to be split.
            batch_size (int): The size of the batch.
    
        Returns:
            tensor: The tensor with the last dimension split into (num_heads, depth) and transposed.
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
        
    def call(self, v, k, q, mask=None):
        """
        Applies the multi-head attention mechanism to the inputs.
    
        Args:
            v (tensor): The value tensor of shape (batch_size, seq_len_v, key_dim).
            k (tensor): The key tensor of shape (batch_size, seq_len_k, key_dim).
            q (tensor): The query tensor of shape (batch_size, seq_len_q, key_dim).
            mask (tensor, optional): The mask tensor of shape (batch_size, seq_len_q, seq_len_k).
                                     Defaults to None.
    
        Returns:
            tensor: The output tensor of shape (batch_size, seq_len_q, key_dim).
            tensor: The attention weights tensor of shape (batch_size, num_heads, seq_len_q, seq_len_k).
        """
        batch_size = tf.shape(q)[0]
        
        # Dense on the q, k, v vectors
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        
        # split the heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # split the queries, keys and values into multiple heads
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        
        # reshape and add Dense layer
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.key_dim))
        output = self.dense(concat_attention)
        output = self.dropout(output)
        
        return output, attention_weights
    

### Fully Connected NeuralNetwork
    
def FeedForward(embedding_dim, fully_connected_dim):
    """Create a fully connected feedforward neural network.
    
    Args:
        embedding_dim (int): Dimensionality of the embedding output from the transformer layer.
        fully_connected_dim (int): Number of neurons in the fully connected layers.
    
    Returns:
        tf.keras.Sequential: A fully connected feedforward neural network with the specified architecture.
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(fully_connected_dim, activation='relu'),
        tf.keras.layers.Dense(embedding_dim)
    ])
    return model


### Encoder

In [21]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1):
        """Initializes the encoder layer
        
        Args: 
            embedding_dim: The dimensionality of the input and output of this layer
            num_heads: The number of attention heads to use in the multi-head attention layer
            fully_connected_dim: The dimensionality of the hidden layer in the feedforward network
            dropout_rate: The rate of dropout to apply to the output of this layer during training
            
        Returns:
            A new instance of the EncoderLayer class
        """
        super(EncoderLayer, self).__init__()
        
        # Multi-head self-attention mechanism
        self.mha = MultiHeadAttention(embedding_dim, num_heads, dropout_rate)
        
        # Layer normalization
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        
        # Dropout
        self.dropout = Dropout(dropout_rate)
        
        # Feedforward network
        self.ffn = FeedForward(embedding_dim, fully_connected_dim)
        
    def call(self, x, training, mask):
        """Applies the encoder layer to the input tensor
        
        Args:
            x: The input tensor to the encoder layer
            training: A boolean indicating whether the model is in training mode
            mask: A tensor representing the mask to apply to the attention mechanism
            
        Returns:
            The output of the encoder layer after applying the multi-head attention and feedforward network
        """
        
        # Apply multi-head self-attention mechanism to input tensor
        attn_output, _ = self.mha(x, x, x, mask)
        
        # Apply first layer normalization and add residual connection
        out1 = self.layernorm1(attn_output + x)
        
        # Apply feedforward network to output of first layer normalization
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout(ffn_output, training=training)
        
        # Apply second layer normalization and add residual connection
        out2 = self.layernorm2(ffn_output + out1)
        
        return out2
    

#### Encoder 
    
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        """
        Initializes the Encoder layer of the Transformer model.
        
        Args:
            num_layers (int): Number of EncoderLayers to stack.
            embedding_dim (int): Dimensionality of the token embedding space.
            num_heads (int): Number of attention heads to use in MultiHeadAttention layers.
            fully_connected_dim (int): Dimensionality of the fully connected layer in the EncoderLayer.
            input_vocab_size (int): Size of the input vocabulary.
            maximum_position_encoding (int): Maximum length of input sequences for positional encoding.
            dropout_rate (float): Probability of dropping out units during training.

        """
        super(Encoder, self).__init__()
        
        self.num_layers = num_layers
        self.embedding_dim = embedding_dim
        
        # Embedding layer
        self.embedding = Embedding(input_vocab_size, embedding_dim)
        
        # Positional encoding
        self.pos_encoding = positional_encoding(maximum_position_encoding, embedding_dim)
        
        # Encoder layers
        self.enc_layers = [EncoderLayer(embedding_dim, num_heads, fully_connected_dim, dropout_rate) for _ in range(num_layers)]
        
        # Dropout layer
        self.dropout = Dropout(dropout_rate)
        
    def call(self, inputs, training, mask):
        """
        Call function for the Encoder layer.
        
        Args:
            inputs: tensor of shape (batch_size, sequence_length) representing input sequences
            training: boolean indicating if the model is in training mode
            mask: tensor of shape (batch_size, sequence_length) representing the mask to apply to the input sequence

        Returns:
            A tensor of shape (batch_size, sequence_length, embedding_dim) representing the encoded sequence
        """

        # Get the sequence length
        seq_len = tf.shape(inputs)[1]

        # Embed the input sequence
        inputs = self.embedding(inputs)

        # Scale the embeddings by sqrt(embedding_dim)
        inputs *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))

        # Add positional encodings to the input sequence
        inputs += self.pos_encoding[:, :seq_len, :]

        # Apply dropout to the input sequence
        inputs = self.dropout(inputs, training=training)

        # Pass the input sequence through the encoder layers
        for i in range(self.num_layers):
            inputs = self.enc_layers[i](inputs, training, mask)

        # Return the encoded sequence
        return inputs

### Decoder 

In [22]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1):
        """
        Initializes a single decoder layer of the transformer model.
        
        Args:
        embedding_dim: The dimension of the embedding space.
        num_heads: The number of attention heads to use.
        fully_connected_dim: The dimension of the feedforward network.
        rate: The dropout rate for regularization.
        """
        super(DecoderLayer, self).__init__()
        
        # Instantiate two instances of MultiHeadAttention.
        self.mha1 = MultiHeadAttention(embedding_dim, num_heads, dropout_rate)
        self.mha2 = MultiHeadAttention(embedding_dim, num_heads, dropout_rate)
        
        # Instantiate a fully connected feedforward network.
        self.ffn = FeedForward(embedding_dim, fully_connected_dim)
        
        # Instantiate three layer normalization layers with epsilon=1e-6.
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.layernorm3 = LayerNormalization(epsilon=1e-6)
        
        # Instantiate a dropout layer for regularization.
        self.dropout3 = Dropout(dropout_rate)
        
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Forward pass through the decoder layer.
        
        Args:
        x: The input to the decoder layer, a query vector.
        enc_output: The output from the top layer of the encoder, a set of attention vectors k and v.
        training: Whether to apply dropout regularization.
        look_ahead_mask: The mask to apply to the input sequence so that it can't look ahead to future positions.
        padding_mask: The mask to apply to the input sequence to ignore padding tokens.
        
        Returns:
        The output from the decoder layer, a tensor with the same shape as the input.
        The attention weights from the first multi-head attention layer.
        The attention weights from the second multi-head attention layer.
        """
        
        # Apply the first multi-head attention layer to the query vector x.
        # We pass x as all three inputs to the layer because this is a self-attention layer.
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)
        
        # Add the original input to the output of the attention layer and apply layer normalization.
        out1 = self.layernorm1(attn1 + x) 
        
        # Apply the second multi-head attention layer to the output from the first layer and the encoder output.
        attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask)
        
        # Add the output from the first layer to the output of the second layer and apply layer normalization.
        out2 = self.layernorm2(attn2 + out1)
        
        # Apply the feedforward network to the output of the second layer and apply dropout regularization.
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        
        # Add the output from the second layer to the output of the feedforward network and apply layer normalization.
        out3 = self.layernorm3(ffn_output + out2)
        
        return out3, attn_weights_block1, attn_weights_block2


### Decoder
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        """
        Initializes the Decoder object.
        
        Args:
            num_layers (int): The number of Decoder layers.
            embedding_dim (int): The size of the embedding dimension.
            num_heads (int): The number of heads in the MultiHeadAttention layer.
            fully_connected_dim (int): The number of units in the feedforward network.
            target_vocab_size (int): The number of words in the target vocabulary.
            maximum_position_encoding (int): The maximum length of a sequence.
            dropout_rate (float): The rate at which to apply dropout.
        """
        super(Decoder, self).__init__()
        
        self.num_layers = num_layers
        self.embedding_dim = embedding_dim
        
        # create layers
        self.embedding = Embedding(target_vocab_size, embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, embedding_dim)
        self.dec_layers = [DecoderLayer(embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1) for _ in range(num_layers)]
        self.dropout = Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Executes the Decoder.

        Args:
            x (tf.Tensor): The input to the Decoder.
            enc_output (tf.Tensor): The output from the Encoder.
            training (bool): Whether the Decoder is in training mode.
            look_ahead_mask (tf.Tensor): The mask for self-attention in the MultiHeadAttention layer.
            padding_mask (tf.Tensor): The mask for padding in the MultiHeadAttention layer.

        Returns:
            tf.Tensor: The output from the Decoder.
            dict: A dictionary of attention weights.
        """
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        # add embedding and positional encoding
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)

        # apply each layer of the decoder
        for i in range(self.num_layers):
            # pass through decoder layer i
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

            # record attention weights for block1 and block2
            attention_weights[f"decoder_layer{i + 1}_block1"] = block1
            attention_weights[f"decoder_layer{i + 1}_block2"] = block2

        return x, attention_weights

## Transformer


In [23]:
class Transformer(tf.keras.Model):
    """
    A Transformer model that takes in an input and target sequence and outputs a final prediction.

    Args:
        num_layers (int): Number of layers in the Encoder and Decoder.
        embedding_dim (int): Dimensionality of the embedding layer.
        num_heads (int): Number of attention heads used in the Transformer.
        fully_connected_dim (int): Dimensionality of the fully connected layer in the Encoder and Decoder.
        input_vocab_size (int): Size of the input vocabulary.
        target_vocab_size (int): Size of the target vocabulary.
        max_positional_encoding_input (int): Maximum length of the input sequence.
        max_positional_encoding_target (int): Maximum length of the target sequence.
        dropout_rate (float, optional): Dropout rate used in the Encoder and Decoder layers. Defaults to 0.1.
    """
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, target_vocab_size, max_positional_encoding_input, max_positional_encoding_target, dropout_rate=0.1):
        super(Transformer, self).__init__()
        
        # Initialize the Encoder and Decoder layers
        self.encoder = Encoder(num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, max_positional_encoding_input, dropout_rate)
        self.decoder = Decoder(num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, max_positional_encoding_target, dropout_rate)
        
        # Add a final dense layer to make the final prediction
        self.final_layer = tf.keras.layers.Dense(target_vocab_size, activation='softmax')
        
    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        """
        Perform a forward pass through the Transformer model.

        Args:
            inp (tf.Tensor): Input sequence tensor with shape (batch_size, input_seq_len).
            tar (tf.Tensor): Target sequence tensor with shape (batch_size, target_seq_len).
            training (bool): Whether the model is being trained or not.
            enc_padding_mask (tf.Tensor): Padding mask for the Encoder with shape (batch_size, 1, 1, input_seq_len).
            look_ahead_mask (tf.Tensor): Mask to prevent the Decoder from looking ahead in the target sequence with shape (batch_size, 1, target_seq_len, target_seq_len).
            dec_padding_mask (tf.Tensor): Padding mask for the Decoder with shape (batch_size, 1, 1, target_seq_len).

        Returns:
            tuple: A tuple containing the final output of the model and the attention weights of the Decoder.
        """
        # Pass the input sequence through the Encoder
        enc_output = self.encoder(inp, training, enc_padding_mask)
        
        # Pass the target sequence and the output of the Encoder through the Decoder
        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)
        
        # Pass the output of the Decoder through the final dense layer to get the final prediction
        final_output = self.final_layer(dec_output)
        
        return final_output, attention_weights


### Initializing Hyperparameters <a name="5-1"></a>

In [25]:
# Set hyperparameters for the Transformer model
embedding_dim = 256          # dimensionality of the embeddings used for tokens in the input and target sequences
fully_connected_dim = 512    # dimensionality of the hidden layer of the feedforward neural network within the Transformer block
num_layers = 4               # number of Transformer blocks in the encoder and decoder stacks
num_heads = 8                # number of heads in the multi-head attention mechanism
dropout_rate = 0.1           # dropout rate for regularization

# Set vocabulary sizes for input and target sequences
input_vocab_size = len(tokenizer_inp.word_index) + 1    # add 1 for the start and end tokens
target_vocab_size = len(tokenizer_tar.word_index) + 1   # add 1 for the start and end tokens

# Set maximum positional encoding values for input and target sequences
max_positional_encoding_input = input_vocab_size    # maximum positional encoding value for input sequence
max_positional_encoding_target = target_vocab_size  # maximum positional encoding value for target sequence

# Set the number of epochs and batch size for training
EPOCHS = 50
batch_size = 32

MAX_LEN = MAX_LEN


### Optimizer Learning Rate Scheduler

In [26]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, embedding_dim, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        self.embedding_dim = tf.cast(embedding_dim, dtype=tf.float32)
        self.warmup_steps = tf.cast(warmup_steps, dtype=tf.float32)

    def __call__(self, step):
        step = tf.cast(step, dtype=tf.float32)
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.embedding_dim) * tf.math.minimum(arg1, arg2)

# Create an instance of the custom learning rate schedule
learning_rate = CustomSchedule(embedding_dim)


### Loss Function and Metrics 

In [27]:
# Define the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2 = 0.98, epsilon = 1e-9)

# Define the loss object
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()


def loss_function(true_values, predictions):
    # Create a mask to exclude the padding tokens
    mask = tf.math.logical_not(tf.math.equal(true_values, 0))

    # Compute the loss value using the loss object
    loss_ = loss_object(true_values, predictions)

    # Apply the mask to exclude the padding tokens
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    # Calculate the mean loss value
    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)

def accuracy_function(true_values, predictions):
    # Compute the accuracies using the true and predicted target sequences
    accuracies = tf.equal(true_values, tf.argmax(predictions, axis=2))

    # Create a mask to exclude the padding tokens
    mask = tf.math.logical_not(tf.math.equal(true_values, 0))

    # Apply the mask to exclude the padding tokens from the accuracies
    accuracies = tf.math.logical_and(mask, accuracies)
    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)

    # Calculate the mean accuracy value
    return tf.reduce_sum(accuracies) / tf.reduce_sum(mask)

# Define the training metrics
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

### Transformer Initialization

In [28]:
# Create an instance of the Transformer model
transformer = Transformer(num_layers, embedding_dim, num_heads,
                           fully_connected_dim, input_vocab_size, target_vocab_size, 
                           max_positional_encoding_input, max_positional_encoding_target, dropout_rate)

In [29]:
# the train function
train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]


@tf.function(input_signature=train_step_signature)
def train_step(encoder_input, target):
    # Slice the target tensor to get the input for the decoder
    decoder_input = target[:, :-1]

    # Slice the target tensor to get the expected output of the decoder
    expected_output = target[:, 1:]

    # Create masks for the encoder input, decoder input and the padding
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, decoder_input)

    # Perform a forward pass through the model
    with tf.GradientTape() as tape:
        predictions, _ = transformer(encoder_input, decoder_input, True, enc_padding_mask, combined_mask, dec_padding_mask)

        # Calculate the loss between the predicted output and the expected output
        loss = loss_function(expected_output, predictions)

    # Calculate gradients and update the model parameters
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    # Update the training loss and accuracy metrics
    train_loss(loss)
    train_accuracy(expected_output, predictions)



### Model Saving

In [30]:
# Get the current date and time
current_time = datetime.now()

# Format the current date and time into a string
time_str = current_time.strftime("%Y-%m-%d-%H-%M")
time_str

'2024-04-03-04-28'

#### Tokenizer Saving
- Save both Tokenizers - English (Input), Telugu (Target)

In [31]:
training_model_name_to_save = f"{time_str}_{embedding_dim}_{fully_connected_dim}_{num_layers}_{num_heads}_{dropout_rate}_{input_vocab_size}_{target_vocab_size}_{EPOCHS}_{batch_size}_{MAX_LEN}"

# Dictionary to hold multiple tokenizers - English + Telugu
tokenizers_dict = {
    'english_tokenizer_target': tokenizer_tar,
    'telugu_tokenizer_input': tokenizer_inp
}

# Create the 'artifacts' directory if it doesn't exist
if not os.path.exists('transformer_training_artifacts'):
    os.makedirs('transformer_training_artifacts/tokenizer')
    os.makedirs('transformer_training_artifacts/training_checkpoints')

def save_the_tokenizer():
    # Save the tokenizer - english & telugu
    with open(f'transformer_training_artifacts/tokenizer/tokenizer_{training_model_name_to_save}.pickle', 'wb') as handle:
        pickle.dump(tokenizers_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

#### Checkpoint Saving

In [32]:
checkpoint_path = f"transformer_training_artifacts/training_checkpoints/transformer_{training_model_name_to_save}"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print ('Latest checkpoint restored!!')

## Training

In [33]:
### Training 
for epoch in tqdm(range(1, EPOCHS+1)):
    start = time.time()
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()
    train_accuracy.reset_states()
    current_batch_index = 0

    # iterate through the dataset in batches of batch_size
    for i in (range(int(len(target_sentences)/batch_size))):
        # get the input and target batch
        target_batch = tf.convert_to_tensor(np.array(target_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)
        input_batch = tf.convert_to_tensor(np.array(inp_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)

        # English --> Telugu Translation
#         input_batch = tf.convert_to_tensor(np.array(target_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)
#         target_batch = tf.convert_to_tensor(np.array(inp_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)

        current_batch_index = current_batch_index + batch_size
        # call the train_step function to train the model using the current batch
        train_step(input_batch, target_batch)

    if (epoch) % 5 == 0:
        ckpt_save_path = ckpt_manager.save()
        print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
                                                             ckpt_save_path))

    # print the epoch loss and accuracy after iterating through the dataset
    print (f'Epoch {epoch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}') 
    print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

# Save the Tokenizer after the Model Training is successful (To reduce the model artifacts if model training is failed)
save_the_tokenizer()

I0000 00:00:1712118517.334202     253 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
  2%|▏         | 1/50 [03:02<2:29:13, 182.72s/it]

Epoch 1 Loss 7.2520 Accuracy 0.8438
Time taken for 1 epoch: 182.71731042861938 secs



  4%|▍         | 2/50 [05:23<2:06:22, 157.97s/it]

Epoch 2 Loss 1.1992 Accuracy 0.9188
Time taken for 1 epoch: 140.64455676078796 secs



  6%|▌         | 3/50 [07:44<1:57:36, 150.14s/it]

Epoch 3 Loss 0.6433 Accuracy 0.9234
Time taken for 1 epoch: 140.8220772743225 secs



  8%|▊         | 4/50 [10:04<1:52:16, 146.46s/it]

Epoch 4 Loss 0.6062 Accuracy 0.9251
Time taken for 1 epoch: 140.8088574409485 secs



 10%|█         | 5/50 [12:26<1:48:36, 144.80s/it]

Saving checkpoint for epoch 6 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-1
Epoch 5 Loss 0.5619 Accuracy 0.9272
Time taken for 1 epoch: 141.86448049545288 secs



 12%|█▏        | 6/50 [14:47<1:45:11, 143.43s/it]

Epoch 6 Loss 0.5195 Accuracy 0.9291
Time taken for 1 epoch: 140.7785406112671 secs



 14%|█▍        | 7/50 [17:08<1:42:09, 142.56s/it]

Epoch 7 Loss 0.4837 Accuracy 0.9309
Time taken for 1 epoch: 140.74905443191528 secs



 16%|█▌        | 8/50 [19:29<1:39:24, 142.01s/it]

Epoch 8 Loss 0.4535 Accuracy 0.9327
Time taken for 1 epoch: 140.826345205307 secs



 18%|█▊        | 9/50 [21:50<1:36:46, 141.63s/it]

Epoch 9 Loss 0.4249 Accuracy 0.9343
Time taken for 1 epoch: 140.79870438575745 secs



 20%|██        | 10/50 [24:11<1:34:26, 141.67s/it]

Saving checkpoint for epoch 11 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-2
Epoch 10 Loss 0.3981 Accuracy 0.9360
Time taken for 1 epoch: 141.75087690353394 secs



 22%|██▏       | 11/50 [26:32<1:31:54, 141.40s/it]

Epoch 11 Loss 0.3724 Accuracy 0.9377
Time taken for 1 epoch: 140.8082935810089 secs



 24%|██▍       | 12/50 [28:53<1:29:26, 141.22s/it]

Epoch 12 Loss 0.3489 Accuracy 0.9395
Time taken for 1 epoch: 140.78357529640198 secs



 26%|██▌       | 13/50 [31:14<1:26:59, 141.07s/it]

Epoch 13 Loss 0.3265 Accuracy 0.9414
Time taken for 1 epoch: 140.7388916015625 secs



 28%|██▊       | 14/50 [33:34<1:24:34, 140.96s/it]

Epoch 14 Loss 0.3004 Accuracy 0.9441
Time taken for 1 epoch: 140.70149040222168 secs



 30%|███       | 15/50 [35:56<1:22:20, 141.17s/it]

Saving checkpoint for epoch 16 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-3
Epoch 15 Loss 0.2718 Accuracy 0.9476
Time taken for 1 epoch: 141.6537425518036 secs



 32%|███▏      | 16/50 [38:17<1:19:54, 141.02s/it]

Epoch 16 Loss 0.2430 Accuracy 0.9515
Time taken for 1 epoch: 140.67940068244934 secs



 34%|███▍      | 17/50 [40:37<1:17:30, 140.93s/it]

Epoch 17 Loss 0.2166 Accuracy 0.9553
Time taken for 1 epoch: 140.7104253768921 secs



 36%|███▌      | 18/50 [42:58<1:15:07, 140.87s/it]

Epoch 18 Loss 0.1933 Accuracy 0.9589
Time taken for 1 epoch: 140.72883868217468 secs



 38%|███▊      | 19/50 [45:19<1:12:45, 140.83s/it]

Epoch 19 Loss 0.1733 Accuracy 0.9621
Time taken for 1 epoch: 140.73692226409912 secs



 40%|████      | 20/50 [47:40<1:10:31, 141.07s/it]

Saving checkpoint for epoch 21 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-4
Epoch 20 Loss 0.1546 Accuracy 0.9653
Time taken for 1 epoch: 141.6169149875641 secs



 42%|████▏     | 21/50 [50:01<1:08:08, 140.97s/it]

Epoch 21 Loss 0.1391 Accuracy 0.9679
Time taken for 1 epoch: 140.74676966667175 secs



 44%|████▍     | 22/50 [52:22<1:05:45, 140.90s/it]

Epoch 22 Loss 0.1253 Accuracy 0.9705
Time taken for 1 epoch: 140.71928143501282 secs



 46%|████▌     | 23/50 [54:43<1:03:23, 140.86s/it]

Epoch 23 Loss 0.1127 Accuracy 0.9729
Time taken for 1 epoch: 140.77496910095215 secs



 48%|████▊     | 24/50 [57:03<1:01:01, 140.82s/it]

Epoch 24 Loss 0.1029 Accuracy 0.9748
Time taken for 1 epoch: 140.74073457717896 secs



 50%|█████     | 25/50 [59:25<58:46, 141.07s/it]  

Saving checkpoint for epoch 26 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-5
Epoch 25 Loss 0.0923 Accuracy 0.9770
Time taken for 1 epoch: 141.65796422958374 secs



 52%|█████▏    | 26/50 [1:01:46<56:23, 140.97s/it]

Epoch 26 Loss 0.0853 Accuracy 0.9785
Time taken for 1 epoch: 140.707510471344 secs



 54%|█████▍    | 27/50 [1:04:07<54:00, 140.89s/it]

Epoch 27 Loss 0.0783 Accuracy 0.9800
Time taken for 1 epoch: 140.7250828742981 secs



 56%|█████▌    | 28/50 [1:06:27<51:39, 140.86s/it]

Epoch 28 Loss 0.0725 Accuracy 0.9814
Time taken for 1 epoch: 140.7936646938324 secs



 58%|█████▊    | 29/50 [1:08:48<49:16, 140.81s/it]

Epoch 29 Loss 0.0668 Accuracy 0.9826
Time taken for 1 epoch: 140.68007016181946 secs



 60%|██████    | 30/50 [1:11:10<47:01, 141.07s/it]

Saving checkpoint for epoch 31 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-6
Epoch 30 Loss 0.0621 Accuracy 0.9837
Time taken for 1 epoch: 141.66960954666138 secs



 62%|██████▏   | 31/50 [1:13:30<44:38, 140.97s/it]

Epoch 31 Loss 0.0581 Accuracy 0.9846
Time taken for 1 epoch: 140.751971244812 secs



 64%|██████▍   | 32/50 [1:15:51<42:16, 140.94s/it]

Epoch 32 Loss 0.0547 Accuracy 0.9855
Time taken for 1 epoch: 140.86904096603394 secs



 66%|██████▌   | 33/50 [1:18:12<39:54, 140.86s/it]

Epoch 33 Loss 0.0514 Accuracy 0.9863
Time taken for 1 epoch: 140.6739809513092 secs



 68%|██████▊   | 34/50 [1:20:33<37:32, 140.80s/it]

Epoch 34 Loss 0.0482 Accuracy 0.9870
Time taken for 1 epoch: 140.6664891242981 secs



 70%|███████   | 35/50 [1:22:54<35:15, 141.05s/it]

Saving checkpoint for epoch 36 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-7
Epoch 35 Loss 0.0461 Accuracy 0.9876
Time taken for 1 epoch: 141.63100361824036 secs



 72%|███████▏  | 36/50 [1:25:15<32:53, 140.95s/it]

Epoch 36 Loss 0.0435 Accuracy 0.9882
Time taken for 1 epoch: 140.712096452713 secs



 74%|███████▍  | 37/50 [1:27:36<30:31, 140.87s/it]

Epoch 37 Loss 0.0409 Accuracy 0.9889
Time taken for 1 epoch: 140.69272208213806 secs



 76%|███████▌  | 38/50 [1:29:56<28:09, 140.83s/it]

Epoch 38 Loss 0.0389 Accuracy 0.9895
Time taken for 1 epoch: 140.7165961265564 secs



 78%|███████▊  | 39/50 [1:32:17<25:48, 140.79s/it]

Epoch 39 Loss 0.0368 Accuracy 0.9900
Time taken for 1 epoch: 140.7060215473175 secs



 80%|████████  | 40/50 [1:34:39<23:30, 141.06s/it]

Saving checkpoint for epoch 41 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-8
Epoch 40 Loss 0.0353 Accuracy 0.9903
Time taken for 1 epoch: 141.69147634506226 secs



 82%|████████▏ | 41/50 [1:37:00<21:08, 140.96s/it]

Epoch 41 Loss 0.0335 Accuracy 0.9908
Time taken for 1 epoch: 140.71352815628052 secs



 84%|████████▍ | 42/50 [1:39:20<18:47, 140.88s/it]

Epoch 42 Loss 0.0320 Accuracy 0.9912
Time taken for 1 epoch: 140.71199584007263 secs



 86%|████████▌ | 43/50 [1:41:41<16:25, 140.83s/it]

Epoch 43 Loss 0.0307 Accuracy 0.9916
Time taken for 1 epoch: 140.70782446861267 secs



 88%|████████▊ | 44/50 [1:44:02<14:04, 140.79s/it]

Epoch 44 Loss 0.0294 Accuracy 0.9919
Time taken for 1 epoch: 140.68632888793945 secs



 90%|█████████ | 45/50 [1:46:23<11:45, 141.05s/it]

Saving checkpoint for epoch 46 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-9
Epoch 45 Loss 0.0281 Accuracy 0.9923
Time taken for 1 epoch: 141.67250180244446 secs



 92%|█████████▏| 46/50 [1:48:44<09:23, 140.95s/it]

Epoch 46 Loss 0.0272 Accuracy 0.9925
Time taken for 1 epoch: 140.690988779068 secs



 94%|█████████▍| 47/50 [1:51:05<07:02, 140.86s/it]

Epoch 47 Loss 0.0258 Accuracy 0.9929
Time taken for 1 epoch: 140.65724110603333 secs



 96%|█████████▌| 48/50 [1:53:25<04:41, 140.81s/it]

Epoch 48 Loss 0.0250 Accuracy 0.9931
Time taken for 1 epoch: 140.69540238380432 secs



 98%|█████████▊| 49/50 [1:55:46<02:20, 140.76s/it]

Epoch 49 Loss 0.0240 Accuracy 0.9935
Time taken for 1 epoch: 140.638023853302 secs



100%|██████████| 50/50 [1:58:08<00:00, 141.76s/it]

Saving checkpoint for epoch 51 at transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325/ckpt-10
Epoch 50 Loss 0.0229 Accuracy 0.9937
Time taken for 1 epoch: 141.67951703071594 secs






In [35]:
# ### Training 
# for epoch in tqdm(range(51, 101)):
#     start = time.time()
#     # Reset the metrics at the start of the next epoch
#     train_loss.reset_states()
#     train_accuracy.reset_states()
#     current_batch_index = 0

#     # iterate through the dataset in batches of batch_size
#     for i in (range(int(len(target_sentences)/batch_size))):
#         # get the input and target batch
#         target_batch = tf.convert_to_tensor(np.array(target_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)
#         input_batch = tf.convert_to_tensor(np.array(inp_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)

#         # English --> Telugu Translation
# #         input_batch = tf.convert_to_tensor(np.array(target_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)
# #         target_batch = tf.convert_to_tensor(np.array(inp_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)

#         current_batch_index = current_batch_index + batch_size
#         # call the train_step function to train the model using the current batch
#         train_step(input_batch, target_batch)

#     if (epoch) % 5 == 0:
#         ckpt_save_path = ckpt_manager.save()
#         print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
#                                                              ckpt_save_path))

#     # print the epoch loss and accuracy after iterating through the dataset
#     print (f'Epoch {epoch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}') 
#     print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

# # Save the Tokenizer after the Model Training is successful (To reduce the model artifacts if model training is failed)
# save_the_tokenizer()

In [36]:
# ### Training 
# for epoch in tqdm(range(101, 121)):
#     start = time.time()
#     # Reset the metrics at the start of the next epoch
#     train_loss.reset_states()
#     train_accuracy.reset_states()
#     current_batch_index = 0

#     # iterate through the dataset in batches of batch_size
#     for i in (range(int(len(target_sentences)/batch_size))):
#         # get the input and target batch
#         target_batch = tf.convert_to_tensor(np.array(target_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)
#         input_batch = tf.convert_to_tensor(np.array(inp_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)

#         # English --> Telugu Translation
# #         input_batch = tf.convert_to_tensor(np.array(target_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)
# #         target_batch = tf.convert_to_tensor(np.array(inp_sentences[current_batch_index:current_batch_index+batch_size]),dtype=tf.int64)

#         current_batch_index = current_batch_index + batch_size
#         # call the train_step function to train the model using the current batch
#         train_step(input_batch, target_batch)

#     if (epoch) % 5 == 0:
#         ckpt_save_path = ckpt_manager.save()
#         print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
#                                                              ckpt_save_path))

#     # print the epoch loss and accuracy after iterating through the dataset
#     print (f'Epoch {epoch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}') 
#     print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

# # Save the Tokenizer after the Model Training is successful (To reduce the model artifacts if model training is failed)
# save_the_tokenizer()

## Inference

#### Greedy Translation

**Translate Helper Function**

**This function takes source language sentence as input and generates a translated sentence in the target language.using the following steps:**

1.  The input sentence is preprocessed by adding start and end of sentence markers and converting it to a list because of TensorFlow's tokenizer.

2.  The preprocessed sentence is vectorized and padded to a fixed length of 30 using the target language tokenizer.

3. The start of the decoder input is tokenized and converted to a tensor using the target language tokenizer.

4. The function then enters a loop that generates predictions for the current input sequence, selects the last word from the seq_len dimension, gets the predicted word ID by taking the argmax of the predictions, concatenates the predicted ID to the output which is given to the decoder as its input, and checks if the predicted ID is equal to the end token. If the predicted ID is equal to the end token, the function returns the decoder input, otherwise the loop continues.

5. If the loop completes without finding the end token, the function returns the translated sentence.

In [38]:
maxlen = MAX_LEN
def translate_helper(sentence):
    """
    Evaluate function that generates a translated sentence from the given input sentence.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    A tensor representing the translated sentence.
    """
    
    # Preprocess the input sentence
    sentence = preprocess_text(sentence[0], is_telugu=True)

    sentence = '<start> ' + sentence + ' <end>' # Add start and end of sentence markers
    sentence = [sentence] # Convert sentence to list because of TensorFlow's tokenizer
    
    # Vectorize and pad the sentence
    sentence = tokenizer_inp.texts_to_sequences(sentence)
    sentence = pad_sequences(sentence, maxlen=MAX_LEN, padding='post', truncating='post')
    input = tf.convert_to_tensor(np.array(sentence),dtype=tf.int64) # Convert input to tensor
    
    # Tokenize the start of the decoder input and convert it to a tensor
    decoder_input = tokenizer_tar.texts_to_sequences(['<start>'])
    decoder_input = tf.convert_to_tensor(np.array(decoder_input), dtype=tf.int64)
    
    # Generate the translated sentence
    for i in range(maxlen):
        # Create masks for the encoder, decoder, and combined
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(input, decoder_input)
        # Generate predictions for the current input sequence
        predictions, _ = transformer(input, decoder_input,False,enc_padding_mask,combined_mask, dec_padding_mask)
        # Select the last word from the seq_len dimension
        predictions = predictions[: ,-1:, :] 
        # Get the predicted word ID by taking the argmax of the predictions
        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int64)
        
        # If the predicted ID is equal to the end token, return the decoder input
        if predicted_id == tokenizer_tar.texts_to_sequences(['<end>']):
            return tf.squeeze(decoder_input, axis=0)
        
        # Concatenate the predicted ID to the output which is given to the decoder
        # as its input.
        decoder_input = tf.concat([decoder_input, predicted_id], axis=1)
    
    # Return the translated sentence
    return tf.squeeze(decoder_input, axis=0)

def translate(sentence):
    """
    Translate function that generates a translation for the given input sentence.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    None.
    """
    
    # Convert sentence to list because our evaluate function requires lists
    sentence = [sentence]
    
    # Print the input sentence
    print(f'Input sentence: {sentence[0]}')
    print()
    
    # Generate the translated sentence
    result = (translate_helper(sentence)).tolist()
    
    # Convert the result tensor to a list of IDs and remove start and end of sentence markers
    predicted_ids = [i for i in result if i != tokenizer_tar.texts_to_sequences(['<start>'])[0][0]
                     and i != tokenizer_tar.texts_to_sequences(['<end>'])[0][0]]
    
    # Convert the predicted IDs to a list of words
    predicted_sentence = tokenizer_tar.sequences_to_texts([predicted_ids])
    
    # Print the predicted translation
    print(f'Translation: {predicted_sentence[0]}')
 


### Beam Search Translation
- BEAM Search Inference Code - Beam-width (k) is a hyperparameter, which helps us to choose top-k elements

In [40]:
# BEAM Search Code

beam_width = 5  # Define the beam width

def translate_helper_beam_search(sentence):
    """
    Evaluate function that generates a translated sentence from the given input sentence using beam search.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    A tensor representing the translated sentence.
    """
    
    # Preprocess the input sentence
    sentence = preprocess_text(sentence[0], is_telugu=True)

    sentence = '<start> ' + sentence + ' <end>' # Add start and end of sentence markers
    sentence = [sentence] # Convert sentence to list because of TensorFlow's tokenizer
    
    # Vectorize and pad the sentence
    sentence = tokenizer_inp.texts_to_sequences(sentence)
    sentence = pad_sequences(sentence, maxlen=MAX_LEN, padding='post', truncating='post')
    input = tf.convert_to_tensor(np.array(sentence), dtype=tf.int64) # Convert input to tensor
    
    # Tokenize the start of the decoder input and convert it to a tensor
    decoder_input = tokenizer_tar.texts_to_sequences(['<start>'])
    decoder_input = tf.convert_to_tensor(np.array(decoder_input), dtype=tf.int64)
    
    # Initialize the beam search candidates
    candidates = [(decoder_input, 0)]  # List of (decoder_input, score) tuples
    
    # Generate the translated sentence using beam search
    for _ in range(maxlen):
        new_candidates = []
        for decoder_input, score in candidates:
            # Create masks for the encoder, decoder, and combined
            enc_padding_mask, combined_mask, dec_padding_mask = create_masks(input, decoder_input)
            # Generate predictions for the current input sequence
            predictions, _ = transformer(input, decoder_input, False, enc_padding_mask, combined_mask, dec_padding_mask)
            # Select the last word from the seq_len dimension
            predictions = predictions[:, -1:, :]
            # Get the top beam_width predictions and their indices
            topk_probs, topk_ids = tf.nn.top_k(tf.squeeze(predictions, axis=1), k=beam_width)
            for i in range(beam_width):
                new_decoder_input = tf.concat([decoder_input, tf.expand_dims(tf.expand_dims(tf.cast(topk_ids[0][i], tf.int64), axis=0), axis=0)], axis=1)
                new_score = score + tf.math.log(topk_probs[0][i]).numpy()
                new_candidates.append((new_decoder_input, new_score))

#             for i in range(beam_width):
#                 new_decoder_input = tf.concat([decoder_input, tf.expand_dims(topk_ids[0][i], axis=0)], axis=1)
#                 new_score = score + tf.math.log(topk_probs[0][i]).numpy()
#                 new_candidates.append((new_decoder_input, new_score))
        
    
        # Select the top beam_width candidates
        candidates = sorted(new_candidates, key=lambda x: x[1], reverse=True)[:beam_width]
        
        # Check if any of the candidates end with the end token
        for candidate, _ in candidates:
            if candidate[0][-1] == tokenizer_tar.texts_to_sequences(['<end>']):
                return tf.squeeze(candidate, axis=0)
    
    # Return the translated sentence with the highest score among the candidates
    return tf.squeeze(candidates[0][0], axis=0)


def translate_beam(sentence):
    """
    Translate function that generates a translation for the given input sentence.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    None.
    """
    
    # Convert sentence to list because our evaluate function requires lists
    sentence = [sentence]
    
    # Print the input sentence
    print(f'Input sentence: {sentence[0]}')
    print()
    
    # Generate the translated sentence
    result = (translate_helper_beam_search(sentence)).tolist()
    
    # Convert the result tensor to a list of IDs and remove start and end of sentence markers
    predicted_ids = [i for i in result if i != tokenizer_tar.texts_to_sequences(['<start>'])[0][0]
                     and i != tokenizer_tar.texts_to_sequences(['<end>'])[0][0]]
    
    # Convert the predicted IDs to a list of words
    predicted_sentence = tokenizer_tar.sequences_to_texts([predicted_ids])
    
    # Print the predicted translation
    print(f'Translation: {predicted_sentence[0]}')



### Evaluation

In [51]:
sentence = "climatic phenomenon that occurs when the waters of the pacific ocean near the equator get warmer"
print(f"Greedy Search Output:\n")
translate(sentence)
print(f"\n\nBeam Search Output:")
translate_beam(sentence)

Greedy Search Output:

Input sentence: climatic phenomenon that occurs when the waters of the pacific ocean near the equator get warmer

Translation: సమీపంలో ఉన్న పసిఫిక్ జలాలు సగటు కంటే వెచ్చగా ఉన్నప్పుడు సంభవించే వాతావరణ దృగ్విషయం ఎల్ ఈ వేడెక్కడం ప్రతి కొన్ని సంవత్సరాలకు సంభవిస్తుంది


Beam Search Output:
Input sentence: climatic phenomenon that occurs when the waters of the pacific ocean near the equator get warmer

Translation: సంక్షోభం మధ్య పెరుగుతున్న వాతావరణం ఏమిటి? సుమారు మంది జరిగింది


In [52]:
sentence = "technology has drastically transformed the way we work allowing for increased efficiency flexibility"
print(f"Greedy Search Output:\n")
translate(sentence)
print(f"\n\nBeam Search Output:")
translate_beam(sentence)

Greedy Search Output:

Input sentence: technology has drastically transformed the way we work allowing for increased efficiency flexibility

Translation: పెరిగిన సాంకేతిక పరిజ్ఞానం వైద్య ప్రపంచంలో చాలా ఇప్పుడు చాలా ఇప్పుడు రొటీన్ మరియు పునరావృత పనులను ఆటోమేట్ చేయగలవు ఉత్పాదకతను పెంచుతాయి మరియు సమయాన్ని ఆదా చేస్తాయి


Beam Search Output:
Input sentence: technology has drastically transformed the way we work allowing for increased efficiency flexibility

Translation: పెరిగిన భద్రత సెల్ఫ్ డ్రైవింగ్ కార్లు సిస్టమ్స్ వంటి అధునాతన సాంకేతిక పరిజ్ఞానం మెరుగుపరచడం


In [53]:
sentence = "with the advent of computer systems internet connectivity and digital tools work is no longer confined to traditional office spaces "
print(f"Greedy Search Output:\n")
translate(sentence)
print(f"\n\nBeam Search Output:")
translate_beam(sentence)

Greedy Search Output:

Input sentence: with the advent of computer systems internet connectivity and digital tools work is no longer confined to traditional office spaces 

Translation: పరికరాలు మరియు డిజిటల్ నియంత్రించడం మరియు సంగీతాన్ని ప్లే చేయడం వంటి వివిధ పనులకు ఉపయోగించే కదలికలను పరిశీలించడానికి ఉపయోగిస్తారు. ఈ పరికరాలు కంప్యూటర్లు ఎలా యాక్సెస్ చేయగల వీలు


Beam Search Output:
Input sentence: with the advent of computer systems internet connectivity and digital tools work is no longer confined to traditional office spaces 

Translation: పరికరాలు ఇవి డిజిటల్ చిత్రాలు మరియు వీడియోలను గుర్తించడానికి మరియు ఇంటర్నెట్ ను ఉపయోగిస్తాయి. ఈ పరికరాలు కంప్యూటర్లు ఎలా సంకేతాలను పంపడం
