# Inference - Language Translation using Transformer Architecture: English to Telugu

##### Loading the Trained Transformer Model from Checkpoint using the below hyperparameters as the Model and Tokenizers are saved in the below format.

*training_model_name_to_save = f"{time_str}_{embedding_dim}_{fully_connected_dim}_{num_layers}_{num_heads}_{dropout_rate}_{input_vocab_size}_{target_vocab_size}_{EPOCHS}_{batch_size}_{MAX_LEN}"*

* Load the Tokenizer by loading path ---> tokenizer_{training_model_name_to_save}.pickle
* Load the Model by loading the path ---> transformer_{training_model_name_to_save}

##### For Inference, Along with **Greedy Search**, Implemented the **Beam-Search** (beam-width is the hyperparameter to select top-k predicted tokens).

* Hyperparameters:

    * embedding_dim = 256          # dimensionality of the embeddings used for tokens in the input and target sequences
    * fully_connected_dim = 512    # dimensionality of the hidden layer of the feedforward neural network within the Transformer block
    * num_layers = 4               # number of Transformer blocks in the encoder and decoder stacks
    * num_heads = 8                # number of heads in the multi-head attention mechanism
    * dropout_rate = 0.1           # dropout rate for regularization

    * input_vocab_size = 20394    
    * target_vocab_size = 32926   

    * max_positional_encoding_input = 20394    # maximum positional encoding value for input sequence
    * max_positional_encoding_target = 32926  # maximum positional encoding value for target sequence

    * EPOCHS = 120
    * batch_size = 32

    * MAX_LEN = 325

#### Importing Necessary Libraries

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import os 

# from wordcloud import WordCloud
from tensorflow.python.ops.numpy_ops import np_config
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, Input, Dropout, LayerNormalization

import pickle
from datetime import datetime
import time
from tqdm import tqdm

np_config.enable_numpy_behavior()

#### Data Preprocessing

In [12]:
df = pd.read_csv('../artifacts/yahma_alpaca_cleaned_telugu_filtered_and_romanized.csv')

cols = ["instruction", "telugu_instruction", "output", "telugu_output"]
df = df[cols]
df['instruction_len'] = df['instruction'].apply(lambda x: len(x.split()))
df['output_len'] = df["output"].apply(lambda x: len(x.split()))
df.head()

Unnamed: 0,instruction,telugu_instruction,output,telugu_output,instruction_len,output_len
0,Give three tips for staying healthy.,ఆరోగ్యంగా ఉండటానికి మూడు చిట్కాలు ఇవ్వండి.,1. Eat a balanced and nutritious diet: Make su...,1. సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజన...,6,121
1,What are the three primary colors?,మూడు ప్రాధమిక రంగులు ఏమిటి?,"The three primary colors are red, blue, and ye...","మూడు ప్రాధమిక రంగులు ఎరుపు, నీలం మరియు పసుపు. ...",6,53
2,Describe the structure of an atom.,పరమాణువు నిర్మాణాన్ని వివరించండి.,An atom is the basic building block of all mat...,పరమాణువు అనేది అన్ని పదార్ధాల యొక్క ప్రాథమిక బ...,6,209
3,How can we reduce air pollution?,వాయు కాలుష్యాన్ని ఎలా తగ్గించవచ్చు?,There are several ways to reduce air pollution...,వాయు కాలుష్యాన్ని తగ్గించడానికి అనేక మార్గాలు ...,6,216
4,Pretend you are a project manager of a constru...,మీరు ఒక కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేన...,I had to make a difficult decision when I was ...,ఓ కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్ గ...,21,133


In [13]:
import pandas as pd
import re

def preprocess_text(text, is_telugu=False):
    # Lowercase
    text = text.lower()
    
    # Remove end-of-line periods and specific telugu punctuation
    text = re.sub("\.$", '', text)  # English and Telugu common

    if is_telugu:
        text = re.sub("。$", '', text)  # Telugu-specific
    
    # Handle punctuation (add spaces or replace based on the case)
    text = re.sub(r"([!#$%&\()*+,-./:;<=>?@[\\]^_`{|}~])", r" \1 ", text)
    text = re.sub(r"['\"]", '', text)  # Remove quotes directly
    text = text.replace(",", ' COMMA')  # Handle commas specifically
    
    # Remove digits
    text = re.sub(r'\d+', '', text)
    
    # Normalize spacing (e.g., after removing or altering punctuation)
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply preprocessing
df['instruction'] = df['instruction'].apply(lambda x: preprocess_text(x))
df['telugu_instruction'] = df['telugu_instruction'].apply(lambda x: preprocess_text(x, is_telugu=True))

df['output'] = df['output'].apply(lambda x: preprocess_text(x))
df['telugu_output'] = df['telugu_output'].apply(lambda x: preprocess_text(x, is_telugu=True))
# Review a sample
df.head()


Unnamed: 0,instruction,telugu_instruction,output,telugu_output,instruction_len,output_len
0,give three tips for staying healthy,ఆరోగ్యంగా ఉండటానికి మూడు చిట్కాలు ఇవ్వండి,. eat a balanced and nutritious diet: make sur...,. సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజనం...,6,121
1,what are the three primary colors?,మూడు ప్రాధమిక రంగులు ఏమిటి?,the three primary colors are red COMMA blue CO...,మూడు ప్రాధమిక రంగులు ఎరుపు COMMA నీలం మరియు పస...,6,53
2,describe the structure of an atom,పరమాణువు నిర్మాణాన్ని వివరించండి,an atom is the basic building block of all mat...,పరమాణువు అనేది అన్ని పదార్ధాల యొక్క ప్రాథమిక బ...,6,209
3,how can we reduce air pollution?,వాయు కాలుష్యాన్ని ఎలా తగ్గించవచ్చు?,there are several ways to reduce air pollution...,వాయు కాలుష్యాన్ని తగ్గించడానికి అనేక మార్గాలు ...,6,216
4,pretend you are a project manager of a constru...,మీరు ఒక కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేన...,i had to make a difficult decision when i was ...,ఓ కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్ గ...,21,133


## Helper Functions for Inference

### Model Name and Parameter Initialization
Load the Tokenizer and Model from Checkpoints using: 
*training_model_name_to_save = f"{time_str}_{embedding_dim}_{fully_connected_dim}_{num_layers}_{num_heads}_{dropout_rate}_{input_vocab_size}_{target_vocab_size}_{EPOCHS}_{batch_size}_{MAX_LEN}"*

* Tokenizer - tokenizer_{training_model_name_to_save}.pickle
* Model - transformer_{training_model_name_to_save}


In [14]:
model_name = "2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325"

# "2024-04-02-03-10_256_512_4_8_0.1_9566_14126_50_32_312"
# training_model_name_to_save = f"{time_str}_{embedding_dim}_{fully_connected_dim}_{num_layers}_{num_heads}_{dropout_rate}_{input_vocab_size}_{target_vocab_size}_{EPOCHS}_{batch_size}_{MAX_LEN}"


In [15]:
# Set hyperparameters for the Transformer model
embedding_dim = int(model_name.split("_")[-10])         
fully_connected_dim = int(model_name.split("_")[-9])
num_layers = int(model_name.split("_")[-8])              
num_heads = int(model_name.split("_")[-7] )                
dropout_rate = float(model_name.split("_")[-6])

input_vocab_size = int(model_name.split("_")[-5])
target_vocab_size = int(model_name.split("_")[-4])

max_positional_encoding_input = input_vocab_size
max_positional_encoding_target = target_vocab_size

EPOCHS = int(model_name.split("_")[-3])
batch_size = int(model_name.split("_")[-2])
MAX_LEN = int(model_name.split("_")[-1])


In [16]:
embedding_dim, dropout_rate, target_vocab_size

(256, 0.1, 32926)

### Positional Encoding

In [17]:
def get_angles(pos, i, embedding_dim):
    """
    Function to compute the angles for positional encoding.
    
    Returns the angle computed
    """
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(embedding_dim))
    return pos * angle_rates


def positional_encoding(position, embedding_dim):
    """
    Adds  positional encoding to the Embeddings to be fed to the Transformer model.
    
    Computes a sin and cos of the angles determined by the get_angles() function
    and adds the value computed to an axis of the embeddings.
    """
    angle_rads = get_angles(np.arange(position)[:, np.newaxis], 
                           np.arange(embedding_dim)[np.newaxis, :], embedding_dim)
    
    # apply sin to even indices in the array. ie 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # apply cos to odd indices in the array. ie 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

### Masking

In [18]:
def create_padding_mask(seq):
    """
    Creates a padding mask for a given sequence.
    
    Args:
        seq (tensor): A tensor of shape (batch_size, seq_len) containing the sequence.
        
    Returns:
        A tensor of shape (batch_size, 1, 1, seq_len) containing a mask that is 1 where the sequence is padded, and 0 otherwise.
    """
    # Convert the sequence to a boolean tensor where True indicates a pad token (value 0).
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    
    # Add an extra dimension to the mask to add the padding to the attention logits.
    return seq[:, tf.newaxis, tf.newaxis, :]

def create_look_ahead_mask(size):
    """
    Creates a look-ahead mask used during training the decoder of a transformer.

    Args:
        size (int): The size of the mask.

    Returns:
        tf.Tensor: A lower triangular matrix of shape (size, size) with ones on the diagonal
            and zeros below the diagonal.
    """
    # create a matrix with ones on the diagonal and zeros below the diagonal
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    
    return mask

def create_masks(inputs, targets):
    """
    Creates masks for the input sequence and target sequence.
    
    Args:
        inputs: Input sequence tensor.
        targets: Target sequence tensor.
    
    Returns:
        A tuple of three masks: the encoder padding mask, the combined mask used in the first attention block, 
        and the decoder padding mask used in the second attention block.
    """
    
    # Create the encoder padding mask.
    enc_padding_mask = create_padding_mask(inputs)
        
    # Create the decoder padding mask.
    dec_padding_mask = create_padding_mask(inputs)
        
    # Create the look ahead mask for the first attention block.
    # It is used to pad and mask future tokens in the tokens received by the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(targets)[1])
    
    # Create the decoder target padding mask.
    dec_target_padding_mask = create_padding_mask(targets)
    
    # Combine the look ahead mask and decoder target padding mask for the first attention block.
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
        
    return enc_padding_mask, combined_mask, dec_padding_mask


### Attention Mechanism - Multihead Attention


In [19]:
def scaled_dot_product_attention(q, k, v, mask):
    """
    Computes the scaled dot product attention weight for the query (q), key (k), and value (v) vectors. 
    The attention weight is a measure of how much focus should be given to each element in the sequence of values (v) 
    based on the corresponding element in the sequence of queries (q) and keys (k).
    
    Args:
    q: query vectors; shape (..., seq_len_q, depth)
    k: key vectors; shape  (..., seq_len_k, depth)
    v: value vectors; shape  (..., seq_len_v, depth_v)
    mask: (optional) mask to be applied to the attention weights
    
    Returns:
    output: The output of the scaled dot product attention computation; shape   (..., seq_len_q, depth_v)
    attention_weights: The attention weights
    """
    # Compute dot product of query and key vectors
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    
    # Compute the square root of the depth of the key vectors
    dk = tf.cast(tf.shape(k)[-1], dtype=tf.float32)
    scaled_dk = tf.math.sqrt(dk)
    
    # Compute scaled attention logits by dividing dot product by scaled dk
    scaled_attention_logits = matmul_qk / scaled_dk
    
    # Apply mask to the attention logits (if mask is not None)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
        
    # Apply softmax to the scaled attention logits to get the attention weights
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
    # Compute the weighted sum of the value vectors using the attention weights
    output = tf.matmul(attention_weights, v)
    
    return output, attention_weights


In [20]:
class MultiHeadAttention(tf.keras.layers.Layer):
    """
    MultiHeadAttention Layer that implements the attention mechanism for the Transformer.
    It splits the input into multiple heads, computes scaled dot-product attention for each head
    and then concatenates the output of the heads and passes it through a dense layer.
    """

    def __init__(self, key_dim, num_heads, dropout_rate=0.0):
        """
        Initializes the MultiHeadAttention layer.
    
        Args:
            key_dim (int): The dimensionality of the key space.
            num_heads (int): The number of attention heads.
            dropout (float): The dropout rate to apply after the dense layer.
        """
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.key_dim = key_dim
        #  ensure  that the dimension of the embedding can be evenly split across attention heads
        assert key_dim % num_heads == 0 
        self.depth = self.key_dim // self.num_heads
        
        # dense layers to project the input into queries, keys and values
        self.wq = Dense(key_dim)
        self.wk = Dense(key_dim)
        self.wv = Dense(key_dim)
    
        # dropout layer
        self.dropout = Dropout(dropout_rate)
    
        # dense layer to project the output of the attention heads
        self.dense = Dense(key_dim)
        
    def split_heads(self, x, batch_size):
        """
        Splits the last dimension of the tensor into (num_heads, depth).
        Transposes the result such that the shape is (batch_size, num_heads, seq_len, depth).
    
        Args:
            x (tensor): The tensor to be split.
            batch_size (int): The size of the batch.
    
        Returns:
            tensor: The tensor with the last dimension split into (num_heads, depth) and transposed.
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
        
    def call(self, v, k, q, mask=None):
        """
        Applies the multi-head attention mechanism to the inputs.
    
        Args:
            v (tensor): The value tensor of shape (batch_size, seq_len_v, key_dim).
            k (tensor): The key tensor of shape (batch_size, seq_len_k, key_dim).
            q (tensor): The query tensor of shape (batch_size, seq_len_q, key_dim).
            mask (tensor, optional): The mask tensor of shape (batch_size, seq_len_q, seq_len_k).
                                     Defaults to None.
    
        Returns:
            tensor: The output tensor of shape (batch_size, seq_len_q, key_dim).
            tensor: The attention weights tensor of shape (batch_size, num_heads, seq_len_q, seq_len_k).
        """
        batch_size = tf.shape(q)[0]
        
        # Dense on the q, k, v vectors
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        
        # split the heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # split the queries, keys and values into multiple heads
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        
        # reshape and add Dense layer
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.key_dim))
        output = self.dense(concat_attention)
        output = self.dropout(output)
        
        return output, attention_weights
    

### Fully Connected NeuralNetwork
    
def FeedForward(embedding_dim, fully_connected_dim):
    """Create a fully connected feedforward neural network.
    
    Args:
        embedding_dim (int): Dimensionality of the embedding output from the transformer layer.
        fully_connected_dim (int): Number of neurons in the fully connected layers.
    
    Returns:
        tf.keras.Sequential: A fully connected feedforward neural network with the specified architecture.
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(fully_connected_dim, activation='relu'),
        tf.keras.layers.Dense(embedding_dim)
    ])
    return model


### Encoder


In [21]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1):
        """Initializes the encoder layer
        
        Args: 
            embedding_dim: The dimensionality of the input and output of this layer
            num_heads: The number of attention heads to use in the multi-head attention layer
            fully_connected_dim: The dimensionality of the hidden layer in the feedforward network
            dropout_rate: The rate of dropout to apply to the output of this layer during training
            
        Returns:
            A new instance of the EncoderLayer class
        """
        super(EncoderLayer, self).__init__()
        
        # Multi-head self-attention mechanism
        self.mha = MultiHeadAttention(embedding_dim, num_heads, dropout_rate)
        
        # Layer normalization
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        
        # Dropout
        self.dropout = Dropout(dropout_rate)
        
        # Feedforward network
        self.ffn = FeedForward(embedding_dim, fully_connected_dim)
        
    def call(self, x, training, mask):
        """Applies the encoder layer to the input tensor
        
        Args:
            x: The input tensor to the encoder layer
            training: A boolean indicating whether the model is in training mode
            mask: A tensor representing the mask to apply to the attention mechanism
            
        Returns:
            The output of the encoder layer after applying the multi-head attention and feedforward network
        """
        
        # Apply multi-head self-attention mechanism to input tensor
        attn_output, _ = self.mha(x, x, x, mask)
        
        # Apply first layer normalization and add residual connection
        out1 = self.layernorm1(attn_output + x)
        
        # Apply feedforward network to output of first layer normalization
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout(ffn_output, training=training)
        
        # Apply second layer normalization and add residual connection
        out2 = self.layernorm2(ffn_output + out1)
        
        return out2
    

#### Encoder 
    
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        """
        Initializes the Encoder layer of the Transformer model.
        
        Args:
            num_layers (int): Number of EncoderLayers to stack.
            embedding_dim (int): Dimensionality of the token embedding space.
            num_heads (int): Number of attention heads to use in MultiHeadAttention layers.
            fully_connected_dim (int): Dimensionality of the fully connected layer in the EncoderLayer.
            input_vocab_size (int): Size of the input vocabulary.
            maximum_position_encoding (int): Maximum length of input sequences for positional encoding.
            dropout_rate (float): Probability of dropping out units during training.

        """
        super(Encoder, self).__init__()
        
        self.num_layers = num_layers
        self.embedding_dim = embedding_dim
        
        # Embedding layer
        self.embedding = Embedding(input_vocab_size, embedding_dim)
        
        # Positional encoding
        self.pos_encoding = positional_encoding(maximum_position_encoding, embedding_dim)
        
        # Encoder layers
        self.enc_layers = [EncoderLayer(embedding_dim, num_heads, fully_connected_dim, dropout_rate) for _ in range(num_layers)]
        
        # Dropout layer
        self.dropout = Dropout(dropout_rate)
        
    def call(self, inputs, training, mask):
        """
        Call function for the Encoder layer.
        
        Args:
            inputs: tensor of shape (batch_size, sequence_length) representing input sequences
            training: boolean indicating if the model is in training mode
            mask: tensor of shape (batch_size, sequence_length) representing the mask to apply to the input sequence

        Returns:
            A tensor of shape (batch_size, sequence_length, embedding_dim) representing the encoded sequence
        """

        # Get the sequence length
        seq_len = tf.shape(inputs)[1]

        # Embed the input sequence
        inputs = self.embedding(inputs)

        # Scale the embeddings by sqrt(embedding_dim)
        inputs *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))

        # Add positional encodings to the input sequence
        inputs += self.pos_encoding[:, :seq_len, :]

        # Apply dropout to the input sequence
        inputs = self.dropout(inputs, training=training)

        # Pass the input sequence through the encoder layers
        for i in range(self.num_layers):
            inputs = self.enc_layers[i](inputs, training, mask)

        # Return the encoded sequence
        return inputs

### Decoder 

In [22]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1):
        """
        Initializes a single decoder layer of the transformer model.
        
        Args:
        embedding_dim: The dimension of the embedding space.
        num_heads: The number of attention heads to use.
        fully_connected_dim: The dimension of the feedforward network.
        rate: The dropout rate for regularization.
        """
        super(DecoderLayer, self).__init__()
        
        # Instantiate two instances of MultiHeadAttention.
        self.mha1 = MultiHeadAttention(embedding_dim, num_heads, dropout_rate)
        self.mha2 = MultiHeadAttention(embedding_dim, num_heads, dropout_rate)
        
        # Instantiate a fully connected feedforward network.
        self.ffn = FeedForward(embedding_dim, fully_connected_dim)
        
        # Instantiate three layer normalization layers with epsilon=1e-6.
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.layernorm3 = LayerNormalization(epsilon=1e-6)
        
        # Instantiate a dropout layer for regularization.
        self.dropout3 = Dropout(dropout_rate)
        
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Forward pass through the decoder layer.
        
        Args:
        x: The input to the decoder layer, a query vector.
        enc_output: The output from the top layer of the encoder, a set of attention vectors k and v.
        training: Whether to apply dropout regularization.
        look_ahead_mask: The mask to apply to the input sequence so that it can't look ahead to future positions.
        padding_mask: The mask to apply to the input sequence to ignore padding tokens.
        
        Returns:
        The output from the decoder layer, a tensor with the same shape as the input.
        The attention weights from the first multi-head attention layer.
        The attention weights from the second multi-head attention layer.
        """
        
        # Apply the first multi-head attention layer to the query vector x.
        # We pass x as all three inputs to the layer because this is a self-attention layer.
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)
        
        # Add the original input to the output of the attention layer and apply layer normalization.
        out1 = self.layernorm1(attn1 + x) 
        
        # Apply the second multi-head attention layer to the output from the first layer and the encoder output.
        attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask)
        
        # Add the output from the first layer to the output of the second layer and apply layer normalization.
        out2 = self.layernorm2(attn2 + out1)
        
        # Apply the feedforward network to the output of the second layer and apply dropout regularization.
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        
        # Add the output from the second layer to the output of the feedforward network and apply layer normalization.
        out3 = self.layernorm3(ffn_output + out2)
        
        return out3, attn_weights_block1, attn_weights_block2


### Decoder
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        """
        Initializes the Decoder object.
        
        Args:
            num_layers (int): The number of Decoder layers.
            embedding_dim (int): The size of the embedding dimension.
            num_heads (int): The number of heads in the MultiHeadAttention layer.
            fully_connected_dim (int): The number of units in the feedforward network.
            target_vocab_size (int): The number of words in the target vocabulary.
            maximum_position_encoding (int): The maximum length of a sequence.
            dropout_rate (float): The rate at which to apply dropout.
        """
        super(Decoder, self).__init__()
        
        self.num_layers = num_layers
        self.embedding_dim = embedding_dim
        
        # create layers
        self.embedding = Embedding(target_vocab_size, embedding_dim)
        self.pos_encoding = positional_encoding(maximum_position_encoding, embedding_dim)
        self.dec_layers = [DecoderLayer(embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1) for _ in range(num_layers)]
        self.dropout = Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        """
        Executes the Decoder.

        Args:
            x (tf.Tensor): The input to the Decoder.
            enc_output (tf.Tensor): The output from the Encoder.
            training (bool): Whether the Decoder is in training mode.
            look_ahead_mask (tf.Tensor): The mask for self-attention in the MultiHeadAttention layer.
            padding_mask (tf.Tensor): The mask for padding in the MultiHeadAttention layer.

        Returns:
            tf.Tensor: The output from the Decoder.
            dict: A dictionary of attention weights.
        """
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        # add embedding and positional encoding
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)

        # apply each layer of the decoder
        for i in range(self.num_layers):
            # pass through decoder layer i
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

            # record attention weights for block1 and block2
            attention_weights[f"decoder_layer{i + 1}_block1"] = block1
            attention_weights[f"decoder_layer{i + 1}_block2"] = block2

        return x, attention_weights

### Transformer


In [23]:
class Transformer(tf.keras.Model):
    """
    A Transformer model that takes in an input and target sequence and outputs a final prediction.

    Args:
        num_layers (int): Number of layers in the Encoder and Decoder.
        embedding_dim (int): Dimensionality of the embedding layer.
        num_heads (int): Number of attention heads used in the Transformer.
        fully_connected_dim (int): Dimensionality of the fully connected layer in the Encoder and Decoder.
        input_vocab_size (int): Size of the input vocabulary.
        target_vocab_size (int): Size of the target vocabulary.
        max_positional_encoding_input (int): Maximum length of the input sequence.
        max_positional_encoding_target (int): Maximum length of the target sequence.
        dropout_rate (float, optional): Dropout rate used in the Encoder and Decoder layers. Defaults to 0.1.
    """
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, target_vocab_size, max_positional_encoding_input, max_positional_encoding_target, dropout_rate=0.1):
        super(Transformer, self).__init__()
        
        # Initialize the Encoder and Decoder layers
        self.encoder = Encoder(num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, max_positional_encoding_input, dropout_rate)
        self.decoder = Decoder(num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, max_positional_encoding_target, dropout_rate)
        
        # Add a final dense layer to make the final prediction
        self.final_layer = tf.keras.layers.Dense(target_vocab_size, activation='softmax')
        
    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        """
        Perform a forward pass through the Transformer model.

        Args:
            inp (tf.Tensor): Input sequence tensor with shape (batch_size, input_seq_len).
            tar (tf.Tensor): Target sequence tensor with shape (batch_size, target_seq_len).
            training (bool): Whether the model is being trained or not.
            enc_padding_mask (tf.Tensor): Padding mask for the Encoder with shape (batch_size, 1, 1, input_seq_len).
            look_ahead_mask (tf.Tensor): Mask to prevent the Decoder from looking ahead in the target sequence with shape (batch_size, 1, target_seq_len, target_seq_len).
            dec_padding_mask (tf.Tensor): Padding mask for the Decoder with shape (batch_size, 1, 1, target_seq_len).

        Returns:
            tuple: A tuple containing the final output of the model and the attention weights of the Decoder.
        """
        # Pass the input sequence through the Encoder
        enc_output = self.encoder(inp, training, enc_padding_mask)
        
        # Pass the target sequence and the output of the Encoder through the Decoder
        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)
        
        # Pass the output of the Decoder through the final dense layer to get the final prediction
        final_output = self.final_layer(dec_output)
        
        return final_output, attention_weights


### Optimizer - Learning Rate Scheduler


In [24]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, embedding_dim, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        self.embedding_dim = tf.cast(embedding_dim, dtype=tf.float32)
        self.warmup_steps = tf.cast(warmup_steps, dtype=tf.float32)

    def __call__(self, step):
        step = tf.cast(step, dtype=tf.float32)
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.embedding_dim) * tf.math.minimum(arg1, arg2)

# Create an instance of the custom learning rate schedule
learning_rate = CustomSchedule(embedding_dim)


### Loss Function and Metrics 

In [25]:
# Define the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2 = 0.98, epsilon = 1e-9)

# Define the loss object
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()


def loss_function(true_values, predictions):
    # Create a mask to exclude the padding tokens
    mask = tf.math.logical_not(tf.math.equal(true_values, 0))

    # Compute the loss value using the loss object
    loss_ = loss_object(true_values, predictions)

    # Apply the mask to exclude the padding tokens
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    # Calculate the mean loss value
    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)

def accuracy_function(true_values, predictions):
    # Compute the accuracies using the true and predicted target sequences
    accuracies = tf.equal(true_values, tf.argmax(predictions, axis=2))

    # Create a mask to exclude the padding tokens
    mask = tf.math.logical_not(tf.math.equal(true_values, 0))

    # Apply the mask to exclude the padding tokens from the accuracies
    accuracies = tf.math.logical_and(mask, accuracies)
    accuracies = tf.cast(accuracies, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)

    # Calculate the mean accuracy value
    return tf.reduce_sum(accuracies) / tf.reduce_sum(mask)

# Define the training metrics
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')




## Transformer Initialization

In [26]:
# Create an instance of the Transformer model
transformer = Transformer(num_layers, embedding_dim, num_heads,
                           fully_connected_dim, input_vocab_size, target_vocab_size, 
                           max_positional_encoding_input, max_positional_encoding_target, dropout_rate)

In [27]:
# the train function
train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]


@tf.function(input_signature=train_step_signature)
def train_step(encoder_input, target):
    # Slice the target tensor to get the input for the decoder
    decoder_input = target[:, :-1]

    # Slice the target tensor to get the expected output of the decoder
    expected_output = target[:, 1:]

    # Create masks for the encoder input, decoder input and the padding
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, decoder_input)

    # Perform a forward pass through the model
    with tf.GradientTape() as tape:
        predictions, _ = transformer(encoder_input, decoder_input, True, enc_padding_mask, combined_mask, dec_padding_mask)

        # Calculate the loss between the predicted output and the expected output
        loss = loss_function(expected_output, predictions)

    # Calculate gradients and update the model parameters
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    # Update the training loss and accuracy metrics
    train_loss(loss)
    train_accuracy(expected_output, predictions)



## Model Reload

### Checkpoint Restoration
- Load the Model from Checkpoints by using the above hyperparameters

In [28]:
checkpoint_path = f"directory/transformer_training_artifacts/training_checkpoints/transformer_{model_name}"

print(f"Loading the Checkpoints from: {checkpoint_path}")

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print ('Latest checkpoint restored!!')

Loading the Checkpoints from: directory/transformer_training_artifacts/training_checkpoints/transformer_2024-04-03-04-28_256_512_4_8_0.1_20394_32926_50_32_325
Latest checkpoint restored!!


### Tokenizer Restoration
- Load the Tokenizer from the artifacts directory using the model name

In [29]:
# Load the Tokenizer
import pickle

def load_tokenizers(file_path):
    # Load the pickled dictionary containing tokenizers
    with open(file_path, 'rb') as handle:
        tokenizers_dict = pickle.load(handle)
    return tokenizers_dict

# Assuming file_path contains the path to the pickled file
tokenizers_file_path = f'directory/transformer_training_artifacts/tokenizer/tokenizer_{model_name}.pickle'

# Load the tokenizers
loaded_tokenizers_dict = load_tokenizers(tokenizers_file_path)

# Access individual tokenizers by their keys
tokenizer_tar = loaded_tokenizers_dict['english_tokenizer_target']
tokenizer_inp = loaded_tokenizers_dict['telugu_tokenizer_input']

## Inference

### Greedy Search

In [32]:
maxlen = MAX_LEN
def translate_helper(sentence):
    """
    Evaluate function that generates a translated sentence from the given input sentence.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    A tensor representing the translated sentence.
    """
    
    # Preprocess the input sentence
    sentence = preprocess_text(sentence[0], is_telugu=True)

    sentence = '<start> ' + sentence + ' <end>' # Add start and end of sentence markers
    sentence = [sentence] # Convert sentence to list because of TensorFlow's tokenizer
    
    # Vectorize and pad the sentence
    sentence = tokenizer_inp.texts_to_sequences(sentence)
    sentence = pad_sequences(sentence, maxlen=MAX_LEN, padding='post', truncating='post')
    input = tf.convert_to_tensor(np.array(sentence),dtype=tf.int64) # Convert input to tensor
    
    # Tokenize the start of the decoder input and convert it to a tensor
    decoder_input = tokenizer_tar.texts_to_sequences(['<start>'])
    decoder_input = tf.convert_to_tensor(np.array(decoder_input), dtype=tf.int64)
    
    # Generate the translated sentence
    for i in range(maxlen):
        # Create masks for the encoder, decoder, and combined
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(input, decoder_input)
        # Generate predictions for the current input sequence
        predictions, _ = transformer(input, decoder_input,False,enc_padding_mask,combined_mask, dec_padding_mask)
        # Select the last word from the seq_len dimension
        predictions = predictions[: ,-1:, :] 
        # Get the predicted word ID by taking the argmax of the predictions
        predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int64)
        
        # If the predicted ID is equal to the end token, return the decoder input
        if predicted_id == tokenizer_tar.texts_to_sequences(['<end>']):
            return tf.squeeze(decoder_input, axis=0)
        
        # Concatenate the predicted ID to the output which is given to the decoder
        # as its input.
        decoder_input = tf.concat([decoder_input, predicted_id], axis=1)
    
    # Return the translated sentence
    return tf.squeeze(decoder_input, axis=0)


def translate(sentence):
    """
    Translate function that generates a translation for the given input sentence.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    None.
    """
    
    # Convert sentence to list because our evaluate function requires lists
    sentence = [sentence]
    
    # Print the input sentence
    print(f'Input sentence: {sentence[0]}')
    print()
    
    # Generate the translated sentence
    result = (translate_helper(sentence)).tolist()
    
    # Convert the result tensor to a list of IDs and remove start and end of sentence markers
    predicted_ids = [i for i in result if i != tokenizer_tar.texts_to_sequences(['<start>'])[0][0]
                     and i != tokenizer_tar.texts_to_sequences(['<end>'])[0][0]]
    
    # Convert the predicted IDs to a list of words
    predicted_sentence = tokenizer_tar.sequences_to_texts([predicted_ids])
    
    # Print the predicted translation
    print(f'Translation: {predicted_sentence[0]}')

### BEAM Search
- Beam-width (k) is a hyperparameter, which helps us to choose top-k elements

In [33]:
beam_width = 5  # Define the beam width

def translate_helper_beam_search(sentence):
    """
    Evaluate function that generates a translated sentence from the given input sentence using beam search.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    A tensor representing the translated sentence.
    """
    
    # Preprocess the input sentence
    sentence = preprocess_text(sentence[0], is_telugu=True)

    sentence = '<start> ' + sentence + ' <end>' # Add start and end of sentence markers
    sentence = [sentence] # Convert sentence to list because of TensorFlow's tokenizer
    
    # Vectorize and pad the sentence
    sentence = tokenizer_inp.texts_to_sequences(sentence)
    sentence = pad_sequences(sentence, maxlen=MAX_LEN, padding='post', truncating='post')
    input = tf.convert_to_tensor(np.array(sentence), dtype=tf.int64) # Convert input to tensor
    
    # Tokenize the start of the decoder input and convert it to a tensor
    decoder_input = tokenizer_tar.texts_to_sequences(['<start>'])
    decoder_input = tf.convert_to_tensor(np.array(decoder_input), dtype=tf.int64)
    
    # Initialize the beam search candidates
    candidates = [(decoder_input, 0)]  # List of (decoder_input, score) tuples
    
    # Generate the translated sentence using beam search
    for _ in range(maxlen):
        new_candidates = []
        for decoder_input, score in candidates:
            # Create masks for the encoder, decoder, and combined
            enc_padding_mask, combined_mask, dec_padding_mask = create_masks(input, decoder_input)
            # Generate predictions for the current input sequence
            predictions, _ = transformer(input, decoder_input, False, enc_padding_mask, combined_mask, dec_padding_mask)
            # Select the last word from the seq_len dimension
            predictions = predictions[:, -1:, :]
            # Get the top beam_width predictions and their indices
            topk_probs, topk_ids = tf.nn.top_k(tf.squeeze(predictions, axis=1), k=beam_width)
            for i in range(beam_width):
                new_decoder_input = tf.concat([decoder_input, tf.expand_dims(tf.expand_dims(tf.cast(topk_ids[0][i], tf.int64), axis=0), axis=0)], axis=1)
                new_score = score + tf.math.log(topk_probs[0][i]).numpy()
                new_candidates.append((new_decoder_input, new_score))

#             for i in range(beam_width):
#                 new_decoder_input = tf.concat([decoder_input, tf.expand_dims(topk_ids[0][i], axis=0)], axis=1)
#                 new_score = score + tf.math.log(topk_probs[0][i]).numpy()
#                 new_candidates.append((new_decoder_input, new_score))
        
    
        # Select the top beam_width candidates
        candidates = sorted(new_candidates, key=lambda x: x[1], reverse=True)[:beam_width]
        
        # Check if any of the candidates end with the end token
        for candidate, _ in candidates:
            if candidate[0][-1] == tokenizer_tar.texts_to_sequences(['<end>']):
                return tf.squeeze(candidate, axis=0)
    
    # Return the translated sentence with the highest score among the candidates
    return tf.squeeze(candidates[0][0], axis=0)


def translate_beam(sentence):
    """
    Translate function that generates a translation for the given input sentence.

    Args:
    sentence (str): The input sentence in the source language.

    Returns:
    None.
    """
    
    # Convert sentence to list because our evaluate function requires lists
    sentence = [sentence]
    
    # Print the input sentence
    print(f'Input sentence: {sentence[0]}')
    print()
    
    # Generate the translated sentence
    result = (translate_helper_beam_search(sentence)).tolist()
    
    # Convert the result tensor to a list of IDs and remove start and end of sentence markers
    predicted_ids = [i for i in result if i != tokenizer_tar.texts_to_sequences(['<start>'])[0][0]
                     and i != tokenizer_tar.texts_to_sequences(['<end>'])[0][0]]
    
    # Convert the predicted IDs to a list of words
    predicted_sentence = tokenizer_tar.sequences_to_texts([predicted_ids])
    
    # Print the predicted translation
    print(f'Translation: {predicted_sentence[0]}')



In [34]:
df.head()

Unnamed: 0,instruction,telugu_instruction,output,telugu_output,instruction_len,output_len
0,give three tips for staying healthy,ఆరోగ్యంగా ఉండటానికి మూడు చిట్కాలు ఇవ్వండి,. eat a balanced and nutritious diet: make sur...,. సమతుల్య మరియు పోషకమైన ఆహారం తినండి: మీ భోజనం...,6,121
1,what are the three primary colors?,మూడు ప్రాధమిక రంగులు ఏమిటి?,the three primary colors are red COMMA blue CO...,మూడు ప్రాధమిక రంగులు ఎరుపు COMMA నీలం మరియు పస...,6,53
2,describe the structure of an atom,పరమాణువు నిర్మాణాన్ని వివరించండి,an atom is the basic building block of all mat...,పరమాణువు అనేది అన్ని పదార్ధాల యొక్క ప్రాథమిక బ...,6,209
3,how can we reduce air pollution?,వాయు కాలుష్యాన్ని ఎలా తగ్గించవచ్చు?,there are several ways to reduce air pollution...,వాయు కాలుష్యాన్ని తగ్గించడానికి అనేక మార్గాలు ...,6,216
4,pretend you are a project manager of a constru...,మీరు ఒక కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేన...,i had to make a difficult decision when i was ...,ఓ కన్ స్ట్రక్షన్ కంపెనీలో ప్రాజెక్ట్ మేనేజర్ గ...,21,133


In [38]:
t = "field of autonomous vehicles is rapidly evolving with numerous advancements being made in recent years"

print(f"Greedy Search Output:")
translate(t)
print(f"\n\nBeam Search Output:")
translate_beam(t)

Greedy Search Output:
Input sentence: field of autonomous vehicles is rapidly evolving with numerous advancements being made in recent years

Translation: వాహనాల రంగం వేగంగా అభివృద్ధి చెందుతోంది ఇటీవలి సంవత్సరాలలో అనేక పురోగతులు తాజా పరిణామాల్లో కొన్ని


Beam Search Output:
Input sentence: field of autonomous vehicles is rapidly evolving with numerous advancements being made in recent years

Translation: వాహనాల రంగం వేగంగా అభివృద్ధి చెందుతోంది ఇటీవలి సంవత్సరాలలో అత్యంత తాజా పరిణామాల్లో కొన్ని


In [39]:
t = "implement a system to reduce waste and increase efficiencies"

print(f"Greedy Search Output:")
translate(t)
print(f"\n\nBeam Search Output:")
translate_beam(t)

Greedy Search Output:
Input sentence: implement a system to reduce waste and increase efficiencies

Translation: రీసైక్లింగ్ ను ప్రోత్సహించండి చాలా దేశాల్లో ఇప్పటికీ సమర్థవంతమైన రీసైక్లింగ్ కార్యక్రమాలు లేవు. ప్రభుత్వాలు పన్ను మినహాయింపులు రీసైక్లింగ్ నిధులు మరియు రీసైక్లింగ్ నిధులు


Beam Search Output:
Input sentence: implement a system to reduce waste and increase efficiencies

Translation: రీసైక్లింగ్ ను ప్రోత్సహించండి చాలా రీసైక్లింగ్ మిశ్రమం యొక్క ప్రధాన ఆర్
