<a href="https://colab.research.google.com/github/shstreuber/AI/blob/main/Week_13_CS_345_545_Translator_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **WELCOME TO THE TRANSLATION OPTIONS NOTEBOOK!**

In this notebook, we explore different types of translation models and techniques for translating text from one language to another. Below are the items covered in this notebook:

0. Recap of the Simple Transformer Model

A. English to German Translation using LSTM

B. English to German Translation using Custom Transformer Model

C. English to German Translation using a Prebuilt Transformer Model

Each section of the notebook demonstrates how to implement a translation model for a specific language pair using different architectures and approaches. By exploring these options, you'll gain insights into the strengths and limitations of each translation method and how to choose the most suitable approach for your translation task. Let's dive in!

#**0. The Simple Transformer Model**

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, LayerNormalization, MultiHeadAttention, Embedding, Concatenate
from tensorflow.keras.models import Model

class Transformer(Model):
    def __init__(self, vocab_size, max_sequence_length, d_model, num_heads, num_layers, dropout_rate=0.1):
        super(Transformer, self).__init__()

        # Define embedding layer
        self.embedding_layer = Embedding(vocab_size, d_model)

        # Define positional encoding layer
        self.positional_encoding = self.get_positional_encoding(max_sequence_length, d_model)

        # Define transformer layers
        self.transformer_layers = [self.create_transformer_layer(d_model, num_heads, dropout_rate) for _ in range(num_layers)]

        # Define output layer
        self.output_layer = Dense(vocab_size)

    def get_positional_encoding(self, max_sequence_length, d_model):
        # Calculate positional encodings
        positional_encoding = []
        for pos in range(max_sequence_length):
            pos_encoding = [pos / tf.pow(tf.constant(10000, dtype=tf.float32), 2 * (i // 2) / tf.cast(d_model, tf.float32)) for i in range(d_model)]
            if pos % 2 == 0:
                positional_encoding.append(tf.math.sin(pos_encoding))
            else:
                positional_encoding.append(tf.math.cos(pos_encoding))
        positional_encoding = tf.stack(positional_encoding)
        return tf.expand_dims(positional_encoding, axis=0)

    def create_transformer_layer(self, d_model, num_heads, dropout_rate):
        # Create transformer layer
        return MultiHeadAttention(num_heads=num_heads, key_dim=d_model // num_heads, dropout=dropout_rate)

    def call(self, inputs, training=False):
        # Define forward pass of the model
        input_sequence, target_sequence = inputs

        # Embed input sequence and add positional encoding
        embedded_input = self.embedding_layer(input_sequence) + self.positional_encoding[:, :tf.shape(input_sequence)[1], :]

        # Apply transformer layers sequentially
        transformer_output = embedded_input
        for layer in self.transformer_layers:
            transformer_output = layer(query=transformer_output, value=transformer_output, attention_mask=None, training=training)

        # Apply output layer
        output_logits = self.output_layer(transformer_output)

        return output_logits

# Example usage:
vocab_size = 10000  # Example vocabulary size
max_sequence_length = 50  # Example maximum sequence length
d_model = 128  # Example model dimensionality
num_heads = 4  # Example number of attention heads
num_layers = 2  # Example number of transformer layers
dropout_rate = 0.1  # Example dropout rate

# Instantiate the Transformer model
transformer_model = Transformer(vocab_size, max_sequence_length, d_model, num_heads, num_layers, dropout_rate)

# Compile the model
transformer_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Define and initialize tokenizer object (replace this with your actual tokenizer)
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

# Example usage with text data
input_text = ["This is an example sentence", "Another example sentence"]
target_text = ["Dies ist ein Beispiel Satz", "Ein weiterer Beispiel Satz"]

# Tokenize input sequences
input_sequences = tokenizer.texts_to_sequences(input_text)

# Pad sequences to ensure equal length
input_sequences_padded = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_length, padding='post')

# Tokenize target sequences (if applicable)
target_sequences = tokenizer.texts_to_sequences(target_text)
target_sequences_padded = tf.keras.preprocessing.sequence.pad_sequences(target_sequences, maxlen=max_sequence_length, padding='post')

# Define model inputs
inputs = (input_sequences_padded, target_sequences_padded)

# Feed data into the model
output_logits = transformer_model(inputs)

# Extract predictions (if applicable)
predicted_sequences = tf.argmax(output_logits, axis=-1)


#**A. Translator with LSTM**
In this completed code:

* English and German sentences are tokenized using the Tokenizer class.
* The sequences are padded to ensure equal length using the pad_sequences function.
* The model architecture is defined with an embedding layer, LSTM layers for both the encoder and decoder, and a dense layer for the decoder output.
* The model is compiled with the RMSprop optimizer and sparse categorical cross-entropy loss.
* Finally, the model is trained using the fit method with the English and German sequences as input and target, respectively.

In [None]:
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Add start/end tokens to target sentences
english_sentences = ['I am a student', 'He likes apples', 'She is reading a book']
german_sentences = ['<start> Ich bin ein Student <end>',
                    '<start> Er mag Äpfel <end>',
                    '<start> Sie liest ein Buch <end>']

# Tokenize the English sentences
english_tokenizer = Tokenizer()
english_tokenizer.fit_on_texts(english_sentences)
english_sequences = english_tokenizer.texts_to_sequences(english_sentences)
max_english_length = max(len(seq) for seq in english_sequences)
english_vocab_size = len(english_tokenizer.word_index) + 1

# Tokenize the German sentences
german_tokenizer = Tokenizer()
german_tokenizer.fit_on_texts(german_sentences)
german_sequences = german_tokenizer.texts_to_sequences(german_sentences)
max_german_length = max(len(seq) for seq in german_sequences)
german_vocab_size = len(german_tokenizer.word_index) + 1

# Pad sequences
english_sequences_padded = pad_sequences(english_sequences, maxlen=max_english_length, padding='post')
german_sequences_padded = pad_sequences(german_sequences, maxlen=max_german_length, padding='post')

# Define model architecture
latent_dim = 256
encoder_inputs = Input(shape=(None,))
encoder_embedding = Embedding(input_dim=english_vocab_size, output_dim=latent_dim, mask_zero=True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None,))
decoder_embedding = Embedding(input_dim=german_vocab_size, output_dim=latent_dim, mask_zero=True)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(german_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define and compile the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer=RMSprop(), loss='sparse_categorical_crossentropy')

# Train the model
model.fit(
    [english_sequences_padded, german_sequences_padded[:, :-1]],
    np.expand_dims(german_sequences_padded[:, 1:], -1),
    batch_size=64,
    epochs=50,
    validation_split=0.2
)

In [None]:
# === Define Inference Models ===

# Define decoder embedding layer separately so it can be reused later
decoder_embedding_layer = Embedding(input_dim=german_vocab_size, output_dim=latent_dim, mask_zero=True)

# Encoder model
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder model for inference
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_inputs_single = Input(shape=(1,))
decoder_embedding_output = decoder_embedding_layer(decoder_inputs_single)
decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding_output, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

# === Translation / Decoding ===

# Example English sentences to translate
english_test_sentences = ['We are happy', 'They love ice cream']

# Tokenize and pad test data
english_test_sequences = english_tokenizer.texts_to_sequences(english_test_sentences)
english_test_sequences_padded = pad_sequences(english_test_sequences, maxlen=max_english_length, padding='post')

# Function to decode predicted sequences
def decode_sequence(input_seq):
    # Encode the input as state vectors
    states_value = encoder_model.predict(input_seq)

    # Start with the <start> token
    start_token_index = german_tokenizer.word_index.get('<start>', 1)
    end_token_index = german_tokenizer.word_index.get('<end>', 2)
    target_seq = np.array([[start_token_index]])

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = german_tokenizer.index_word.get(sampled_token_index, '')

        if sampled_token_index == end_token_index or len(decoded_sentence.split()) > max_german_length:
            stop_condition = True
        else:
            decoded_sentence += sampled_word + ' '

        target_seq = np.array([[sampled_token_index]])
        states_value = [h, c]

    return decoded_sentence.strip()

# Predict and decode German translations
predicted_sentences = []
for seq in english_test_sequences_padded:
    decoded = decode_sequence(seq.reshape(1, -1))
    predicted_sentences.append(decoded)

# Print the translations
for english_sentence, german_translation in zip(english_test_sentences, predicted_sentences):
    print('English:', english_sentence)
    print('German:', german_translation)
    print()

#**B. Translator with Transformer: Option 1**

In [None]:
# Import the needed libraries
import tensorflow as tf
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
import numpy as np

##**1. Encoder Layer**
The class TransformerEncoderLayer defines a single layer within the encoder of a Transformer model. In a Transformer architecture, the encoder is responsible for processing input sequences and extracting their representations. Each layer in the encoder consists of two main components: a multi-head self-attention mechanism and a position-wise feedforward neural network (FFNN).

Here's a breakdown of what the class does:

1. Initialization: The constructor initializes the layer, taking parameters such as the dimensionality of the model (d_model), the number of attention heads (num_heads), the dimensionality of the feedforward neural network (dff), and the dropout rate (rate).

2. Components: Inside the layer, there are three main components:

* Multi-Head Self-Attention (MHA): This component allows the model to focus on different parts of the input sequence simultaneously. It takes the input sequence, computes attention weights for each position, and combines information from different positions based on these weights.
* Feedforward Neural Network (FFNN): After the attention mechanism, the output passes through a position-wise feedforward neural network. This network applies a series of linear transformations and non-linear activations to each position independently.
* Layer Normalization and Dropout: Layer normalization and dropout are applied after each sub-layer (attention and feedforward network) to stabilize training and prevent overfitting.
3. Call Method: The call method defines how the layer processes input data. It takes the input sequence and a boolean flag indicating whether the model is in training mode. Inside the method, the input sequence is passed through the multi-head self-attention mechanism and feedforward neural network in sequence. Layer normalization and dropout are applied after each sub-layer, and the output of the layer is returned.

In summary, the TransformerEncoderLayer class encapsulates the functionality of a single layer within the encoder of a Transformer model. It performs operations such as multi-head self-attention and position-wise feedforward processing on input sequences, facilitating the extraction of meaningful representations from the data.

In [None]:
# Define the TransformerEncoderLayer and TransformerDecoderLayer classes
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerEncoderLayer, self).__init__()

        # Multi-head self-attention mechanism
        self.mha = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        # Feedforward neural network
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])

        # Layer normalization
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        # Dropout layers
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, training):
        # Self-attention mechanism
        attn_output = self.mha(inputs, inputs, attention_mask=None, training=training)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)

        # Feedforward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2


##**2. Decoder Layer**
The TransformerDecoderLayer class defines a single layer within the decoder of a Transformer model. In the Transformer architecture, the decoder is responsible for generating output sequences based on the representations learned by the encoder. Each layer in the decoder consists of three main components: multi-head self-attention, encoder-decoder attention, and a position-wise feedforward neural network (FFNN).

Here's an explanation of what the class does:

1. Initialization: The constructor initializes the layer, taking parameters such as the dimensionality of the model (d_model), the number of attention heads (num_heads), the dimensionality of the feedforward neural network (dff), and the dropout rate (rate).

2. Components: Inside the layer, there are three main components:

* Masked Multi-Head Self-Attention (MHA): This component allows the decoder to attend to previous positions in the output sequence while preventing it from attending to future positions. It computes attention weights for each position in the output sequence and combines information from different positions based on these weights.
* Encoder-Decoder Attention: This component allows the decoder to take into account the representations learned by the encoder. It computes attention weights between the current position in the output sequence and the input sequence representations, enabling the decoder to align its output with the input.
* Feedforward Neural Network (FFNN): After the attention mechanisms, the output passes through a position-wise feedforward neural network, similar to the encoder. This network applies a series of linear transformations and non-linear activations to each position independently.
* Layer Normalization and Dropout: Layer normalization and dropout are applied after each sub-layer (masked self-attention, encoder-decoder attention, and feedforward network) to stabilize training and prevent overfitting.
3. Call Method: The call method defines how the layer processes input data. It takes the input sequence, encoder output, and a boolean flag indicating whether the model is in training mode. Inside the method, the input sequence is passed through the masked multi-head self-attention mechanism, encoder-decoder attention mechanism, and feedforward neural network in sequence. Layer normalization and dropout are applied after each sub-layer, and the output of the layer is returned.

In [None]:
class TransformerDecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerDecoderLayer, self).__init__()

        # Multi-head attention mechanisms
        self.mha1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.mha2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

        # Feedforward neural network
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])

        # Layer normalization
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.layernorm3 = LayerNormalization(epsilon=1e-6)

        # Dropout layers
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, enc_output, training):
        # Self-attention mechanism
        attn1 = self.mha1(inputs, inputs, attention_mask=None, training=training)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + inputs)

        # Encoder-decoder attention mechanism
        attn2 = self.mha2(out1, enc_output, attention_mask=None, training=training)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        # Feedforward network
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)

        return out3


##**3. Implement the Translator**
The TransformerTranslator class is a custom Keras model that implements a sequence-to-sequence transformer for translation tasks, such as English to German translation. Here's what the class does:

1. Initialization: The constructor initializes the model, taking parameters such as the number of layers (num_layers), the dimensionality of the model (d_model), the number of attention heads (num_heads), the dimensionality of the feedforward neural network (dff), the vocabulary size of the input and target languages (input_vocab_size, target_vocab_size), and the dropout rate (dropout_rate).

2. Components: Inside the model, there are two main components:

* Encoder: The encoder consists of a stack of transformer encoder layers. Each layer contains multi-head self-attention and feedforward neural network sub-layers. The encoder processes the input sequences (English sentences) and generates representations for each token in the input sequence.
* Decoder: The decoder consists of a stack of transformer decoder layers. Each layer contains masked multi-head self-attention, encoder-decoder attention, and feedforward neural network sub-layers. The decoder takes the encoder output and generates output sequences (German translations) based on the learned representations.
3. Forward Pass: The call method defines the forward pass of the model. Given input sequences in both the source (English) and target (German) languages, the model passes the input through the encoder to generate encoder output representations. The decoder then takes these representations and generates output sequences in the target language. The final layer of the decoder outputs the predicted token probabilities for each position in the output sequence.
4. Loss Calculation: During training, the model calculates the loss between the predicted token probabilities and the actual target tokens using a loss function such as sparse categorical cross-entropy. This loss is used to update the model's weights during optimization.
5. Model Compilation: After defining the forward pass and loss calculation, the model is compiled using an optimizer (e.g., Adam) and a loss function. This step prepares the model for training.
6. Model Summary: Finally, the model's architecture is summarized using the summary method, which displays the layer names, output shapes, and number of parameters in the model.

In summary, the TransformerTranslator class encapsulates the functionality of a sequence-to-sequence transformer model for translation tasks. It combines encoder and decoder components to process input sequences and generate output sequences, facilitating tasks such as language translation.

In [None]:
# Define the TransformerTranslator model
class TransformerTranslator(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, dropout_rate=0.1):
        super(TransformerTranslator, self).__init__()

        # Embedding layers for input and target vocabularies
        self.encoder_embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)

        # Transformer encoder and decoder layers
        self.transformer_encoder_layers = [TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate)
                                           for _ in range(num_layers)]
        self.transformer_decoder_layers = [TransformerDecoderLayer(d_model, num_heads, dff, dropout_rate)
                                           for _ in range(num_layers)]

        # Final output layer
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inputs, targets, training=False):
        # Encoder padding mask
        enc_padding_mask = None
        # Decoder padding mask
        dec_padding_mask = None

        # Encoder
        encoder_input = self.encoder_embedding(inputs)
        for layer in self.transformer_encoder_layers:
            encoder_input = layer(encoder_input, training=training)

        # Decoder
        decoder_input = self.decoder_embedding(targets)
        for layer in self.transformer_decoder_layers:
            decoder_input = layer(decoder_input, encoder_input, training=training)

        # Final output
        output = self.final_layer(decoder_input)
        return output

##**4. Configure Transformer Architecture**
The example hyperparameters define various aspects of the transformer architecture and training process. Here's a breakdown of what each hyperparameter does:

* num_layers: This hyperparameter determines the number of encoder and decoder layers in the transformer model. Each layer contains multiple sub-layers, such as multi-head self-attention and feedforward neural networks.

* d_model: The dimensionality of the model, also known as the hidden size. It represents the dimensionality of the embedding space and the internal representations within the transformer layers.

* num_heads: The number of attention heads in the multi-head attention mechanism. More attention heads allow the model to focus on different parts of the input sequence simultaneously, capturing more complex relationships.

* dff: The dimensionality of the feedforward neural network layer within each transformer layer. It determines the size of the hidden layer in the feedforward network.

* input_vocab_size: The size of the vocabulary for the input language (e.g., English). It represents the total number of unique tokens or words in the vocabulary.

* target_vocab_size: The size of the vocabulary for the target language (e.g., German). Similar to the input vocabulary size, it represents the total number of unique tokens or words in the target language vocabulary.

* dropout_rate: The dropout rate applied to the outputs of each transformer layer during training. Dropout is a regularization technique that helps prevent overfitting by randomly dropping units (along with their connections) from the network during training.

These hyperparameters collectively define the architecture and behavior of the transformer model, influencing its capacity, attention mechanism, and regularization during training. Adjusting these hyperparameters allows practitioners to tailor the model to specific tasks and datasets, balancing model complexity and performance.

In [None]:
# Example hyperparameters
num_layers = 4
d_model = 128
num_heads = 8
dff = 512
input_vocab_size = 10000
target_vocab_size = 10000
dropout_rate = 0.1

##**5. Initialize and Compile the Model**
The "Initialize and compile the model" section of the code involves setting up the model architecture and configuring its training process. Here's what each step in this section does:

* Model Initialization: The TransformerTranslator class is instantiated with the specified hyperparameters (num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, dropout_rate). This creates an instance of the transformer model with the specified architecture.

* Optimizer Initialization: An optimizer is chosen to update the model parameters during training. In this case, the Adam optimizer is used, which is a popular optimization algorithm for deep learning models. It adapts the learning rate for each parameter during training.

* Loss Function Initialization: The loss function is defined to measure the discrepancy between the model predictions and the actual target values. In this example, SparseCategoricalCrossentropy is used as the loss function. It is suitable for classification tasks with integer targets and sparse target values.

* Model Compilation: The model is compiled with the optimizer and loss function. This step configures the model for training by specifying the optimization algorithm and the loss function to be minimized. Additionally, any additional metrics can be specified here to monitor during training.

Overall, this section prepares the model for training by specifying its architecture, optimization algorithm, loss function, and any additional metrics to be tracked. After compilation, the model is ready to be trained on the training data.

In [None]:
# Initialize and compile the model
model = TransformerTranslator(num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, dropout_rate)
optimizer = Adam()
loss_object = SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# Define custom loss function
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)

model.compile(optimizer=optimizer, loss=loss_function)

##**6. Model Summary**
The "Display model summary" section of the code generates a summary of the model architecture, providing useful information about the different layers and parameters in the model. Here's what each step in this section does:

1. Model Summary: The summary() method is called on the model object. This method prints a concise summary of the model architecture to the console. It includes information such as:
* The type of each layer in the model.
* The output shape of each layer.
* The number of parameters in each layer.
* The total number of parameters in the model.
By displaying the model summary, you can quickly inspect the architecture of the model, including the number of layers, the shape of the input and output tensors, and the number of trainable parameters. This information is helpful for debugging, optimizing the model, and ensuring that the architecture is as intended.




In [None]:
# Display model summary

# Generate random example inputs and outputs for the model
# The input is a batch of sequences (64 sequences of length 50) for the encoder
input_example = tf.random.uniform((64, 50), minval=0, maxval=input_vocab_size, dtype=tf.int32)

# The output is a batch of sequences (64 sequences of length 50) for the decoder
output_example = tf.random.uniform((64, 50), minval=0, maxval=target_vocab_size, dtype=tf.int32)

# Create padding masks for both encoder and decoder
# Here we assume that padding tokens are represented by 0
# The encoder_padding_mask is True (1) wherever input tokens are padding (0)
encoder_padding_mask = tf.cast(tf.math.equal(input_example, 0), tf.float32)

# The decoder_padding_mask is True (1) wherever output tokens are padding (0)
decoder_padding_mask = tf.cast(tf.math.equal(output_example, 0), tf.float32)

# Run the model with the generated input and output examples
# Pass 'training=False' as a keyword argument to avoid the positional argument error
model(input_example, output_example, training=False)

# Display the model architecture and parameters
model.summary()

##**THE ENTIRE CODE**

In [None]:
# ****************** THE ENTIRE CODE *******************
# Import the needed libraries
import tensorflow as tf
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
import numpy as np

# Define the TransformerEncoderLayer class
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerEncoderLayer, self).__init__()

        # Multi-head self-attention mechanism
        self.mha = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        # Feedforward neural network
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])

        # Layer normalization
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        # Dropout layers
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, training):
        # Self-attention mechanism
        attn_output = self.mha(inputs, inputs, attention_mask=None, training=training)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)

        # Feedforward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2


# Define the TransformerDecoderLayer class
class TransformerDecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerDecoderLayer, self).__init__()

        # Multi-head attention mechanisms
        self.mha1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.mha2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

        # Feedforward neural network
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])

        # Layer normalization
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.layernorm3 = LayerNormalization(epsilon=1e-6)

        # Dropout layers
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, enc_output, training):
        # Self-attention mechanism
        attn1 = self.mha1(inputs, inputs, attention_mask=None, training=training)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + inputs)

        # Encoder-decoder attention mechanism
        attn2 = self.mha2(out1, enc_output, attention_mask=None, training=training)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        # Feedforward network
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)

        return out3


# Define the TransformerTranslator model
class TransformerTranslator(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, dropout_rate=0.1):
        super(TransformerTranslator, self).__init__()

        # Embedding layers for input and target vocabularies
        self.encoder_embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)

        # Transformer encoder and decoder layers
        self.transformer_encoder_layers = [TransformerEncoderLayer(d_model, num_heads, dff, dropout_rate)
                                           for _ in range(num_layers)]
        self.transformer_decoder_layers = [TransformerDecoderLayer(d_model, num_heads, dff, dropout_rate)
                                           for _ in range(num_layers)]

        # Final output layer
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inputs, targets, training=False):
        # Encoder padding mask
        enc_padding_mask = None
        # Decoder padding mask
        dec_padding_mask = None

        # Encoder
        encoder_input = self.encoder_embedding(inputs)
        for layer in self.transformer_encoder_layers:
            encoder_input = layer(encoder_input, training=training)

        # Decoder
        decoder_input = self.decoder_embedding(targets)
        for layer in self.transformer_decoder_layers:
            decoder_input = layer(decoder_input, encoder_input, training=training)

        # Final output
        output = self.final_layer(decoder_input)
        return output


# Example hyperparameters
num_layers = 4
d_model = 128
num_heads = 8
dff = 512
input_vocab_size = 10000
target_vocab_size = 10000
dropout_rate = 0.1

# Initialize and compile the model
model = TransformerTranslator(num_layers, d_model, num_heads, dff, input_vocab_size, target_vocab_size, dropout_rate)
optimizer = Adam()
loss_object = SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# Define custom loss function
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)

model.compile(optimizer=optimizer, loss=loss_function)

# Display model summary
input_example = tf.random.uniform((64, 50), minval=0, maxval=input_vocab_size, dtype=tf.int32)
output_example = tf.random.uniform((64, 50), minval=0, maxval=target_vocab_size, dtype=tf.int32)

# Run the model with the generated input and output examples
# Pass 'training=False' as a keyword argument to avoid the positional argument error
model(input_example, output_example, training=False)

# Display the model architecture and parameters
model.summary()


#**C. Translator with Transformer: Option 2**

The code below uses the transformers library to load a pre-trained translation model (**opus-mt-en-de**) and its tokenizer. Then, it translates English sentences to German using the model's generate method. Finally, it prints the translations.

The **opus-mt-en-de** library is a machine translation (MT) model trained specifically for translating text from English to German. It utilizes the OPUS-MT framework, which is an open-source initiative that provides pre-trained neural machine translation models for various language pairs. The opus-mt-en-de model is trained on a large corpus of English-German parallel text data and is capable of translating English sentences into German with high accuracy. This library makes it easy to perform English to German translation tasks programmatically by providing a ready-to-use translation model.

Please make sure you have the transformers library installed (pip install transformers) before running this code.

In [None]:
import tensorflow as tf
import numpy as np
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

# Load pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate English sentences to German
english_sentences = ['I am a student', 'He likes apples', 'She is reading a book']
german_translations = []

for sentence in english_sentences:
    input_ids = tokenizer.encode(sentence, return_tensors="tf", padding=True)
    output_ids = model.generate(input_ids, max_length=128, num_beams=4, early_stopping=True)
    german_translation = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    german_translations.append(german_translation)

# Print translations
for english_sentence, german_translation in zip(english_sentences, german_translations):
    print('English:', english_sentence)
    print('German:', german_translation)
    print()
