# Sequence-to-Sequence Learning: English-German Translator

This notebook illustrates how to develop an English-to-German translation model as a sequence-to-sequence (seq2seq) learning problem using TensorFlow and Keras. This project demonstrates the core components of an encoder-decoder architecture with *BiLSTMs*, handling of text preprocessing, and evaluation using BLEU scores.

---

### Sequence-to-Sequence Learning Overview

Sequence-to-sequence (seq2seq) learning is a type of deep learning model that maps input sequences to output sequences of potentially different lengths. It has wide applications in:

- Machine Translation (e.g., English-to-German)
- Text Summarization
- Chatbots and Conversational Agents

Key components:
- **Encoder**: Encodes the input sequence into a fixed-length context vector.
- **Decoder**: Decodes the context vector to generate the target sequence.


***Reference***: [TensorFlow in Action](https://www.google.de/books/edition/TensorFlow_in_Action/JYyKEAAAQBAJ?hl=en&gbpv=0).

### **Step 1**: Download and Extract the Dataset

In [1]:
import os
import pandas as pd
import zipfile
from typing import List, Tuple

# Ensure the required dataset is available and extracted
def prepare_data() -> pd.DataFrame:
    """
    Prepares the English-German dataset for translation.
    Downloads and extracts the data if not already present.

    Returns:
        pd.DataFrame: A dataframe containing English and German sentences.
    """
    data_dir = 'data'
    zip_path = os.path.join(data_dir, 'deu-eng.zip')
    extracted_path = os.path.join(data_dir, 'deu.txt')

    if not os.path.exists(zip_path):
        raise FileNotFoundError(
            f"Please download 'deu-eng.zip' from "
            f"http://www.manythings.org/anki/deu-eng.zip and place it in the {data_dir} folder."
        )

    if not os.path.exists(extracted_path):
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Dataset extracted.")
    else:
        print("Dataset already extracted.")

    # Load and preprocess the data
    df = pd.read_csv(extracted_path, delimiter='\t', header=None, names=["EN", "DE", "Attribution"])
    df = df[["EN", "DE"]]
    return df

df = prepare_data()

Dataset already extracted.


**Explanation**: This step ensures the required dataset is extracted for processing. If the dataset is not present in the expected location, it prompts the user to download it manually.

---

### **Step 2**: Data Setup and Preprocessing
---

Download [deu-eng dataset](http://www.manythings.org/anki/deu-eng.zip) manually and locate it in the ```\data``` folder.

In [2]:
import pandas as pd

# Read the csv file
df = pd.read_csv(os.path.join('data', 'deu.txt'), delimiter='\t', header=None)

# Set column names
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]

print('df.shape = {}'.format(df.shape))

clean_inds = [i for i in range(len(df)) if b"\xc2" not in df.iloc[i]["DE"].encode("utf-8")]
df = df.iloc[clean_inds]

df.head()

df.shape = (277891, 2)


Unnamed: 0,EN,DE
0,Go.,Geh.
1,Hi.,Hallo!
2,Hi.,Grüß Gott!
3,Run!,Lauf!
4,Run.,Lauf!


**Explanation**: In this step, we clean the data by removing rows with unwanted characters and retain only English and German sentence pairs for the translation task.

---

### **Step 3**: Random Sampling and Token Addition

In [3]:
n_samples = 50000
random_seed= 4321
df = df.sample(n=n_samples, random_state=random_seed)

start_token = 'sos'
end_token = 'eos'
df["DE"] = start_token + ' ' + df["DE"] + ' ' + end_token

# Randomly sample 10% examples from the total 50000 randomly
test_df = df.sample(n=int(n_samples/10), random_state=random_seed)

# Randomly sample 10% examples from the remaining randomly
valid_df = df.loc[~df.index.isin(test_df.index)].sample(n=int(n_samples/10), random_state=random_seed)

# Assign the rest to training data
train_df = df.loc[~(df.index.isin(test_df.index) | df.index.isin(valid_df.index))]

**Explanation**: This step prepares a subset of ```50,000``` sentences for the task, appends start (```sos```) and end (```eos```) tokens to each German sentence, and splits the data into training, validation, and test sets.

---

### **Step 4**: Analyze Sequence Lengths

In [4]:
def print_sequence_length(str_ser):
    """
    Print the summary stats of the sequence length
    """
    seq_length_ser = str_ser.str.split(' ').str.len()
    print("\nSome summary statistics")
    print("Median length: {}\n".format(seq_length_ser.median()))
    print(seq_length_ser.describe())
    
    print("\nComputing the statistics between the 1% and 99% quantiles (to ignore outliers)")
    
    p_01 = seq_length_ser.quantile(0.01)
    p_99 = seq_length_ser.quantile(0.99)
    
    print(seq_length_ser[
    (seq_length_ser >= p_01) & (seq_length_ser < p_99)
    ].describe())

"""
print("English corpus")
print('='*50)
print_sequence_length(train_df["EN"])
print("\nGerman corpus")
print('='*50)

print_sequence_length(train_df["DE"])
"""

'\nprint("English corpus")\nprint(\'=\'*50)\nprint_sequence_length(train_df["EN"])\nprint("\nGerman corpus")\nprint(\'=\'*50)\n\nprint_sequence_length(train_df["DE"])\n'

**Explanation**: Here, we analyze the sequence lengths for both corpora. The results help determine appropriate sequence lengths for the encoder and decoder.

---

### **Step 5**: Vocabulary Analysis

In [5]:
from collections import Counter

en_words = train_df["EN"].str.split().sum()
de_words = train_df["DE"].str.split().sum()
n=10

def get_vocabulary_size_greater_than(words, n, verbose=True):
    """
    Get the vocabulary size above a certain threshold
    """
    counter = Counter(words)
    
    freq_df = pd.Series(
    list(counter.values()),
    index=list(counter.keys())
    ).sort_values(ascending=False)
    
    if verbose:
        print(freq_df.head(n=10))
    n_vocab = (freq_df>=n).sum()
    if verbose:
        print("\nVocabulary size (>={} frequent): {}".format(n, n_vocab))
    return n_vocab

en_vocab = get_vocabulary_size_greater_than(en_words, n)
de_vocab = get_vocabulary_size_greater_than(de_words, n)

Tom    9228
to     8700
I      8620
the    6766
you    6136
a      5741
is     4141
in     2639
of     2470
was    2380
dtype: int64

Vocabulary size (>=10 frequent): 2218
sos      40000
eos      40000
Tom       9713
Ich       7964
ist       4735
nicht     4616
zu        3606
Sie       3441
du        3132
das       2987
dtype: int64

Vocabulary size (>=10 frequent): 2483


**Explanation**: We compute the vocabulary size for both languages, focusing on words that appear at least 10 times. This informs the TextVectorization step later in the pipeline.

---

In [6]:
# Define sequence lengths with some extra space for longer sequences
en_seq_length = 19
de_seq_length = 21

print("EN vocabulary size: {}".format(en_vocab))
print("DE vocabulary size: {}".format(de_vocab))
print("\n")

print("EN max sequence length: {}".format(en_seq_length))
print("DE max sequence length: {}".format(de_seq_length))


EN vocabulary size: 2218
DE vocabulary size: 2483


EN max sequence length: 19
DE max sequence length: 21


## *Writing an English-German seq2seq Machine Translator*
---

Machine translation involves transforming text from one language to another. In this notebook, I focus on creating an *English-to-German* translator using a *sequence-to-sequence* (*seq2seq*) deep learning model.

**Seq2Seq Model Architecture**. The seq2seq model consists of two primary components:
- *Encoder:*
    - Processes the input (English) text and generates a fixed-length context vector, also known as a "*[thought vector](https://wiki.pathmind.com/thought-vectors)*."
    - Encodes the input sequence into a hidden representation that summarizes its semantic and syntactic content.
- *Decoder:*
  - Takes the context vector produced by the encoder as input.
  - Decodes it to produce the output sequence (German text).

Both the encoder and decoder are *recurrent neural networks* (*RNNs*), making them suitable for handling sequential data of arbitrary lengths.

**Key Challenges in Seq2Seq Learning**:
- *Variable Lengths*: The input and output sequences often differ in length. For example, the number of words in a translation might be fewer or greater than the source text.
- *Mapping Arbitrary Sequences*: The model must map an input sequence of arbitrary length to an output sequence of arbitrary length while preserving contextual meaning.

![encoder-decoder-machine-translation](plots/enc-dec.svg)

The *encoder* in our sequence-to-sequence model is built using a *Gated Recurrent Unit* (*GRU*), a type of *RNN*. Its role is to process the input sequence and produce a fixed-size output that summarizes the sequence's information. The *GRU* processes each element of the input sequence step by step, updating its hidden state based on the current input and the previous hidden state. After processing the entire sequence, the final hidden state of the *GRU* serves as the context vector, which encapsulates the semantic and syntactic information of the input.

The *decoder*, which is also based on a *GRU*, further processes the context vector to generate the target sequence. In addition to the *GRU*, the *decoder* incorporates *Dense layers*, which play a critical role in producing the final output. The Dense layers map the *GRU's* outputs to the target vocabulary, generating a probability distribution over possible words at each time step. A key feature of the *decoder* is that the weights of the Dense layers are shared across time steps. This means that, similar to how the *GRU* updates and reuses the same weights for each input in the sequence, the Dense layers also reuse their weights for predicting each word in the output sequence. This approach ensures consistency and efficiency in processing sequential data.

![encoder-decoder-machine-translation](plots/gru.svg)

### **Step 6**: Text Vectorization

The TextVectorization layer takes in a string, tokenizes it, and converts the tokens to *IDs* by means of a vocabulary (or dictionary) lookup. It takes a list of strings (or an array of strings) as the input, where each string can be a *word/phrase/sentence* (and so on). Then it learns the vocabulary from that corpus. Finally, the layer can be used to convert a list of strings to a tensor that contains a sequence of token IDs for each string in the list provided.

In [7]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import TextVectorization

def get_vectorizer(
    corpus, n_vocab=100, max_length=None, return_vocabulary=True, name=None
    ):
    """ Return a text vectorization layer or a model """
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input')
    vectorize_layer = tf.keras.layers.TextVectorization(
                                                        max_tokens=n_vocab+2,
                                                        output_mode='int',
                                                        output_sequence_length=max_length,
                                                        )
    vectorize_layer.adapt(corpus)
    vectorized_out = vectorize_layer(inp)
    
    if not return_vocabulary:
        return tf.keras.models.Model(
    inputs=inp, outputs=vectorized_out, name=name
    )
    else:
        return tf.keras.models.Model(inputs=inp, outputs=vectorized_out, name=name), vectorize_layer.get_vocabulary()

# Get the English vectorizer/vocabulary
en_vectorizer, en_vocabulary = get_vectorizer(
corpus=np.array(train_df["EN"].tolist()), n_vocab=en_vocab,
max_length=en_seq_length, name='en_vectorizer'
)

# Get the German vectorizer/vocabulary
de_vectorizer, de_vocabulary = get_vectorizer(
corpus=np.array(train_df["DE"].tolist()), n_vocab=de_vocab,
max_length=de_seq_length-1, name='de_vectorizer'
)

**Explanation**: Text vectorization converts sentences into sequences of integer token IDs, which are compatible with neural network input layers.

---


### **Step 7**: Build the Encoder and the Decoder

In [8]:
import tensorflow as tf

def build_encoder(n_vocab, vectorizer):
    """
    Build the encoder for the seq2seq model.
    """
    # Input layer
    encoder_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input')
    
    # Text vectorization
    vectorized_output = vectorizer(encoder_input)
    
    # Embedding layer
    embedding_layer = tf.keras.layers.Embedding(
        input_dim=n_vocab + 2, 
        output_dim=128, 
        mask_zero=True, 
        name='encoder_embedding'
    )
    embedded_output = embedding_layer(vectorized_output)
    
    # Bidirectional GRU
    bidirectional_gru = tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(128, name='encoder_gru'), 
        name='encoder_bidirectional_gru'
    )
    encoder_output = bidirectional_gru(embedded_output)
    
    # Define encoder model
    encoder_model = tf.keras.models.Model(
        inputs=encoder_input, 
        outputs=encoder_output, 
        name='encoder'
    )
    return encoder_model


def build_decoder(n_vocab, encoder, vectorizer):
    """
    Build the decoder for the seq2seq model.
    """
    # Encoder input for initializing the decoder's GRU state
    encoder_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input_final')
    decoder_initial_state = encoder(encoder_input)

    # Decoder input
    decoder_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='decoder_input')
    
    # Text vectorization for decoder input
    vectorized_output = vectorizer(decoder_input)
    
    # Embedding layer
    embedding_layer = tf.keras.layers.Embedding(
        input_dim=n_vocab + 2, 
        output_dim=128, 
        mask_zero=True, 
        name='decoder_embedding'
    )
    embedded_output = embedding_layer(vectorized_output)
    
    # GRU layer
    gru_layer = tf.keras.layers.GRU(
        256, 
        return_sequences=True, 
        name='decoder_gru'
    )
    gru_output = gru_layer(embedded_output, initial_state=decoder_initial_state)
    
    # Dense layers for output
    dense_layer_1 = tf.keras.layers.Dense(
        512, 
        activation='relu', 
        name='decoder_dense_1'
    )
    dense_output_1 = dense_layer_1(gru_output)
    
    dense_layer_final = tf.keras.layers.Dense(
        n_vocab + 2, 
        activation='softmax', 
        name='decoder_dense_final'
    )
    decoder_output = dense_layer_final(dense_output_1)
    
    # Define decoder model
    decoder_model = tf.keras.models.Model(
        inputs=[encoder_input, decoder_input], 
        outputs=decoder_output, 
        name='decoder'
    )
    return decoder_model

**Explanation**: The *encoder* processes the input sequence using text vectorization, embeddings, and a bidirectional GRU layer. It produces a context vector representing the input sequence. The *decoder*, on the other hand takes the encoder's context vector and generates output sequences. It uses a GRU layer for decoding and dense layers for token probability distribution.

---

#### *Bidirectional RNN: Reading text forward and backward*
---
The standard *RNN* reads the text forward, one time step at a time, and outputs a sequence of outputs. *Bidirectional RNNs*, as the name suggests, not only read the text forward, but read it backward. This means that bidirectional *RNNs* have twosequences of outputs. Then these two sequences are combined using a combination strategy (e.g., concatenation) to produce the final output. *Bidirectional RNNs* typically outperform standard RNNs because they understand relationships in text both forward and backward, as shown in the following figure.

![rnn-bi-rnn](plots/rnn-bi-rnn.svg)

Why does reading text backward help? There are some languages that are read backward (e.g., Arabic, Hebrew). Unless the text is specifically processed to account for this writing style, a standard RNN would have a very difficult time understanding the language. By having a *bidirectional RNN*, you are removing the model's dependency on a language to always be left to right or right to left. If we consider the English language, there can be instances where it's impossible to infer a relationship going only forward. Consider the two sentences John went toward the bank on Clarence Street. John went towards the bank of the river. Since the two sentences are identical up to the word "bank," it is not possible to know if the bank is referring to the financial institution or a river bank until you read the rest. For a bidirectional RNN, this is trivial.

### **Step 8**: Build and Compile the Seq2Seq Model

In [9]:
def build_seq2seq_model(encoder, decoder):
    """
    Build the final seq2seq model using the pre-built encoder and decoder.
    """
    # Encoder input
    encoder_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input_final')
    encoder_output = encoder(encoder_input)

    # Decoder input
    decoder_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='decoder_input')

    # Use the decoder model
    decoder_output = decoder([encoder_input, decoder_input])

    # Define the seq2seq model
    seq2seq_model = tf.keras.models.Model(
        inputs=[encoder_input, decoder_input], 
        outputs=decoder_output, 
        name='seq2seq_model'
    )
    return seq2seq_model


# Usage
# Assuming `en_vocab`, `en_vectorizer`, and `de_vectorizer` are defined
encoder = build_encoder(en_vocab, en_vectorizer)
decoder = build_decoder(de_vocab, encoder, de_vectorizer)
seq2seq_model = build_seq2seq_model(encoder, decoder)

# Summaries
encoder.summary()
decoder.summary()
seq2seq_model.summary()

In [10]:
# Ensure [UNK] is at the beginning of the vocabulary
en_vocabulary = [v for v in en_vocabulary if v != '[UNK]']  # Remove existing [UNK]
en_vocabulary = ['[UNK]'] + en_vocabulary  # Add [UNK] at the start

de_vocabulary = [v for v in de_vocabulary if v != '[UNK]']  # Remove existing [UNK]
de_vocabulary = ['[UNK]'] + de_vocabulary  # Add [UNK] at the start

In [11]:
from tensorflow.keras.metrics import SparseCategoricalAccuracy
seq2seq_model.compile(
loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy']
)

In [12]:
seq2seq_model.summary()

**Explanation**: The seq2seq model combines the encoder and decoder into a unified framework for training and inference.

---

#### *Detailed Explanation of Seq2Seq Encoder and Decoder Construction*
---
The Seq2Seq model is designed to process sequences of text, such as translating sentences from one language to another. It consists of two key components: the encoder and the decoder. While the encoder compresses the input sequence into a context vector, the decoder expands that context vector to generate the output sequence. Let’s break down the process based on the provided implementation.

**Encoder: Compressing the Input Sequence**. 

The encoder processes the input text sequence and produces a context vector (a fixed-size representation of the input). Here's how it's implemented:
1. ***Input Layer***. An input layer ```encoder_input``` accepts sequences of text strings with a shape of ```(1,)```. The input type is ```tf.string```, enabling flexible handling of textual data.
2. ***Text Vectorization***. The input text is processed through the vectorizer to tokenize and convert it into numerical IDs.
3. ***Embedding Layer***. These token IDs are mapped into dense vectors using an ```Embedding``` layer with a vocabulary ```size n_vocab + 2```, an output dimension of ```128```, and ```mask_zero=True``` to handle padding.
4. ***Bidirectional GRU***. A bidirectional GRU layer processes the embeddings, combining information from both forward and backward passes. This produces a robust context vector.
5. ***Encoder Model***. The encoder model outputs the processed context vector, which serves as input to the decoder.


**Decoder: Generating the Output Sequence**

The decoder takes the context vector from the encoder and generates the output sequence one token at a time. This component is slightly more complex due to its recurrent nature.

1. ***Decoder's Initial State***. The decoder begins with the context vector (output of the encoder) as its initial state, which serves as the memory of the source sequence.
2. ***Input Layer***. The decoder has a separate input layer ```decoder_input``` for target text sequences.
3. ***Text Vectorization and Embedding***. Similar to the encoder, the target input is passed through a ```vectorizer``` and then through an ```Embedding``` layer. The embedding layer for the decoder is separate since the source and target vocabularies differ.
4. ***GRU Layer***. The GRU layer processes the embedded vectors to predict the next word. Unlike the encoder, the GRU here is unidirectional because predictions are sequential, depending only on past and current inputs.
   - The initial state of the GRU is set to the context vector (```decoder_initial_state```) from the encoder.
   - ```return_sequences=True``` ensures that the GRU outputs a sequence of states, one for each time step.
5. ***Dense Layers***. Two fully connected layers:
    - A hidden dense layer with 512 units and ReLU activation (```decoder_dense_1```) processes the GRU outputs.
    - A final dense layer with ```softmax``` activation predicts the probability distribution over the target vocabulary (```decoder_dense_final```).
7. ***Decoder Model***. The decoder outputs probabilities for the next word at each time step, enabling sequence generation.


![seq-seq-modeling](plots/en2ger-translator.svg)



**Seq2Seq Model: Combining Encoder and Decoder**

The final Seq2Seq model integrates the encoder and decoder into an end-to-end architecture.

1. ***Input***. Two inputs are required:
    - ```encoder_input```: For the source text sequence.
    - ```decoder_input```: For the target text sequence.
3. ***Context Vector***. The ```encoder_input``` is passed through the encoder to generate the context vector.
4. ***Output***. The ```decoder_input``` and the context vector are fed into the decoder to generate the output probabilities.
5. ***Final Model***. The entire model takes both source and target sequences as input and outputs the target sequence probabilities.


**Training the Model with Teacher Forcing**
During training, the decoder uses the actual target sequence as guidance, a method known as **teacher forcing**. For each time step:
- Given the previous token, the model predicts the next token.
- The process trains the decoder to align predictions with the target sequence efficiently.

**Key Highlights in the Implementation**
1. **Bidirectional Encoder**. Enhances context understanding by combining forward and backward information.
2. **Separate Embedding Layers**. Different embeddings for source and target sequences ensure language-specific token mapping.
3. **GRU State Initialization**. The decoder initializes with the encoder's context vector, bridging the two components seamlessly.
4. **Dense Output Layers**. These ensure the decoder outputs a probability distribution over the entire target vocabulary.

### **Step 9**: Prepare and shuffle data

In [13]:
def prepare_data(train_df, valid_df, test_df):
    """
    Create a data dictionary from the dataframes containing data
    """
    data_dict = {}
    for label, df in zip(
        ['train', 'valid', 'test'], [train_df, valid_df, test_df]
        ):
        en_inputs = np.array(df["EN"].tolist())
        de_inputs = np.array(
        df["DE"].str.rsplit(n=1, expand=True).iloc[:,0].tolist()
        )
        de_labels = np.array(
        df["DE"].str.split(n=1, expand=True).iloc[:,1].tolist()
        )
        data_dict[label] = {
        'encoder_inputs': en_inputs,
        'decoder_inputs': de_inputs,
        'decoder_labels': de_labels
        }
    return data_dict

In [14]:
def shuffle_data(en_inputs, de_inputs, de_labels, shuffle_indices=None):
    """
    Shuffle the data randomly (but all of inputs and labels at ones)
    """
    if shuffle_indices is None:
        shuffle_indices = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        shuffle_indices = np.random.permutation(shuffle_indices)
    return (
    en_inputs[shuffle_indices],
    de_inputs[shuffle_indices],
    de_labels[shuffle_indices]
    ), shuffle_indices

### **Step 10**: Define BLEU Evaluation metric
---

The source of the ```nmt_bleu.py``` module can be found in this [GIT Repository](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py)

In [15]:
from tensorflow.keras.layers import StringLookup
from nmt_bleu import compute_bleu

class BLEUMetric:
    def __init__(self, vocabulary, name='bleu', **kwargs):
        """
        Computes the BLEU score for machine translation.
        """
        super().__init__()
        self.vocab = vocabulary
        self.id_to_token_layer = StringLookup(
            vocabulary=self.vocab,
            invert=True,
            num_oov_indices=1  # Allow one OOV token
        )

    def calculate_bleu_from_predictions(self, real, pred):
        """
        Calculate BLEU score for targets and predictions.
        """
        pred_argmax = tf.argmax(pred, axis=-1)  # Get predicted IDs
        pred_tokens = self.id_to_token_layer(pred_argmax)  # Convert to tokens
        real_tokens = self.id_to_token_layer(real)  # Convert to tokens
        
        # Clean and process tokens
        def clean_text(tokens):
            # Remove padding, strip unwanted tokens
            t = tf.strings.strip(
                tf.strings.regex_replace(
                    tf.strings.join(tf.transpose(tokens), separator=' '),
                    "eos.*", ""  # Remove everything after "eos"
                )
            )
            t = np.char.decode(t.numpy().astype(np.bytes_), encoding='utf-8')
            t = [doc if len(doc) > 0 else '[UNK]' for doc in t]
            return np.char.split(t).tolist()
        
        pred_tokens = clean_text(pred_tokens)
        real_tokens = [[r] for r in clean_text(real_tokens)]  # Format for BLEU

        # Compute BLEU using the provided `compute_bleu` function
        bleu, _, _, _, _, _ = compute_bleu(real_tokens, pred_tokens)
        return bleu

**Explanation**: The *BLEU* metric measures the similarity between predicted translations and ground truth references. It is a widely used evaluation metric for machine translation.

In [16]:
def clean_text(tokens):
    

    # 3. Strip the string of any extra white spaces
    translations_in_bytes = tf.strings.strip(
        # 2. Replace everything after the eos token with blank
        tf.strings.regex_replace(
            # 1. Join all the tokens to one string in each sequence
            tf.strings.join(tf.transpose(tokens), separator=' '),
                "eos.*", ""
                ),
                )
    # Decode the byte stream to a string
    translations = np.char.decode(translations_in_bytes.numpy().astype(np.bytes_), encoding='utf-8')
    
    # If the string is empty, add a [UNK] token
    # Otherwise get a Division by zero error
    translations = [sent if len(sent)>0 else '[UNK]' for sent in translations ]
    
    # Split the sequences to individual tokens
    translations = np.char.split(translations).tolist()
    return translations

In [17]:
def evaluate_model(model, vectorizer, en_inputs_raw, de_inputs_raw, de_labels_raw, epochs, batch_size):
    """Evaluate the model on various metrics."""
    loss_log, accuracy_log, bleu_log = [], [], []
    bleu_metric = BLEUMetric(de_vocabulary)
    n_batches = en_inputs_raw.shape[0] // batch_size

    for i in range(n_batches):
        print(f"Evaluating batch {i + 1}/{n_batches}", end="\r")

        # Convert inputs to tensors
        x = [
            tf.convert_to_tensor(en_inputs_raw[i * batch_size:(i + 1) * batch_size], dtype=tf.string),
            tf.convert_to_tensor(de_inputs_raw[i * batch_size:(i + 1) * batch_size], dtype=tf.string),
        ]
        # Convert labels to integer token IDs
        y = tf.convert_to_tensor(vectorizer(de_labels_raw[i * batch_size:(i + 1) * batch_size]))

        # Evaluate model
        loss, accuracy = model.evaluate(x, y, verbose=0)
        pred_y = model.predict(x)

        # Compute BLEU score
        bleu = bleu_metric.calculate_bleu_from_predictions(y, pred_y)

        loss_log.append(loss)
        accuracy_log.append(accuracy)
        bleu_log.append(bleu)

    return np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)

In [22]:
def train_model_with_inference(model, vectorizer_layer, train_df, valid_df, test_df, epochs, batch_size):
    """Train the model and perform inference directly after training."""
    bleu_metric = BLEUMetric(de_vocabulary)
    data_dict = prepare_data(train_df, valid_df, test_df)
    shuffle_inds = None

    for epoch in range(epochs):
        bleu_log, accuracy_log, loss_log = [], [], []

        # Shuffle the training data
        (en_inputs_raw, de_inputs_raw, de_labels_raw), shuffle_inds = shuffle_data(
            data_dict['train']['encoder_inputs'],
            data_dict['train']['decoder_inputs'],
            data_dict['train']['decoder_labels'],
            shuffle_inds
        )
        n_train_batches = en_inputs_raw.shape[0] // batch_size

        print(f"\nEpoch {epoch + 1}/{epochs}")

        # Training Loop
        for i in range(n_train_batches):
            print(f"Training batch {i + 1}/{n_train_batches}", end="\r")
            x = [
                tf.convert_to_tensor(en_inputs_raw[i * batch_size:(i + 1) * batch_size], dtype=tf.string),
                tf.convert_to_tensor(de_inputs_raw[i * batch_size:(i + 1) * batch_size], dtype=tf.string),
            ]
            y = tf.convert_to_tensor(vectorizer_layer(de_labels_raw[i * batch_size:(i + 1) * batch_size]))

            # Train on batch
            model.train_on_batch(x, y)

            # Track metrics
            loss, accuracy = model.evaluate(x, y, verbose=0)
            pred_y = model.predict(x)
            bleu = bleu_metric.calculate_bleu_from_predictions(y, pred_y)

            loss_log.append(loss)
            accuracy_log.append(accuracy)
            bleu_log.append(bleu)

        print(f"\t(train) loss: {np.mean(loss_log):.4f} - accuracy: {np.mean(accuracy_log):.4f} - bleu: {np.mean(bleu_log):.4f}")

        # Validation after the epoch
        val_loss, val_accuracy, val_bleu = evaluate_model(
            model,
            vectorizer_layer,
            data_dict['valid']['encoder_inputs'],
            data_dict['valid']['decoder_inputs'],
            data_dict['valid']['decoder_labels'],
            epochs=1,
            batch_size=batch_size
        )
        print(f"\t(valid) loss: {val_loss:.4f} - accuracy: {val_accuracy:.4f} - bleu: {val_bleu:.4f}")

        # Perform inference directly after training
        print("\nPerforming inference on test data...")
        vocabulary = vectorizer_layer.get_vocabulary()  # Retrieve vocabulary from the TextVectorization layer
        for test_example in range(5):  # Perform inference on 5 examples
            en_input = data_dict['test']['encoder_inputs'][test_example]
            de_target = data_dict['test']['decoder_labels'][test_example]
            x = [
                tf.convert_to_tensor([en_input], dtype=tf.string),
                tf.convert_to_tensor(["sos"], dtype=tf.string),  # Start with the SOS token
            ]
            prediction = model.predict(x)
            predicted_sequence_ids = tf.argmax(prediction, axis=-1).numpy()[0]
            predicted_sequence = [vocabulary[token_id] for token_id in predicted_sequence_ids]
            print(f"Input: {en_input}")
            print(f"Target: {de_target}")
            print(f"Prediction: {' '.join(predicted_sequence)}")
            print("-" * 30)

    # Test evaluation after all epochs
    test_loss, test_accuracy, test_bleu = evaluate_model(
        model,
        vectorizer_layer,
        data_dict['test']['encoder_inputs'],
        data_dict['test']['decoder_inputs'],
        data_dict['test']['decoder_labels'],
        epochs=1,
        batch_size=batch_size
    )
    print(f"\n(test) loss: {test_loss:.4f} - accuracy: {test_accuracy:.4f} - bleu: {test_bleu:.4f}")
    print("Training and inference complete.")


In [23]:
train_model_with_inference(seq2seq_model, de_vectorizer.layers[1], train_df, valid_df, test_df, epochs=1, batch_size=4096)



Epoch 1/1
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 17ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step
	(train) loss: 5.2885 - accuracy: 0.0706 - bleu: 0.0000
[1m128/128[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step
	(valid) loss: 5.1502 - accuracy: 0.0768 - bleu: 0.0000

Performing inference on test data...
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
Input: 



Loading vocabularies
Loading weights and generating the inference model


ValueError: Unknown layer: 'NotEqual'. Please ensure you are using a `keras.utils.custom_object_scope` and that this object is included in the scope. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

In [46]:
import time

t1 = time.time()    
train_model(final_model, de_vectorizer, train_df, valid_df, test_df, 1, 4096)
t2 = time.time()

print("\nIt took {} seconds to complete the training".format(t2-t1))

ValueError: Found reserved OOV token at unexpected location in `vocabulary`. Note that passed `vocabulary` does not need to include the OOV and mask tokens. Either remove all mask and OOV tokens, or include them only at the start of the vocabulary in precisely this order: ['[UNK]']. Received: oov_token=[UNK] at vocabulary index [1]