# Sequence-to-Sequence Learning
---
This notebook illustrates the detailed steps of english-german translation model as a *sequence to sequence* problem.

***Reference***: [TensorFlow in Action](https://www.google.de/books/edition/TensorFlow_in_Action/JYyKEAAAQBAJ?hl=en&gbpv=0).

In [1]:
import os
import requests
import zipfile

Download [deu-eng dataset](http://www.manythings.org/anki/deu-eng.zip) manually and locate it in the ```\data``` folder.

In [2]:
# Make sure the zip file has been downloaded
if not os.path.exists(os.path.join('data','deu-eng.zip')):
    raise FileNotFoundError(
    "Uh oh! Did you download the deu-eng.zip from http:/ /www.manythings.org/anki/deu-eng.zip manually and place it in the /data folder?")
else:
    if not os.path.exists(os.path.join('data', 'deu.txt')):
        with zipfile.ZipFile(os.path.join('data','deu-eng.zip'), 'r') as zip_ref:
            zip_ref.extractall('data')
    else:
        print("The extracted data already exists")

The extracted data already exists


In [3]:
import pandas as pd

# Read the csv file
df = pd.read_csv(os.path.join('data', 'deu.txt'), delimiter='\t', header=None)

# Set column names
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]

In [4]:
print('df.shape = {}'.format(df.shape))

df.shape = (277891, 2)


In [5]:
clean_inds = [i for i in range(len(df)) if b"\xc2" not in df.iloc[i]["DE"].encode("utf-8")]
df = df.iloc[clean_inds]

df.head()

Unnamed: 0,EN,DE
0,Go.,Geh.
1,Hi.,Hallo!
2,Hi.,Grüß Gott!
3,Run!,Lauf!
4,Run.,Lauf!


In [6]:
n_samples = 50000
random_seed= 4321
df = df.sample(n=n_samples, random_state=random_seed)

In [7]:
start_token = 'sos'
end_token = 'eos'
df["DE"] = start_token + ' ' + df["DE"] + ' ' + end_token

In [8]:
# Randomly sample 10% examples from the total 50000 randomly
test_df = df.sample(n=int(n_samples/10), random_state=random_seed)

# Randomly sample 10% examples from the remaining randomly
valid_df = df.loc[~df.index.isin(test_df.index)].sample(n=int(n_samples/10), random_state=random_seed)

# Assign the rest to training data
train_df = df.loc[~(df.index.isin(test_df.index) | df.index.isin(valid_df.index))]

### Analyzing the vocabulary size

In [9]:
from collections import Counter

en_words = train_df["EN"].str.split().sum()
de_words = train_df["DE"].str.split().sum()
n=10

def get_vocabulary_size_greater_than(words, n, verbose=True):
    """
    Get the vocabulary size above a certain threshold
    """
    counter = Counter(words)
    
    freq_df = pd.Series(
    list(counter.values()),
    index=list(counter.keys())
    ).sort_values(ascending=False)
    
    if verbose:
        print(freq_df.head(n=10))
    n_vocab = (freq_df>=n).sum()
    if verbose:
        print("\nVocabulary size (>={} frequent): {}".format(n, n_vocab))
    return n_vocab


print("English corpus")
print('='*50)
en_vocab = get_vocabulary_size_greater_than(en_words, n)

print("\nGerman corpus")
print('='*50)
de_vocab = get_vocabulary_size_greater_than(de_words, n)

English corpus
Tom    9228
to     8700
I      8620
the    6766
you    6136
a      5741
is     4141
in     2639
of     2470
was    2380
dtype: int64

Vocabulary size (>=10 frequent): 2218

German corpus
sos      40000
eos      40000
Tom       9713
Ich       7964
ist       4735
nicht     4616
zu        3606
Sie       3441
du        3132
das       2987
dtype: int64

Vocabulary size (>=10 frequent): 2483


### Analyzing the sequence length

In [10]:
def print_sequence_length(str_ser):
    """
    Print the summary stats of the sequence length
    """
    seq_length_ser = str_ser.str.split(' ').str.len()
    print("\nSome summary statistics")
    print("Median length: {}\n".format(seq_length_ser.median()))
    print(seq_length_ser.describe())
    
    print("\nComputing the statistics between the 1% and 99% quantiles (to ignore outliers)")
    
    p_01 = seq_length_ser.quantile(0.01)
    p_99 = seq_length_ser.quantile(0.99)
    
    print(seq_length_ser[
    (seq_length_ser >= p_01) & (seq_length_ser < p_99)
    ].describe())

In [11]:
print("English corpus")
print('='*50)
print_sequence_length(train_df["EN"])
print("\nGerman corpus")
print('='*50)
print_sequence_length(train_df["DE"])

English corpus

Some summary statistics
Median length: 6.0

count    40000.000000
mean         6.294025
std          2.542850
min          1.000000
25%          5.000000
50%          6.000000
75%          8.000000
max         44.000000
Name: EN, dtype: float64

Computing the statistics between the 1% and 99% quantiles (to ignore outliers)
count    39584.000000
mean         6.184671
std          2.284073
min          2.000000
25%          5.000000
50%          6.000000
75%          7.000000
max         14.000000
Name: EN, dtype: float64

German corpus

Some summary statistics
Median length: 8.0

count    40000.000000
mean         8.332250
std          2.536094
min          3.000000
25%          7.000000
50%          8.000000
75%         10.000000
max         52.000000
Name: DE, dtype: float64

Computing the statistics between the 1% and 99% quantiles (to ignore outliers)
count    39227.000000
mean         8.253116
std          2.231582
min          5.000000
25%          7.000000
50%    

In [12]:
print("EN vocabulary size: {}".format(en_vocab))
print("DE vocabulary size: {}".format(de_vocab))
print("\n")
# Define sequence lengths with some extra space for longer sequences
en_seq_length = 19
de_seq_length = 21
print("EN max sequence length: {}".format(en_seq_length))
print("DE max sequence length: {}".format(de_seq_length))

EN vocabulary size: 2218
DE vocabulary size: 2483


EN max sequence length: 19
DE max sequence length: 21


### Writing an English-German seq2seq Machine Translator
---

Machine translation involves transforming text from one language to another. In this notebook, I focus on creating an *English-to-German* translator using a *sequence-to-sequence* (*seq2seq*) deep learning model.

**Seq2Seq Model Architecture**. The seq2seq model consists of two primary components:
- *Encoder:*
    - Processes the input (English) text and generates a fixed-length context vector, also known as a "*[thought vector](https://wiki.pathmind.com/thought-vectors)*."
    - Encodes the input sequence into a hidden representation that summarizes its semantic and syntactic content.
- *Decoder:*
  - Takes the context vector produced by the encoder as input.
  - Decodes it to produce the output sequence (German text).

Both the encoder and decoder are *recurrent neural networks* (*RNNs*), making them suitable for handling sequential data of arbitrary lengths.

**Key Challenges in Seq2Seq Learning**:
- *Variable Lengths*: The input and output sequences often differ in length. For example, the number of words in a translation might be fewer or greater than the source text.
- *Mapping Arbitrary Sequences*: The model must map an input sequence of arbitrary length to an output sequence of arbitrary length while preserving contextual meaning.

![encoder-decoder-machine-translation](plots/enc-dec.svg)

The *encoder* in our sequence-to-sequence model is built using a *Gated Recurrent Unit* (*GRU*), a type of *RNN*. Its role is to process the input sequence and produce a fixed-size output that summarizes the sequence's information. The *GRU* processes each element of the input sequence step by step, updating its hidden state based on the current input and the previous hidden state. After processing the entire sequence, the final hidden state of the *GRU* serves as the context vector, which encapsulates the semantic and syntactic information of the input.

The *decoder*, which is also based on a *GRU*, further processes the context vector to generate the target sequence. In addition to the *GRU*, the *decoder* incorporates *Dense layers*, which play a critical role in producing the final output. The Dense layers map the *GRU's* outputs to the target vocabulary, generating a probability distribution over possible words at each time step. A key feature of the *decoder* is that the weights of the Dense layers are shared across time steps. This means that, similar to how the *GRU* updates and reuses the same weights for each input in the sequence, the Dense layers also reuse their weights for predicting each word in the output sequence. This approach ensures consistency and efficiency in processing sequential data.

![encoder-decoder-machine-translation](plots/gru.svg)


### The TextVectorization layer
---

The TextVectorization layer takes in a string, tokenizes it, and converts the tokens to *IDs* by means of a vocabulary (or dictionary) lookup. It takes a list of strings (or an array of strings) as the input, where each string can be a *word/phrase/sentence* (and so on). Then it learns the vocabulary from that corpus. Finally, the layer can be used to convert a list of strings to a tensor that contains a sequence of token IDs for each string in the list provided.

In [13]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

en_vectorize_layer = TextVectorization(
max_tokens=en_vocab,
output_mode='int',
output_sequence_length=None
)

In [14]:
en_vectorize_layer.adapt(np.array(train_df["EN"].tolist()).astype('str'))
print(en_vectorize_layer.get_vocabulary()[:10])

['', '[UNK]', np.str_('tom'), np.str_('to'), np.str_('i'), np.str_('you'), np.str_('the'), np.str_('a'), np.str_('is'), np.str_('that')]


In [15]:
print(len(en_vectorize_layer.get_vocabulary()))

2218


In [16]:
"""
toy_model = tf.keras.models.Sequential()
toy_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
toy_model.add(en_vectorize_layer)

input_data = [["run"], ["how are you"],["ectoplasmic residue"]]
pred = toy_model.predict(input_data)

print("Input data: \n{}\n".format(input_data))
print("\nToken IDs: \n{}".format(pred))
"""

'\ntoy_model = tf.keras.models.Sequential()\ntoy_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))\ntoy_model.add(en_vectorize_layer)\n\ninput_data = [["run"], ["how are you"],["ectoplasmic residue"]]\npred = toy_model.predict(input_data)\n\nprint("Input data: \n{}\n".format(input_data))\nprint("\nToken IDs: \n{}".format(pred))\n'

In [17]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import TextVectorization

# Vocabulary size and sequence length
en_vocab = 1000

# Define the TextVectorization layer
en_vectorize_layer = TextVectorization(
    max_tokens=en_vocab,
    output_mode='int',
    output_sequence_length=10  # Adjust sequence length as needed
)

# Adapt the TextVectorization layer to a sample dataset
sample_text = ["run", "how are you", "ectoplasmic residue"]
en_vectorize_layer.adapt(sample_text)

# Create a toy model with the TextVectorization layer
toy_model = Sequential([en_vectorize_layer])

# Input data (ensure it's a TensorFlow tensor of strings)
input_data = tf.constant(["run", "how are you", "ectoplasmic residue"])

# Predict token IDs using the model
pred = toy_model.predict(input_data)

# Display results
print("Input data: \n{}\n".format(input_data.numpy()))
print("\nToken IDs: \n{}".format(pred))


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
Input data: 
[b'run' b'how are you' b'ectoplasmic residue']


Token IDs: 
[[3 0 0 0 0 0 0 0 0 0]
 [5 7 2 0 0 0 0 0 0 0]
 [6 4 0 0 0 0 0 0 0 0]]


### Defining the text vectorizers for the encoder-decoder model

In [18]:
def get_vectorizer(
    corpus, n_vocab=100, max_length=None, return_vocabulary=True, name=None
    ):
    """ Return a text vectorization layer or a model """
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input')
    vectorize_layer = tf.keras.layers.TextVectorization(
                                                        max_tokens=n_vocab+2,
                                                        output_mode='int',
                                                        output_sequence_length=max_length,
                                                        )
    vectorize_layer.adapt(corpus)
    vectorized_out = vectorize_layer(inp)
    
    if not return_vocabulary:
        return tf.keras.models.Model(
    inputs=inp, outputs=vectorized_out, name=name
    )
    else:
        return tf.keras.models.Model(inputs=inp, outputs=vectorized_out, name=name), vectorize_layer.get_vocabulary()

In [19]:
# Get the English vectorizer/vocabulary
en_vectorizer, en_vocabulary = get_vectorizer(
corpus=np.array(train_df["EN"].tolist()), n_vocab=en_vocab,
max_length=en_seq_length, name='en_vectorizer'
)

# Get the German vectorizer/vocabulary
de_vectorizer, de_vocabulary = get_vectorizer(
corpus=np.array(train_df["DE"].tolist()), n_vocab=de_vocab,
max_length=de_seq_length-1, name='de_vectorizer'
)

#### Bidirectional RNN: Reading text forward and backward

![rnn-bi-rnn](plots/rnn-bi-rnn.svg)

#### Defining the encoder

In [20]:
n_vocab = 100

# The input is (None,1) shaped and accepts an array of strings
inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input')

# Vectorize the data (assign token IDs)
vectorized_out = en_vectorizer(inp)

# Define an embedding layer to convert IDs to word vectors
emb_layer = tf.keras.layers.Embedding(
input_dim=n_vocab+2, output_dim=128, mask_zero=True, name='e_embedding')

# Get the embeddings of the token IDs
emb_out = emb_layer(vectorized_out)

gru_layer = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(128))
gru_out = gru_layer(emb_out)

encoder = tf.keras.models.Model(inputs=inp, outputs=gru_out)

#### The function that returns the encoder

In [21]:
def get_encoder(n_vocab, vectorizer):
    """
    Define the encoder of the seq2seq model
    """
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input')
    vectorized_out = vectorizer(inp)
    emb_layer = tf.keras.layers.Embedding(
    n_vocab+2, 128, mask_zero=True, name='e_embedding'
    )
    emb_out = emb_layer(vectorized_out)
    gru_layer = tf.keras.layers.Bidirectional(
    tf.keras.layers.GRU(128, name='e_gru'),
    name='e_bidirectional_gru'
    )

    gru_out = gru_layer(emb_out)
    encoder = tf.keras.models.Model(
    inputs=inp, outputs=gru_out, name='encoder'
    )
    return encoder

In [22]:
encoder = get_encoder(en_vocab, en_vectorizer)
encoder.summary()

#### Defining the decoder and the final model

In [23]:
e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input_final')
d_init_state = encoder(e_inp)

d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
vectorized_out = de_vectorizer(inp)

emb_layer = tf.keras.layers.Embedding(
input_dim=n_vocab+2, output_dim=128, mask_zero=True, name='d_embedding'
)
emb_out = emb_layer(vectorized_out)

gru_layer = tf.keras.layers.GRU(256, return_sequences=True)
gru_out = gru_layer(emb_out, initial_state=d_init_state)


In [24]:
def get_final_seq2seq_model(n_vocab, encoder, vectorizer):
    """
    Define the final encoder-decoder model
    """
    
    e_inp = tf.keras.Input(
    shape=(1,), dtype=tf.string, name='e_input_final'
    )
    d_init_state = encoder(e_inp)
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    d_vectorized_out = vectorizer(d_inp)
    d_emb_layer = tf.keras.layers.Embedding(
    n_vocab+2, 128, mask_zero=True, name='d_embedding'
    )
    d_emb_out = d_emb_layer(d_vectorized_out)
    d_gru_layer = tf.keras.layers.GRU(
    256, return_sequences=True, name='d_gru'
    )
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_init_state)
    d_dense_layer_1 = tf.keras.layers.Dense(
    512, activation='relu', name='d_dense_1'
    )
    d_dense1_out = d_dense_layer_1(d_gru_out)
    d_dense_layer_final = tf.keras.layers.Dense(
    n_vocab+2, activation='softmax', name='d_dense_final'
    )
    d_final_out = d_dense_layer_final(d_dense1_out)
    seq2seq = tf.keras.models.Model(
    inputs=[e_inp, d_inp], outputs=d_final_out, name='final_seq2seq'
    )
    return seq2seq

In [42]:
# Get the English vectorizer/vocabulary
en_vectorizer, en_vocabulary = get_vectorizer(
corpus=np.array(train_df["EN"].tolist()), n_vocab=en_vocab,
max_length=en_seq_length, name='e_vectorizer'
)
# Get the German vectorizer/vocabulary
de_vectorizer, de_vocabulary = get_vectorizer(corpus=np.array(train_df["DE"].tolist()), n_vocab=de_vocab,
max_length=de_seq_length-1, name='d_vectorizer'
)
# Define the final model
encoder = get_encoder(n_vocab=en_vocab, vectorizer=en_vectorizer)
final_model = get_final_seq2seq_model(
n_vocab=de_vocab, encoder=encoder, vectorizer=de_vectorizer
)

In [43]:
# Ensure [UNK] is at the beginning of the vocabulary
en_vocabulary = [v for v in en_vocabulary if v != '[UNK]']  # Remove existing [UNK]
en_vocabulary = ['[UNK]'] + en_vocabulary  # Add [UNK] at the start

de_vocabulary = [v for v in de_vocabulary if v != '[UNK]']  # Remove existing [UNK]
de_vocabulary = ['[UNK]'] + de_vocabulary  # Add [UNK] at the start


In [44]:
from tensorflow.keras.metrics import SparseCategoricalAccuracy
final_model.compile(
loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy']
)

In [45]:
final_model.summary()

#### Preparing the training/validation/test data for model training and evaluation

In [46]:
def prepare_data(train_df, valid_df, test_df):
    """
    Create a data dictionary from the dataframes containing data
    """
    data_dict = {}
    for label, df in zip(
        ['train', 'valid', 'test'], [train_df, valid_df, test_df]
        ):
        en_inputs = np.array(df["EN"].tolist())
        de_inputs = np.array(
        df["DE"].str.rsplit(n=1, expand=True).iloc[:,0].tolist()
        )
        de_labels = np.array(
        df["DE"].str.split(n=1, expand=True).iloc[:,1].tolist()
        )
        data_dict[label] = {
        'encoder_inputs': en_inputs,
        'decoder_inputs': de_inputs,
        'decoder_labels': de_labels
        }
    return data_dict

In [47]:
def shuffle_data(en_inputs, de_inputs, de_labels, shuffle_indices=None):
    """
    Shuffle the data randomly (but all of inputs and labels at ones)
    """
    if shuffle_indices is None:
        shuffle_indices = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        shuffle_indices = np.random.permutation(shuffle_indices)
    return (
    en_inputs[shuffle_indices],
    de_inputs[shuffle_indices],
    de_labels[shuffle_indices]
    ), shuffle_indices

#### Defining the BLEUMetric for evaluating the machine translation model
---
The source of the ```nmt_bleu.py``` module can be found in this [GIT Repository](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py)

In [49]:
from tensorflow.keras.layers import StringLookup

class BLEUMetric:
    def __init__(self, vocabulary, name='bleu', **kwargs):
        """
        Computes the BLEU score for machine translation.
        """
        super().__init__()
        self.vocab = vocabulary
        self.id_to_token_layer = StringLookup(
            vocabulary=self.vocab,
            invert=True,
            num_oov_indices=1  # Allow one OOV token
        )

    def calculate_bleu_from_predictions(self, real, pred):
        """
        Calculate BLEU score for targets and predictions.
        """
        pred_argmax = tf.argmax(pred, axis=-1)  # Get predicted IDs
        pred_tokens = self.id_to_token_layer(pred_argmax)  # Convert to tokens
        real_tokens = self.id_to_token_layer(real)  # Convert to tokens
        
        # Clean and process tokens
        def clean_text(tokens):
            # Remove padding, strip unwanted tokens
            t = tf.strings.strip(
                tf.strings.regex_replace(
                    tf.strings.join(tf.transpose(tokens), separator=' '),
                    "eos.*", ""  # Remove everything after "eos"
                )
            )
            t = np.char.decode(t.numpy().astype(np.bytes_), encoding='utf-8')
            t = [doc if len(doc) > 0 else '[UNK]' for doc in t]
            return np.char.split(t).tolist()
        
        pred_tokens = clean_text(pred_tokens)
        real_tokens = [[r] for r in clean_text(real_tokens)]  # Format for BLEU

        # Compute BLEU using the provided `compute_bleu` function
        bleu, _, _, _, _, _ = compute_bleu(real_tokens, pred_tokens)
        return bleu


#### Using the BLEU metric

In [50]:
translation = [['[UNK]', '[UNK]', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]
reference = [[['als', 'mÃssen', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]]

bleu1, _, _, _, _, _ = compute_bleu(reference, translation)

translation = [['[UNK]', 'einmal', 'mÃssen', '[UNK]', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]
reference = [[['als', 'mÃssen', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]]


bleu2, _, _, _, _, _ = compute_bleu(reference, translation)

print("BLEU score with longer correctly predicte phrases: {}".format(bleu1))
print("BLEU score without longer correctly predicte phrases: {}".format(bleu2))

BLEU score with longer correctly predicte phrases: 0.7598356856515925
BLEU score without longer correctly predicte phrases: 0.537284965911771


In [51]:
def clean_text(tokens):
    

    # 3. Strip the string of any extra white spaces
    translations_in_bytes = tf.strings.strip(
        # 2. Replace everything after the eos token with blank
        tf.strings.regex_replace(
            # 1. Join all the tokens to one string in each sequence
            tf.strings.join(tf.transpose(tokens), separator=' '),
                "eos.*", ""
                ),
                )
    # Decode the byte stream to a string
    translations = np.char.decode(translations_in_bytes.numpy().astype(np.bytes_), encoding='utf-8')
    
    # If the string is empty, add a [UNK] token
    # Otherwise get a Division by zero error
    translations = [sent if len(sent)>0 else '[UNK]' for sent in translations ]
    
    # Split the sequences to individual tokens
    translations = np.char.split(translations).tolist()
    return translations


In [52]:
"""
translations_in_bytes = tf.strings.strip(
    # 2. Replace everything after the eos token with blank
    tf.strings.regex_replace(
        # 1. Join all the tokens to one string in each sequence
        tf.strings.join(tf.transpose(tokens), separator=' '),
        "eos.*", ""
        ),
        )
"""

'\ntranslations_in_bytes = tf.strings.strip(\n    # 2. Replace everything after the eos token with blank\n    tf.strings.regex_replace(\n        # 1. Join all the tokens to one string in each sequence\n        tf.strings.join(tf.transpose(tokens), separator=\' \'),\n        "eos.*", ""\n        ),\n        )\n'

In [53]:
translation = [['[UNK]', '[UNK]', 'mÃssen', 'wir', 'in', 'erfahrung','bringen', 'wo', 'sie', 'wohnen']]
reference = [[['als', 'mÃssen', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]]

bleu1, _, _, _, _, _ = compute_bleu(reference, translation)

In [54]:
translation = [['[UNK]', 'einmal', 'mÃssen', '[UNK]', 'in', 'erfahrung','bringen', 'wo', 'sie', 'wohnen']]
reference = [[['als', 'mÃssen', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]]

In [55]:
bleu2, _, _, _, _, _ = compute_bleu(reference, translation)
print("BLEU score with longer correctly predict phrases: {}".format(bleu1))
print("BLEU score without longer correctly predict phrases:{}".format(bleu2))

BLEU score with longer correctly predict phrases: 0.7598356856515925
BLEU score without longer correctly predict phrases:0.537284965911771


### Evaluating the encoder-decoder model

In [71]:
def evaluate_model(model, vectorizer, en_inputs_raw, de_inputs_raw, de_labels_raw, epochs, batch_size):
    """Evaluate the model on various metrics."""
    loss_log, accuracy_log, bleu_log = [], [], []
    bleu_metric = BLEUMetric(de_vocabulary)
    n_batches = en_inputs_raw.shape[0] // batch_size

    for i in range(n_batches):
        print(f"Evaluating batch {i + 1}/{n_batches}", end="\r")

        # Convert inputs to tensors
        x = [
            tf.convert_to_tensor(en_inputs_raw[i * batch_size:(i + 1) * batch_size], dtype=tf.string),
            tf.convert_to_tensor(de_inputs_raw[i * batch_size:(i + 1) * batch_size], dtype=tf.string),
        ]
        # Convert labels to integer token IDs
        y = tf.convert_to_tensor(vectorizer(de_labels_raw[i * batch_size:(i + 1) * batch_size]))

        # Evaluate model
        loss, accuracy = model.evaluate(x, y, verbose=0)
        pred_y = model.predict(x)

        # Compute BLEU score
        bleu = bleu_metric.calculate_bleu_from_predictions(y, pred_y)

        loss_log.append(loss)
        accuracy_log.append(accuracy)
        bleu_log.append(bleu)

    return np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)


### Training the model using a custom training/evaluation loop

In [72]:
def train_model(model, vectorizer, train_df, valid_df, test_df, epochs,
    batch_size):
    """ Training the model and evaluating on validation/test sets """
    bleu_metric = BLEUMetric(de_vocabulary)
    data_dict = prepare_data(train_df, valid_df, test_df)
    shuffle_inds = None

    for epoch in range(epochs):
        bleu_log = []
        accuracy_log = []
        loss_log = []
        (en_inputs_raw,de_inputs_raw,de_labels_raw), shuffle_inds = shuffle_data(
        data_dict['train']['encoder_inputs'],
        data_dict['train']['decoder_inputs'],
        data_dict['train']['decoder_labels'],
        shuffle_inds
        )
        n_train_batches = en_inputs_raw.shape[0]//batch_size
        for i in range(n_train_batches):
            print("Training batch {}/{}".format(i+1, n_train_batches), end='\r')
            x = [
                tf.convert_to_tensor(en_inputs_raw[i*batch_size:(i+1)*batch_size], dtype=tf.string),
                tf.convert_to_tensor(de_inputs_raw[i*batch_size:(i+1)*batch_size], dtype=tf.string),
                ]
            y = tf.convert_to_tensor(vectorizer(de_labels_raw[i*batch_size:(i+1)*batch_size]))

            model.train_on_batch(x, y)
            loss, accuracy = model.evaluate(x, y, verbose=0)
            pred_y = model.predict(x)
            bleu = bleu_metric.calculate_bleu_from_predictions(y, pred_y)
            loss_log.append(loss)
            accuracy_log.append(accuracy)
            bleu_log.append(bleu)

            val_en_inputs = data_dict['valid']['encoder_inputs']
            val_de_inputs = data_dict['valid']['decoder_inputs']
            val_de_labels = data_dict['valid']['decoder_labels']
    
            val_loss, val_accuracy, val_bleu = evaluate_model(
                                                            model,
                                                            vectorizer,
                                                            val_en_inputs,
                                                            val_de_inputs,
                                                            val_de_labels,
                                                            epochs,
                                                            batch_size
                                                            )
            print("\nEpoch {}/{}".format(epoch+1, epochs))
            print("\t(train) loss: {} - accuracy: {} - bleu: {}".format(np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)))
            
        print("\t(valid) loss: {} - accuracy: {} - bleu: {}".format(val_loss, val_accuracy, val_bleu))
        
        test_en_inputs = data_dict['test']['encoder_inputs']
        test_de_inputs = data_dict['test']['decoder_inputs']
        test_de_labels = data_dict['test']['decoder_labels']
        
        test_loss, test_accuracy, test_bleu = evaluate_model(
                model,
                vectorizer,
                test_en_inputs,
                test_de_inputs,
                test_de_labels,
                epochs,
                batch_size
                )

        print("\n(test) loss: {} - accuracy: {} - bleu: {}".format(
        test_loss, test_accuracy, test_bleu)
        )


In [73]:
epochs = 5
batch_size = 128

In [None]:
train_model(final_model, de_vectorizer, train_df, valid_df, test_df,epochs, batch_size)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10