# Seq2Seq models (Sequence-to-Sequence)

Sequence to sequence models are a variant of deep learning models that consists of an encoder and a decoder. They are used for problems that map an abitrarily long sequence to another arbitrarliy long sequence. For example, in machine translation, you convert a sequence of words in a source language to a sequence of words in a target language. Here we will see how we can use a seq2seq model to solve a machine translation task to convert English to German.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch11-Ch12-Sequence-to-Sequence-Learning-with-TF2/11.1_seq2seq_machine_translation_part_1.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>



In [1]:
import random
import tensorflow as tf
import numpy as np
import time
import json

def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
 
# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

print("TensorFlow version: {}".format(tf.__version__))

TensorFlow version: 2.9.2


http://www.manythings.org/anki/
    
german-english

In [2]:
# Not setting this led to the following error
# _Derived_]RecvAsync is cancelled.   
# [[{{node gradient_tape/model_1/embedding_1/embedding_lookup/Reshape/_172}}]] [Op:__inference_train_function_31985]

%env TF_FORCE_GPU_ALLOW_GROWTH=true

env: TF_FORCE_GPU_ALLOW_GROWTH=true


## Loading the data (Requires manual download)

Unfortunately, this dataset **must be manually downloaded** by clicking [this link](http://www.manythings.org/anki/deu-eng.zip). Then place the downloaded `deu-eng.zip` file in the `Ch11/data` folder before running the cell below.


In [4]:
# Section 11.1

import os
import requests
import zipfile

# Make sure the zip file has been downloaded
if not os.path.exists(os.path.join('data','deu-eng.zip')):
    raise FileNotFoundError(
        "Uh oh! Did you download the deu-eng.zip from http://www.manythings.org/anki/deu-eng.zip manually and place it in the Ch11/data folder?"
    )

else:
    if not os.path.exists(os.path.join('data', 'deu.txt')):
        with zipfile.ZipFile(os.path.join('data','deu-eng.zip'), 'r') as zip_ref:
            zip_ref.extractall('data')
    else:
        print("The extracted data already exists")

## Reading the data

Data is in a single `.txt` file. It is a parallel corpus meaning there is a English sentence/phrase/paragraph and a corresponding German translation of it side-by-side. In the file, the source input and the translation are separated by a tab (i.e. tab-seperated file)

In [5]:
# Section 11.1

import pandas as pd

# Read the csv file
df = pd.read_csv(os.path.join('data', 'deu.txt'), delimiter='\t', encoding='utf-8', encoding_errors="strict", header=None)
# Set column names
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]
print('df.shape = {}'.format(df.shape))

df.shape = (255817, 2)


In [6]:
# There are \xc2\xa0 (undecode-able bytes remaining in some text)
# This can cause errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 3: unexpected end of data
# when using the TextVectorization layer
clean_inds = [i for i in range(len(df)) if b"\xc2" not in df.iloc[i]["DE"].encode("utf-8")]

df = df.iloc[clean_inds]

In [7]:
df.head()

Unnamed: 0,EN,DE
0,Go.,Geh.
1,Hi.,Hallo!
2,Hi.,Grüß Gott!
3,Run!,Lauf!
4,Run.,Lauf!


In [8]:
df.tail()

Unnamed: 0,EN,DE
255808,Even if some sentences by non-native speakers ...,Auch wenn Sätze von Nichtmuttersprachlern mitu...
255809,Remember that the purpose of the Tatoeba Proje...,"Es gilt zu bedenken, dass es das Anliegen des ..."
255810,"When I was younger, I hated going to weddings....","Als ich jünger war, hasste ich es, auf Hochzei..."
255811,If someone who doesn't know your background sa...,"Wenn jemand, der deine Herkunft nicht kennt, s..."
255812,If someone who doesn't know your background sa...,"Wenn jemand Fremdes dir sagt, dass du dich wie..."


## Use a smaller sample for computational speed

There are more than 220000 samples in the original dataset. We will be using a smaller set of 50000 for our dataset. 

In [9]:
n_samples = 50000
df = df.sample(n=n_samples, random_state=random_seed)

## Introducing the `SOS` and `EOS` tokens (Decoder)

We will add these special tokens to the translated targets. `sos` indicates the start of the sentence and `eos` marks the end of the sentence. 

E.g. `Grüß Gott!` becomes `sos Grüß Gott! eos`

In [10]:
start_token = 'sos'
end_token = 'eos'

df["DE"] = start_token + ' ' + df["DE"] + ' ' + end_token

## Splitting training/validation/testing data

We will be creating three datasets by sampling randomly (without replacement);

* Test dataset - 5000 samples
* Validation dataset - 5000 samples
* Training dataset - 40000 samples

In [11]:
# Randomly sample 5000 examples from the total 50000 randomly
test_df = df.sample(n=int(n_samples/10), random_state=random_seed)
# Randomly sample 5000 examples from the total 50000 randomly
valid_df = df.loc[~df.index.isin(test_df.index)].sample(n=int(n_samples/10), random_state=random_seed)
# Assign the rest to training data
train_df = df.loc[~(df.index.isin(test_df.index) | df.index.isin(valid_df.index))]

print('test_df.shape = {}'.format(test_df.shape))
print('valid_df.shape = {}'.format(valid_df.shape))
print('train_df.shape = {}'.format(train_df.shape))

test_df.shape = (5000, 2)
valid_df.shape = (5000, 2)
train_df.shape = (40000, 2)


## Analysing the vocabulary sizes (English and German)

Calculate the vocabulary size. We will only consider the words that appear at least 10 times in the corpus.

<a id="pgfId-1305275" href=""></a><span class="fm-combinumeral">#1</span> Create a flattened list from English words.<br>
<a id="pgfId-1305296" href=""></a><span class="fm-combinumeral">#2</span> Create a flattened list of German words.<br>
<a id="pgfId-1305313" href=""></a><span class="fm-combinumeral">#3</span> Get the vocabulary size of words appearing more than or equal to 10 times.<br>
<a id="pgfId-1305330" href=""></a><span class="fm-combinumeral">#4</span> Generate a counter object (i.e., dict word -&gt; frequency).<br>
<a id="pgfId-1305347" href=""></a><span class="fm-combinumeral">#5</span> Create a pandas series from the counter, and then sort most frequent to least.<br>
<a id="pgfId-1305364" href=""></a><span class="fm-combinumeral">#6</span> Print the most common words.<br>
<a id="pgfId-1305381" href=""></a><span class="fm-combinumeral">#7</span> Get the count of words that appear at least 10 times.<br>

In [12]:
# Section 11.1

from collections import Counter

# Create a flattened list from English words
en_words = train_df["EN"].str.split().sum()
# Create a flattened list of German words
de_words = train_df["DE"].str.split().sum()

# Get the vocabulary size of words appearing more than or equal to 10 times
n=10

# Code listing 11.1
def get_vocabulary_size_greater_than(words, n, verbose=True):
    
    """ Get the vocabulary size above a certain threshold """
    
    # Generate a counter object i.e. dict word -> frequency
    counter = Counter(words)
    
    # Create a pandas series from the counter, then sort most frequent to least
    freq_df = pd.Series(list(counter.values()), index=list(counter.keys())).sort_values(ascending=False)
    
    if verbose:
        # Print most common words
        print(freq_df.head(n=10))

    # Count of words >= n frequent    
    n_vocab = (freq_df>=n).sum()
    
    if verbose:
        print("\nVocabulary size (>={} frequent): {}".format(n, n_vocab))
        
    return n_vocab

print("English corpus")
print('='*50)
en_vocab = get_vocabulary_size_greater_than(en_words, n)

print("\nGerman corpus")
print('='*50)
de_vocab = get_vocabulary_size_greater_than(de_words, n)

English corpus
Tom    9498
to     8488
I      8243
the    6920
you    6092
a      5800
is     4318
in     2583
of     2544
was    2279
dtype: int64

Vocabulary size (>=10 frequent): 2225

German corpus
sos      40000
eos      40000
Tom       9960
Ich       7782
ist       4773
nicht     4546
zu        3528
Sie       3374
du        3141
das       2941
dtype: int64

Vocabulary size (>=10 frequent): 2482


## Analysing the sequence length (English and German)

Here we compute the sequence length of the sequences in the English and German corpora. To ignore the outliers, we only consider data between the 1% and 99% quantiles.

<a id="pgfId-1305034" href=""></a><span class="fm-combinumeral">#1</span> Create a pd.Series, which contains the sequence length for each review.<br>
<a id="pgfId-1305055" href=""></a><span class="fm-combinumeral">#2</span> Get the median as well as summary statistics of the sequence length.<br>
<a id="pgfId-1305072" href=""></a><span class="fm-combinumeral">#3</span> Get the quantiles at given marks (i.e., 1% and 99% percentiles).<br>
<a id="pgfId-1305089" href=""></a><span class="fm-combinumeral">#4</span> Print the summary stats of the data between the defined quantiles.<br>

In [13]:
# Section 11.1

# Code listing 11.2
def print_sequence_length(str_ser):
    
    """ Print the summary stats of the sequence length """
    
    # Create a pd.Series, which contain the sequence length for each review
    seq_length_ser = str_ser.str.split(' ').str.len()

    # Get the median as well as summary statistics of the sequence length
    print("\nSome summary statistics")
    print("Median length: {}\n".format(seq_length_ser.median()))
    print(seq_length_ser.describe())
    
    # Get the quantiles at given marks
    print("\nComputing the statistics between the 1% and 99% quantiles (to ignore outliers)")
    p_01 = seq_length_ser.quantile(0.01)
    p_99 = seq_length_ser.quantile(0.99)
    
    # Print the summary stats of the data between the defined quantlies
    print(seq_length_ser[(seq_length_ser >= p_01) & (seq_length_ser < p_99)].describe())

print("English corpus")
print('='*50)
print_sequence_length(train_df["EN"])

print("\nGerman corpus")
print('='*50)
print_sequence_length(train_df["DE"])

English corpus

Some summary statistics
Median length: 6.0

count    40000.000000
mean         6.299175
std          2.579978
min          1.000000
25%          5.000000
50%          6.000000
75%          8.000000
max         44.000000
Name: EN, dtype: float64

Computing the statistics between the 1% and 99% quantiles (to ignore outliers)
count    39543.000000
mean         6.178995
std          2.304810
min          2.000000
25%          5.000000
50%          6.000000
75%          7.000000
max         14.000000
Name: EN, dtype: float64

German corpus

Some summary statistics
Median length: 8.0

count    40000.000000
mean         8.334825
std          2.581789
min          3.000000
25%          7.000000
50%          8.000000
75%         10.000000
max         42.000000
Name: DE, dtype: float64

Computing the statistics between the 1% and 99% quantiles (to ignore outliers)
count    39157.000000
mean         8.247746
std          2.262261
min          5.000000
25%          7.000000
50%    

## Printing the vocabulary size and sequence length

In [14]:
print("EN vocabulary size: {}".format(en_vocab))
print("DE vocabulary size: {}".format(de_vocab))

# Define sequence lengths with some extra space for longer sequences
en_seq_length = 19
de_seq_length = 21

print("EN max sequence length: {}".format(en_seq_length))
print("DE max sequence length: {}".format(de_seq_length))

EN vocabulary size: 2225
DE vocabulary size: 2482
EN max sequence length: 19
DE max sequence length: 21


## TensorFlow `TextVectorization` layer

The `TextVectorization` layer takes in strings and convert them to token IDs. The layer can build a vocabulary using a given text corups and uses that to generate the token IDs.

In [15]:
# Section 11.2

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

print("Defined the vectorization layer for English")

# Create the text vectorization layer (English)
en_vectorize_layer = TextVectorization(
    max_tokens=en_vocab,
    output_mode='int',
    output_sequence_length=None
)

print("Fitting the EN vectorization layer on data")
# Here we are calling adapt to fit the vectorization layer with text
# so that it learns the vocabulary
en_vectorize_layer.adapt(np.array(train_df["EN"].tolist()).astype('str'))
print("\tDone")

print("\nDefined the vectorization layer for German")

# Create the text vectorization layer (German)
de_vectorize_layer = TextVectorization(
    max_tokens=de_vocab,    
    output_mode='int',
    output_sequence_length=de_seq_length,
    pad_to_max_tokens=False,
)

print("Fitting the DE vectorization layer on data")
de_vectorize_layer.adapt(np.array(train_df["DE"].tolist()))
print("\tDone")

Defined the vectorization layer for English
Fitting the EN vectorization layer on data
	Done

Defined the vectorization layer for German
Fitting the DE vectorization layer on data
	Done


## `TextVectorization` layer in action
 
### How to use the layer (EN)

In [16]:
import tensorflow.keras.backend as K
K.clear_session()

# Create the model that uses the vectorize text layer
toy_model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
toy_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
toy_model.add(en_vectorize_layer)

# Now, the model can map strings to integers, 
input_data = [["run"], ["I\'ll go home"],["ectoplasmic residue"]]
pred = toy_model.predict(input_data)

print("Input data: \n{}\n".format(input_data))
print("\nToken IDs: \n{}".format(pred))

Input data: 
[['run'], ["I'll go home"], ['ectoplasmic residue']]


Token IDs: 
[[421   0   0]
 [ 74  49 112]
 [  1   1   0]]


### How to use the layer (DE)

In [17]:
import tensorflow.keras.backend as K
K.clear_session()

# Create the model that uses the vectorize text layer
toy_model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
toy_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
toy_model.add(de_vectorize_layer)

# Now, the model can map strings to integers, 
input_data = [["[sos] Geh"], ["geh lauf"]]
pred = toy_model.predict(input_data)

print("Input data: \n{}\n".format(input_data))
print("\nToken IDs: \n{}".format(pred))

Input data: 
[['[sos] Geh'], ['geh lauf']]


Token IDs: 
[[  2 737   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]
 [737   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]]


### Sample of the vocabulary

Let's print some words from the two vocabularies

In [18]:
# Section 11.2

print("English")
# Print first few words in the vocabulary
print(en_vectorize_layer.get_vocabulary()[:10])
# Print the size of the vocabulary
print(len(en_vectorize_layer.get_vocabulary()))

print("\nGerman")
# Print first few words in the vocabulary
print(de_vectorize_layer._lookup_layer.input_vocabulary)
print(de_vectorize_layer.get_vocabulary()[:10])

# Print the size of the vocabulary
print(len(de_vectorize_layer.get_vocabulary()))

English
['', '[UNK]', 'tom', 'to', 'you', 'the', 'i', 'a', 'is', 'that']
2225

German
None
['', '[UNK]', 'sos', 'eos', 'ich', 'tom', 'nicht', 'ist', 'das', 'du']
2482


## Defining the Seq2Seq model

Here we define an encoder decoder model to translate between English and German. We will be using a bidirectional encoder and a standard decoder. The model will use Gated Recurrent Unit (GRU) as the recurrent component. The encoder and the decoder has their own `TextVectorization` layers as they use two different languages. 

In [19]:
# Section 11.2

import tensorflow.keras.backend as K
K.clear_session()

# Code listing 11.3
def get_vectorizer(corpus, n_vocab, max_length=None, return_vocabulary=True, name=None):
    
    """ Return a text vectorization layer or a model """
    
    # Definie an input layer that takes a list of strings (or an array of strings)
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input')
    
    # When defining the vocab size, we'd add two for special tokens '' (Padding) and '[UNK]' (Oov tokens)
    vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
        max_tokens=n_vocab+2,
        output_mode='int',
        output_sequence_length=max_length,                
    )
    
    # Fit the vectorizer layer on the data
    vectorize_layer.adapt(corpus)
        
    # Get the token IDs
    vectorized_out = vectorize_layer(inp)
        
    if not return_vocabulary: 
        return tf.keras.models.Model(inputs=inp, outputs=vectorized_out, name=name)    
    else:
        # Returns the vocabulary in addition to the model
        return tf.keras.models.Model(inputs=inp, outputs=vectorized_out, name=name), vectorize_layer.get_vocabulary()
    
# Code listing 11.4   
def get_encoder(n_vocab, vectorizer):
    """ Define the encoder of the seq2seq model"""
    
    # The input is (None,1) shaped and accepts an array of strings
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input')

    # Vectorize the data (assign token IDs)
    vectorized_out = vectorizer(inp)
    
    # Define an embedding layer to convert IDs to word vectors
    emb_layer = tf.keras.layers.Embedding(n_vocab+2, 128, mask_zero=True, name='e_embedding')
    # Get the embeddings of the token IDs
    emb_out = emb_layer(vectorized_out)
    
    # Define a bidirectional GRU layer
    # Encoder looks at the english text (i.e. the input) both backwards and forward
    # this leads to better performance
    gru_layer = tf.keras.layers.Bidirectional(tf.keras.layers.GRU(128, name='e_gru'), name='e_bidirectional_gru')
    
    # Get the output of the gru layer
    gru_out = gru_layer(emb_out)
    
    # Define the encoder model
    encoder = tf.keras.models.Model(inputs=inp, outputs=gru_out, name='encoder')
        
    return encoder


# Code listing 11.5
def get_final_seq2seq_model(n_vocab, encoder, vectorizer):
    """ Define the final encoder-decoder model """
    
    # Encoder's input
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input_final')    
    # Get the encoders final output
    d_init_state = encoder(e_inp)
    
    # The input is (None,1) shaped and accepts an array of strings
    # This input layer is used to train the seq2seq model with teacher-forcing
    # we feed the German sequence as the input and ask the model to predict 
    # it with the words offset by 1 (i.e. next word)
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    
    # Vectorize the data (assign token IDs)
    d_vectorized_out = vectorizer(d_inp)
    
    # Define an embedding layer to convert IDs to word vectors
    # Note that this is a different embedding layer to the encoder's embedding layer
    d_emb_layer = tf.keras.layers.Embedding(n_vocab+2, 128, mask_zero=True, name='d_embedding')
    
    # Get the embeddings of the token IDs
    d_emb_out = d_emb_layer(d_vectorized_out)
    
    # Define a GRU layer
    # Unlike the encoder, we cannot define a bidirectional GRU for the decoder
    # Why?
    d_gru_layer = tf.keras.layers.GRU(256, return_sequences=True, name='d_gru')
    
    # Get the output of the gru layer
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_init_state)
    
    # Define an intermediate dense layer
    d_dense_layer_1 = tf.keras.layers.Dense(512, activation='relu', name='d_dense_1')
    d_dense1_out = d_dense_layer_1(d_gru_out)
    
    # The final prediction layer with softmax
    d_dense_layer_final = tf.keras.layers.Dense(n_vocab+2, activation='softmax', name='d_dense_final')
    d_final_out = d_dense_layer_final(d_dense1_out)
    
    # Define the full model
    seq2seq = tf.keras.models.Model(inputs=[e_inp, d_inp], outputs=d_final_out, name='final_seq2seq')
    
    return seq2seq

# Get the English vectorizer/vocabulary
en_vectorizer, en_vocabulary = get_vectorizer(np.array(train_df["EN"].tolist()), en_vocab, max_length=en_seq_length, name='e_vectorizer')
# Get the German vectorizer/vocabulary
de_vectorizer, de_vocabulary = get_vectorizer(np.array(train_df["DE"].tolist()), de_vocab, max_length=de_seq_length-1, name='d_vectorizer')

# Define the final model
encoder = get_encoder(en_vocab, en_vectorizer)
final_model = get_final_seq2seq_model(de_vocab, encoder, de_vectorizer)


## Compile the model

Compile the model with a suitable loss, an optimizer and metrics.

In [20]:
# Section 11.2
from tensorflow.keras.metrics import SparseCategoricalAccuracy

# Compile the model
final_model.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)
final_model.summary()

Model: "final_seq2seq"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 d_input (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 d_vectorizer (Functional)      (None, 20)           0           ['d_input[0][0]']                
                                                                                                  
 e_input_final (InputLayer)     [(None, 1)]          0           []                               
                                                                                                  
 d_embedding (Embedding)        (None, 20, 128)      317952      ['d_vectorizer[0][0]']           
                                                                                      

## Evaluating MT models - BLEU metric

In machine translation, a popular choice for assessing performance is the BiLingual Evaluation Understudy (BLEU) metric. Word-to-word accuracy does not reflect the true performance of these models as there can be different ways the same phrase can be translated to. BLEU can take into account such multiple translations when computing the final score. Furthermore, BLEU is superior because it measures precision at multiple n-gram scales between the actual and predicted translations.

The implementation is inspired by: https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py

### Defining the BLEU metric

Below we define a `BLEUMetric` object that can be used to compute the performance of the model.

<a id="pgfId-1302173" href=""></a><span class="fm-combinumeral">#1</span> Get the vocabulary from the fitted TextVectorizer.<br>
<a id="pgfId-1302201" href=""></a><span class="fm-combinumeral">#2</span> Define a StringLookup layer, which can convert token IDs to words.<br>
<a id="pgfId-1302218" href=""></a><span class="fm-combinumeral">#3</span> Get the predicted token IDs.<br>
<a id="pgfId-1302238" href=""></a><span class="fm-combinumeral">#4</span> Convert token IDs to words using the vocabulary and the StringLookup.<br>
<a id="pgfId-1302255" href=""></a><span class="fm-combinumeral">#5</span> Strip the string of any extra white spaces.<br>
<a id="pgfId-1302272" href=""></a><span class="fm-combinumeral">#6</span> Replace everything after the EOS token with a blank.<br>
<a id="pgfId-1302289" href=""></a><span class="fm-combinumeral">#7</span> Join all the tokens to one string in each sequence.<br>
<a id="pgfId-1302306" href=""></a><span class="fm-combinumeral">#8</span> Decode the byte stream to a string.<br>
<a id="pgfId-1302323" href=""></a><span class="fm-combinumeral">#9</span> If the string is empty, add a [UNK] token. If not, it can lead to numerical errors.<br>
<a id="pgfId-1302340" href=""></a><span class="fm-combinumeral">#10</span> Split the sequences into individual tokens.<br>
<a id="pgfId-1302357" href=""></a><span class="fm-combinumeral">#11</span> Get the clean versions of the predictions and real sequences.<br>
<a id="pgfId-1302374" href=""></a><span class="fm-combinumeral">#12</span> We have to wrap each real sequence in a list to make use of a third-party function to compute BLEU.<br>
<a id="pgfId-1302391" href=""></a><span class="fm-combinumeral">#13</span> Get the BLEU value for the given batch of targets and predictions.
copy<br>

In [21]:
# Section 11.3

from tensorflow.keras.layers.experimental.preprocessing import StringLookup
from bleu import compute_bleu

# Code listing 11.8
class BLEUMetric(object):
    
    def __init__(self, vocabulary, name='perplexity', **kwargs):
      """ Computes the BLEU score (Metric for machine translation) """
      super().__init__()
      self.vocab = vocabulary
      self.id_to_token_layer = StringLookup(vocabulary=self.vocab, num_oov_indices=0, oov_token="[KNU]", invert=True)
    
    def calculate_bleu_from_predictions(self, real, pred):
        """ Calculate the BLEU score for targets and predictions """
        
        # Get the predicted token IDs
        pred_argmax = tf.argmax(pred, axis=-1)  
        
        # Convert token IDs to words using the vocabulary and the StringLookup
        pred_tokens = self.id_to_token_layer(pred_argmax)
        real_tokens = self.id_to_token_layer(real)
        
        def clean_text(tokens):
            
            """ Clean padding and [SOS]/[EOS] tokens to only keep meaningful words """
            
            # 3. Strip the string of any extra white spaces
            translations_in_bytes = tf.strings.strip(
                        # 2. Replace everything after the eos token with blank
                        tf.strings.regex_replace(
                            # 1. Join all the tokens to one string in each sequence
                            tf.strings.join(
                                tf.transpose(tokens), separator=' '
                            ),
                        "eos.*", ""),
                   )
            
            # Decode the byte stream to a string
            translations = np.char.decode(
                translations_in_bytes.numpy().astype(np.bytes_), encoding='utf-8'
            )
            
            # If the string is empty, add a [UNK] token
            # Otherwise get a Division by zero error
            translations = [sent if len(sent)>0 else '[UNK]' for sent in translations ]
            
            # Split the sequences to individual tokens 
            translations = np.char.split(translations).tolist()
            
            return translations
        
        # Get the clean versions of the predictions and real seuqences
        pred_tokens = clean_text(pred_tokens)
        # We have to wrap each real sequence in a list to make use of a function to compute bleu
        real_tokens = [[token_seq] for token_seq in clean_text(real_tokens)]

        # The compute_bleu method accpets the translations and references in the following format
        # tranlation - list of list of tokens
        # references - list of list of list of tokens
        bleu, precisions, bp, ratio, translation_length, reference_length = compute_bleu(real_tokens, pred_tokens, smooth=False)

        return bleu

### Using the BLEU metric

Below you can see BLEU being used to computer the similarity between a translation (predicted) and reference (true target).

In [22]:
translation = [['[UNK]', '[UNK]', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]
reference = [[['als', 'mÃssen', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]]

bleu1, _, _, _, _, _ = compute_bleu(reference, translation)

translation = [['[UNK]', 'einmal', 'mÃssen', '[UNK]', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]
reference = [[['als', 'mÃssen', 'mÃssen', 'wir', 'in', 'erfahrung', 'bringen', 'wo', 'sie', 'wohnen']]]


bleu2, _, _, _, _, _ = compute_bleu(reference, translation)

print("BLEU score with longer correctly predicte phrases: {}".format(bleu1))
print("BLEU score without longer correctly predicte phrases: {}".format(bleu2))

BLEU score with longer correctly predicte phrases: 0.7598356856515925
BLEU score without longer correctly predicte phrases: 0.537284965911771


## Training the model with a custom loop

We will train the model using a custom loop as we want to incorporate BLEU as a metric in our training. We will follow the following procedure;

* Each epoch,
  * Shuffle the training data
  * Train our model on all the training data (in batches)
  * Evaluate the model on validation data
* Finally, evaluate the model on test data


<a id="pgfId-1300335" href=""></a><span class="fm-combinumeral">#1</span> Define the metric.<br>
<a id="pgfId-1300356" href=""></a><span class="fm-combinumeral">#2</span> Define the data.<br>
<a id="pgfId-1300373" href=""></a><span class="fm-combinumeral">#3</span> Reset metric logs at the beginning of every epoch.<br>
<a id="pgfId-1300390" href=""></a><span class="fm-combinumeral">#4</span> Shuffle data at the beginning of every epoch.<br>
<a id="pgfId-1300410" href=""></a><span class="fm-combinumeral">#5</span> Get the number of training batches.<br>
<a id="pgfId-1300427" href=""></a><span class="fm-combinumeral">#6</span> Train one batch at a time.<br>
<a id="pgfId-1300444" href=""></a><span class="fm-combinumeral">#7</span> Status update<br>
<a id="pgfId-1300461" href=""></a><span class="fm-combinumeral">#8</span> Get a batch of inputs (English and German sequences).<br>
<a id="pgfId-1300478" href=""></a><span class="fm-combinumeral">#9</span> Get a batch of targets (German sequences offset by 1).<br>
<a id="pgfId-1300495" href=""></a><span class="fm-combinumeral">#10</span> Train for a single step.<br>
<a id="pgfId-1300512" href=""></a><span class="fm-combinumeral">#11</span> Evaluate the model to get the metrics.<br>
<a id="pgfId-1300529" href=""></a><span class="fm-combinumeral">#12</span> Get the final prediction to compute BLEU.<br>
<a id="pgfId-1300546" href=""></a><span class="fm-combinumeral">#13</span> Compute the BLEU metric.<br>
<a id="pgfId-1300563" href=""></a><span class="fm-combinumeral">#14</span> Update the epoch's log records of the metrics.<br>
<a id="pgfId-1300580" href=""></a><span class="fm-combinumeral">#15</span> Define validation data.<br>
<a id="pgfId-1300597" href=""></a><span class="fm-combinumeral">#16</span> Evaluate the model on validation data.<br>
<a id="pgfId-1300614" href=""></a><span class="fm-combinumeral">#17</span> Print the evaluation metrics of each epoch.<br>

In [23]:
# Section 11.3
import time

epochs = 5
batch_size = 128

# Code listing 11.6
def prepare_data(train_df, valid_df, test_df):
    """ Create a data dictionary from the dataframes containing data """
    
    data_dict = {}
    for label, df in zip(['train', 'valid', 'test'], [train_df, valid_df, test_df]):
        en_inputs = np.array(df["EN"].tolist())
        de_inputs = np.array(df["DE"].str.rsplit(n=1, expand=True).iloc[:,0].tolist())
        de_labels = np.array(df["DE"].str.split(n=1, expand=True).iloc[:,1].tolist())
        data_dict[label] = {'encoder_inputs': en_inputs, 'decoder_inputs': de_inputs, 'decoder_labels': de_labels}
    
    return data_dict

# Code listing 11.7
def shuffle_data(en_inputs, de_inputs, de_labels, shuffle_inds=None): 
    """ Shuffle the data randomly (but all of inputs and labels at ones)"""
        
    if shuffle_inds is None:
        # If shuffle_inds are not passed create a shuffling automatically
        shuffle_inds = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        # Shuffle the provided shuffle_inds
        shuffle_inds = np.random.permutation(shuffle_inds)
    
    # Return shuffled data
    return (en_inputs[shuffle_inds], de_inputs[shuffle_inds], de_labels[shuffle_inds]), shuffle_inds


# Code listing 11.9
def evaluate_model(model, vectorizer, en_inputs_raw, de_inputs_raw, de_labels_raw, batch_size):
    """ Evaluate the model on various metrics such as loss, accuracy and BLEU """
    
    # Define the metric
    bleu_metric = BLEUMetric(de_vocabulary)
    
    loss_log, accuracy_log, bleu_log = [], [], []
    # Get the number of batches
    n_batches = en_inputs_raw.shape[0]//batch_size
    print(" ", end='\r')

    # Evaluate one batch at a time
    for i in range(n_batches):
        # Status update
        print("Evaluating batch {}/{}".format(i+1, n_batches), end='\r')

        # Get the inputs and targers
        x = [en_inputs_raw[i*batch_size:(i+1)*batch_size], de_inputs_raw[i*batch_size:(i+1)*batch_size]]
        y = vectorizer(de_labels_raw[i*batch_size:(i+1)*batch_size])

        # Get the evaluation metrics
        loss, accuracy = model.evaluate(x, y, verbose=0)
        # Get the predictions to compute BLEU
        pred_y = model.predict(x, verbose=0)

        # Update logs
        loss_log.append(loss)
        accuracy_log.append(accuracy)
        bleu_log.append(bleu_metric.calculate_bleu_from_predictions(y, pred_y))
    
    return np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)
    

# Code listing 11.10
def train_model(model, vectorizer, train_df, valid_df, test_df, epochs, batch_size):
    """ Training the model and evaluating on validation/test sets """
    
    # Define the metric
    bleu_metric = BLEUMetric(de_vocabulary)

    # Define the data
    data_dict = prepare_data(train_df, valid_df, test_df)

    shuffle_inds = None
    
    
    for epoch in range(epochs):

        # Reset metric logs every epoch
        bleu_log = []
        accuracy_log = []
        loss_log = []

        # =================================================================== #
        #                         Train Phase                                 #
        # =================================================================== #

        # Shuffle data at the beginning of every epoch
        (en_inputs_raw,de_inputs_raw,de_labels_raw), shuffle_inds  = shuffle_data(
            data_dict['train']['encoder_inputs'],
            data_dict['train']['decoder_inputs'],
            data_dict['train']['decoder_labels'],
            shuffle_inds
        )

        # Get the number of training batches
        n_train_batches = en_inputs_raw.shape[0]//batch_size

        # Train one batch at a time
        for i in range(n_train_batches):
            # Status update
            print("Training batch {}/{}".format(i+1, n_train_batches), end='\r')

            # Get a batch of inputs (english and german sequences)
            x = [en_inputs_raw[i*batch_size:(i+1)*batch_size], de_inputs_raw[i*batch_size:(i+1)*batch_size]]
            # Get a batch of targets (german sequences offset by 1)
            y = vectorizer(de_labels_raw[i*batch_size:(i+1)*batch_size])

            # Train for a single step
            model.train_on_batch(x, y)        
            # Evaluate the model to get the metrics
            loss, accuracy = model.evaluate(x, y, verbose=0)
            # Get the final prediction to compute BLEU
            pred_y = model.predict(x, verbose=0)

            # Update the epoch's log records of the metrics
            loss_log.append(loss)
            accuracy_log.append(accuracy)
            bleu_log.append(bleu_metric.calculate_bleu_from_predictions(y, pred_y))

        # =================================================================== #
        #                      Validation Phase                               #
        # =================================================================== #
        
        val_en_inputs = data_dict['valid']['encoder_inputs']
        val_de_inputs = data_dict['valid']['decoder_inputs']
        val_de_labels = data_dict['valid']['decoder_labels']
            
        val_loss, val_accuracy, val_bleu = evaluate_model(
            model, vectorizer, val_en_inputs, val_de_inputs, val_de_labels, batch_size
        )
            
        # Print the evaluation metrics of each epoch
        print("\nEpoch {}/{}".format(epoch+1, epochs))
        print("\t(train) loss: {} - accuracy: {} - bleu: {}".format(np.mean(loss_log), np.mean(accuracy_log), np.mean(bleu_log)))
        print("\t(valid) loss: {} - accuracy: {} - bleu: {}".format(val_loss, val_accuracy, val_bleu))
    
    # =================================================================== #
    #                      Test Phase                                     #
    # =================================================================== #    
    
    test_en_inputs = data_dict['test']['encoder_inputs']
    test_de_inputs = data_dict['test']['decoder_inputs']
    test_de_labels = data_dict['test']['decoder_labels']
            
    test_loss, test_accuracy, test_bleu = evaluate_model(
            model, vectorizer, test_en_inputs, test_de_inputs, test_de_labels, batch_size
    )
    
    print("\n(test) loss: {} - accuracy: {} - bleu: {}".format(test_loss, test_accuracy, test_bleu))


In [24]:

t1 = time.time()    
train_model(final_model, de_vectorizer, train_df, valid_df, test_df, epochs, batch_size)
t2 = time.time()

print("\nIt took {} seconds to complete the training".format(t2-t1))

  return bool(asarray(a1 == a2).all())


Evaluating batch 39/39
Epoch 1/5
	(train) loss: 1.7550944273288434 - accuracy: 0.24879128696062627 - bleu: 0.002123049560838
	(valid) loss: 1.4457139174143474 - accuracy: 0.332460513481727 - bleu: 0.013267734063560356
Evaluating batch 39/39
Epoch 2/5
	(train) loss: 1.3106029087152236 - accuracy: 0.37036055078109104 - bleu: 0.02950098674420384
	(valid) loss: 1.2049159453465388 - accuracy: 0.4025515715281169 - bleu: 0.045067310463008284
Evaluating batch 39/39
Epoch 3/5
	(train) loss: 1.09318800480702 - accuracy: 0.43510486300175005 - bleu: 0.06972906873423713
	(valid) loss: 1.0502231044647021 - accuracy: 0.4519448585999318 - bleu: 0.0828369233741428
Evaluating batch 39/39
Epoch 4/5
	(train) loss: 0.9339591274276758 - accuracy: 0.4842957123540915 - bleu: 0.10476757940391189
	(valid) loss: 0.9576318584955655 - accuracy: 0.4828882202123984 - bleu: 0.10944701101094335
Evaluating batch 39/39
Epoch 5/5
	(train) loss: 0.8074664710423886 - accuracy: 0.5278435642711627 - bleu: 0.14234866198511817

## Save the trained model

We save the trained model as well as the vocabularies

In [25]:
# Section 11.3

## Save the model
os.makedirs('models', exist_ok=True)
tf.keras.models.save_model(final_model, os.path.join('models', 'seq2seq'))

import json
os.makedirs(os.path.join('models', 'seq2seq_vocab'), exist_ok=True)

# Save the vocabulary files
with open(os.path.join('models', 'seq2seq_vocab', 'en_vocab.json'), 'w') as f:
    json.dump(en_vocabulary, f)    
with open(os.path.join('models', 'seq2seq_vocab', 'de_vocab.json'), 'w') as f:
    json.dump(de_vocabulary, f)



INFO:tensorflow:Assets written to: models\seq2seq\assets


INFO:tensorflow:Assets written to: models\seq2seq\assets


## Defining the inference model

For inference we have to create a new model using the weights of the trained model. During training we used teacher forcing, i.e. providing words from the translation as inputs to the decoder. This cannot be done during inference as we do not have a translation, but want to generate one.

Therefore, we create a decoder model that can generate one prediction at a time. We start the prediction process by giving the `sos` token as the initial input to the decoder and keep generating words until the decoder outputs `eos`.

In [26]:
# Section 11.4

# Code listing 11.11
import tensorflow.keras.backend as K
K.clear_session()

def get_inference_model(save_path):
    """ Load the saved model and create an inference model from that """
    
    # Load the model
    model = tf.keras.models.load_model(save_path)
    
    # Get the encoder model
    en_model = model.get_layer("encoder")
    
    # Define two inputs
    # 1. Takes a single word as the input to the decoder
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_infer_input')
    # 2. Takes an initial state to pass to the decoder GRU as an input
    d_state_inp = tf.keras.Input(shape=(256,), name='d_infer_state')
    
    # Generate the vectorized output of inp
    d_vectorizer = model.get_layer('d_vectorizer')    
    d_vectorized_out = d_vectorizer(d_inp)
    
    # Generate the embeddings from the vectorized input
    d_emb_out = model.get_layer('d_embedding')(d_vectorized_out)
    
    # Get the GRU layer
    d_gru_layer = model.get_layer("d_gru")
    # Since we generate one word at a time, we will not need the return_sequences
    d_gru_layer.return_sequences = False
    # Get the GRU out while using d_state_inp from earlier, as the initial state
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_state_inp) 
    
    # Get the dense output
    d_dense1_out = model.get_layer("d_dense_1")(d_gru_out) 
    
    # Get the final output
    d_final_out = model.get_layer("d_dense_final")(d_dense1_out) 
    
    # Define the final decoder
    de_model = tf.keras.models.Model(inputs=[d_inp, d_state_inp], outputs=[d_final_out, d_gru_out])
    
    return en_model, de_model

def get_vocabularies(save_dir):
    """ Load the vocabulary files from a given path"""
    
    with open(os.path.join(save_dir, 'en_vocab.json'), 'r') as f:
        en_vocabulary = json.load(f)
        
    with open(os.path.join(save_dir, 'de_vocab.json'), 'r') as f:
        de_vocabulary = json.load(f)
        
    return en_vocabulary, de_vocabulary

print("Loading vocabularies")
en_vocabulary, de_vocabulary = get_vocabularies(os.path.join('models', 'seq2seq_vocab'))

print("Loading weights and generating the inference model")
en_model, de_model = get_inference_model(os.path.join('models', 'seq2seq'))
print("\tDone")

Loading vocabularies
Loading weights and generating the inference model
	Done


## Generating new translations

Here we generate a new translation by first starting with the `sos` token and asking the decoder to generate words until it outputs `eos`.

In [27]:
# Code listing 11.12
def generate_new_translation(en_model, de_model, de_vocabulary, sample_en_text):
    """ Generate a new translation """
    
    start_token = 'sos'
    
    # Print the input
    print("Input: {}".format(sample_en_text))
    
    # Get the initial state for the decoder
    d_state = en_model.predict(np.array([sample_en_text]), verbose=0)
    # First word will be sos
    de_word = start_token
    # We collect the translation in this list
    de_translation = []
    
    # Keep predicting until we get eos
    while de_word != 'eos':
        # Override the previous state input with the new state
        de_pred, d_state = de_model.predict([np.array([de_word]), d_state], verbose=0)    
        # Get the actual word from the token ID of the prediction
        de_word = de_vocabulary[np.argmax(de_pred[0])]
        # Add that to the translation
        de_translation.append(de_word)

    print("Translation: {}\n".format(' '.join(de_translation)))

for i in range(5):
    sample_en_text = test_df["EN"].iloc[i]
    generate_new_translation(en_model, de_model, de_vocabulary, sample_en_text)

Input: My hair is naturally curly.
Translation: mein [UNK] hat [UNK] [UNK] eos

Input: I think about it every day.
Translation: ich denke es [UNK] jeden tag eos

Input: I'll never doubt you again.
Translation: ich werde nie wieder wieder [UNK] eos

Input: Did the doctors give you anything for the pain?
Translation: hat die geschichte für ihre [UNK] [UNK] eos

Input: She runs fastest in our class.
Translation: sie [UNK] in der stadt eos

