# Seq2Seq models (Sequence-to-Sequence)

Sequence to sequence models are a variant of deep learning models that consists of an encoder and a decoder. They are used for problems that map an abitrarily long sequence to another arbitrarliy long sequence. For example, in machine translation, you convert a sequence of words in a source language to a sequence of words in a target language. Here we will see how we can use a seq2seq model to solve a machine translation task to convert English to German.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch11/11.1_Seq2seq_machine_translation.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>



In [1]:
import tensorflow as tf
import numpy as np
import time

def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
 
# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)




http://www.manythings.org/anki/
    
german-english

In [2]:
# Not setting this led to the following error
# _Derived_]RecvAsync is cancelled.   
# [[{{node gradient_tape/model_1/embedding_1/embedding_lookup/Reshape/_172}}]] [Op:__inference_train_function_31985]

%env TF_FORCE_GPU_ALLOW_GROWTH=true

env: TF_FORCE_GPU_ALLOW_GROWTH=true


## Loading the data (Requires manual download)

Unfortunately, this dataset **must be manually downloaded** by clicking [this link](http://www.manythings.org/anki/deu-eng.zip). Then place the downloaded `deu-eng.zip` file in the `Ch11/data` folder before running the cell below.


In [3]:
import os
import requests
import zipfile

# Retrieve the data
if not os.path.exists(os.path.join('data','deu-eng.zip')):
    print("Uh oh! Did you download the deu-eng.zip from http://www.manythings.org/anki/deu-eng.zip manually and place it in the Ch11/data folder?")

else:
    if not os.path.exists(os.path.join('data', 'deu.txt')):
        with zipfile.ZipFile(os.path.join('data','deu-eng.zip'), 'r') as zip_ref:
            zip_ref.extractall('data')
    else:
        print("The extracted data already exists")

The extracted data already exists


## Reading the data in

In [4]:
import pandas as pd

df = pd.read_csv(os.path.join('data', 'deu.txt'), delimiter='\t', header=None)
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]
print('df.shape = {}'.format(df.shape))

df.shape = (227080, 2)


In [5]:
df.head()

Unnamed: 0,EN,DE
0,Go.,Geh.
1,Hi.,Hallo!
2,Hi.,Grüß Gott!
3,Run!,Lauf!
4,Run.,Lauf!


## Use a smaller sample for computational speed

In [6]:
df = df.sample(n=50000, random_state=random_seed)

In [7]:
df["DE"] = '[SOS] ' + df["DE"] + ' [EOS]'

## Splitting training/validation/testing data

In [8]:
test_df = df.sample(n=5000, random_state=random_seed)
valid_df = df.loc[~df.index.isin(test_df.index)].sample(n=5000, random_state=random_seed)
train_df = df.loc[~(df.index.isin(test_df.index) | df.index.isin(valid_df.index))]

print('test_df.shape = {}'.format(test_df.shape))
print('valid_df.shape = {}'.format(valid_df.shape))
print('train_df.shape = {}'.format(train_df.shape))

test_df.shape = (5000, 2)
valid_df.shape = (5000, 2)
train_df.shape = (40000, 2)


## Vocabulary sizes (English-German)

In [9]:
from collections import Counter

en_words = train_df["EN"].str.split().sum()
de_words = train_df["DE"].str.split().sum()
n=10

def get_vocabulary_size_greater_than(words, n, verbose=True):
    counter = Counter(words)

    freq_df = pd.Series(list(counter.values()), index=list(counter.keys())).sort_values(ascending=False)
    
    if verbose:
        # Print most common words
        print(freq_df.head(n=10))

    # Count of words >= n frequent    
    n_vocab = (freq_df>=n).sum()
    
    if verbose:
        print("\nVocabulary size (>={} frequent): {}".format(n, n_vocab))
        
    return n_vocab

print("English corpus")
print('='*50)
en_vocab = get_vocabulary_size_greater_than(en_words, n)

print("\nGerman corpus")
print('='*50)
de_vocab = get_vocabulary_size_greater_than(de_words, n)

English corpus
Tom    9427
to     8673
I      8436
the    6999
you    6125
a      5680
is     4374
in     2664
of     2613
was    2298
dtype: int64

Vocabulary size (>=10 frequent): 2238

German corpus
[SOS]    40000
[EOS]    40000
Tom       9928
Ich       7749
ist       4753
nicht     4414
zu        3583
Sie       3465
du        3112
das       2909
dtype: int64

Vocabulary size (>=10 frequent): 2497


## Sequence length 

In [10]:
def print_sequence_length(str_ser):
    # Create a pd.Series, which contain the sequence length for each review
    seq_length_ser = str_ser.str.len()

    # Get the median as well as summary statistics of the sequence length
    print("\nSome summary statistics")
    print("Median length: {}\n".format(seq_length_ser.median()))
    print(seq_length_ser.describe())

    print("\nComputing the statistics between the 10% and 90% quantiles (to ignore outliers)")
    p_10 = seq_length_ser.quantile(0.1)
    p_90 = seq_length_ser.quantile(0.9)

    print(seq_length_ser[(seq_length_ser >= p_10) & (seq_length_ser < p_90)].describe(percentiles=[0.33, 0.66]))

print("English corpus")
print('='*50)
print_sequence_length(train_df["EN"])

print("\nGerman corpus")
print('='*50)
print_sequence_length(train_df["DE"])

English corpus

Some summary statistics
Median length: 29.0

count    40000.000000
mean        31.841100
std         13.496887
min          4.000000
25%         23.000000
50%         29.000000
75%         38.000000
max        537.000000
Name: EN, dtype: float64

Computing the statistics between the 10% and 90% quantiles (to ignore outliers)
count    32161.000000
mean        30.086658
std          7.659525
min         18.000000
33%         26.000000
50%         29.000000
66%         33.000000
max         47.000000
Name: EN, dtype: float64

German corpus

Some summary statistics
Median length: 46.0

count    40000.000000
mean        49.175300
std         16.145143
min         18.000000
25%         38.000000
50%         46.000000
75%         57.000000
max        493.000000
Name: DE, dtype: float64

Computing the statistics between the 10% and 90% quantiles (to ignore outliers)
count    31818.000000
mean        47.255453
std          9.185303
min         33.000000
33%         42.000000
50%

In [11]:
print("EN vocabulary size: {}".format(en_vocab))
print("DE vocabulary size: {}".format(de_vocab))
en_seq_length = 50
de_seq_length = 60
print("EN max sequence length: {}".format(en_seq_length))
print("DE max sequence length: {}".format(de_seq_length))

EN vocabulary size: 2238
DE vocabulary size: 2497
EN max sequence length: 50
DE max sequence length: 60


In [12]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

print("Defined the vectorization layer for English")
# Create the layer.
en_vectorize_layer = TextVectorization(
    max_tokens=en_vocab,
    output_mode='int',
    output_sequence_length=None
)

print("Fitting the EN vectorization layer on data")
# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
en_vectorize_layer.adapt(train_df["EN"].tolist())
print("\tDone")

print("\nDefined the vectorization layer for German")
# Create the layer.
de_vectorize_layer = TextVectorization(
    max_tokens=de_vocab,
    output_mode='int',
    output_sequence_length=de_seq_length,
    pad_to_max_tokens=False
)

print("Fitting the DE vectorization layer on data")
de_vectorize_layer.adapt(train_df["DE"].tolist())
print("\tDone")

Defined the vectorization layer for English
Fitting the EN vectorization layer on data
	Done

Defined the vectorization layer for German
Fitting the DE vectorization layer on data
	Done


## Vectorization layer in action

In [13]:
import tensorflow.keras.backend as K
K.clear_session()

# Create the model that uses the vectorize text layer
toy_model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
toy_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
toy_model.add(en_vectorize_layer)

# Now, the model can map strings to integers, and you can add an embedding
# layer to map these integers to learned embeddings.
input_data = [["run"], ["how are you"],["ectoplasmic residue"]]
pred = toy_model.predict(input_data)
print(pred)

[[427   0   0]
 [ 40  23   4]
 [  1   1   0]]


In [14]:
print(en_vectorize_layer.get_vocabulary()[:10])
print(len(en_vectorize_layer.get_vocabulary()))

['', '[UNK]', 'tom', 'to', 'you', 'the', 'i', 'a', 'is', 'that']
2238


## Defining the real model

In [15]:
import tensorflow.keras.backend as K
K.clear_session()

def get_vectorizer(list_of_strings, n_vocab, max_length=None, return_vocabulary=True, name=None):
    
    """ Return a text vectorization layer or a model """
        
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input')
    
    vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
        max_tokens=n_vocab+2,
        output_mode='int',
        output_sequence_length=max_length,        
        name=name
    )
    
    vectorize_layer.adapt(list_of_strings)
        
    vectorized_out = vectorize_layer(inp)
        
    if not return_vocabulary: 
        return tf.keras.models.Model(inputs=inp, outputs=vectorized_out)    
    else:
        return tf.keras.models.Model(inputs=inp, outputs=vectorized_out), vectorize_layer.get_vocabulary()        
    
        
def get_encoder_and_state(n_vocab, vectorizer):
    
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input')

    vectorized_out = vectorizer(inp)
    
    emb_layer = tf.keras.layers.Embedding(n_vocab+2, 32, mask_zero=True, name='e_embedding')
    emb_out = emb_layer(vectorized_out)
    
    gru_layer = tf.keras.layers.GRU(32)
    
    gru_out = gru_layer(emb_out)
    
    encoder = tf.keras.models.Model(inputs=inp, outputs=gru_out)
        
    return encoder, gru_out


def get_final_model_and_state(n_vocab, encoder, init_state, vectorizer):
        
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    
    vectorized_out = vectorizer(inp)
    
    emb_layer = tf.keras.layers.Embedding(n_vocab+2, 32, mask_zero=True, name='d_embedding')
    emb_out = emb_layer(vectorized_out)
    
    gru_layer = tf.keras.layers.GRU(32, return_sequences=True)
    
    gru_out = gru_layer(emb_out, initial_state=init_state)
    
    dense_layer = tf.keras.layers.Dense(n_vocab+2, activation='softmax')
    
    dense_out = dense_layer(gru_out)
    
    decoder = tf.keras.models.Model(inputs=[encoder.input, inp], outputs=dense_out)
    
    return decoder, gru_out


en_vectorizer, en_vocabulary = get_vectorizer(train_df["EN"].tolist(), en_vocab, max_length=en_seq_length, name='en_vectorizer')
de_vectorizer, de_vocabulary = get_vectorizer(train_df["DE"].tolist(), de_vocab, max_length=de_seq_length-1, name='de_vectorizer')

encoder, enc_final_state = get_encoder_and_state(en_vocab, en_vectorizer)
final_model, _ = get_final_model_and_state(de_vocab, encoder, enc_final_state, de_vectorizer)


In [37]:
# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Python implementation of BLEU and smooth-BLEU.

This module provides a Python implementation of BLEU and smooth-BLEU.
Smooth BLEU is computed following the method outlined in the paper:
Chin-Yew Lin, Franz Josef Och. ORANGE: a method for evaluating automatic
evaluation metrics for machine translation. COLING 2004.
"""

import collections
import math


def _get_ngrams(segment, max_order):
  """Extracts all n-grams upto a given maximum order from an input segment.

  Args:
    segment: text segment from which n-grams will be extracted.
    max_order: maximum length in tokens of the n-grams returned by this
        methods.

  Returns:
    The Counter containing all n-grams upto max_order in segment
    with a count of how many times each n-gram occurred.
  """
  ngram_counts = collections.Counter()
  for order in range(1, max_order + 1):
    for i in range(0, len(segment) - order + 1):
      ngram = tuple(segment[i:i+order])
      ngram_counts[ngram] += 1
  return ngram_counts


def compute_bleu(reference_corpus, translation_corpus, max_order=4,
                 smooth=False):
  """Computes BLEU score of translated segments against one or more references.

  Args:
    reference_corpus: list of lists of references for each translation. Each
        reference should be tokenized into a list of tokens.
    translation_corpus: list of translations to score. Each translation
        should be tokenized into a list of tokens.
    max_order: Maximum n-gram order to use when computing BLEU score.
    smooth: Whether or not to apply Lin et al. 2004 smoothing.

  Returns:
    3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
    precisions and brevity penalty.
  """

  if isinstance(reference_corpus, tf.Tensor):
    reference_corpus = reference_corpus.numpy().astype('str').tolist()
    print(reference_corpus)
  if isinstance(translation_corpus, tf.Tensor):
    translation_corpus = translation_corpus.numpy().astype('str').tolist()
  matches_by_order = [0] * max_order
  possible_matches_by_order = [0] * max_order
  reference_length = 0
  translation_length = 0
  for (references, translation) in zip(reference_corpus,
                                       translation_corpus):

    reference_length += min(len(r) for r in references)
    translation_length += len(translation)

    merged_ref_ngram_counts = collections.Counter()
    for reference in references:
      merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
    translation_ngram_counts = _get_ngrams(translation, max_order)

    overlap = translation_ngram_counts & merged_ref_ngram_counts
    

    for ngram in overlap:
      matches_by_order[len(ngram)-1] += overlap[ngram]
    for order in range(1, max_order+1):
      possible_matches = len(translation) - order + 1
      if possible_matches > 0:
        possible_matches_by_order[order-1] += possible_matches

  precisions = [0] * max_order
  for i in range(0, max_order):
    if smooth:
      precisions[i] = ((matches_by_order[i] + 1.) /
                       (possible_matches_by_order[i] + 1.))
    else:
      if possible_matches_by_order[i] > 0:
        precisions[i] = (float(matches_by_order[i]) /
                         possible_matches_by_order[i])
      else:
        precisions[i] = 0.0

  if min(precisions) > 0:
    p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
    geo_mean = math.exp(p_log_sum)
  else:
    geo_mean = 0

  ratio = float(translation_length) / reference_length

  if ratio > 1.0:
    bp = 1.
  else:
    bp = math.exp(1 - 1. / ratio)

  bleu = geo_mean * bp

  return (bleu, precisions, bp, ratio, translation_length, reference_length)

In [43]:
import tensorflow as tf
from functools import partial
import tensorflow.keras.backend as K
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
#from bleu import compute_bleu

class BLEUMetric(tf.keras.metrics.Mean):
    
    def __init__(self, vocabulary, name='perplexity', **kwargs):
      super().__init__(name=name, **kwargs)
      self.vocab = vocabulary
      self.id_to_token_layer = StringLookup(vocabulary=self.vocab, invert=True)
    
    def _calculate_bleu(self, real, pred):
      
        pred_argmax = tf.argmax(pred, axis=-1)  
        
        pred_tokens = self.id_to_token_layer(pred_argmax)
        real_tokens = self.id_to_token_layer(real)
        
        def clean_padding(tokens):
            """ If padding left in the sequence, they will count towards BLEU """
            t = tf.strings.split(
                    tf.strings.strip(
                        tf.strings.join(
                            tf.transpose(tokens), separator=' '
                        )
                    ),
                    sep=' '
                ).to_list()
                
            #t = np.char.split(t.numpy().astype('str')).tolist()
            
            return t
        
        pred_tokens = clean_padding(pred_tokens)
        real_tokens = clean_padding(real_tokens)
        
        partial_compute_bleu = partial(compute_bleu, smooth=True)
        bleu = tf.py_function(partial_compute_bleu, [real_tokens, pred_tokens], Tout='float32')
        #print(bleu)
        return bleu

    def update_state(self, y_true, y_pred, sample_weight=None):      
        # bleu, precisions, bp, ratio, translation_length, reference_length
        bleu = self._calculate_bleu(y_true, y_pred)      
        #print(bleu)
        super().update_state(bleu)

        
bleu = BLEUMetric(de_vocabulary)

bleu.update_state(
    np.array([[0,1,0],[0,2,2]]), np.array([[[1,0,0],[0,1,0],[0,1,0]],[[1,0,0],[0,0,1],[0,0,1]]])
)


ValueError: in user code:

    <ipython-input-41-945396f2073d>:46 update_state  *
        bleu = self._calculate_bleu(y_true, y_pred)
    <ipython-input-39-54e66107b70b>:40 _calculate_bleu  *
        bleu = tf.py_function(partial_compute_bleu, [real_tokens, pred_tokens], Tout='float32')
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201 wrapper  **
        return target(*args, **kwargs)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py:513 eager_py_func
        func=func, inp=inp, Tout=Tout, name=name, use_tape_cache=True)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py:420 _eager_py_func
        use_tape_cache=use_tape_cache)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py:348 _internal_py_func
        name=name)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/ops/gen_script_ops.py:55 eager_py_func
        ctx=_ctx)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/ops/gen_script_ops.py:96 eager_py_func_eager_fallback
        _attr_Tin, input = _execute.convert_to_mixed_eager_tensors(input, ctx)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/eager/execute.py:295 convert_to_mixed_eager_tensors
        v = [ops.convert_to_tensor(t, ctx=ctx) for t in values]
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/eager/execute.py:295 <listcomp>
        v = [ops.convert_to_tensor(t, ctx=ctx) for t in values]
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/profiler/trace.py:163 wrapped
        return func(*args, **kwargs)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py:1540 convert_to_tensor
        ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py:339 _constant_tensor_conversion_function
        return constant(v, dtype=dtype, name=name)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py:265 constant
        allow_broadcast=True)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py:276 _constant_impl
        return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py:301 _constant_eager_impl
        t = convert_to_eager_tensor(value, ctx, dtype)
    /home/thushv89/anaconda3/envs/manning.tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py:98 convert_to_eager_tensor
        return ops.EagerTensor(value, ctx.device_name, dtype)

    ValueError: Can't convert non-rectangular Python sequence to Tensor.


In [42]:
from tensorflow.keras.metrics import SparseCategoricalAccuracy
#final_model_text = tf.keras.models.Model(inputs=[en_vectorizer.input, de_vectorizer.input], outputs=final_model.output)

final_model.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)
final_model.summary()

ValueError: Metric (<__main__.BLEUMetric object at 0x7fe74deadfd0>) passed to model.compile was created inside of a different distribution strategy scope than the model. All metrics must be created in the same distribution strategy scope as the model (in this case <tensorflow.python.distribute.distribute_lib._DefaultDistributionStrategy object at 0x7fe756816dd8>). If you pass in a string identifier for a metric to compile the metric will automatically be created in the correct distribution strategy scope.

## use the following for BLEU

https://github.com/tensorflow/nmt/blob/master/nmt/scripts/bleu.py

In [31]:
epochs = 5
batch_size = 64

en_inputs = np.array(train_df["EN"].tolist())
de_inputs = np.array(train_df["DE"].str.rsplit(n=1, expand=True).iloc[:,0].tolist())
de_labels = de_vectorizer.predict(train_df["DE"].str.split(n=1, expand=True).iloc[:,1].tolist())
    

final_model.fit([en_inputs, de_inputs], de_labels, epochs=20)
        


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fb2c0054cc0>

In [None]:
de_labels_vec = de_vectorizer.predict