### Intro: DeepPavlov sequence-to-sequence tutorial

In this tutorial we are going to implement [sequence-to-sequence](https://arxiv.org/abs/1409.3215) model in DeepPavlov.

Sequence-to-sequence is the concept of mapping input sequence to target sequence. Sequence-to-sequence models consist of two main components: encoder and decoder. Encoder is used to encode the input sequence to dense representation and decoder uses this dense representation to generate target sequence.

![sequence-to-sequence](https://cdn-images-1.medium.com/max/1400/1*Ismhi-muID5ooWf3ZIQFFg.png)

(image credit: [towardsdatascience.com](https://towardsdatascience.com))

To implement this model in DeepPavlov we have to code some DeepPavlov abstractions:
* **DatasetReader** to read the data
* **DatasetIterator** to generate batches
* **Vocabulary** to convert words to indexes
* **Model** to train it and then use it
* and some other components for pre- and postprocessing

### Probably the most usefull blog post about tensorflow I've seen
or why `tf.shape != tensor.get_shape`

[TensorFlow: Shapes and dynamic dimensions](https://blog.metaflow.fr/shapes-and-dynamic-dimensions-in-tensorflow-7b1fe79be363)

### Designations
    
for clarity we add the following suffixes to the end of python variables:

`_ph` - tf.placeholder
`_layer` - tf.keras.layer
`_op` - tensorflow operation (remember that tf.Tensor is not a set of valuet, it is a node in computational graph).

In [1]:
import os
import pdb
import json
from itertools import chain
from pathlib import Path

import numpy as np
import tensorflow as tf

import deeppavlov
from deeppavlov import build_model
from deeppavlov.core.data.dataset_reader import DatasetReader
from deeppavlov.core.common.registry import register

In [2]:
tf.__version__, deeppavlov.__version__

('1.13.1', '0.3.1')

In [3]:
BATCH_SIZE = 32
MAXLEN = 50

## Download & extract dataset

In [4]:
from deeppavlov.core.data.utils import download_decompress

dataset_path = './personachat'

if not os.path.exists(dataset_path):
    download_decompress('http://files.deeppavlov.ai/datasets/personachat_v2.tar.gz', dataset_path)
else:
    print('Dataset has already been downloaded')

Dataset has already been downloaded


## DatasetReader

DatasetReader is used to read and parse data from files. Here, we define new PersonaChatDatasetReader which reads [PersonaChat dataset](https://arxiv.org/abs/1801.07243).

PersonaChat dataset consists of dialogs and user personalities.

User personality is described by four sentences, e.g.:

    i like to remodel homes.
    i like to go hunting.
    i like to shoot a bow.
    my favorite holiday is halloween.
    
But we will be using only dialogues in this tutorial.

In [5]:
@register('personachat_dataset_reader')  # to use component later in train config it sould be registered
class PersonaChatDatasetReader(DatasetReader):
    """
    PersonaChat dataset from
    Zhang S. et al. Personalizing Dialogue Agents: I have a dog, do you have pets too?
    https://arxiv.org/abs/1801.07243
    Also, this dataset is used in ConvAI2 http://convai.io/
    This class reads dataset to the following format:
    [{
        'persona': [list of persona sentences],
        'x': input utterance,
        'y': output utterance,
        'dialog_history': list of previous utterances
        'candidates': [list of candidate utterances]
        'y_idx': index of y utt in candidates list
      },
       ...
    ]
    """
    def read(self, dir_path: str, mode='self_original', verbose=False):
        if verbose:
            print('Reading dataset...')
        dir_path = Path(dir_path)
        dataset = {}
        for dt in ['train', 'valid', 'test']:
            dataset[dt] = self._parse_data(dir_path / f'{dt}_{mode}.txt', verbose)

        print('Done\n')
        return dataset

    @staticmethod
    def _parse_data(filename, verbose):
        examples = []
        if verbose:
            print(filename)
        curr_persona = []
        curr_dialog_history = []
        persona_done = False
        with filename.open('r') as fin:
            for line in fin:
                line = ' '.join(line.strip().split(' ')[1:])
                your_persona_pref = 'your persona: '
                if line[:len(your_persona_pref)] == your_persona_pref and persona_done:
                    curr_persona = [line[len(your_persona_pref):]]
                    curr_dialog_history = []
                    persona_done = False
                elif line[:len(your_persona_pref)] == your_persona_pref:
                    curr_persona.append(line[len(your_persona_pref):])
                else:
                    persona_done = True
                    x, y, _, candidates = line.split('\t')
                    candidates = candidates.split('|')
                    example = {
                        'persona': curr_persona,
                        'x': x,
                        'y': y,
                        'dialog_history': curr_dialog_history[:],
                        'candidates': candidates,
                        'y_idx': candidates.index(y)
                    }
                    curr_dialog_history.extend([x, y])
                    examples.append(example)

        return examples

Read dataset, check size and sample some examples

In [6]:
data = PersonaChatDatasetReader().read('./personachat', verbose=True)

for k in data:
    print(k, '\t:', len(data[k]))

Reading dataset...
personachat/train_self_original.txt
personachat/valid_self_original.txt
personachat/test_self_original.txt
Done

train 	: 65719
valid 	: 7801
test 	: 7512


In [7]:
data['train'][100]

{'persona': ['i go to at least 10 concerts a year.',
  'i work in retail.',
  'madonna is my all time favorite.',
  'lady gaga is my current favorite singer.'],
 'x': 'they are both greyhounds . their names are tom and jerry .',
 'y': 'that is cute ! how old are they ?',
 'dialog_history': ['hey , what are you up to ?',
  'hello , i am listening to lady gaga , do you like her ?',
  'i prefer rock music , like led zeppelin .',
  'madonna is my first favorite . do you go to a lot of concerts ?',
  'i would if i could , but i have a farm to maintain .',
  'i work at the mall , so i am close to the venue .',
  'i prefer hiking outdoors and photography rather than crowded malls .',
  'pays well , lol . i make great money as the manager .',
  'work is tiring . i would love to travel the world instead .',
  'i would love to travel as well .',
  'staying here is fine too though . my two dogs keep me company .',
  'i love dogs ! what kind do you have ?'],
 'candidates': ['my girlfriend eloped w

### Dataset iterator

Dataset iterator is used to generate batches from parsed dataset (DatasetReader). Let's extract only `x` and `y` from parsed dataset and use them to predict sentence `y` by sentence `x`.

In [8]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

@register('personachat_iterator')
class PersonaChatIterator(DataLearningIterator):
    def split(self, *args, **kwargs):
        for dt in ['train', 'valid', 'test']:
            setattr(self, dt, self._to_tuple(getattr(self, dt)))

    @staticmethod
    def _to_tuple(data):
        """
        Returns:
            list of (x, y)
        """
        return list(map(lambda x: (x['x'], x['y']), data))

In [9]:
iterator = PersonaChatIterator(data)
batch_generator = iterator.gen_batches(5, 'train')
batch = next(batch_generator)
for x, y in zip(*batch):
    print('x:', x)
    print('y:', y)
    print('----------')

x: i am so sorry . my mother died during child birth and my father died 3 years ago .
y: sorry to hear that . it must have been so hard
----------
x: its mine , my dads and brothers cat i live with them
y: cooking is what i love
----------
x: oh . i like lasagna so much
y: we sell lasagna at the aldis store i work part time at
----------
x: great here . my name is reginald . you ?
y: good to hear ! nice to meet you my name is brianna .
----------
x: i am usually there to run marathons
y: wow that is awesome . i am 300 pounds so i have not run a marathon before .
----------


### Tokenizer
Splits utterance into tokens (words)

In [10]:
from deeppavlov.models.tokenizers.lazy_tokenizer import LazyTokenizer
tokenizer = LazyTokenizer()
tokenizer(["I'd like to tokenize some text"])

[nltk_data] Downloading package punkt to
[nltk_data]     /home/not_a_robot/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/not_a_robot/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/not_a_robot/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/not_a_robot/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


[['I', "'d", 'like', 'to', 'tokenize', 'some', 'text']]

### Vocabulary

Vocabulary object prepares mapping from tokens to token indices.
It uses train data to build this mapping.

We will implement DialogVocab (inherited from SimpleVocabulary) wich adds all tokens from `x` and `y` utterances to vocabulary.

In [11]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

@register('dialog_vocab')
class DialogVocab(SimpleVocabulary):
    def fit(self, *args):
        tokens = chain(*args)
        super().fit(tokens)

    def __call__(self, batch, **kwargs):
        indices_batch = []
        for utt in batch:
            tokens = [self[token] for token in utt]
            indices_batch.append(tokens)
        return indices_batch

Let's create instance of DialogVocab. We define save and load paths, minimal frequence of tokens which are added to vocabulary and set of special tokens.

Special tokens are:
* <PAD\> - padding
* <SOS\> - start of sequence
* <EOS\> - end of sequence
* <UNK\> - unknown token - token which is not presented in vocabulary

And fit it on tokens from *x* and *y*.

In [12]:
vocab = DialogVocab(
    save_path='./vocab.dict',
    load_path='./vocab.dict',
    min_freq=2,
    special_tokens=('<PAD>','<SOS>', '<EOS>', '<UNK>',),
    unk_token='<UNK>'
)

vocab.fit(tokenizer(iterator.get_instances(data_type='train')[0]), tokenizer(iterator.get_instances(data_type='train')[1]))
vocab.save()

PAD_idx = vocab._t2i['<PAD>']
SOS_idx = vocab._t2i['<SOS>']
assert PAD_idx == 0, 'this is required by tf.keras'

2019-06-26 00:49:25.331 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/not_a_robot/Documents/random_notebooks/CISS/vocab.dict]
2019-06-26 00:49:39.320 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/not_a_robot/Documents/random_notebooks/CISS/vocab.dict]


In [13]:
# number of words in vocab
len(vocab)

11595

In [14]:
vocab.freqs.most_common(10)

[('i', 103487),
 ('.', 101599),
 ('you', 48296),
 ('?', 43771),
 (',', 39500),
 ('a', 34214),
 ('to', 32105),
 ('do', 30574),
 ('is', 28579),
 ('my', 26953)]

One can use vocabulary to encode tokenized text

In [19]:
vocab([['<SOS>', 'this', 'is', 'tokenized', 'there_is_no_such_word_in_dataset', 'and_this', 'sentence', '<EOS>', '<PAD>', '<PAD>']])

[[1, 110, 12, 3, 3, 3, 6060, 2, 0, 0]]

## Padding

To feed sequences of token indexes to neural model we should make their lengths equal. If sequence is too short we add <PAD\> symbols to the end of sequence. If sequence is too long we just cut it.

SentencePadder implements such behavior, it also adds <SOS\> and <EOS\> tokens.

In [16]:
from deeppavlov.core.models.component import Component

@register('sentence_padder')
class SentencePadder(Component):
    def __init__(self, length_limit, pad_token_id=0, start_token_id=1, end_token_id=2, *args, **kwargs):
        self.length_limit = length_limit
        self.pad_token_id = pad_token_id
        self.start_token_id = start_token_id
        self.end_token_id = end_token_id

    def __call__(self, batch):
        for i in range(len(batch)):
            batch[i] = batch[i][:self.length_limit]
            batch[i] = [self.start_token_id] + batch[i] + [self.end_token_id]
            batch[i] += [self.pad_token_id] * (self.length_limit + 2 - len(batch[i]))
        return batch

In [17]:
padder = SentencePadder(length_limit=6)
padded = padder(vocab(tokenizer(['this is very very long sentence that does not fit',
                                 'this one is short'])))
padded

[[1, 110, 12, 75, 75, 149, 6060, 2], [1, 110, 76, 12, 456, 2, 0, 0]]

To reverse mapping, just apply vocab again

In [18]:
vocab(padded)

[['<SOS>', 'this', 'is', 'very', 'very', 'long', 'sentence', '<EOS>'],
 ['<SOS>', 'this', 'one', 'is', 'short', '<EOS>', '<PAD>', '<PAD>']]

# Seq2seq model

![](img/seq2seq_training.png)

(image credit: Stanford cs224n)

In [23]:
def build_seq2seq_graph(input_ph, target_ph, build_encoder_fn, build_decoder_fn, hidden_size, vocab_size, emb_size, dropout_rate, is_training_ph):
    """
    Args
        x_ph: input tokens placeholder
        y_ph: expected output tokens placeholder (used at training time for input feeding)
        build_encoder: function to build encoder graph
        build_decoder: function to build decoder graph
        hidden_dim: size of encoder rnn
        vocab_size: number of words in the vocabulary
        emb_dim: embedding size

    Returns:
        tf.Tensor [batch_size, maxlen, decoder_output_dim]
    """
    # embedding is shared between encoder and decoder
    embedding_layer = tf.keras.layers.Embedding(vocab_size, emb_size)
    dropout_layer = tf.keras.layers.Dropout(rate=dropout_rate)
    mask = tf.cast(input_ph, tf.bool)

    encoder_outputs_op, encoder_state_op = build_encoder_fn(
        input_ph,
        embedding_layer,
        dropout_layer,
        hidden_size,
        is_training_ph
    )
    decoder_op = build_decoder_fn(
        encoder_outputs_op,
        encoder_state_op,
        target_ph,
        embedding_layer,
        dropout_layer,
        is_training_ph,
        mask
    )
    logits_layer = tf.keras.layers.Dense(vocab_size)
    logits_op = logits_layer(decoder_op)
    return logits_op


In [24]:
def build_encoder(x_ph, embedding_layer, dropout_layer, hidden_dim, is_training_ph):
    """
    Args:
        x_ph: input tokens placeholder
        embedding_layer: tf.keras.Embedding object
        dropout_layer: tf.keras.Dropout object
        hidden_dim: size of rnn (also output size)
        is_training_ph: is training mode flag

    Returns:
        encoder_outputs_op: tf.Tensor, [batch_size, maxlen, encoder_hidden_size]
        encoder_state_op: tf.Tensor, [batch_size, encoder_hidden_size]
    """
    # embed x, apply dropout if is_training
    x_op = embedding_layer(x_ph)
    x_op = dropout_layer(x_op, training=is_training_ph)

    # make rnn layer and apply it to x
    # rnn layer should return both encoded sequences and state
    rnn_layer = tf.keras.layers.GRU(hidden_dim, return_sequences=True, return_state=True)
    encoder_outputs_op, encoder_state_op = rnn_layer(x_op)

    return encoder_outputs_op, encoder_state_op

Test encoder.

shapes should be `[batch_size, maxlen, encoder_hidden_size]` and `[batch_size, encoder_hidden_size]`

i.e `[3, 11, 17]` and `[3, 17]`

In [22]:
tf.reset_default_graph()

toy_batch_size = 3
toy_vocab_size = 13
toy_hidden_dim = 7
toy_emb_size = 5
toy_maxlen = 11

toy_emb = tf.keras.layers.Embedding(toy_vocab_size, toy_emb_size, input_length=toy_maxlen)
toy_dropout = tf.keras.layers.Dropout(rate=0.5)

# toy_input = tf.cast(tf.random_uniform(shape=[toy_batch_size, toy_maxlen]) * toy_vocab_size, tf.int32)
toy_input = tf.placeholder(tf.int64, [None, toy_maxlen])

encoder_outputs_op, encoder_state_op = build_encoder(
    toy_input, toy_emb, toy_dropout, toy_hidden_dim, True)

if (encoder_outputs_op.shape[1:] == (toy_maxlen, toy_hidden_dim)):
    print('Test 1 passed')
else:
    print('Problem with the shape')
    print(f'Shape should be {(None, toy_maxlen, toy_hidden_dim)}')
    print(f'But got {encoder_outputs_op.shape} instead')

if (encoder_state_op.shape[1:] == (toy_hidden_dim)):
    print('Test 2 passed')
else:
    print('Problem with the shape')
    print(f'Shape should be {(None, toy_hidden_dim)}')
    print(f'But got {encoder_state_op.shape} instead')

encoder_outputs_op, encoder_state_op

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Test 1 passed
Test 2 passed


(<tf.Tensor 'gru/transpose_1:0' shape=(?, 11, 7) dtype=float32>,
 <tf.Tensor 'gru/while/Exit_3:0' shape=(?, 7) dtype=float32>)

## Vanilla seq2seq decoder

In [20]:
def build_decoder(
        encoder_otputs_op,
        encoder_state_op,
        target_ph,
        embedding_layer,
        dropout_layer,
        is_training_ph,
        encoder_mask_ph=None):
    """Decoder without attention
    it ignores encoder_otputs_op and uses only encoder_state_op to generate sequence

    Args:
        encoder_otputs_op: used only to get max_len
        encoder_state_op: state of the encoder RNN, [batch_size, hidden_dim]
        enc_mask_ph: ignored
        target_ph: target placeholder, used at training time for input feeding
        embedding_layer: tf.keras.Embedding object
        dropout_layer: tf.keras.Dropout object
        is_training_ph: is training mode flag

    Returns:
        tf.Tensor [batch_size, max_len, hidden_dim]
    """
    _, max_len, hidden_dim = encoder_otputs_op.get_shape()
    _, max_len, hidden_dim = _, max_len.value, hidden_dim.value

    batch_size_op = tf.shape(encoder_otputs_op)[0]
    vocab_size = embedding_layer.input_dim

    # make decoder cell layer
    decoder_cell = tf.keras.layers.GRUCell(hidden_dim)
    # make decoder output projection layer (projects to vocabulary space)
    decoder_output_proj_layer = tf.keras.layers.Dense(vocab_size)

    # first decoder input is start-of-sentence token
    decoder_input_op = tf.ones([batch_size_op]) * SOS_idx
    # first decoder state is last encoder state
    decoder_state_op = encoder_state_op
    # in this list we will store the logits of predicted sequence
    output_logits = []

    for i in range(max_len):
        decoder_input_emb_op = embedding_layer(decoder_input_op)

        # for some complicated reasons, we must to expand_dims on state
        # decoder_cell returns output and states, in the case of GRU, states and output are the same
        decoder_state_op, _ = decoder_cell(decoder_input_emb_op, tf.expand_dims(decoder_state_op, 1))

        decoder_output_logit_op = decoder_output_proj_layer(decoder_state_op)
        output_logits.append(decoder_output_logit_op)

        # if training, use input feeding i.e. teacher forcing
        decoder_input_op = tf.cond(is_training_ph,
                                   lambda: target_ph[:, i],
                                   lambda: tf.argmax(decoder_output_logit_op, axis=1))

    output_logits_op = tf.stack(output_logits, axis=1)
    return output_logits_op

Test decoder

In [24]:
tf.reset_default_graph()

toy_batch_size = 3
toy_vocab_size = 13
toy_hidden_dim = 7
toy_emb_size = 5
toy_maxlen = 11

toy_emb = tf.keras.layers.Embedding(toy_vocab_size, toy_emb_size, input_length=toy_maxlen)
toy_dropout = tf.keras.layers.Dropout(rate=0.5)

toy_input = tf.placeholder(tf.int64, [None, toy_maxlen])
toy_target = tf.placeholder(tf.int64, [None, toy_maxlen])
toy_mask = tf.placeholder(tf.bool, [None, toy_maxlen])

is_training_ph = tf.placeholder_with_default(tf.constant(True), shape=())

encoder_outputs_op, encoder_state_op = build_encoder(
    toy_input, toy_emb, toy_dropout, toy_hidden_dim, is_training_ph)

toy_logits_op = build_decoder(
    encoder_outputs_op, encoder_state_op, toy_target, toy_emb, toy_dropout, is_training_ph, toy_mask
)

if (toy_logits_op.shape[1:] == (toy_maxlen, toy_vocab_size)):
    print('Test passed')
else:
    print('Problem with the shape')
    print(f'Shape should be {(None, toy_maxlen, toy_vocab_size)}')
    print(f'But got {toy_logits_op.shape} instead')

toy_logits_op

Test passed


<tf.Tensor 'stack:0' shape=(?, 11, 13) dtype=float32>

Test full model

In [25]:
toy_logits_op = build_seq2seq_graph(
    toy_input, toy_target, build_encoder, build_decoder, toy_hidden_dim, toy_vocab_size, toy_emb_size, 0.5, is_training_ph)

### Decoder with attention math

Decoder is much more tricky then encoder, especially with attention.
So it would be better for us to write down all decoder operations mathematically.

Let $m$ be the length of a source sequence, $h$ be dimension of encoder output, $\operatorname E \in \mathbb{R}^{ vocab\_size \times emb\_size}$ - embedding matrix.

Before encoding we have:
$$
\mathbf{h}_i^{enc} \in \mathbb{R}^h - \text{encoder output at i-th timestamp}\\
\mathbf{h}_m^{enc} - \text{last encoder output (encoder state)}\\
$$

#### Zeroth step

At the zeroth decoding step we sould construct decoder **input** and decoder **initial state**.
Decoder **initial state** is encoder state.
Decoder **input** is attention vector with SOS-token embedding as **query**.

Let $sos$ be SOS-token index in embedding matrix.

$$
\mathbf{h}_o^{dec} = \mathbf{h}_m\\
\mathbf{e}_0 = \operatorname{E}[sos]\\
% \mathbf{o}_0 = \mathbf 0, \mathbf{o}_0 \in \mathbb{R}^h\\
% \mathbf{h}_1^{dec} = \operatorname{Decoder}([e_0; o_0])
$$

#### t-th step, t > 0

At t-th step decoder **input** is attention vector with previous predicted token embedding as **query**.

**Note:** at training time we use **teacher forcing** (it is also called input feeding) that means that instead of using previous predicted token decoder uses previous true token from target sequence.


#### Attention

When we got decoder state $h_1$, we can compute attention vector.

Let $\operatorname{W}_{attProj} \in \mathbb{R}^{h \times h}$ be attention weighs, $\mathbf{s}$ be attention scores, $\mathbf{e}_t$ - embedding of the previous predicted token.

$$
\begin{aligned}
\mathbf{s}_{t, i} &= (\mathbf{e}_t)^T \operatorname{W}_{attProj} \mathbf{h}_i^{enc}\\
\mathbf{\alpha}_t &= \operatorname{Softmax}(\mathbf{s}_t) \text{  }\\
\mathbf{a}_t &= \sum_i^m \alpha_{t, i} \mathbf{h}_i^{enc}
\end{aligned}
$$

Or in terms of **query keys ans values**:

$$
\begin{aligned}
\mathbf{s}_{t, i} &= (\mathbf{q}_t)^T \operatorname{W}_{attProj} \mathbf{k}\\
\mathbf{\alpha}_t &= \operatorname{Softmax}(\mathbf{s}_t) \text{  }\\
\mathbf{a}_t &= \sum_i^m \alpha_{t, i} \mathbf{v}
\end{aligned}
$$

In [33]:
def softmax_masked(values, mask):
    masked_values = -np.inf * (1 - tf.cast(mask, tf.float32)) + values
    return tf.nn.softmax(masked_values, 2)

def build_decoder_with_attention(
        encoder_outputs_op,
        encoder_state_op,
        target_ph,
        embedding_layer,
        dropout_layer,
        is_training_ph,
        encoder_mask_ph):
    """Decoder with Luong attention
    https://arxiv.org/abs/1508.04025

    Args:
        encoder_otputs_op: used only to get max_len
        encoder_state_op: state of the encoder RNN, [batch_size, hidden_dim]
        enc_mask_ph: ignored
        target_ph: target placeholder, used at training time for input feeding
        embedding_layer: tf.keras.Embedding object
        dropout_layer: tf.keras.Dropout object
        is_training_ph: is training mode flag

    Returns:
        tf.Tensor [batch_size, max_len, hidden_dim]
    """    
    _, max_len, hidden_dim = encoder_outputs_op.get_shape()
    _, max_len, hidden_dim = _, max_len.value, hidden_dim.value

    batch_size_op = tf.shape(encoder_outputs_op)[0]
    vocab_size = embedding_layer.input_dim

    # make decoder cell layer
    decoder_cell = tf.keras.layers.GRUCell(hidden_dim)
    # make decoder output projection layer (projects to vocabulary space)
    decoder_output_proj_layer = tf.keras.layers.Dense(vocab_size)

    # first decoder input is start-of-sentence token
    decoder_input_op = tf.ones([batch_size_op], dtype=tf.int64) * SOS_idx
    # first decoder state is last encoder state
    decoder_state_op = encoder_state_op
    # in this list we will store the logits of predicted sequence
    output_logits = []

    # attention-related variables:
    attention_proj_layer = tf.keras.layers.Dense(hidden_dim, use_bias=False)
    attention_keys_op = attention_proj_layer(encoder_outputs_op)  # W_attProj @ h_enc
    attention_values_op = encoder_outputs_op
    attention_query_op = decoder_state_op

    for i in range(max_len):
        # compute input tensor for decoder rnn
        decoder_input_emb_op = embedding_layer(decoder_input_op)

        # apply attention with decoder_input_emb_op as query
        attention_query_op = tf.expand_dims(decoder_state_op, 1)  # [batch_size, 1, hidden]
        attention_scores_op = tf.matmul(attention_query_op, attention_keys_op, transpose_b=True)  # [batch_size, 1, maxlen]

        attention_probs_op = softmax_masked(attention_scores_op, tf.expand_dims(encoder_mask_ph, 1))
        attention_probs_op = tf.nn.softmax(attention_scores_op, 2)  # [batch_size, 1, maxlen]
        attention_vec_op = tf.matmul(attention_probs_op, attention_values_op)  # [batch_size, 1, hidden]
        attention_vec_op = tf.squeeze(attention_vec_op, 1)  # [batch_size, hidden]

        # for some complicated reasons, we must to expand_dims(decoder_state_op, 1) on state
        # decoder_cell returns output and states, in the case of GRU, states and output are the same
        decoder_state_op, _ = decoder_cell(attention_vec_op, tf.expand_dims(decoder_state_op, 1))

        decoder_output_logit_op = decoder_output_proj_layer(decoder_state_op)
        output_logits.append(decoder_output_logit_op)

        # if training, use input feeding i.e. teacher forcing
        decoder_input_op = tf.cond(is_training_ph,
                                   lambda: target_ph[:, i],
                                   lambda: tf.argmax(decoder_output_logit_op, axis=1))

    output_logits_op = tf.stack(output_logits, axis=1)
    return output_logits_op

In [27]:
tf.reset_default_graph()

toy_batch_size = 3
toy_vocab_size = 13
toy_hidden_dim = 7
toy_emb_size = 5
toy_maxlen = 11

toy_emb = tf.keras.layers.Embedding(toy_vocab_size, toy_emb_size)
toy_dropout = tf.keras.layers.Dropout(rate=0.5)

toy_input = tf.placeholder(tf.int64, [None, toy_maxlen])
toy_target = tf.placeholder(tf.int64, [None, toy_maxlen])
toy_mask = tf.placeholder(tf.bool, [None, toy_maxlen])

is_training_ph = tf.placeholder_with_default(tf.constant(True), shape=())

encoder_outputs_op, encoder_state_op = build_encoder(
    toy_input, toy_emb, toy_dropout, toy_hidden_dim, is_training_ph)

toy_logits_op = build_decoder_with_attention(
    encoder_outputs_op, encoder_state_op, toy_target, toy_emb, toy_dropout, is_training_ph, toy_mask
)

if (toy_logits_op.shape[1:] == (toy_maxlen, toy_vocab_size)):
    print('Test passed')
else:
    print('Problem with the shape')
    print(f'Shape should be {(toy_batch_size, toy_maxlen, toy_vocab_size)}')
    print(f'But got {toy_logits_op.shape} instead')

toy_logits_op

Test passed


<tf.Tensor 'stack:0' shape=(?, 11, 13) dtype=float32>

In [28]:
toy_logits_op = build_seq2seq_graph(
    toy_input, toy_target, build_encoder, build_decoder_with_attention, toy_hidden_dim, toy_vocab_size, toy_emb_size, 0.5, is_training_ph)

## Make model class and train

![](https://bastings.github.io/annotated_encoder_decoder/images/bahdanau.png)

In [89]:
from deeppavlov.core.models.tf_model import TFModel
# http://docs.deeppavlov.ai/en/master/_modules/deeppavlov/core/models/tf_model.html


@register('seq2seq_57389')
class Seq2Seq(TFModel):
    def __init__(self, **kwargs):
        # model hyperparameters
        self.emb_size = kwargs['emb_size']
        self.hidden = kwargs['hidden']
        self.dropout = kwargs['dropout']
        self.vocab_size = kwargs['vocab_size']
        self.max_len = kwargs['max_len']

        # optimization hyperparameters
        self.grad_clip = kwargs.get('grad_clip', 5.)
        self.learning_rate = kwargs.get('learning_rate', 1e-3)

        # placeholders
        self.input_ph = tf.placeholder(tf.int64, [None, self.max_len])
        self.target_ph = tf.placeholder(tf.int64, [None, self.max_len])
        self.is_training_ph = tf.placeholder_with_default(tf.constant(False), shape=())
        self.target_mask_ph = tf.cast(self.target_ph > 0, tf.float32)

        # graph
        self.logits_op = build_seq2seq_graph(
            self.input_ph,
            self.target_ph,
            build_encoder,
            build_decoder_with_attention,
            self.hidden,
            self.vocab_size,
            self.emb_size,
            self.dropout,
            self.is_training_ph)
        self.predictions_op = tf.argmax(self.logits_op, axis=2)

        self.loss = self._build_loss(self.input_ph, self.logits_op, self.target_mask_ph)

        self.train_op = self.get_train_op(self.loss, self.learning_rate,
                                          optimizer=tf.train.AdamOptimizer,
                                          clip_norm=self.grad_clip)

        # create session and initialize graph variables
        sess_config = tf.ConfigProto()
        sess_config.gpu_options.allow_growth = True  # do not use all GPU memory at once
        self.sess = tf.Session(config=sess_config)

        self.sess.run(tf.global_variables_initializer())

        super().__init__(**kwargs)
        if self.save_path:
            pass
        if self.load_path is not None:
            self.load()

    def _build_loss(self, y_true, y_logits_pred, y_mask):
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_logits_pred) * y_mask
        loss = tf.reduce_sum(loss) / tf.reduce_sum(y_mask)
        return loss

    def _build_feed_dict(self, x, y=None):
        feed_dict = {
            self.input_ph: x,
        }
        if y is not None:
            feed_dict.update({
                self.target_ph: y,
                self.is_training_ph: True,
            })
        return feed_dict

    def train_on_batch(self, x, y):
        feed_dict = self._build_feed_dict(x, y)
        loss, _ = self.sess.run([self.loss, self.train_op], feed_dict=feed_dict)
        return loss

    def __call__(self, x):
        feed_dict = self._build_feed_dict(x)
        y_pred = self.sess.run(self.predictions_op, feed_dict=feed_dict)
        return y_pred

    def process_event(self, *args, **kwargs):
        pass

In [90]:
model = Seq2Seq(emb_size=3, hidden=5, dropout=0.1, vocab_size=7, max_len=9, save_path=None)



### Postprocessing

In postprocessing step we are going to remove all <PAD\>, <SOS\>, <EOS\> tokens.

In [46]:
@register('postprocessing')
class SentencePostprocessor(Component):
    def __init__(self, pad_token='<PAD>', start_token='<SOS>', end_token='<EOS>', *args, **kwargs):
        self.pad_token = pad_token
        self.start_token = start_token
        self.end_token = end_token

    def __call__(self, batch):
        for i in range(len(batch)):
            batch[i] = ' '.join(self._postproc(batch[i]))
        return batch
    
    def _postproc(self, utt):
        if self.end_token in utt:
            utt = utt[:utt.index(self.end_token)]
        return utt

In [47]:
postprocess = SentencePostprocessor()

In [48]:
padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this'], ['It']]))

[[1, 70, 13, 240, 3, 3, 2, 0, 0], [1, 3, 2, 0, 0, 0, 0, 0, 0]]

In [51]:
padder = SentencePadder(length_limit=9 - 2)
model = Seq2Seq(emb_size=3, hidden=5, dropout=0.1, vocab_size=7, max_len=9, save_path=None)
postprocess(vocab(model(padder(vocab([['hello', 'my', 'friend', 'there_is_no_such_word_in_dataset', 'and_this']])))))

['<UNK> <SOS> <SOS> you you you you you you']

### Create config file
Let's put is all together in one config file.

In [96]:
config = {
  "dataset_reader": {
    "class_name": "personachat_dataset_reader",
    "data_path": "./personachat"
  },
  "dataset_iterator": {
    "class_name": "personachat_iterator",
    "shuffle": True
  },
  "chainer": {
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      {
        "class_name": "lazy_tokenizer",
        "id": "tokenizer",
        "in": ["x"],
        "out": ["x_tokens"]
      },
      {
        "class_name": "lazy_tokenizer",
        "id": "tokenizer",
        "in": ["y"],
        "out": ["y_tokens"]
      },
      {
        "class_name": "dialog_vocab",
        "id": "vocab",
        "save_path": "./vocab.dict",
        "load_path": "./vocab.dict",
        "min_freq": 2,
        "special_tokens": ["<PAD>","<SOS>", "<EOS>", "<UNK>"],
        "unk_token": "<UNK>",
        "fit_on": ["x_tokens", "y_tokens"],
        "in": ["x_tokens"],
        "out": ["x_tokens_ids"]
      },
      {
        "ref": "vocab",
        "in": ["y_tokens"],
        "out": ["y_tokens_ids"]
      },
      {
        "class_name": "sentence_padder",
        "id": "padder",
        "length_limit": MAXLEN,
        "in": ["x_tokens_ids"],
        "out": ["x_tokens_ids"]
      },
      {
        "ref": "padder",
        "in": ["y_tokens_ids"],
        "out": ["y_tokens_ids"]
      },
      {
        "class_name": "seq2seq_57389",
        "id": "seq2seq_model",
        "max_len": "#padder.length_limit+2",
        "hidden": 250,
        "emb_size": 100,
        "vocab_size": len(vocab),
        "dropout": 0.1,
        "learning_rate": 1e-3,
        "save_path": "./seq2seq_model_57389",
        "load_path": "./seq2seq_model_57389",
        "in": ["x_tokens_ids"],
        "in_y": ["y_tokens_ids"],
        "out": ["y_predicted_tokens_ids"],
      },
      {
        "ref": "vocab",
        "in": ["y_predicted_tokens_ids"],
        "out": ["y_predicted_tokens"]
      },
      {
        "class_name": "postprocessing",
        "in": ["y_predicted_tokens"],
        "out": ["y_predicted_tokens"]
      }
    ],
    "out": ["y_predicted_tokens"]
  },
  "train": {
    "log_every_n_batches": 500,
    "val_every_n_epochs": 0,
    "batch_size": 64,
    "validation_patience": 5,
    "epochs": 20,
    "metrics": ["bleu"],
  }
}

### Interact with model using config

In [93]:
model = build_model(config)

2019-06-26 07:15:04.624 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/not_a_robot/Documents/random_notebooks/CISS/vocab.dict]
2019-06-26 07:15:17.608 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 52: [loading model from /home/not_a_robot/Documents/random_notebooks/CISS/seq2seq_model]


INFO:tensorflow:Restoring parameters from /home/not_a_robot/Documents/random_notebooks/CISS/seq2seq_model


In [94]:
model(['hi, how are you?', 'any ideas my dear friend?'])

['nd nd nd nd texture texture texture texture wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana',
 'nd nd nd nd texture wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana wana']

### Train model


Run experiments with and without attention, with teacher forcing and without.

In [97]:
with open('seq2seq.json', 'w') as f:
    json.dump(config, f)

In [None]:
from deeppavlov.core.commands.train import train_evaluate_model_from_config

train_evaluate_model_from_config('seq2seq.json')

2019-06-26 07:15:55.221 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/not_a_robot/Documents/random_notebooks/CISS/vocab.dict]


Done



2019-06-26 07:16:08.928 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/not_a_robot/Documents/random_notebooks/CISS/vocab.dict]
2019-06-26 07:16:40.379 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 164: New best bleu of 0.0666
2019-06-26 07:16:40.379 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 166: Saving model
2019-06-26 07:16:40.380 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 76: [saving model to /home/not_a_robot/Documents/random_notebooks/CISS/seq2seq_model_57389]


{"valid": {"eval_examples_count": 7801, "metrics": {"bleu": 0.0666}, "time_spent": "0:00:20", "epochs_done": 0, "batches_seen": 0, "train_examples_seen": 0, "impatience": 0, "patience_limit": 5}}
{"train": {"eval_examples_count": 64, "metrics": {"bleu": 0}, "time_spent": "0:05:00", "epochs_done": 0, "batches_seen": 500, "train_examples_seen": 32000, "loss": 9.187372651100159}}


In [None]:
model = build_model(config)
model(['hi, how are you?', 'any ideas my dear friend?', 'okay, i agree with you', 'good bye!'])

In [None]:
model(['tell me about yourself'])

## Extra
### Decoder with attention math 2 (more realistic case)

This is typical NMT decoder with attention. It uses a lot of hacky tricks to make decoding a bit better.

Decoder is much more tricky then encoder, especially with attention.
So it would be better for us to write down all decoder operations mathematically.

Let $m$ be the length of a source sequence, $h$ be dimension of encoder output, $\operatorname E \in \mathbb{R}^{ vocab\_size \times emb\_size}$ - embedding matrix.

Before encoding we have:
$$
\mathbf{h}_i^{enc} \in \mathbb{R}^h - \text{encoder output at i-th timestamp}\\
\mathbf{h}_m^{enc} - \text{last encoder output (encoder state)}\\
$$

#### Zeroth step

At the zeroth decoding step we sould construct decoder **input** and decoder **initial state**.
Decoder **initial state** is transformed (projected with matrix $\mathbf{\operatorname{W}}_h$) encoder state.
Decoder **input** is _zero_ vector of size $h$.

Let $sos$ be SOS-token index in embedding matrix.

$$
\mathbf{h}_o^{dec} = \operatorname{W}_h \mathbf{h}_m, \; \operatorname{W}_h \in \mathbb{R}^{h \times 2h} - \text{decoder initial state is transformed encoder state}\\
\mathbf{e}_0 = \operatorname{E}[sos]\\
\mathbf{o}_0 = \mathbf 0, \mathbf{o}_0 \in \mathbb{R}^h\\
\mathbf{h}_1^{dec} = \operatorname{Decoder}([e_0; o_0])
$$

#### t-th step, t > 0

At t-th step decoder **input** is concatenated combined-output vector $\mathbf{o}_t$ (it is explained down this page in Attention paragraph) and previous predicted token embedding.

**Note:** at training time we use **teacher forcing** (it is also called input feeding) that means that instead of using previous predicted token decoder uses previous true token from target sequence.


#### Attention

When we got decoder state $h_1$, we can compute attention vector.

Let $\operatorname{W}_{attProj} \in \mathbb{R}^{h \times 2h}$ be attention weighs, $\mathbf{s}$ be attention scores.

$$
\begin{aligned}
\mathbf{s}_{t, i} &= (h_t^{dec})^T \operatorname{W}_{attProj} h_i^{enc}\\
\mathbf{\alpha}_t &= \operatorname{Softmax}(\mathbf{s}_t) \text{  }\\
\mathbf{a}_t &= \sum_i^m \alpha_{t, i} \mathbf{h}_i^{enc}
\end{aligned}
$$

Then, decoder output is concatenated with attention vector and passed through a linear layer, tanh and dropout to attain combined-output vector $\mathbf{o}_t$.

$$
\begin{aligned}
\mathbf{u}_t = [a_t; h_t^{dec}]
    \text{   } \; &where \; \text{  }
        \mathbf{u}_t \in \mathbb{R}^{3h \times 1}\\
\mathbf{o}_t = \operatorname{Dropout(tanh(W_u} \mathbf{u}_t))
    \text{   } \;  &where  \text{   } \; 
        \operatorname{W}_u \in \mathbb{R}^{h \times 3h}, \mathbf{o}_t \in \mathbb{R}^{h \times 1}
\end{aligned}
$$