# Lecture 13: Seq2seq with Attention 
## 0. Overview
- Seq2seq
- Implementation keys
- Example: Chatbot

## 1. Sequence to Sequence
### 1.1 intro
- The current model class of choice for most dialogue and machine translation systems
- [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,  Cho et al. 2014](https://arxiv.org/pdf/1406.1078.pdf)

### 1.2 Structure: two RNNs
- Encoder: maps a variable-length source sequence (input) to a fixed-length vector
- Decoder: maps the vector representation back to a variable-length target sequence (output)   

Two RNNs are trained jointly to maximize the conditional probability of the target sequence given a source sequence

### 1.3 Encoder and Decoder in TensorFlow
#### Vanilla model
Each input has to be encoded into a **fixed-size state vector** (the only thing to be passed to decoder)
![vanilla](figures/13_01.png)

#### Model with attention
Decoder gets direct access to input data
![attention](figures/13_02.png)

## 2. Implementation keys
### 2.1 Bucketing
- Avoid too much padding
- Group sequences of similar lengths into the same buckets
- Create a separate subgraph for each bucket
- in theory (v1.0)

        tf.contrib.training.bucket_by_sequence_length(max_length, examples, batch_size, bucket_boundaries, capacity=2 * batch_size, dynamic_pad=True)
        
- in practice (v0.12): [TF translate model](https://www.tensorflow.org/tutorials/seq2seq)

### 2.2 Sampled Softmax
- advantage: avoid the growing complexity of computing the normalization constant
- idea: approximate the negative term of the gradient by importance sampling with a small number of samples
    - At each step, update only the vectors associated with the correct word $w$ and with the sampled words in $V’$
    - Once training is over, use the full target vocabulary to compute the output probability of each target word
- [On Using Very Large Target Vocabulary for Neural Machine Translation (Jean et al., 2015)](https://arxiv.org/pdf/1412.2007.pdf)

In [None]:
if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB:
    weight = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB])
    bias = tf.get_variable('proj_b', [config.DEC_VOCAB])
    self.output_projection = (w, b)
    
    def sampled_loss(inputs, labels):
        labels = tf.reshape(labels, [-1, 1])
        return tf.nn.sampled_softmax_loss(tf.transpose(weight), bias, inputs, labels, 
                                          config.NUM_SAMPLES, config.DEC_VOCAB)
    self.softmax_loss_function = sampled_loss

- Generally an underestimate of the full softmax loss
- At inference time, compute the **full softmax** using

        tf.nn.softmax(tf.matmul(inputs, tf.transpose(weight)) + bias)
        

### 2.3 Seq2seq in TensorFlow

In [None]:
outputs, states = basic_rnn_seq2seq(encoder_inputs, 
                                    decoder_inputs, 
                                    cell)

- **encoder_inputs**: a list of tensors representing inputs to the encoder
- **decoder_inputs**: a list of tensors representing inputs to the decoder
- **cell**: single or multiple layer cells
- **outputs**: a list of decoder_size tensors, each of dimension 1 x DECODE_VOCAB corresponding to the probability distribution at each time-step
- **states**: a list of decoder_size tensors, each corresponds to the internal state of the decoder at every time-step

In [None]:
outputs, states = embedding_rnn_seq2seq(encoder_inputs,
                                        decoder_inputs,
                                        cell,
                                        num_encoder_symbols,
                                        num_decoder_symbols,
                                        embedding_size,
                                        output_projection=None,
                                        feed_previous=False)

- **num_encoder_symbols** & **num_decoder_symbols**: To embed your inputs and outputs, need to specify the number of input and output tokens
- **feed_previous** if you want to feed the previously predicted word to train, even if the model makes mistakes
- **output_projection**: tuple of project weight and bias if use sampled softmax

In [None]:
outputs, states = embedding_attention_seq2seq(encoder_inputs,
                                              decoder_inputs,
                                              cell,
                                              num_encoder_symbols,
                                              num_decoder_symbols,
                                              num_heads=1,
                                              output_projection=None,
                                              feed_previous=False,
                                              initial_state_attention=False)

In [None]:
outputs, losses = model_with_buckets(encoder_inputs,
                                     decoder_inputs,
                                     targets,
                                     weights,
                                     buckets,
                                     seq2seq,
                                     softmax_loss_function=None,
                                     per_example_loss=False)

- **seq2seq**: one of the seq2seq functions defined above
- **softmax_loss_function**: normal softmax or sampled softmax

## 3. TensorFlow chatbot
### 3.1 [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- 220,579 conversational exchanges between
- 10,292 pairs of movie characters
- 9,035 characters from 617 movies
- 304,713 total utterances
- Very well-formatted (almost perfect)
- [*Chameleons in imagined conversations, Cristian Danescu-Niculescu-Mizil and Lillian Lee*](http://www.cs.cornell.edu/~cristian/papers/chameleons.pdf)


### 3.2 input & bucketing
- input length distribution
![inputlen](figures/13_03.png)
- bucketing
    - 9 buckets: 
    
            [(6, 8), (8, 10), (10, 12), (13, 15), (16, 19), (19, 22), (23, 26), (29, 32), (39, 44)]
            [19530, 17449, 17585, 23444, 22884, 16435, 17085, 18291, 18931]
            
    - 5 buckets:
    
            [(8, 10), (12, 14), (16, 19), (23, 26), (39, 43)] # bucket boundaries
            [37049, 33519, 30223, 33513, 37371] # number of samples in each bucket

    - 3 buckets (recommended)
    
            [(8, 10), (12, 14), (16, 19)]
            [37899, 34480, 31045]

#### Vocabulary tradeoff
- Get all tokens that appear at least a number of time (twice)
- Alternative approach: get a fixed size vocabulary

#### Smaller vocabulary:
- Has smaller loss/perplexity (but loss/perplexity isn’t everything)
- Gives <unk> answers to questions that require personal information
- Doesn’t give the bot’s answers much response
- Doesn’t train much faster than big vocab using sampled softmax

### 3.4 Model
- Seq2seq
- Attentional decoder
- Reverse encoder inputs
- Bucketing
- Sampled softmax
- Based on the Google’s vanilla translate model (originally used to translate from English to French)

### 3.5 Sanity check
Check if we implemented our model correctly.
- Run the model on a small dataset (~2,000 pairs) and
- run for a lot of epochs to see if it converges (learns all the responses by heart)

### 3.6 Problems
- The bot is very dramatic (thanks to Hollywood screenwriters)
- Topics of conversations aren’t realistic
- Responses are always fixed for one encoder input
- Inconsistent personality
- Use only the last previous utterance as the input for the encoder
- Doesn’t keep track of information about users

### 3.7 More refinements
#### Train on multiple datasets
- [Twitter chat log (courtesy of Marsan Ma)](https://github.com/Marsan-Ma/chat_corpus)
- [More movie substitles (less clean)](https://github.com/Marsan-Ma/chat_corpus/)
- [Every publicly available Reddit comments (1TB of data!)](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)
- Your own conversations (chat logs, text messages, emails)

#### Chatbot with personalities
- At the decoder phase, inject consistent information about the bot
- Use the decoder inputs from one person only

#### Train on the incoming inputs
- Save the conversation with users and train on those conversations
- Create a feedback loop so users can correct the bot’s responses

#### Remember what users say

#### Use characters instead of tokens
- Character level language modeling seems to be working quite well
- Smaller vocabulary -- no unknown tokens!
- But the sequences will be much longer (approximately 4x longer)

#### Improve input pipeline
- Right now, 50% of running time is spent on generating batches