In [37]:
%matplotlib inline

Generation of Questions with a seq2seq network 
*************************************************************




::

    [KEY: > input, = target, < output]

   

This is made possible by the simple but powerful idea of the `sequence
to sequence network <http://arxiv.org/abs/1409.3215>`__, in which two
recurrent neural networks work together to transform one sequence to
another. An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.

.. figure:: /_static/img/seq-seq-images/seq2seq.png
   :alt:

To improve upon this model we'll use an `attention
mechanism <https://arxiv.org/abs/1409.0473>`__, which lets the decoder
learn to focus over a specific range of the input sequence.

**Requirements**



In [47]:
from __future__ import unicode_literals, print_function, division
from io import open
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Loading data files
===================

The data for this project is a set of many thousands of sentence pairs, broken down into two main groups i.e agreement and no agreement and are present in the dataset folder of the project



Similar to the character encoding used in the character-level RNN
tutorials, we will be representing each word in a language as a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.

.. figure:: /_static/img/seq-seq-images/word-encoding.png
   :alt:





We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Lang`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` to use to later replace rare words.




In [48]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

we will make evertything lower case adn trim most punctuation.




In [49]:


# Lowercase, trim, and remove non-letter characters


def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

To read the data file we will split the file into lines, and then split
lines into pairs




In [50]:
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('./../data/no_agreement_data/train_data.txt' ,).\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

The maximum length of each sentence has been set to a 100 words.



In [51]:
MAX_LENGTH = 100



def filterPairs(pairs):
    return [pair for pair in pairs]

The full process for preparing the data is:

-  Read text file and split into lines, split lines into pairs
-  Normalize text, filter by length and content
-  Make word lists from sentences in pairs




In [52]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepareData('out', 'inp', True)
print(random.choice(pairs))

Reading lines...
Read 196980 sentence pairs
Trimmed to 196980 sentence pairs
Counting words...
Counted words:
inp 54
out 52
['the yaks that will entertain the elephants will giggle quest', 'will the yaks that will entertain the elephants giggle']


The Seq2Seq Model
=================

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.

A `Sequence to Sequence network <http://arxiv.org/abs/1409.3215>`__, or
seq2seq network, or `Encoder Decoder
network <https://arxiv.org/pdf/1406.1078v3.pdf>`__, is a model
consisting of two RNNs called the encoder and decoder. The encoder reads
an input sequence and outputs a single vector, and the decoder reads
that vector to produce an output sequence.

.. figure:: /_static/img/seq-seq-images/seq2seq.png
   :alt:

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.

Consider the sentence "Je ne suis pas le chat noir" → "I am not the
black cat". Most of the words in the input sentence have a direct
translation in the output sentence, but are in slightly different
orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"
construction there is also one more word in the input sentence. It would
be difficult to produce a correct translation directly from the sequence
of input words.

With a seq2seq model the encoder creates a single vector which, in the
ideal case, encodes the "meaning" of the input sequence into a single
vector — a single point in some N dimensional space of sentences.




The Encoder
-----------

The encoder of a seq2seq network is a RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

.. figure:: /_static/img/seq-seq-images/encoder-network.png
   :alt:





In [53]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

The Decoder
-----------

The decoder is another RNN that takes the encoder output vector(s) and
outputs a sequence of words to create the translation.




Simple Decoder
^^^^^^^^^^^^^^

In the simplest seq2seq decoder we use only last output of the encoder.
This last output is sometimes called the *context vector* as it encodes
context from the entire sequence. This context vector is used as the
initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and
hidden state. The initial input token is the start-of-string ``<SOS>``
token, and the first hidden state is the context vector (the encoder's
last hidden state).

.. figure:: /_static/img/seq-seq-images/decoder-network.png
   :alt:





In [54]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

I encourage you to train and observe the results of this model, but to
save space we'll be going straight for the gold and introducing the
Attention Mechanism.




Attention Decoder
^^^^^^^^^^^^^^^^^

If only the context vector is passed betweeen the encoder and decoder,
that single vector carries the burden of encoding the entire sentence.

Attention allows the decoder network to "focus" on a different part of
the encoder's outputs for every step of the decoder's own outputs. First
we calculate a set of *attention weights*. These will be multiplied by
the encoder output vectors to create a weighted combination. The result
(called ``attn_applied`` in the code) should contain information about
that specific part of the input sequence, and thus help the decoder
choose the right output words.

.. figure:: https://i.imgur.com/1152PYf.png
   :alt:

Calculating the attention weights is done with another feed-forward
layer ``attn``, using the decoder's input and hidden state as inputs.
Because there are sentences of all sizes in the training data, to
actually create and train this layer we have to choose a maximum
sentence length (input length, for encoder outputs) that it can apply
to. Sentences of the maximum length will use all the attention weights,
while shorter sentences will only use the first few.

.. figure:: /_static/img/seq-seq-images/attention-decoder-network.png
   :alt:





In [55]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

<div class="alert alert-info"><h4>Note</h4><p>There are other forms of attention that work around the length
  limitation by using a relative position approach. Read about "local
  attention" in `Effective Approaches to Attention-based Neural Machine
  Translation <https://arxiv.org/abs/1508.04025>`__.</p></div>

Training
========

Preparing Training Data
-----------------------

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will append the
EOS token to both sequences.




In [56]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

Training the Model
------------------

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the ``<SOS>`` token as its first input, and the last hidden state of the
encoder as its first hidden state.

"Teacher forcing" is the concept of using the real target outputs as
each next input, instead of using the decoder's guess as the next input.
Using teacher forcing causes it to converge faster but `when the trained
network is exploited, it may exhibit
instability <http://minds.jacobs-university.de/sites/default/files/uploads/papers/ESNTutorialRev.pdf>`__.

You can observe outputs of teacher-forced networks that read with
coherent grammar but wander far from the correct translation -
intuitively it has learned to represent the output grammar and can "pick
up" the meaning once the teacher tells it the first few words, but it
has not properly learned how to create the sentence from the translation
in the first place.

Because of the freedom PyTorch's autograd gives us, we can randomly
choose to use teacher forcing or not with a simple if statement. Turn
``teacher_forcing_ratio`` up to use more of it.




In [57]:
teacher_forcing_ratio = 0.5


def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length

This is a helper function to print time elapsed and estimated time
remaining given the current time and progress %.




In [58]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

The whole training process looks like this:

-  Start a timer
-  Initialize optimizers and criterion
-  Create set of training pairs
-  Start empty losses array for plotting

Then we call ``train`` many times and occasionally print the progress (%
of examples, time so far, estimated time) and average loss.




In [59]:
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

Plotting results
----------------

Plotting is done with matplotlib, using the array of loss values
``plot_losses`` saved while training.




In [60]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

Evaluation
==========

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder's predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there. We also store the decoder's
attention outputs for display later.




In [61]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]

We can evaluate random sentences from the training set and print out the
input, target, and output to make some subjective quality judgements:




In [62]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

Training and Evaluating
=======================

With all these helper functions in place (it looks like extra work, but
it makes it easier to run multiple experiments) we can actually
initialize a network and start training.

Remember that the input sentences were heavily filtered. For this small
dataset we can use relatively small networks of 256 hidden nodes and a
single GRU layer. After about 40 minutes on a MacBook CPU we'll get some
reasonable results.

.. Note::
   If you run this notebook you can train, interrupt the kernel,
   evaluate, and continue training later. Comment out the lines where the
   encoder and decoder are initialized and run ``trainIters`` again.




In [63]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 75000, print_every=50)

0m 2s (- 71m 3s) (50 0%) 2.9451
0m 3s (- 49m 29s) (100 0%) 2.6636
0m 5s (- 42m 34s) (150 0%) 3.0684
0m 6s (- 39m 22s) (200 0%) 2.8410
0m 7s (- 37m 16s) (250 0%) 2.9088
0m 8s (- 36m 4s) (300 0%) 2.7883
0m 9s (- 35m 2s) (350 0%) 2.5754
0m 10s (- 34m 11s) (400 0%) 2.6211
0m 12s (- 33m 29s) (450 0%) 2.3578
0m 13s (- 33m 3s) (500 0%) 2.5245
0m 14s (- 32m 46s) (550 0%) 2.4896
0m 15s (- 32m 45s) (600 0%) 2.6542
0m 17s (- 32m 29s) (650 0%) 2.6394
0m 18s (- 32m 22s) (700 0%) 2.4809
0m 19s (- 32m 18s) (750 1%) 2.7272
0m 20s (- 32m 8s) (800 1%) 2.6463
0m 22s (- 32m 0s) (850 1%) 2.5437
0m 23s (- 31m 52s) (900 1%) 2.6954
0m 24s (- 31m 46s) (950 1%) 2.5506
0m 25s (- 31m 40s) (1000 1%) 2.5308
0m 27s (- 31m 44s) (1050 1%) 2.5245
0m 28s (- 31m 58s) (1100 1%) 2.6929
0m 29s (- 31m 51s) (1150 1%) 2.6096
0m 30s (- 31m 46s) (1200 1%) 2.6287
0m 32s (- 31m 47s) (1250 1%) 2.6386
0m 34s (- 32m 10s) (1300 1%) 2.5934
0m 36s (- 32m 49s) (1350 1%) 2.4100
0m 37s (- 32m 55s) (1400 1%) 2.4479
0m 38s (- 32m 57s) (1450 

5m 56s (- 32m 56s) (11450 15%) 0.2991
5m 57s (- 32m 54s) (11500 15%) 0.3199
5m 59s (- 32m 52s) (11550 15%) 0.2965
6m 0s (- 32m 51s) (11600 15%) 0.3098
6m 2s (- 32m 50s) (11650 15%) 0.3157
6m 4s (- 32m 49s) (11700 15%) 0.3073
6m 5s (- 32m 47s) (11750 15%) 0.2911
6m 7s (- 32m 46s) (11800 15%) 0.3037
6m 8s (- 32m 44s) (11850 15%) 0.2312
6m 10s (- 32m 43s) (11900 15%) 0.2807
6m 11s (- 32m 42s) (11950 15%) 0.2551
6m 13s (- 32m 40s) (12000 16%) 0.2504
6m 15s (- 32m 39s) (12050 16%) 0.2473
6m 16s (- 32m 37s) (12100 16%) 0.2821
6m 18s (- 32m 36s) (12150 16%) 0.3002
6m 19s (- 32m 34s) (12200 16%) 0.3085
6m 21s (- 32m 33s) (12250 16%) 0.2922
6m 22s (- 32m 32s) (12300 16%) 0.2606
6m 24s (- 32m 30s) (12350 16%) 0.3845
6m 26s (- 32m 28s) (12400 16%) 0.2678
6m 27s (- 32m 27s) (12450 16%) 0.3001
6m 29s (- 32m 25s) (12500 16%) 0.2560
6m 30s (- 32m 24s) (12550 16%) 0.2808
6m 32s (- 32m 22s) (12600 16%) 0.2492
6m 33s (- 32m 20s) (12650 16%) 0.2723
6m 35s (- 32m 19s) (12700 16%) 0.2881
6m 36s (- 32m 18s)

11m 39s (- 27m 38s) (22250 29%) 0.1619
11m 41s (- 27m 36s) (22300 29%) 0.1168
11m 42s (- 27m 35s) (22350 29%) 0.1230
11m 44s (- 27m 33s) (22400 29%) 0.1105
11m 45s (- 27m 31s) (22450 29%) 0.1380
11m 47s (- 27m 30s) (22500 30%) 0.1291
11m 48s (- 27m 28s) (22550 30%) 0.1151
11m 50s (- 27m 27s) (22600 30%) 0.1458
11m 52s (- 27m 26s) (22650 30%) 0.1498
11m 53s (- 27m 24s) (22700 30%) 0.1187
11m 55s (- 27m 22s) (22750 30%) 0.1107
11m 56s (- 27m 21s) (22800 30%) 0.1212
11m 58s (- 27m 19s) (22850 30%) 0.1527
12m 0s (- 27m 18s) (22900 30%) 0.1167
12m 1s (- 27m 16s) (22950 30%) 0.1018
12m 3s (- 27m 15s) (23000 30%) 0.1227
12m 4s (- 27m 13s) (23050 30%) 0.1391
12m 6s (- 27m 12s) (23100 30%) 0.1124
12m 8s (- 27m 10s) (23150 30%) 0.1142
12m 9s (- 27m 9s) (23200 30%) 0.1094
12m 11s (- 27m 7s) (23250 31%) 0.1158
12m 12s (- 27m 6s) (23300 31%) 0.1207
12m 14s (- 27m 4s) (23350 31%) 0.1118
12m 16s (- 27m 3s) (23400 31%) 0.1257
12m 17s (- 27m 1s) (23450 31%) 0.0738
12m 19s (- 26m 59s) (23500 31%) 0.1256

17m 6s (- 21m 53s) (32900 43%) 0.0344
17m 8s (- 21m 52s) (32950 43%) 0.0557
17m 9s (- 21m 50s) (33000 44%) 0.0783
17m 10s (- 21m 48s) (33050 44%) 0.0846
17m 12s (- 21m 46s) (33100 44%) 0.0612
17m 13s (- 21m 45s) (33150 44%) 0.0683
17m 15s (- 21m 43s) (33200 44%) 0.0598
17m 16s (- 21m 41s) (33250 44%) 0.0711
17m 17s (- 21m 39s) (33300 44%) 0.0709
17m 19s (- 21m 38s) (33350 44%) 0.1000
17m 20s (- 21m 36s) (33400 44%) 0.0660
17m 22s (- 21m 34s) (33450 44%) 0.0805
17m 23s (- 21m 32s) (33500 44%) 0.0538
17m 24s (- 21m 30s) (33550 44%) 0.0660
17m 26s (- 21m 28s) (33600 44%) 0.0696
17m 27s (- 21m 27s) (33650 44%) 0.0521
17m 28s (- 21m 25s) (33700 44%) 0.0687
17m 30s (- 21m 23s) (33750 45%) 0.0767
17m 31s (- 21m 21s) (33800 45%) 0.0841
17m 33s (- 21m 20s) (33850 45%) 0.0600
17m 34s (- 21m 18s) (33900 45%) 0.0537
17m 36s (- 21m 16s) (33950 45%) 0.0472
17m 37s (- 21m 15s) (34000 45%) 0.0529
17m 38s (- 21m 13s) (34050 45%) 0.0627
17m 40s (- 21m 11s) (34100 45%) 0.0499
17m 41s (- 21m 9s) (34150 45

22m 28s (- 16m 16s) (43500 57%) 0.0458
22m 29s (- 16m 14s) (43550 58%) 0.0492
22m 31s (- 16m 13s) (43600 58%) 0.0341
22m 33s (- 16m 11s) (43650 58%) 0.0514
22m 34s (- 16m 10s) (43700 58%) 0.0334
22m 36s (- 16m 8s) (43750 58%) 0.0308
22m 37s (- 16m 7s) (43800 58%) 0.0289
22m 39s (- 16m 5s) (43850 58%) 0.0311
22m 41s (- 16m 4s) (43900 58%) 0.0432
22m 42s (- 16m 2s) (43950 58%) 0.0375
22m 44s (- 16m 1s) (44000 58%) 0.0531
22m 45s (- 15m 59s) (44050 58%) 0.0565
22m 47s (- 15m 58s) (44100 58%) 0.0339
22m 49s (- 15m 56s) (44150 58%) 0.0437
22m 50s (- 15m 55s) (44200 58%) 0.0696
22m 52s (- 15m 53s) (44250 59%) 0.0412
22m 54s (- 15m 52s) (44300 59%) 0.0464
22m 55s (- 15m 50s) (44350 59%) 0.0484
22m 57s (- 15m 49s) (44400 59%) 0.0246
22m 58s (- 15m 47s) (44450 59%) 0.0352
23m 0s (- 15m 46s) (44500 59%) 0.0282
23m 2s (- 15m 44s) (44550 59%) 0.0306
23m 3s (- 15m 43s) (44600 59%) 0.0196
23m 5s (- 15m 41s) (44650 59%) 0.0313
23m 6s (- 15m 40s) (44700 59%) 0.0242
23m 8s (- 15m 38s) (44750 59%) 0.036

28m 10s (- 10m 53s) (54100 72%) 0.0233
28m 12s (- 10m 51s) (54150 72%) 0.0233
28m 13s (- 10m 50s) (54200 72%) 0.0262
28m 15s (- 10m 48s) (54250 72%) 0.0191
28m 16s (- 10m 46s) (54300 72%) 0.0234
28m 18s (- 10m 45s) (54350 72%) 0.0259
28m 20s (- 10m 43s) (54400 72%) 0.0306
28m 21s (- 10m 42s) (54450 72%) 0.0204
28m 23s (- 10m 40s) (54500 72%) 0.0327
28m 24s (- 10m 39s) (54550 72%) 0.0252
28m 26s (- 10m 37s) (54600 72%) 0.0433
28m 28s (- 10m 36s) (54650 72%) 0.0286
28m 29s (- 10m 34s) (54700 72%) 0.0112
28m 31s (- 10m 32s) (54750 73%) 0.0118
28m 32s (- 10m 31s) (54800 73%) 0.0149
28m 34s (- 10m 29s) (54850 73%) 0.0204
28m 36s (- 10m 28s) (54900 73%) 0.0196
28m 37s (- 10m 26s) (54950 73%) 0.0223
28m 39s (- 10m 25s) (55000 73%) 0.0186
28m 41s (- 10m 23s) (55050 73%) 0.0256
28m 42s (- 10m 22s) (55100 73%) 0.0180
28m 44s (- 10m 20s) (55150 73%) 0.0165
28m 45s (- 10m 19s) (55200 73%) 0.0208
28m 47s (- 10m 17s) (55250 73%) 0.0147
28m 49s (- 10m 15s) (55300 73%) 0.0194
28m 50s (- 10m 14s) (5535

33m 57s (- 5m 15s) (64950 86%) 0.0127
33m 58s (- 5m 13s) (65000 86%) 0.0227
34m 0s (- 5m 12s) (65050 86%) 0.0162
34m 2s (- 5m 10s) (65100 86%) 0.0138
34m 3s (- 5m 8s) (65150 86%) 0.0138
34m 5s (- 5m 7s) (65200 86%) 0.0110
34m 6s (- 5m 5s) (65250 87%) 0.0086
34m 8s (- 5m 4s) (65300 87%) 0.0151
34m 10s (- 5m 2s) (65350 87%) 0.0151
34m 11s (- 5m 1s) (65400 87%) 0.0094
34m 13s (- 4m 59s) (65450 87%) 0.0115
34m 14s (- 4m 58s) (65500 87%) 0.0137
34m 16s (- 4m 56s) (65550 87%) 0.0120
34m 17s (- 4m 54s) (65600 87%) 0.0149
34m 19s (- 4m 53s) (65650 87%) 0.0196
34m 21s (- 4m 51s) (65700 87%) 0.0110
34m 22s (- 4m 50s) (65750 87%) 0.0150
34m 24s (- 4m 48s) (65800 87%) 0.0166
34m 25s (- 4m 47s) (65850 87%) 0.0136
34m 27s (- 4m 45s) (65900 87%) 0.0265
34m 29s (- 4m 43s) (65950 87%) 0.0153
34m 30s (- 4m 42s) (66000 88%) 0.0189
34m 32s (- 4m 40s) (66050 88%) 0.0324
34m 34s (- 4m 39s) (66100 88%) 0.0181
34m 35s (- 4m 37s) (66150 88%) 0.0114
34m 37s (- 4m 36s) (66200 88%) 0.0168
34m 39s (- 4m 34s) (6625

In [64]:
evaluateRandomly(encoder1, attn_decoder1)

> some rabbit behind some bird could live ident
= some rabbit behind some bird could live
< some rabbit behind some bird could live <EOS>

> the dog that the seals can live would live quest
= would the dog that the seals can live live
< would the dog that the seals can live live <EOS>

> my monkeys that my monkeys will giggle will giggle quest
= will my monkeys that my monkeys will giggle giggle
< will my monkeys that my monkeys will giggle giggle <EOS>

> some unicorns who some birds could live could live quest
= could some unicorns who some birds could live live
< could some unicorns who some birds could live live <EOS>

> my monkeys will impress the seals that will impress the elephants quest
= will my monkeys impress the seals that will impress the elephants
< will my monkeys impress the seals that will impress the elephants <EOS>

> the cat around our monkey would irritate the cat quest
= would the cat around our monkey irritate the cat
< would the cat around our monkey irritate t

Visualizing Attention
---------------------

A useful property of the attention mechanism is its highly interpretable
outputs. Because it is used to weight specific encoder outputs of the
input sequence, we can imagine looking where the network is focused most
at each time step.

You could simply run ``plt.matshow(attentions)`` to see attention output
displayed as a matrix, with the columns being input steps and rows being
output steps:




In [66]:
output_words, attentions = evaluate(
    encoder1, attn_decoder1, "her rabbit who would irritate our rabbit could smile ident")
plt.matshow(attentions.numpy())

<matplotlib.image.AxesImage at 0x7f2cb05112b0>

In [67]:
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize model
model = TheModelClass()

# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

# Print optimizer's state_dict
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

torch.save(model,"no_agreement.pt")    

Model's state_dict:
conv1.weight 	 torch.Size([6, 3, 5, 5])
conv1.bias 	 torch.Size([6])
conv2.weight 	 torch.Size([16, 6, 5, 5])
conv2.bias 	 torch.Size([16])
fc1.weight 	 torch.Size([120, 400])
fc1.bias 	 torch.Size([120])
fc2.weight 	 torch.Size([84, 120])
fc2.bias 	 torch.Size([84])
fc3.weight 	 torch.Size([10, 84])
fc3.bias 	 torch.Size([10])
Optimizer's state_dict:
param_groups 	 [{'weight_decay': 0, 'params': [139830040301072, 139830040299920, 139830040299776, 139830040297832, 139830040298840, 139830040300064, 139830042299488, 139830039709160, 139830039709232, 139830039709304], 'lr': 0.001, 'dampening': 0, 'momentum': 0.9, 'nesterov': False}]
state 	 {}


For a better viewing experience we will do the extra work of adding axes
and labels:




In [68]:
def showAttention(input_sentence, output_words, attentions):
    # Set up figure with colorbar
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    fig.savefig(input_sentence+'.png')


def evaluateAndShowAttention(input_sentence):
    output_words, attentions = evaluate(
        encoder1, attn_decoder1, input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions)


evaluateAndShowAttention("your cat who will admire your cat could smile quest")



input = your cat who will admire your cat could smile quest
output = could your cat who will admire your cat smile <EOS>


  % get_backend())


Exercises
=========

-  Try with a different dataset

   -  Another language pair
   -  Human → Machine (e.g. IOT commands)
   -  Chat → Response
   -  Question → Answer

-  Replace the embeddings with pre-trained word embeddings such as word2vec or
   GloVe
-  Try with more layers, more hidden units, and more sentences. Compare
   the training time and results.
-  If you use a translation file where pairs have two of the same phrase
   (``I am test \t I am test``), you can use this as an autoencoder. Try
   this:

   -  Train as an autoencoder
   -  Save only the Encoder network
   -  Train a new Decoder for translation from there




In [None]:
x