# Fundamentals of Deep Learning
## 目录
- Chapter 7. Models for Sequence Analysis
    - Analyzing Variable-Length Inputs
    - Tackling seq2seq with Neural N-Grams
    - Implementing a Part-of-Speech Tagger
    - Dependency Parsing and SyntaxNet
    - Beam Search and Global Normalization
    - A Case for Stateful Deep Learning Models
    - Recurrent Neural Networks
    - Long Short-Term Memory (LSTM) Units
    - Implementing a Sentiment Analysis Model
    - Solving seq2seq Tasks with Recurrent Neural Networks
    - Augmenting Recurrent Networks with Attention


## Analyzing Variable-Length Inputs
In Figure 7-1, we illustrate how our feed-forward neural networks break when analyzing sequences. If the sequence is the same size as the input layer, the model can perform as we expect it to. It’s even possible to deal with smaller inputs by **padding zeros to the end of the input until it’s the appropriate length**. However, the moment the input exceeds the size of the input layer, naively using the feedforward network no longer works.

![7-1](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0701.png)

Figure 7-1. Feed-forward networks thrive on fixed input size problems. Zero padding can address the handling of smaller inputs, but when naively utilized, these models break when inputs exceed the fixed input size. 

## Tackling seq2seq with Neural N-Grams
In this section, we’ll begin exploring a feed-forward neural network architecture that can process a body of text and produce a sequence of `part-of-speech (POS)` tags. An example of this is shown in Figure 7-2. 

![7-2](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0702.png)

Figure 7-2. An example of an accurate POS parse of an English sentence

We can predict each POS tag one at a time by using a fixed-length subsequence. In particular, we utilize the subsequence starting from the word of interest and extending n words into the past. This neural n-gram strategy is depicted in Figure 7-3.

![7-3](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0703.png)

Figure 7-3. Using a feed-forward network to perform seq2seq when we can ignore long-term dependencies

Specifically, when we predict the POS tag for the $i^{th}$ word in the input, we utilize the $i - n + 1^{st}, i - n + 2^{nd}, \cdots, i^{th}$ words as the input. We’ll refer to this subsequence as the `context window`.

## Implementing a Part-of-Speech Tagger
On a high level, the network consists of an input layer that leverages a 3-gram context window. We’ll utilize word embeddings that are 300-dimensional, resulting in a context window of size 900. The feed-forward network will have two hidden layers of size 512 neurons and 256 neurons, respectively. Finally, the output layer will be a softmax calculating the probability distribution of the POS tag output over a space of 44 possible tags.

The tricky part of building the POS tagger is in preparing the dataset. We’ll leverage pretrained word embeddings generated from Google News. It includes vectors for 3 million words and phrases and was trained on roughly 100 billion words. 

As we mentioned, the gensim model contains three million words, which is larger than our dataset. For the sake of efficiency, we’ll selectively cache word vectors for words in our dataset and discard everything else. To figure out which words we’d like to cache, let’s download the POS dataset from the CoNLL-2000 task.

In [2]:
!head data/pos.train.txt

Confidence NN
in IN
the DT
pound NN
is VBZ
widely RB
expected VBN
to TO
take VB
another DT


In [5]:
!head data/pos.test.txt

Rockwell NNP
International NNP
Corp. NNP
's POS
Tulsa NNP
unit NN
said VBD
it PRP
signed VBD
a DT


In [10]:
!python feedforward_pos.py 3

LOADING PRETRAINED WORD2VEC MODEL... 
Using a 3-gram model
2017-11-27 13:20:32.525448: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-27 13:20:32.525474: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-27 13:20:32.525484: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-27 13:20:32.525492: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-27 13:20:32.525499: W tensorflow/core/plat

Every epoch, we manually inspect the model by parsing the sentence: “The woman, after grabbing her umbrella, went to the bank to deposit her cash.” Within 100 epochs of training, the algorithm achieves over 96% accuracy and nearly perfectly parses the validation sentence (it makes the understandable mistake of confusing the possessive pronoun and personal pronoun tags for the first appearance of the word “her”). We’ll conclude this by including the visualizations of our model’s performance using TensorBoard in Figure 7-4.

![7-4](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0704.png)

Figure 7-4. TensorBoard visualization of our feedfoward POS tagging model

## Dependency Parsing and SyntaxNet
The idea behind building a dependency parse tree is to map the relationships between words in a sentence. Take, for example, the dependency in Figure 7-5.

![7-5](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0705.png)

Figure 7-5. An example of a dependency parse, which generates a tree of relationships between words in a sentence

One way to express a  tree as a sequence is by linearizing it. Let’s consider the examples in Figure 7-6.

![7-6](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0706.png)

Figure 7-6. We linearize two example trees, the diagrams omit edge labels for the sake of visual clarity

Using this paradigm, we can take our example dependency parse and linearize it, as shown in Figure 7-7.

![7-7](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0707.png)

Figure 7-7. Linearization of the dependency parse tree example

One interpretation of this seq2seq problem would be to read the input sentence and produce a sequence of tokens as an output that represents the linearization of the input’s dependency parse.

To make the problem more approachable, we instead reconsider the dependency parsing task as finding a sequence of valid “actions” that generates the correct dependency parse. This technique, known as the `arc-standard system`:

![7-8](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0708.png)

Figure 7-8. At any step, we have three options: to shift a word from the buffer (blue) to the stack (green), to draw an arc from the right element to the left element (left arc), or to draw an arc from the left element to the right element (right arc)

At any step, we can take one of three possible classes of actions:

- SHIFT
    - Move a word from the buffer to the front of the stack.
- LEFT ARC
    - Combine the two elements at the front of the stack into a single unit where the root of the rightmost element is the parent node and the root of leftmost element is the child node.
- RIGHT ARC
    - Combine the two elements at the front of the stack into a single unit where the root of the left element is the parent node and the root of right element is the child node.

We finally terminate this process when the buffer is empty and the stack has one element in it (which represents the full dependency parse). 

![7-9](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0709.png)

Figure 7-9. A sequence of actions that results in the correct dependency parse; we omit labels

At every step, we take the current configuration, we vectorize the configuration by extracting a large number of features that describe the configuration (words in specific locations of the stack/buffer, specific children of the words in these locations, part of speech tags, etc.). During train time, we can feed this vector into a feed-forward network and compare its prediction of the next action to take to a gold standard decision made by a human linguist. To use this model in the wild, we can take the action that the network recommends, apply it to the configuration, and use this new configuration as the starting point for the next step (feature extraction, action prediction, and action application). This process is shown in Figure 7-10.

![7-10](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0710.png)

Figure 7-10. A neural framework for arc-standard dependency parsing 

Taken together, these ideas form the core for Google’s [SyntaxNet][1], the state-of-the-art open source implementation for dependency parsing.

## Beam Search and Global Normalization
Consider the following sentence: “The complex houses married and single soldiers and their families.” The first glance pass-through is confusing. Most people interpret “complex” as an adjective, “houses” as a noun, and “married” as a past tense verb. This makes little semantic sense though, and starts to break down as the rest of the sentence is read. Instead, we realize that “complex” is a noun (as in a military complex) and that “houses” is a verb. In other words, the sentence implies that the military complex contains soldiers (who may be single or married) and their families. A **greedy version** of SyntaxNet would fail to correct the early parse mistake of considering “complex” as an adjective describing the “houses,” and therefore fail on the full version of the sentence.

To remedy this shortcoming, we utilize a strategy known as `beam search`, illustrated in Figure 7-11. We generally leverage beam searches in situations like SyntaxNet, where the output of our network at a particular step influences the inputs used in future steps. The basic idea behind beam search is that **instead of greedily selecting the most probable prediction at each step, we maintain a beam of the most likely hypothesis (up to a fixed beam size b) for the sequence of the first k actions and their associated probabilities.** Beam searching can be broken up into two major phases: `expansion` and `pruning`.

![7-11](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0711.png)

During the expansion step, we take each hypothesis and consider it as a possible input to SyntaxNet. Assume SyntaxNet produces a probability distribution over a space of |A| total actions. We then compute the probability of each of the b|A| possible hypotheses for the sequence of the first k+1 actions. Then, during the pruning step, we keep only the b hypothesis out of the b|A| total options with the largest probabilities. 

As [Andor et al. describe in 2016][2], this process of global normalization provides both strong theoretical guarantees and clear performance gains relative to local normalization in practice. In a locally normalized network, our network is tasked with selecting the best action given a configuration. The network outputs a score that is normalized using a softmax layer. This is meant to model a probability distribution over all possible actions, provided the actions performed thus far. Our loss function attempts to force the probability distribution to the ideal output (i.e., probability 1 for the correct action and 0 for all other actions). The cross-entropy loss does a spectacular job of ensuring this for us.

In a globally normalized network, our interpretation of the scores is slightly different. Instead of putting the scores through a softmax to generate a per-action probability distribution, we instead add up all the scores for a hypothesis action sequence. One way of ensuring that we select the correct hypothesis sequence is by computing this sum over all possible hypotheses and then applying a softmax layer to generate a probability distribution. We could theoretically use the same cross-entropy loss function as we used in the locally normalized network. The problem with this strategy, however, is that there is an intractably large number of possible hypothesis sequences. Even considering an average sentence length of 10 and a conservative total number of 15 possible actions—1 shift and 7 labels for each of the left and right arcs—this corresponds to 1,000,000,000,000,000 possible hypotheses.

To make this problem tractable, as shown in Figure 7-12, we apply a beam search, with a fixed beam size, until we either 1) reach the end of the sentence, or 2) the correct sequence of actions is no longer contained on the beam. We then construct a loss function that tries to push the “gold standard” action sequence (highlighted in blue) as high as possible on the beam by maximizing its score relative to the other hypotheses. 

![7-12](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0712.png)

Figure 7-12. We can make global normalization in SyntaxNet tractable by coupling training and beam search

## A Case for Stateful Deep Learning Models
Sometimes, however, the task is far more complicated than finding a one-to-one mapping between input and output sequences. For example, we might want to develop a model that can consume an entire input sequence at once and then conclude if the sentiment of the entire input was positive or negative. We may want an algorithm that consumes a complex input (such as an image) and generate a sentence, one word at a time, describing the input. We may even want to translate sentences from one language to another (e.g., from English to French). In all of these instances, there’s no obvious mapping between input tokens and output tokens. Instead, the process is more like the situation in Figure 7-13.

![7-13](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0713.png)

Figure 7-13. The ideal model for sequence analysis can store information in memory over long periods of time, leading to a coherent “thought” vector that it can use to generate an answer

The idea is simple. We want our model to maintain some sort of memory over the span of reading the input sequence. As it reads the input, the model should able to modify this memory bank, taking into account the information that it observes. By the time it has reached the end of the input sequence, the internal memory contains a “thought” that represents the key pieces of information, that is, the meaning, of the original input. We should then, as shown in Figure 7-13, be able to use this thought vector to either produce a label for the original sequence or produce an appropriate output sequence (translation, description, abstractive summary, etc.).

## Recurrent Neural Networks
![7-14](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0714.png)

Figure 7-14. A recurrent layer contains recurrent connections, that is to say, connections between neurons that are located in the same layer

It turns out that, given a fixed lifetime (say t time steps) of an RNN instance, we can actually express the instance as a feed-forward network (albeit irregularly structured).

We perform the transformation by taking the neurons of the single recurrent layer and replicating them it t times, once for each time step. We similarly replicate the neurons of the input and output layers. 

![7-15](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0715.png)

Figure 7-15. We can run an RNN through time to express it as a feedforward network that we can train using backpropagation

## The Challenges with Vanishing Gradients
Let’s start our investigation by considering the simplest possible RNN, shown in Figure 7-16, with a single input neuron, a single output neuron, and a fully connected recurrent layer with one neuron.

![7-16](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0716.png)

Figure 7-16. A single neuron, fully connected recurrent layer (both compressed and unrolled) for the sake of investigating gradient-based learning algorithms 

Given nonlinearity f, we can express the activation $h^{(t)}$ of the the hidden neuron of the recurrent layer at time step t as follows, where $i^{(t)}$ is the incoming logit from the input neuron at time step t:

$h^{(t)}=f(w_{in}^{(t)}i^{(t)}+w_{rec}^{(t-1)}h^{(t-1)})$

Let’s try to compute how the activation of the hidden neuron changes in response to changes to the input logit from k time steps in the past. In analyzing this component of the backpropagation gradient expressions, we can start to quantify how much “memory” is retained from past inputs. We start by taking the partial derivative and apply the chain rule:

$\frac{\partial h^{(t)}}{\partial i^{(t-k)}}=f'(w_{in}^{(t)}i^{(t)}+w_{rec}^{(t-1)}h^{(t-1)})\frac{\partial}{\partial i^{(t-k)}}(w_{in}^{(t)}i^{(t)}+w_{rec}^{(t-1)}h^{(t-1)})$

Because the values of the input and recurrent weights are independent of the input logit at time step t - k, we can further simplify this expression:

$\frac{\partial h^{(t)}}{\partial i^{(t-k)}}=f'(w_{in}^{(t)}i^{(t)}+w_{rec}^{(t-1)}h^{(t-1)})w_{rec}^{(t-1)}\frac{\partial h^{(t-1)}}{\partial i^{(i-k)}}$

We also know that for all common nonlinearities (the tanh, logistic, and ReLU nonlinearities), the maximum value of |f'| is at most 1. This leads to the following recursive inequality: 

$|\frac{\partial h^{(t)}}{\partial i^{(t-k)}}| \le |w_{rec}^{(t-1)}| \cdot |\frac{\partial h^{(t-1)}}{\partial i^{(t-k)}}|$

We can continue to expand this inequality recursively until we reach the base case, at step t-k:

$|\frac{\partial h^{(t)}}{\partial i^{(t-k)}}| \le |w_{rec}^{(t-1)}| \cdots |w_{rec}^{(t-k)}| \cdot |\frac{\partial h^{(t-k)}}{\partial i^{(t-k)}}|$

We can evaluate this partial derivative similarly to how we proceeded previously:

$h^{(t-k)}=f(w_{in}^{(t-k)}i^{(t-k)}+w_{rec}^{(t-k-1)}h^{(t-k-1)})$

$\frac{\partial h^{(t-k)}}{\partial i^{(t-k)}}=f'(w_{in}^{(t-k)}i^{(t-k)}+w_{rec}^{(t-k-1)}h^{(t-k-1)})\frac{\partial}{\partial i^{(t-k)}}(w_{in}^{(t-k)}i^{(t-k)}+w_{rec}^{(t-k-1)}h^{(t-k-1)})$

In this expression, the hidden activation at time t - k - 1 is independent of the value of the input at t - k. Thus we can rewrite this expression as:

$\frac{\partial h^{(t-k)}}{\partial i^{(t-k)}}=f'(w_{in}^{(t-k)}i^{(t-k)}+w_{rec}^{(t-k-1)}h^{(t-k-1)})w_{in}^{(t-k)}$

Finally, taking the absolute value on both sides and again applying the observation about the maximum value of |f'|, we can write:

$|\frac{\partial h^{(t-k)}}{\partial i^{(t-k)}}| \le |w_{in}^{(t-k)}|$

This results in the final inequality (which we can simplify because we constrain the connections at different time steps to have equal value):

$|\frac{\partial h^{(t)}}{\partial i^{(t-k)}}| \le |w_{rec}^{(t-1)}| \cdots |w_{rec}^{(t-k)}| \cdot |w_{in}^{(t-k)}|=|w_{rec}|^k \cdot w_{in}$

This relationship places a strong upper bound on how much a change in the input at time t - k can impact the hidden state at time t. **Because the weights of our model are initialized to small values at the beginning of training, the value of this derivative approaches zero as k increases.** This issue is commonly referred to as the problem of `vanishing gradients`.

## Long Short-Term Memory (LSTM) Units
In  order  to combat the problem of vanishing gradients, Sepp Hochreiter and Jürgen Schmidhuber introduced the long short-term memory (`LSTM`) architecture. The basic principle behind the architecture was that the network would be designed for the purpose of reliably transmitting important information many time steps into the future. The design considerations resulted in the architecture shown in Figure 7-17.

![7-17](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0717.png)

Figure 7-17. The architecture of an LSTM unit, illustrated at a tensor (designated by arrows) and operation (designated by the purple blocks) level

First, the unit must determine how much of the previous memory to keep. This is determined by the `keep gate`, shown in detail in Figure 7-18.

![7-18](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0718.png)

Figure 7-18. Architecture of the keep gate of an LSTM unit

The memory state tensor from the previous time step is rich with information, but some of that information may be stale (and therefore might need to be erased). We figure out which elements in the memory state tensor are still relevant and which elements are irrelevant by trying to compute a bit tensor (a tensor of zeros and ones) that we multiply with the previous state. If a particular location in the bit tensor holds a 1, it means that location in the memory cell is still relevant and ought to be kept. If that particular location instead held a 0, it means that the location in the memory cell is no longer relevant and ought to be eased. We approximate this bit tensor by concatenating the input of this time step and the LSTM unit’s output from the previous time step and applying a sigmoid layer to the resulting tensor. A sigmoidal neuron, as you may recall, outputs a value that is either very close to 0 or very close to 1 most of the time (the only exception is when the input is close to zero). As a result, the output of the sigmoidal layer is a close approximation of a bit tensor, and we can use this to complete the keep gate.

This part of the LSTM unit is known as the `write gate`, and it’s depicted in Figure 7-19.  This is broken down into two major parts. The first component is figuring out what information we’d like to write into the state. This is computed by the tanh layer to create an intermediate tensor. The second component is figuring out which components of this computed tensor we actually want to include into the new state and which we want to toss before writing. We do this by approximating a bit vector of 0’s and 1’s using the same strategy (a sigmoidal layer) as we used in the keep gate.  We multiply the bit vector with our intermediate tensor and then add the result to create the new state vector for the LSTM.

![7-19](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0719.png)

Figure 7-19. Architecture of the write gate of an LSTM unit

Finally, at every time step, we’d like the LSTM unit to provide an output. The architecture of the output gate is shown in Figure 7-20. We use a nearly identical structure as the write gate: 1) the tanh layer creates an intermediate tensor from the state vector, 2) the sigmoid layer produces a bit tensor mask using the current input and previous output, and 3) the intermediate tensor is multiplied with the bit tensor to produce the final output.

![7-20](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0720.png)

Figure 7-20. Architecture of the output gate of an LSTM unit

So why is this better than using a raw RNN unit? The unrolled architecture is shown in Figure 7-21. At the very top, we can observed the propagation of the state vector, whose interactions are primarily linear through time. The result is that the gradient that relates an input several time steps in the past to **the current output does not attenuate as dramatically as in the vanilla RNN architecture.** This means that **the LSTM can learn long-term relationships much more effectively than our original formulation of the RNN.**

![7-21](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0721.png)

Figure 7-21. Unrolling an LSTM unit through time

Well, just as we can we can stack RNN layers to create more expressive models with more capacity, we can similarly stack LSTM units, where the input of the second unit is the output of the first unit, the input of the third unit is the output of the second, and so on. An illustration of how this works is shown in Figure 7-22, with a multicellular made of two LSTM units. This means that anywhere we use a vanilla RNN layer, we can easily substitute an LSTM unit.

![7-22](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0722.png)

Figure 7-22. Composing LSTM units as one might stack recurrent layers in a neural network

[1]: https://github.com/tensorflow/models/tree/master/research/syntaxnet
[2]: https://arxiv.org/abs/1603.06042

## Implementing a Sentiment Analysis Model
In this section, we attempt to analyze the sentiment of movie reviews taken from the Large Movie Review Dataset. This dataset consists of 50,000 reviews from IMDB, each of which labeled as having positive or negative sentiment. 

We use the **IMDBDataset** Python class([read_imdb_data.py](read_imdb_data.py)) to serve both the training and validation sets we’ll use while training our sentiment analysis model.

First, we’ll want to map each word in the input review to a word vector. To do this, we’ll utilize an embedding layer, which, is a simple lookup table that stores an embedding vector that corresponds to each word.

[imdb_lstm.py](imdb_lstm.py):


```py
def embedding_layer(input, weight_shape):
    weight_init = tf.random_normal_initializer(stddev=(
                    1.0/weight_shape[0])**0.5)
    E = tf.get_variable("E", weight_shape,
                        initializer=weight_init)
    incoming = tf.cast(input, tf.int32)
    embeddings = tf.nn.embedding_lookup(E, incoming)
    return embeddings
```

We then take the result of the embedding layer and build an LSTM with dropout:

```py
def lstm(input, hidden_dim, keep_prob, phase_train):
    lstm = rnn.BasicLSTMCell(hidden_dim)
    dropout_lstm = rnn.DropoutWrapper(lstm, input_keep_prob=keep_prob, output_keep_prob=keep_prob)
    lstm_outputs, state = tf.nn.dynamic_rnn(dropout_lstm, input, dtype=tf.float32)
    return tf.reduce_max(lstm_outputs, reduction_indices=[1])
```

Stringing all of these components together, we can build the inference graph:

```py
def inference(input, phase_train):
    embedding = embedding_layer(input, [30000, 512])
    lstm_output = lstm(embedding, 512, 0.5, phase_train)
    output = layer(lstm_output, [512, 2], [2], phase_train)
    return output
```

In [None]:
import tensorflow as tf
from lstm import LSTMCell
import read_imdb_data as data
from imdb_lstm import loss, training, evaluate, inference

training_epochs = 1 # 1000
batch_size = 32
display_step = 1


with tf.Graph().as_default():
    with tf.device('/gpu:0'):
        x = tf.placeholder("float", [None, 500])
        y = tf.placeholder("float", [None, 2])
        phase_train = tf.placeholder(tf.bool)

        output = inference(x, phase_train)

        cost, train_loss_summary_op, val_loss_summary_op = loss(output, y)

        global_step = tf.Variable(0, name='global_step', trainable=False)

        train_op = training(cost, global_step)

        eval_op, eval_summary_op = evaluate(output, y)

        saver = tf.train.Saver(max_to_keep=100)

        sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))

        summary_writer = tf.summary.FileWriter("imdb_lstm_logs/", graph=sess.graph)

        init_op = tf.global_variables_initializer()

        sess.run(init_op)

        for epoch in range(training_epochs):
            avg_cost = 0.
            total_batch = int(data.train.num_examples/batch_size)
            print "Total of %d minbatches in epoch %d" % (total_batch, epoch)
            # Loop over all batches
            for i in range(total_batch):
                minibatch_x, minibatch_y = data.train.minibatch(batch_size)
                # Fit training using batch data
                _, new_cost, train_summary = sess.run([train_op, cost, train_loss_summary_op], feed_dict={x: minibatch_x, y: minibatch_y, phase_train: True})
                summary_writer.add_summary(train_summary, sess.run(global_step))
                # Compute average loss
                avg_cost += new_cost/total_batch
                print "Training cost for batch %d in epoch %d was:" % (i, epoch), new_cost
                if i % 100 == 0:
                    print "Epoch:", '%04d' % (epoch+1), "Minibatch:", '%04d' % (i+1), "cost =", "{:.9f}".format((avg_cost * total_batch)/(i+1))
                    val_x, val_y = data.val.minibatch(data.val.num_examples)
                    val_accuracy, val_summary, val_loss_summary = sess.run([eval_op, eval_summary_op, val_loss_summary_op], feed_dict={x: val_x, y: val_y, phase_train: False})
                    summary_writer.add_summary(val_summary, sess.run(global_step))
                    summary_writer.add_summary(val_loss_summary, sess.run(global_step))
                    print "Validation Accuracy:", val_accuracy

                    saver.save(sess, "imdb_lstm_logs/model-checkpoint-" + '%04d' % (epoch+1), global_step=global_step)
            # Display logs per epoch step
            # if epoch % display_step == 0:
            #     print "Epoch:", '%04d' % (epoch+1), "cost =", "{:.9f}".format(avg_cost)
            #     val_x, val_y = data.val.minibatch(data.val.num_examples)
            #     val_accuracy, val_summary, val_loss_summary = sess.run([eval_op, eval_summary_op, val_loss_summary_op], feed_dict={x: val_x, y: val_y, phase_train: False})
            #     summary_writer.add_summary(val_summary, sess.run(global_step))
            #     summary_writer.add_summary(val_loss_summary, sess.run(global_step))
            #     print "Validation Accuracy:", val_accuracy
            #
            #     saver.save(sess, "imdb_lstm_logs/model-checkpoint-" + '%04d' % (epoch+1), global_step=global_step)

        print "Optimization Finished!"

Total of 703 minbatches in epoch 0
Training cost for batch 0 in epoch 0 was: 0.69461
Epoch: 0001 Minibatch: 0001 cost = 0.694610476
Validation Accuracy: 0.4952
Training cost for batch 1 in epoch 0 was: 0.69163
Training cost for batch 2 in epoch 0 was: 0.692238
Training cost for batch 3 in epoch 0 was: 0.690888
Training cost for batch 4 in epoch 0 was: 0.685791
Training cost for batch 5 in epoch 0 was: 0.693556
Training cost for batch 6 in epoch 0 was: 0.687123
Training cost for batch 7 in epoch 0 was: 0.680736
Training cost for batch 8 in epoch 0 was: 0.657541
Training cost for batch 9 in epoch 0 was: 0.639738
Training cost for batch 10 in epoch 0 was: 0.657558
Training cost for batch 11 in epoch 0 was: 0.627809
Training cost for batch 12 in epoch 0 was: 0.624833
Training cost for batch 13 in epoch 0 was: 0.67078
Training cost for batch 14 in epoch 0 was: 0.720676
Training cost for batch 15 in epoch 0 was: 0.60072
Training cost for batch 16 in epoch 0 was: 0.595772
Training cost for ba

## Solving seq2seq Tasks with Recurrent Neural Networks
The first network is known as the encoder network. The **encoder** network is a recurrent network (usually one that uses LSTM units) that consumes the entire input sequence. The goal of the encoder network is to generate a condensed understanding of the input and summarize it into a singular thought represented by the final state of the encoder network. Then we use a **decoder** network, whose starting state is initialized with the final state of the encoder network, to produce the target output sequence token by token.

![](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0724.png)

Figure 7-24. Illustration of how we use an encoder/decoder recurrent network schema to tackle seq2seq problems 

In this this setup, we are attempting to translate an American sentence into French. We tokenize the input sentence and use an embedding (similarly to our approach in the sentiment analysis model we built in the previous section), one word at a time as an input to the encoder network. At the end of the sentence, we use a special “end of sentence” (EOS) token to indicate the end of the input sequence to the encoder network. Then we take the hidden state of the encoder network and use that as the initialization of the decoder network. The first input to the decoder network is the EOS token, and the output is interpreted as the first word of the predicted French translation. From that point onward, we use the output of the decoder network as the input to itself at the next time step. We continue until the decoder network emits an EOS token as its output, at which point we know that the network has completed producing the translation of the original English sentence. 

For example, Kiros et al. in 2015 invented the notion of a [skip-thought vector](https://papers.nips.cc/paper/5950-skip-thought-vectors). The skip-thought vector was generated by dividing up a passage into a set of triplets consisting of consecutive sentences. The authors utilized a single encoder network and two decoder networks:

![](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0725.png)

Figure 7-25. The skip-thought seq2seq architecture to generate embedding representations of entire sentences

## Augmenting Recurrent Networks with Attention
![](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0726.png)

Figure 7-26. An attempt at engineering attentional abilities in a seq2seq architecture. This attempt falls short because it fails to dynamically select the most relevant parts of the input to focus on. 

The key realization here is that it’s not enough to merely give the decoder access to all the outputs. Instead, we must engineer a mechanism by which the decoder network can dynamically pay **attention** to a specific subset of the encoder’s outputs.

![](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0727.png)

Figure 7-27. A modification to our original proposal that enables a dynamic attentional mechanism based on the hidden state of the decoder network in the previous time step