# Fundamentals of Deep Learning
## 目录
- Chapter 7. Models for Sequence Analysis
    - Analyzing Variable-Length Inputs
    - Tackling seq2seq with Neural N-Grams
    - Implementing a Part-of-Speech Tagger
    - Dependency Parsing and SyntaxNet
    - Beam Search and Global Normalization
    - A Case for Stateful Deep Learning Models
    - Recurrent Neural Networks
    - Long Short-Term Memory (LSTM) Units

## Analyzing Variable-Length Inputs
In Figure 7-1, we illustrate how our feed-forward neural networks break when analyzing sequences. If the sequence is the same size as the input layer, the model can perform as we expect it to. It’s even possible to deal with smaller inputs by **padding zeros to the end of the input until it’s the appropriate length**. However, the moment the input exceeds the size of the input layer, naively using the feedforward network no longer works.

![7-1](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0701.png)

Figure 7-1. Feed-forward networks thrive on fixed input size problems. Zero padding can address the handling of smaller inputs, but when naively utilized, these models break when inputs exceed the fixed input size. 

## Tackling seq2seq with Neural N-Grams
In this section, we’ll begin exploring a feed-forward neural network architecture that can process a body of text and produce a sequence of `part-of-speech (POS)` tags. An example of this is shown in Figure 7-2. 

![7-2](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0702.png)

Figure 7-2. An example of an accurate POS parse of an English sentence

We can predict each POS tag one at a time by using a fixed-length subsequence. In particular, we utilize the subsequence starting from the word of interest and extending n words into the past. This neural n-gram strategy is depicted in Figure 7-3.

![7-3](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0703.png)

Figure 7-3. Using a feed-forward network to perform seq2seq when we can ignore long-term dependencies

Specifically, when we predict the POS tag for the $i^{th}$ word in the input, we utilize the $i - n + 1^{st}, i - n + 2^{nd}, \cdots, i^{th}$ words as the input. We’ll refer to this subsequence as the `context window`.

## Implementing a Part-of-Speech Tagger
On a high level, the network consists of an input layer that leverages a 3-gram context window. We’ll utilize word embeddings that are 300-dimensional, resulting in a context window of size 900. The feed-forward network will have two hidden layers of size 512 neurons and 256 neurons, respectively. Finally, the output layer will be a softmax calculating the probability distribution of the POS tag output over a space of 44 possible tags.

The tricky part of building the POS tagger is in preparing the dataset. We’ll leverage pretrained word embeddings generated from Google News. It includes vectors for 3 million words and phrases and was trained on roughly 100 billion words. 

As we mentioned, the gensim model contains three million words, which is larger than our dataset. For the sake of efficiency, we’ll selectively cache word vectors for words in our dataset and discard everything else. To figure out which words we’d like to cache, let’s download the POS dataset from the CoNLL-2000 task.

In [2]:
!head data/pos.train.txt

Confidence NN
in IN
the DT
pound NN
is VBZ
widely RB
expected VBN
to TO
take VB
another DT


In [5]:
!head data/pos.test.txt

Rockwell NNP
International NNP
Corp. NNP
's POS
Tulsa NNP
unit NN
said VBD
it PRP
signed VBD
a DT
