# [Sequence to Sequence Learning with Neural Networks](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)
<br><br>

#### 論文概要
主な成果は、初めて翻訳のタスクにおいて、ニューラルネットワーク（deep LSTM）を使ったSequence-to-Sequenceモデルで、フレーズベース統計翻訳（a phrase-based SMT system, PBSMT）を上回るスコアを出したことである。<br>
同時に、入力データの反転（Reverse）でパフォーマンスが上がったことと、LSTMが長い文章も適切に翻訳していたことも報告している。<br>

#### Sequence-to-Sequence（Seq2Seq）モデル
Sequence-to-Sequence（Seq2Seq）モデルは、系列を入力として系列を出力する機構を持つモデル。
入力系列をRNN（Recurrent Neural Networks）でベクトルに変換（=Encoder）し、そのベクトルから別のRNNを用いて系列を生成する（=Decoder）ことから、Encoder-Decoderモデルと呼ばれることもある。単純なRNNでは達成できなかった、言語によって語順や長さが異なる問題に対応することなどが可能になった。<br>


Applications（Natural Language processing）
<br>using CNN, Attention, Transformer, VAE, GAN, Q-learning, Policy Gradient,or SeqGAN

* Translation
* Chatbot
* Caption generation（画像を入力して画像の説明を生成するタスク）
* Reading comprehension
* Question answering
* Headline generation

<br><br>
翻訳で使われている Seq2Seqの図[3]
<img src="./paperPhoto/20171210145927.jpg">
<br>
<img src="./paperPhoto/seq2seq.jpg">

1. 単語をベクトル表現に変換
1. ベクトル表現をRNNに渡す
1. `<s>` はdecodeを開始する合図

## Abstract

>Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier

* DNN（Deep Neural Networks）は大きなラベル付きトレーニングセットがあれば優れたパフォーマンスを出すが、Seqence-to-sequenceの問題（系列を入力として系列を出力すること）は解決できない。
* よって本稿では、時系列データをベクトル表現にする（Encoder）、またはその逆（Decoder）をdeep LSTM（Long Short-Term Memory）で行った、 End to end（入力から出力まで）な Sequence learning を示す。
* 主な結果として、英語からフランス語への翻訳のタスクにおいて、評価はBLEU（Bilingual Evaluation Understudy）スコアで行った。
(他の評価方法には、人による評価（ネイティブ翻訳者による評価）と自動評価（TERスコア）がある)
* 結果、フレーズベース統計翻訳（a phrase-based SMT system, PBSMT）を上回るスコアを出した。
* LSTMは語順を敏感に学んだ。また、比較的一定の語順や長さの受動態-能動態の表現を学んだ。
* LSTMでは、原文の単語の順序を逆にする事でLSTMのパフォーマンスが向上した。

>**統計翻訳（Statistical Machine Translation, SMT）**<br>
原文と訳文を大量に集めた対訳データと統計的な学習アルゴリズムだけで翻訳システムを構築する手法。<br>

>**BLEU（Bilingual Evaluation Understudy）スコア**<br>
$$BLEU = e^{1−r/c} exp(\sum_{n=1}^N w_n log p_n)$$
>>r(Reference) は人間が翻訳した文の長さ<br>
>>c(Candidate) は機械が翻訳した文の長さ<br>
>>$w_n$は適当な重み<br>
>>$p_n$はmodified n-gram precisions<br><br>
>**brevity penalty**<br>
右辺第一項で、CandidateよりReferenceの方が長いパターンに対してペナルティを課している。（Candidateが短すぎる場合に対してのペナルティ）<br><br>
**modified n-gram precisions**<br>
precisionを計算するときにReference中の単語は一回使うともう使えないという制約を加えている

## Introduction
<img src="./paperPhoto/Screen Shot 2019-03-19 at 17.25.55.png">

* モデルの入力文が"ABC"、出力文が"WXYZ"である。
* EOS(end-of-sentence)トークンは文末を表す仮想単語である。
* LSTMの入力文章の読み込む順番を逆にすることで、多くの短い依存関係を作ることで最適化計算をより早くした。

>Deep Neural Networks (DNNs) are extremely powerful machine learning models that achieve excellent performance on difficult problems such as speech recognition [13, 7] and visual object recognition [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation for a modest number of steps. A surprising example of the power of DNNs is their ability to sort N N-bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are related to conventional statistical models, they learn an intricate computation. Furthermore, large DNNs can be trained with supervised backpropagation whenever the labeled training set has enough information to specify the network's parameters. Thus, if there exists a parameter setting of a large DNN that achieves good results (for example, because humans can solve the task very rapidly), supervised backpropagation will find these parameters and solve the problem. Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori. For example, speech recognition and machine translation are sequential problems. Likewise, question answering can also be seen as mapping a sequence of words representing the question to a sequence of words representing the answer. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful. Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and outputs is known and fixed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed- dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM's ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs (fig. 1). There have been a number of related attempts to address the general sequence to sequence learning problem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18] who were the first to map the entire input sentence to vector, and is very similar to Choetal. [5]. Graves [10] introduced a novel differentiable attention mechanism that allows neural networks to focus on different parts of their input, and an elegant variant of this idea was successfully applied to machine translation by Bahdanauetal. [2]. The Connectionist Sequence Classification is another popular technique for mapping sequences to sequences with neural networks, although it assumes a monotonic alignment between the inputs and the outputs [11].

* DNNは音声認識、視覚オブジェクト認識などの困難な複雑な問題に対して優れた性能を発揮する機械学習モデルであり、教師ありの誤差逆伝播で学習することができる。
* しかし、DNNは入力と出力が固定次元のベクトルでエンコードできる問題にしか適用できない。
* 多くの重要な問題(音声認識や機械翻訳、質問応答)は逐次問題で長さが先験的に知られていないSequenceで最もよく表現されるため、DNNは上手く表現できない。
* 本稿では、LSTMが一般的なsequence to sequence learningを解決する事ができる事を示す。
* まず、最初のLSTMがtimesteps毎に一旦固定次元のベクトルでsequenceをエンコードして行くことでベクトル表現を得る。次のLSTMがtimesteps毎にベクトルからsequenceをデコードする。
* 長い依存関係を学習できることからLSTMを選んだ。
* LSTMは長距離の時間依存性を持つデータをうまく学習することができる。

## The model

### RNN
>The Recurrent Neural Network (RNN) [31, 28] is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs (x1 , . . . , xT ), a standard RNN computes a sequence of outputs (y1 , . . . , yT ) by iterating the following equation:
$$ h_t = sigm(W^{hx}x_t W^{hh} h_{t-1})$$
$$ y_t = W^{yh} h_t $$
The RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengths with complicated and non-monotonic relation- ships.
A simple strategy for general sequence learning is to map the input sequence to a fixed-sized vector using one RNN, and then to map the vector to the target sequence with another RNN (this approach has also been taken by Choetal. [5]). While it could work in principle since the RNN is provided with all the relevant information, it would be difficult to train the RNNs due to the resulting long term dependencies [14, 4] (figure 1) [16, 15]. However, the Long Short-Term Memory (LSTM) [16] is known to learn problems with long range temporal dependencies, so an LSTM may succeed in this setting.<br>

#### 再帰ニューラルネット(recurrent neural networks; RNN)<br>
以下3文は同じ意味。
* RNNは可変長の入力列を扱うことに優れたネットワーク構造である。<br>
* RNNは自己回帰型の構造をもつニューラルネットワークの総称である。<br>
* RNNは時系列データ向けのニューラルネットワーク(NN)である。<br>
(a natural generalization of feedforward neural networks to sequences)<br><br>
by iterating the following equation<br>
$$ h_t = sigm(W^{hx}x_t W^{hh} h_{t-1})$$
$$ y_t = W^{yh} h_t $$
>inputs 長さTの時系列データ(x1 , . . . , xT )<br>
outputs (y1 , . . . , yT )<br>
*活性化関数は恒等関数である<br>
$W^{xh}∈ℝ|hidden|×|input|$: 入力層から隠れ層への重み<br>
$W^{hh}∈ℝ|hidden|×|hidden|$: 隠れ層から隠れ層への重み<br>
$W^{yh}∈ℝ|output|×|hidden|$: 隠れ層から出力層への重み<br>
*出力層から中間層への期間経路の重みは提示されていない

* RNNは長い時系列データではネットワークが時系列長に比例して非常に深くなる。<br>
* よって、勾配消失が容易に発生し、情報が上手く伝達されない。

### LSTM
>The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT ′ |x1, . . . , xT ) where (x1,...,xT)isaninputsequenceandy1,...,yT′ isitscorrespondingoutputsequencewhoselength T ′ may differ from T . The LSTM computes this conditional probability by first obtaining the fixed- dimensional representation v of the input sequence (x1 , . . . , xT ) given by the last hidden state of the LSTM, and then computing the probability of y1 , . . . , yT ′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1 , . . . , xT :
$$ p(y_1,....,y_{T'}|x_1,....,x_T)=\prod_{t=1}^{T'}p(y_t|v,y_1,....,y_{t-1}) $$
In this equation, each p(yt|v, y1, . . . , yt−1) distribution is represented with a softmax over all the words in the vocabulary. We use the LSTM formulation from Graves [10]. Note that we require that each sentence ends with a special end-of-sentence symbol “<EOS>”, which enables the model to define a distribution over sequences of all possible lengths. The overall scheme is outlined in figure 1, where the shown LSTM computes the representation of “A”, “B”, “C”, “<EOS>” and then uses this representation to compute the probability of “W”, “X”, “Y”, “Z”, “<EOS>”.
Our actual models differ from the above description in three important ways. First, we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously [18]. Second, we found that deep LSTMs significantly outperformed shallow LSTMs, so we chose an LSTM with four layers. Third, we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a,b,c to the sentence α,β,γ, the LSTM is asked to map c,b,a to α,β,γ, where α, β, γ is the translation of a, b, c. This way, a is in close proximity to α, b is fairly close to β, and so on, a fact that makes it easy for SGD to “establish communication” between the input and the output. We found this simple data transformation to greatly boost the performance of the LSTM.<br>

#### ゲート付再帰ニューラルネット<br>
RNNは時間方向に深いニューラルネットとなっているため、時間的に離れている時刻に発生した誤差を伝播させるのは勾配消失の影響で困難であった。<br>
よって、長期間の依存関係を表現するためのパラメータ（長期記憶（long-term memory））を学習するのが苦手で、直近の依存関係（短期記憶（short-term memory））だけを学習してしまう傾向があった。これをゲートを導入することで長期記憶と短期記憶をバランス良く学習することができるモデル。<br>

#### 長短期記憶（long short-term memory; LSTM）<br>
* LSTMは最も代表的なゲート付き再帰ニューラルネット。<br>
* LSTMはある程度長い時系列データに対しても学習ができるよう考案されたモデル<br>
ソース言語の文とターゲット言語での文の間の条件付き確率に注目した式が、
$$ p(y_1,....,y_{T'}|x_1,....,x_T)=\prod_{t=1}^{T'}p(y_t|v,y_1,....,y_{t-1}) $$
$y_t$は全てのワードに対して配布される。<br>
* 右辺に各確率分布をモデル化して、トレーニングデータを使って学習させることを考えている。<br>
* このt-1までの確率分布Pを用いて$y^t$としてどの単語を採用するかを決めている。
* この作業を繰り返すことで、入力文に対する翻訳文全体が出来上がる。<br>
* Figure 1はA, B, C, を計算し、そしてW, X, Y, Z, それぞれの確率を求めている。<br>


#### 本稿で実際にLSTMを使う際の3つの変更点

* 入力層と出力層を少し異なる構造を持たせる。2層のLSTMにすることで、計算コストはそこまで増えないのに対し、様々な言語ペアの学習を実現できる。<br>
    勾配消失を防ぐため。
* 4層のLSTMを使用すること。<br>
    深いSLTMの方が性能が良いため。
* 入力の語順を反転させること。<br>
    a, b, c → α, β, γ より c, b, a → α, β, γ の方が近接関係が統一的になり、性能が向上すると知られているから。


## Experience
>We applied our method to the WMT’14 English to French MT task in two ways. We used it to directly translate the input sentence without using a reference SMT system and we it to rescore the n-best lists of an SMT baseline. We report the accuracy of these translation methods, present sample translations, and visualize the resulting sentence representation.

WMT’14英語-フランス語の翻訳タスクを2通り
1. SMTを使わず、時系列データを直接翻訳。
1. SMTベースのn-best lists(解析した時にスコアを降順にソートしたリスト)を翻訳。

の方法でaccuracyを記録、サンプル翻訳を記載し、結果を可視化する。

### Dataset details
>We used the WMT’14 English to French dataset. We trained our models on a subset of 12M sen- tences consisting of 348M French words and 304M English words, which is a clean “selected” subset from [29]. We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT [29].
As typical neural language models rely on a vector representation for each word, we used a fixed vocabulary for both languages. We used 160,000 of the most frequent words for the source language and 80,000 of the most frequent words for the target language. Every out-of-vocabulary word was replaced with a special “UNK” token.

* [WMT'14 English to Frenchのデータセット](http://www.statmt.org/wmt14/translation-task.html)
* 1200万の文章、3億4800万単語のフランス語と3億400万単語の英語をトレーニングデータとして使用した。
* 単語をベクトル表現する為に、ボキャブラリを利用した。
* ソース言語(英語)では16万の頻出単語を、そしてターゲット言語(フランス語)では8万の頻出単語を使用した。ボキャブラリに存在しない単語はUNK(unknown)として扱った。


### Decoding and Rescoring
>The core of our experiments involved training a large deep LSTM on many sentence pairs. We trained it by maximizing the log probability of a correct translation T given the source sentence S, so the training objective is
$$ \frac 1 {|S|} \sum_{(T,S)∈S} log p(T|S)$$
where S is the training set. Once training is complete, we produce translations by finding the most likely translation according to the LSTM:
$$ \hat{T} = arg\max_{T}p(T|S) $$
We search for the most likely translation using a simple left-to-right beam search decoder which maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some translation. At each timestep we extend each partial hypothesis in the beam with every possible word in the vocabulary. This greatly increases the number of the hypotheses so we discard all but the B most likely hypotheses according to the model’s log probability. As soon as the “<EOS>” symbol is appended to a hypothesis, it is removed from the beam and is added to the set of complete hypotheses. While this decoder is approximate, it is simple to implement. Interestingly, our system performs well even with a beam size of 1, and a beam of size 2 provides most of the benefits of beam search (Table 1).
We also used the LSTM to rescore the 1000-best lists produced by the baseline system [29]. To rescore an n-best list, we computed the log probability of every hypothesis with our LSTM and took an even average with their score and the LSTM’s score.
    
英語とフランス語の文章のペアのデータセットを用いてLSTMを学習させる。
$$ \frac 1 {|S|} \sum_{(T,S)∈S} log p(T|S)$$
S: source(ソース)言語 (英語)<br>
T: target(ターゲット)言語 (フランス語)<br>
Σの下のSはトレーニングのデータセット<br>
<br>
トレーニングが終了すると、予測値$\hat{T}$を見つける。<br>
$$ \hat{T} = arg\max_{T}p(T|S) $$

* left-to-right beam serch decoderを使用すると、少ない候補を維持する事でメモリの消費を抑える事ができ、スコアが良かった。
* 1000-bestで整理したLSTMを使用すると、平均スコアもLSTMのスコアも上回った。

### Reversing the Source Sentenses
>While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.
While we do not have a complete explanation to this phenomenon, we believe that it is caused by the introduction of many short term dependencies to the dataset. Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large “minimal time lag” [17]. By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.
Initially, we believed that reversing the input sentences would only lead to more confident predictions in the early parts of the target sentence and to less confident predictions in the later parts. However, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs trained on the raw source sentences (see sec. 3.7), which suggests that reversing the input sentences results in LSTMs with better memory utilization.

LSTMはソース言語を逆転させた時に性能が向上した。
* ソース言語の語順を反転させること。<br>
    a, b, c → α, β, γ より c, b, a → α, β, γ の方が近接関係が統一的になり、性能が向上すると知られているから。
* perplexityは5.0から4.7に減少した。BLEUは25.9から30.6に上昇した。
* この現象はまだ完全に説明できない。
* 同じ意味の単語同士が遠くなる事でラグが発生するが、ソース言語の語順を反転させることで、より等しくした。
* よって、誤差逆伝播の時間が減り、全体的なパフォーマンスが上がった。
* 始めは、文の後半部分の予測が上手くいかず、パフォーマンスが上がらないと思っていた。

### Training details
>We found that the LSTM models are fairly easy to train. We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. We found deep LSTMs to significantly outperform shallow LSTMs, where each additional layer reduced perplexity by nearly 10%, possibly due to their much larger hidden state. We used a naive softmax over 80,000 words at each output. The resulting LSTM has 380M parameters of which 64M are pure recurrent connections (32M for the “encoder” LSTM and 32M for the “decoder” LSTM). The complete training details are given below:
• We initialized all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08
• We used stochastic gradient descent without momentum, with a fixed learning rate of 0.7. After 5 epochs, we begun halving the learning rate every half epoch. We trained our models for a total of 7.5 epochs.
• We used batches of 128 sequences for the gradient and divided it the size of the batch (namely, 128).
• Although LSTMs tend to not suffer from the vanishing gradient problem, they can have
exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10,25] by scaling it when its norm exceeded a threshold. For each training batch, we compute $s = ∥g∥$ , where g is the gradient divided by 128. If s > 5, we set $g = \frac {5g} s$
• Different sentences have different lengths. Most sentences are short (e.g., length 20-30) but some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences, and as a result, much of the computation in the minibatch is wasted. To address this problem, we made sure that all sentences within a minibatch were roughly of the same length, which a 2x speedup.

* 4層のLSTM, 層あたり1000個のcell, 1000次元のword embeddingを利用した。
* 入力160,000語, 出力80,000語, 結果3億8,000万パラメータができた。
* 出力には、ナイーブsoftmaxを利用した。
<br>
* LSTMのパラメータ初期値は-0.08から0.08の一様分布から取った。
* 慣性$＋αΔw^tを$付与しないSGD(確率的勾配降下法)を利用した。学習率は0.7で、5エポック以降半減させ、トータルで7.5エポック学習する。
$$ w^{t+1}  \gets w^t - \eta \frac {\partial {E(w^t)}} {\partial w^t}$$
* バッチ数は128で行った。
* LSTMは勾配消失の問題はあまり受けないが、勾配爆発する可能性がある。よって、閾値を定め、以下の式でスケーリングした。
$$ s > 5, g = \frac {5g} s$$
* 多くは短い文章(20-30語)を使ったが、長い文章(100語以上)がミニバッチに紛れていた為、多くの計算を無駄にした。よって、文章の長さは揃えた方が良い。(計算が2倍早くなる)

### Parallezation
> A C++ implementation of deep LSTM with the configuration from the previous section on a single GPU processes a speed of approximately 1,700 words per second. This was too slow for our purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was executed on a different GPU and communicated its activations to the next GPU (or layer) as soon as they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible for multiplying by a 1000 × 20000 matrix. The resulting implementation achieved a speed of 6,300 (both English and French) words per second with a minibatch size of 128. Training took about a ten days with this implementation.

* C++でLSTMを計算すると1,700語/s/GPUで学習した。これは経験則としてはとても遅い。
* 8つのGPUを使い、4つのGPUをそれぞれのLSTM層の計算、残りの4つを活性化関数の計算に並列して割り当てた。
* 結果、6,300語/s/8GPUとなり、計算に10日間かかった。

### Experimental Results
>We used the cased BLEU score [24] to evaluate the quality of our translations. We computed our BLEU scores using multi-bleu.pl1 on the tokenized predictions and ground truth. This way of evaluating the BELU score is consistent with [5] and [2], and reproduces the 33.3 score of [29]. However, if we evaluate the state of the art system of [9] (whose predictions can be downloaded from statmt.org\matrix) in this manner, we get 37.0, which is greater than the 35.8 reported by statmt.org\matrix.
The results are presented in tables 1 and 2. Our best results are obtained with an ensemble of LSTMs that differ in their random initializations and in the random order of minibatches. While the decoded translations of the LSTM ensemble do not beat the state of the art, it is the first time that a pure neural translation system outperforms a phrase-based SMT baseline on a large MT task by  sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM is within 0.5 BLEU points of the previous state of the art by rescoring the 1000-best list of the baseline system.

<img src="./paperPhoto/Screen Shot 2019-03-19 at 17.26.31.png">

* 評価手法には、multi-bleu.plというプログラムを用いて計算して求めたBLEUを使った。初期値やミニバッチの順番が違う複数のLSTMのアンサンブルを利用した結果、37.0(state-of-the-art)を記録した。
* NNによる翻訳システムがSMTに初めて性能で上回った。
> SMT(statistical machine learning): translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora

### Performance on long sentences
>We were surprised to discover that the LSTM did well on long sentences, which is shown quantitatively in figure 3. Table 3 presents several examples of long sentences and their translations.

<img src="paperPhoto/Screen Shot 2019-03-19 at 17.27.01.png">
<img src="paperPhoto/Screen Shot 2019-03-19 at 17.27.13.png">

* LSTMが長い文章に対して驚くほど良い結果を残した。Figure3では量的にグラフで比べ、Table3ではその例を示した。

### Model Analysis
<img src="paperPhoto/Screen Shot 2019-03-19 at 17.26.51.png">

* Figure2ではLSTMの隠れ層の状態をPCA(主成分分析)で2次元にしたもの。
* 能動態-受動態に鈍感であるが、bag-of-wordsモデルでは達成できなかった、語順に対しては敏感に反応している。

## Related work
>There is a large body of work on applications of neural networks to machine translation. So far, the simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a Feedforward Neural Network Language Model (NNLM) [3] to an MT task is by rescoring the n-best lists of a strong MT baseline [22], which reliably improves translation quality.
More recently, researchers have begun to look into ways of including information about the source language into the NNLM. Examples of this work include Auli et al. [1], who combine an NNLM with a topic model of the input sentence, which improves rescoring performance. Devlin et al. [8] followed a similar approach, but they incorporated their NNLM into the decoder of an MT system and used the decoder’s alignment information to provide the NNLM with the most useful words in the input sentence. Their approach was highly successful and it achieved large improvements over their baseline.
Our work is closely related to Kalchbrenner and Blunsom [18], who were the first to map the input sentence into a vector and then back to a sentence, although they map sentences to vectors using convolutional neural networks, which lose the ordering of the words. Similarly to this work, Cho et al. [5] used an LSTM-like RNN architecture to map sentences into vectors and back, although their primary focus was on integrating their neural network into an SMT system. Bahdanau et al. [2] also attempted direct translations with a neural network that used an attention mechanism to overcome the poor performance on long sentences experienced by Cho et al. [5] and achieved encouraging results. Likewise, Pouget-Abadie et al. [26] attempted to address the memory problem of Cho et al. [5] by translating pieces of the source sentence in way that produces smooth translations, which is similar to a phrase-based approach. We suspect that they could achieve similar improvements by simply training their networks on reversed source sentences.
End-to-end training is also the focus of Hermann et al. [12], whose model represents the inputs and outputs by feedforward networks, and map them to similar points in space. However, their approach cannot generate translations directly: to get a translation, they need to do a look up for closest vector in the pre-computed database of sentences, or to rescore a sentence.

#### 関連研究
* NNを応用したmachine translation(機械翻訳)は多く存在する。
* 最もシンプルで効果的なのはRNNLM(RNN-Language Model)または、NNLM(Feedforward Neural Network Language Model)である。
* 最近では、研究者はソース言語の情報をNNLMに入れようとしている。
* KalchbrennerとBlunsomが近い研究をしていて、初めてベクトルの中に文章情報を付加した。

## Conclusion
> In this work, we showed that a large deep LSTM with a limited vocabulary can outperform a standard SMT-based system whose vocabulary is unlimited on a large-scale MT task. The success of our simple LSTM-based approach on MT suggests that it should do well on many other sequence learning problems, provided they have enough training data.
We were surprised by the extent of the improvement obtained by reversing the words in the source sentences. We conclude that it is important to find a problem encoding that has the greatest number of short term dependencies, as they make the learning problem much simpler. In particular, while we were unable to train a standard RNN on the non-reversed translation problem (shown in fig. 1), we believe that a standard RNN should be easily trainable when the source sentences are reversed (although we did not verify it experimentally).
We were also surprised by the ability of the LSTM to correctly translate very long sentences. We were initially convinced that the LSTM would fail on long sentences due to its limited memory, and other researchers reported poor performance on long sentences with a model similar to ours [5, 2, 26]. And yet, LSTMs trained on the reversed dataset had little difficulty translating long sentences.
Most importantly, we demonstrated that a simple, straightforward and a relatively unoptimized approach can outperform a mature SMT system, so further work will likely lead to even greater translation accuracies. These results suggest that our approach will likely do well on other challenging sequence to sequence problems.

* 本稿では、LSTMがこれまでのSTMに性能を上回ったことを示した。我々は、Sequence to sequenceのタスクに対して良いアプローチであったと言える。この方法を、強化することでされに翻訳の精度を上げられるだろう。また、データが十分にあれば、他の時系列データのタスクにも応用できるであろう。
* 語順を逆にすることでパフォーマンスが上がったことには驚いた。これは、エンコードの工夫でタスクをシンプルにすることができるということである。
* LSTMが長い文章も適切に翻訳していたことにも驚いた。始めは、メモリ不足で失敗すると考えていた。

## References
[1] M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent
neural networks. In EMNLP, 2013.<br>
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473, 2014.<br>
[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In Journal of
Machine Learning Research, pages 1137–1155, 2003.<br>
[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult.<br>
IEEE Transactions on Neural Networks, 5(2):157–166, 1994.<br>
[5] K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078,
2014.<br>
[6] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification.<br>
In CVPR, 2012.<br>
[7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large
vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing - Special
Issue on Deep Learning for Speech and Language Processing, 2012.<br>
[8] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network
joint models for statistical machine translation. In ACL, 2014.<br>
[9] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine
translation systems for wmt-14. In WMT, 2014.<br>
[10] A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850,
2013.<br>
[11] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal class ´ ification: labelling
unsegmented sequence data with recurrent neural networks. In ICML, 2006.<br>
[12] K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In
ICLR, 2014.<br>
[13] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,
T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE
Signal Processing Magazine, 2012.<br>
[14] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master’s thesis, Institut fur Informatik, Technische Universitat, Munchen, 1991.<br>
[15] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty
of learning long-term dependencies, 2001.<br>
[16]  [S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf) <br>
[17] S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. 1997.<br>
[18] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013.<br>
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural
networks. In NIPS, 2012.<br>
[20] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building
high-level features using large scale unsupervised learning. In ICML, 2012.<br>
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 1998.<br>
[22] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of
Technology, 2012.<br>
[23] T. Mikolov, M. Karafiat, L. Burget, J. Cernock ´ y, and S. Khudanpur. Recurrent neural network based 
language model. In INTERSPEECH, pages 1045–1048, 2010.<br>
[24] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: a method for automatic evaluation of machine
translation. In ACL, 2002.<br>
[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. arXiv
preprint arXiv:1211.5063, 2012.<br>
[26] J. Pouget-Abadie, D. Bahdanau, B. van Merrienboer, K. Cho, and Y. Bengio. Overcoming the
curse of sentence length for neural machine translation using automatic segmentation. arXiv preprint
arXiv:1409.1257, 2014.<br>
[27] A. Razborov. On small depth threshold circuits. In Proc. 3rd Scandinavian Workshop on Algorithm
Theory, 1992.<br>
[28] D. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.
Nature, 323(6088):533–536, 1986.<br>
[29]  [H. Schwenk. University le mans.](http://www-lium.univ-lemans.fr/˜schwenk/cslm_joint_paper/) , 2014. [Online; accessed 03-September-2014].<br>
[30] M. Sundermeyer, R. Schluter, and H. Ney. LSTM neural networks for language modeling. In INTERSPEECH, 2010.<br>
[31] P. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of IEEE, 1990.<br>

## 参考文献
1. [Sequence to Sequence Learning
with Neural Networks(原文)](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)

1. [統計翻訳に構造制約を導入する新しいアプローチ](http://www.japio.or.jp/00yearbook/files/2008book/08_5_05.pdf)

1. [ 自動評価尺度BLEU ](http://www2.nict.go.jp/astrec-att/member/mutiyama/corpmt/4.pdf)

1. [Translation with a Sequence to Sequence Network and Attention](https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb)

1. [Neural Machine Translation (seq2seq) Tutorial](https://github.com/tensorflow/nmt/blob/master/README.md)

