# Learn to calculate with seq2seq model

In this assignment, you will learn how to use neural networks to solve sequence-to-sequence prediction tasks. Seq2Seq models are very popular these days because they achieve great results in Machine Translation, Text Summarization, Conversational Modeling and more.

Using sequence-to-sequence modeling you are going to build a calculator for evaluating arithmetic expressions, by taking an equation as an input to the neural network and producing an answer as it's output.

The resulting solution for this problem will be based on state-of-the-art approaches for sequence-to-sequence learning and you should be able to easily adapt it to solve other tasks. However, if you want to train your own machine translation system or intellectual chat bot, it would be useful to have access to compute resources like GPU, and be patient, because training of such systems is usually time consuming. 

### Libraries

For this task you will need the following libraries:
 - [TensorFlow](https://www.tensorflow.org) — an open-source software library for Machine Intelligence.
 - [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
 
If you have never worked with TensorFlow, you will probably want to read some tutorials during your work on this assignment, e.g. [Neural Machine Translation](https://www.tensorflow.org/tutorials/seq2seq) tutorial deals with very similar task and can explain some concepts to you. 

In [1]:
"""
    使用神经网络解决sequence-to-sequence 任务
    sequence-to-sequence 模型可以用于解决很多问题
        例如机器翻译，文本总结，对话模型等,并且取得非常好的效果
    在这个任务中，将会使用seq2seq模型来为数学表达式建立一个计算器，将一个计算式作为神经网络的输入，计算出结果作为神经网络的输出
    
"""

'\n    使用神经网络解决sequence-to-sequence 任务\n    sequence-to-sequence 模型可以用于解决很多问题，例如机器翻译，文本总结，对话模型等,并且取得非常好的效果\n    在这个任务中，将会使用seq2seq模型来为数学表达式建立一个计算器，将一个计算式作为神经网络的输入，计算出结果作为神经网络的输出\n    \n'

### Data

One benefit of this task is that you don't need to download any data — you will generate it on your own! We will use two operators (addition and subtraction) and work with positive integer numbers in some range. Here are examples of correct inputs and outputs:

    Input: '1+2'
    Output: '3'
    
    Input: '0-99'
    Output: '-99'

*Note, that there are no spaces between operators and operands.*


Now you need to implement the function *generate_equations*, which will be used to generate the data.

In [2]:
import random
import numpy as np

In [3]:
def generate_equations(allowed_operators, dataset_size, min_value, max_value):
    """Generates pairs of equations and solutions to them.
    
       Each equation has a form of two integers with an operator in between.
       Each solution is an integer with the result of the operaion.
    
        allowed_operators: list of strings, allowed operators.
        dataset_size: an integer, number of equations to be generated.
        min_value: an integer, min value of each operand.
        max_value: an integer, max value of each operand.

        result: a list of tuples of strings (equation, solution).
    """
    sample = []
    for _ in range(dataset_size):
        ######################################
        ######### YOUR CODE HERE #############
        ######################################
        value1 = random.randint(min_value, max_value)
        value2 = random.randint(min_value, max_value)
        index = random.randint(0,len(allowed_operators)-1)
        if allowed_operators[index] == "+":
            solution = value1 + value2
        else:
            solution = value1 - value2
        equation = str(value1) + allowed_operators[index] + str(value2)
        tup =(equation, str(solution))
        sample.append(tup)
    #print(sample)
    return sample

To check the correctness of your implementation, use *test_generate_equations* function:

In [4]:
def test_generate_equations():
    allowed_operators = ['+', '-']
    dataset_size = 10
    for (input_, output_) in generate_equations(allowed_operators, dataset_size, 0, 100):
        if not (type(input_) is str and type(output_) is str):
            return "Both parts should be strings."
        if eval(input_) != int(output_):
            return "The (equation: {!r}, solution: {!r}) pair is incorrect.".format(input_, output_)
    return "Tests passed."

In [5]:
print(test_generate_equations())

Tests passed.


Finally, we are ready to generate the train and test data for the neural network:

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
#生成了十万个训练样本
allowed_operators = ['+', '-']
dataset_size = 100000
data = generate_equations(allowed_operators, dataset_size, min_value=0, max_value=9999)
#数据集切分
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

## Prepare data for the neural network

The next stage of data preparation is creating mappings of the characters to their indices in some vocabulary. Since in our task we already know which symbols will appear in the inputs and outputs, generating the vocabulary is a simple step.

#### How to create dictionaries for other task

First of all, you need to understand what is the basic unit of the sequence in your task. In our case, we operate on symbols and the basic unit is a symbol. The number of symbols is small, so we don't need to think about filtering/normalization steps. However, in other tasks, the basic unit is often a word, and in this case the mapping would be *word $\to$ integer*. The number of words might be huge, so it would be reasonable to filter them, for example, by frequency and leave only the frequent ones. Other strategies that your should consider are: data normalization (lowercasing, tokenization, how to consider punctuation marks), separate vocabulary for input and for output (e.g. for machine translation), some specifics of the task.

In [8]:
"""
    创建一个从字符到下标的映射，在我们的任务中已经知道了哪些符号会出现在输入输出中
    首先需要理解在我的任务中，序列的基础单元的是什么．在这个案例中，我们处理的基础单元是字符．这个字符的总数量很少，所以不需要进行过滤．
    对于其他的任务，理解序列的基础单元，一般来说是词，那这种映射就是 word-->integer．
    词汇量非常巨大，所以一般会过滤掉一些，仅仅留下高频词，还要用到数据的规范化（大小写，符号标记等），为输入和输出准备不同的词汇表等
"""

'\n    创建一个从字符到下标的映射，在我们的任务中已经知道了哪些符号会出现在输入输出中\n    首先需要理解在我的任务中，序列的基础单元的是什么．在这个案例中，我们处理的基础单元是字符．这个字符的总数量很少，所以不需要进行过滤．\n    对于其他的任务，理解序列的基础单元，一般来说是词，那这种映射就是 word-->integer．\n    词汇量非常巨大，所以一般会过滤掉一些，仅仅留下高频词，还要用到数据的规范化（大小写，符号标记等），为输入和输出准备不同的词汇表等\n'

In [9]:
#enumerate：枚举变量，可以形成 index,character,形成了一个字典
word2id = {symbol:i for i, symbol in enumerate('^$#+-1234567890')}
id2word = {i:symbol for symbol, i in word2id.items()}

In [10]:
#print(word2id)
#print(len(word2id))
print(id2word)

{0: '^', 1: '$', 2: '#', 3: '+', 4: '-', 5: '1', 6: '2', 7: '3', 8: '4', 9: '5', 10: '6', 11: '7', 12: '8', 13: '9', 14: '0'}


#### Special symbols

In [11]:
start_symbol = '^'       #表示解码过程的开始
end_symbol = '$'         #用于字符串的结尾，包括输入和输出序列
padding_symbol = '#'     #填充字符串至同一长度

You could notice that we have added 3 special symbols: '^', '\$' and '#':
- '^' symbol will be passed to the network to indicate the beginning of the decoding procedure. We will discuss this one later in more details.
- '\$' symbol will be used to indicate the *end of a string*, both for input and output sequences. 
- '#' symbol will be used as a *padding* character to make lengths of all strings equal within one training batch.

People have a bit different habits when it comes to special symbols in encoder-decoder networks, so don't get too much confused if you come across other variants in tutorials you read. 

#### Padding

When vocabularies are ready, we need to be able to convert a sentence to a list of vocabulary word indices and back. At the same time, let's care about padding. We are going to preprocess each sequence from the input (and output ground truth) in such a way that:
- it has a predefined length *padded_len*
- it is probably cut off or padded with the *padding symbol* '#'
- it *always* ends with the *end symbol* '$'

We will treat the original characters of the sequence **and the end symbol** as the valid part of the input. We will store *the actual length* of the sequence, which includes the end symbol, but does not include the padding symbols. 

In [12]:
"""
    将原始序列的字符以及最后的一个结束标志都作为输入的有效部分，这也是要存储的序列实际长度．包括结尾字符，但是不包括填充符号
    把每个句子转换成一个有词汇下标组成的列表，反之，也必须可以做到
"""

'\n    将原始序列的字符以及最后的一个结束标志都作为输入的有效部分，这也是要存储的序列实际长度．包括结尾字符，但是不包括填充符号\n    把每个句子转换成一个有词汇下标组成的列表，反之，也必须可以做到\n'

 Now you need to implement the function *sentence_to_ids* that does the described job. 

In [13]:
"""
    将句子转换为下标列表，超过了规定长度就剪掉后续序列，不够就补充至规定长度
"""
def sentence_to_ids(sentence, word2id, padded_len):
    """ Converts a sequence of symbols to a padded sequence of their ids.
    
      sentence: a string, input/output sequence of symbols.
      word2id: a dict, a mapping from original symbols to ids.
      padded_len: an integer, a desirable length of the sequence.

      result: a tuple of (a list of ids, an actual length of sentence).
    """
    #print("type(sentence):", type(sentence))
#     for i in sentence:
#         if i not in word2id.keys():
#             #print("跑哪去了：", i)

    #-1都是因为要考虑到一个结尾字符"$"
    sent_ids = [word2id[i] for i in sentence] ######### YOUR CODE HERE #############
    if len(sent_ids) >= (padded_len-1):
        sent_ids = sent_ids[0:padded_len-1]
    sent_ids.append(1)
    sent_len = len(sent_ids)                  ######### YOUR CODE HERE #############
    while(len(sent_ids) < (padded_len) ):
        sent_ids.append(2)
    
    return sent_ids, sent_len

Check that your implementation is correct:

In [14]:
def test_sentence_to_ids():
    sentences = [("123+123", 7), ("123+123", 8), ("123+123", 10)]
    expected_output = [([5, 6, 7, 3, 5, 6, 1], 7), 
                       ([5, 6, 7, 3, 5, 6, 7, 1], 8), 
                       ([5, 6, 7, 3, 5, 6, 7, 1, 2, 2], 8)] 
    for (sentence, padded_len), (sentence_ids, expected_length) in zip(sentences, expected_output):
        #print("sentence:", sentence)
        output, length = sentence_to_ids(sentence, word2id, padded_len)
        print(output)
        if output != sentence_ids:
            return("Convertion of '{}' for padded_len={} to {} is incorrect.".format(
                sentence, padded_len, output))
        if length != expected_length:
            return("Convertion of '{}' for padded_len={} has incorrect actual length {}.".format(
                sentence, padded_len, length))
    return("Tests passed.")

In [15]:
print(test_sentence_to_ids())

[5, 6, 7, 3, 5, 6, 1]
[5, 6, 7, 3, 5, 6, 7, 1]
[5, 6, 7, 3, 5, 6, 7, 1, 2, 2]
Tests passed.


We also need to be able to get back from indices to symbols:

In [16]:
"""
    把下标转换回句子
"""
def ids_to_sentence(ids, id2word):
    """ Converts a sequence of ids to a sequence of symbols.
    
          ids: a list, indices for the padded sequence.
          id2word:  a dict, a mapping from ids to original symbols.

          result: a list of symbols.
    """
 
    return [id2word[i] for i in ids] 

#### Generating batches

The final step of data preparation is a function that transforms a batch of sentences to a list of lists of indices. 

In [17]:
#sentences：传进来的tuple的列表
#官方代码：一次确实只能处理一个句子
#如果batch_to_ids一次可以处理一个句子集合，我的问题就迎刃而解了
def batch_to_ids(sentences, word2id, max_len):
    """Prepares batches of indices. 
    
       Sequences are padded to match the longest sequence in the batch,
       if it's longer than max_len, then max_len is used instead.

        sentences: a list of strings, original sequences.
        word2id: a dict, a mapping from original symbols to ids.
        max_len: an integer, max len of sequences allowed.

        result: a list of lists of ids, a list of actual lengths.
    """
    #最大长度:最长句子的长度和max_len中选择一个更小的
    max_len_in_batch = min(max(len(s) for s in sentences) + 1, max_len)  #+1是指结尾字符
    batch_ids, batch_ids_len = [], []
    for sentence in sentences:
        ids, ids_len = sentence_to_ids(sentence, word2id, max_len_in_batch)
        batch_ids.append(ids)
        batch_ids_len.append(ids_len)
    return batch_ids, batch_ids_len

The function *generate_batches* will help to generate batches with defined size from given samples.

In [18]:
#用来批量的产生样本
def generate_batches(samples, batch_size=64):
    X, Y = [], []
    for i, (x, y) in enumerate(samples, 1):
        X.append(x)
        Y.append(y)
        if i % batch_size == 0:
            yield X, Y
            X, Y = [], []
    if X and Y:
        yield X, Y

To illustrate the result of the implemented functions, run the following cell:

In [19]:
# sentences = []
# sentences.append(train_set[0])
# sentences.append(train_set[1])
sentences = train_set[0]
ids, sent_lens = batch_to_ids(sentences, word2id, max_len=10)    #X和Y的最大长度都是10
print('Input:', sentences)
print("type(sentences):", type(sentences))
print('Ids: {}\nSentences lengths: {}'.format(ids, sent_lens))    #返回的是有效字符串的长度，包括结尾字符-

Input: ('8579-7887', '692')
type(sentences): <class 'tuple'>
Ids: [[12, 9, 11, 13, 4, 11, 12, 12, 11, 1], [10, 13, 6, 1, 2, 2, 2, 2, 2, 2]]
Sentences lengths: [10, 4]


## Encoder-Decoder architecture

Encoder-Decoder is a successful architecture for Seq2Seq tasks with different lengths of input and output sequences. The main idea is to use two recurrent neural networks, where the first neural network *encodes* the input sequence into a real-valued vector and then the second neural network *decodes* this vector into the output sequence. While building the neural network, we will specify some particular characteristics of this architecture.

In [20]:
"""
    Encoder-Decoder　是一个用来解决Seq2Seq任务的成功的架构，输入和输出的序列不一样．主要的思想是使用两种递归神经网络，
    第一个网络将输入序列变成一个实值向量，第二个神经网络将这个实值向量解码成输出序列
"""

'\n    Encoder-Decoder\u3000是一个用来解决Seq2Seq任务的成功的架构，输入和输出的序列不一样．主要的思想是使用两种递归神经网络，\n    第一个网络将输入序列变成一个实值向量，第二个神经网络将这个实值向量解码成输出序列\n'

In [21]:
import tensorflow as tf

Let us use TensorFlow building blocks to specify the network architecture.

In [22]:
class Seq2SeqModel(object):
    pass

First, we need to create [placeholders](https://www.tensorflow.org/api_guides/python/io_ops#Placeholders) to specify what data we are going to feed into the network during the execution time. For this task we will need:
 - *input_batch* — sequences of sentences (the shape will equal to [batch_size, max_sequence_len_in_batch]);
 - *input_batch_lengths* — lengths of not padded sequences (the shape equals to [batch_size]);
 - *ground_truth* — sequences of groundtruth (the shape will equal to [batch_size, max_sequence_len_in_batch]);
 - *ground_truth_lengths* — lengths of not padded groundtruth sequences (the shape equals to [batch_size]);
 - *dropout_ph* — dropout keep probability; this placeholder has a predifined value 1;
 - *learning_rate_ph* — learning rate.

In [23]:
"""
input_batch：输入句子Ｘ的序列
input_batch_lengths：没有填充的句子的长度，即Ｘ的长度
ground_truth：输入Ｘ对应的结果Ｙ的序列
ground_truth_lengths：Y的长度
留存率
学习率
"""

'\ninput_batch：输入句子Ｘ的序列\ninput_batch_lengths：没有填充的句子的长度，即Ｘ的长度\nground_truth：输入Ｘ对应的结果Ｙ的序列\nground_truth_lengths：Y的长度\n留存率\n学习率\n'

In [24]:

def declare_placeholders(self):
    """Specifies placeholders for the model."""
    
    # Placeholders for input and its actual lengths.
    # shape, dtype, name
    self.input_batch = tf.placeholder(shape=(None, None), dtype=tf.int32, name='input_batch')    #batch_size * sequence_length
    #print("type(self.input_batch):", type(self.input_batch))
    self.input_batch_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='input_batch_lengths')    #batch_size
    
    # Placeholders for groundtruth and its actual lengths.
    self.ground_truth = tf.placeholder(shape=(None, None), dtype=tf.int32, name="ground_truth")    #batch_size * sequence_length
    self.ground_truth_lengths = tf.placeholder(shape=(None,), dtype=tf.int32, name="ground_truth_lengths")    #batch_size
        
    #tf.cast(x, dtype, name=None) 将ｘ的数据格式转化成dtype
    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    self.learning_rate_ph = tf.placeholder(dtype=tf.float32, shape=[])######### YOUR CODE HERE ############# 

In [25]:
Seq2SeqModel.__declare_placeholders = classmethod(declare_placeholders)

Now, let us specify the layers of the neural network. First, we need to prepare an embedding matrix. Since we use the same vocabulary for input and output, we need only one such matrix. For tasks with different vocabularies there would be multiple embedding layers.
- Create embeddings matrix with [tf.Variable](https://www.tensorflow.org/api_docs/python/tf/Variable). Specify its name, type (tf.float32), and initialize with random values.
- Perform [embeddings lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) for a given input batch.

In [26]:
def create_embeddings(self, vocab_size, embeddings_size):
    """Specifies embeddings layer and embeds an input batch."""
     
    #初始化 vocab_size:15    embeddings_size:512?
    random_initializer = tf.random_uniform((vocab_size, embeddings_size), -1.0, 1.0)
    self.embeddings = tf.Variable(initial_value=random_initializer, dtype=tf.float32, name="embeddings")######### YOUR CODE HERE ############# 
    
    # Perform embeddings lookup for self.input_batch. 
    self.input_batch_embedded = tf.nn.embedding_lookup(self.embeddings, self.input_batch)######### YOUR CODE HERE ############# 

In [27]:
Seq2SeqModel.__create_embeddings = classmethod(create_embeddings)

#### Encoder

The first RNN of the current architecture is called an *encoder* and serves for encoding an input sequence to a real-valued vector. Input of this RNN is an embedded input batch. Since sentences in the same batch could have different actual lengths, we also provide input lengths to avoid unnecessary computations. The final encoder state will be passed to the second RNN (decoder), which we will create soon. 

- TensorFlow provides a number of [RNN cells](https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cells_for_use_with_TensorFlow_s_core_RNN_methods) ready for use. We suggest that you use [GRU cell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/GRUCell), but you can also experiment with other types. 
- Wrap your cells with [DropoutWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper). Dropout is an important regularization technique for neural networks. Specify input keep probability using the dropout placeholder that we created before.
- Combine the defined encoder cells with [Dynamic RNN](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn). Use the embedded input batches and their lengths here.
- Use *dtype=tf.float32* everywhere.

In [28]:
"""
    这个架构中的第一个RNN网络是编码器，用于将输入的序列编码成一个实值向量．这个RNN的输入是一个输入的embedded矩阵．
    在相同batch中的句子也可以有不同的实际长度，会提供输出长度来避免不必要的计算．编码器的最后一个状态会通过第二个RNN网络．
"""

'\n    这个架构中的第一个RNN网络是编码器，用于将输入的序列编码成一个实值向量．这个RNN的输入是一个输入的embedded矩阵．\n    在相同batch中的句子也可以有不同的实际长度，会提供输出长度来避免不必要的计算．编码器的最后一个状态会通过第二个RNN网络．\n'

In [29]:
def build_encoder(self, hidden_size):
    """Specifies encoder architecture and computes its output."""
    
    # Create GRUCell with dropout.
#     encoder_cell = tf.contrib.rnn.GRUCell(hidden_size)######### YOUR CODE HERE #############
#     tf.contrib.rnn.DropoutWrapper(encoder_cell, input_keep_prob=self.dropout_ph)
    encoder_cell = tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.GRUCell(hidden_size), input_keep_prob=self.dropout_ph)
    
    # Create RNN with the predefined cell.　编码完之后返回最后一个状态,运行encoder_cell
    _, self.final_encoder_state = tf.nn.dynamic_rnn(cell=encoder_cell, 
                                                    inputs=self.input_batch_embedded, 
                                                    sequence_length=self.input_batch_lengths, 
                                                    dtype=tf.float32)######### YOUR CODE HERE #############

In [30]:
Seq2SeqModel.__build_encoder = classmethod(build_encoder)

#### Decoder

The second RNN is called a *decoder* and serves for generating the output sequence. In the simple seq2seq arcitecture, the input sequence is provided to the decoder only as the final state of the encoder. Obviously, it is a bottleneck and [Attention techniques](https://www.tensorflow.org/tutorials/seq2seq#background_on_the_attention_mechanism) can help to overcome it. So far, we do not need them to make our calculator work, but this would be a necessary ingredient for more advanced tasks. 

During training, decoder also uses information about the true output. It is feeded in as input symbol by symbol. However, during the prediction stage (which is called *inference* in this architecture), the decoder can only use its own generated output from the previous step to feed it in at the next step. Because of this difference (*training* vs *inference*), we will create two distinct instances, which will serve for the described scenarios.

The picture below illustrates the point. It also shows our work with the special characters, e.g. look how the start symbol `^` is used. The transparent parts are ignored. In decoder, it is masked out in the loss computation. In encoder, the green state is considered as final and passed to the decoder. 

In [31]:
"""
    仅仅对编码器的最后一个状态进行解码．attention机制可以协助解决这个问题
    在训练期间，解码器可以使用真实的输出信息，把真实结果信息像输入特征一样喂进行就可以了．在测试期间，解码器就仅仅只能用它自己前一步输出的信息了
    因为训练和预测是不一样的，我们将会创建两种不同的实例，分别用于描述的两种场景．
    针对特定字符的工作："$",在编码器中，$之后的信息就不会再进行计算，$所在的状态会传递到解码器
"""

'\n    仅仅对编码器的最后一个状态进行解码．attention机制可以协助解决这个问题\n    在训练期间，解码器可以使用真实的输出信息，把真实结果信息像输入特征一样喂进行就可以了．在测试期间，解码器就仅仅只能用它自己前一步输出的信息了\n    因为训练和预测是不一样的，我们将会创建两种不同的实例，分别用于描述的两种场景．\n    针对特定字符的工作："$",在编码器中，$之后的信息就不会再进行计算，$所在的状态会传递到解码器\n'

<img src="encoder-decoder-pic.png" style="width: 500px;">

Now, it's time to implement the decoder:
 - First, we should create two [helpers](https://www.tensorflow.org/api_guides/python/contrib.seq2seq#Dynamic_Decoding). These classes help to determine the behaviour of the decoder. During the training time, we will use [TrainingHelper](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/TrainingHelper). For the inference we recommend to use [GreedyEmbeddingHelper](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/GreedyEmbeddingHelper).
 - To share all parameters during training and inference, we use one scope and set the flag 'reuse' to True at inference time. You might be interested to know more about how [variable scopes](https://www.tensorflow.org/programmers_guide/variables) work in TF. 
 - To create the decoder itself, we will use [BasicDecoder](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/BasicDecoder) class. As previously, you should choose some RNN cell, e.g. GRU cell. To turn hidden states into logits, we will need a projection layer. One of the simple solutions is using [OutputProjectionWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/OutputProjectionWrapper).
 - For getting the predictions, it will be convinient to use [dynamic_decode](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/dynamic_decode). This function uses the provided decoder to perform decoding.

In [32]:
"""
helper:
    提供在解码过程中的抽样方法，例如训练过程中解码器的输出采用argmax算法还是广义伯努利分布算法，推断过程中输出采用argmax算法还是广义
    伯努利分布采样来得到输出id
GreedyEmbeddingHelper:
    采取argmax抽样算法来得到输出id,并且经过embedding层作为下一时刻的输入;
TrainingHelper:
    在sample过程中，它采用最简单的argmax算法．
"""

'\nhelper:\n    提供在解码过程中的抽样方法，例如训练过程中解码器的输出采用argmax算法还是广义伯努利分布算法，推断过程中输出采用argmax算法还是广义\n    伯努利分布采样来得到输出id\nGreedyEmbeddingHelper:\n    采取argmax抽样算法来得到输出id,并且经过embedding层作为下一时刻的输入;\nTrainingHelper:\n    在sample过程中，它采用最简单的argmax算法．\n'

In [33]:
def build_decoder(self, hidden_size, vocab_size, max_iter, start_symbol_id, end_symbol_id):
    """Specifies decoder architecture and computes the output.
    
        Uses different helpers:
          - for train: feeding ground truth
          - for inference: feeding generated output

        As a result, self.train_outputs and self.infer_outputs are created. 
        Each of them contains two fields:
          rnn_output (predicted logits)
          sample_id (predictions).

    """
    
    # Use start symbols as the decoder inputs at the first time step.
    batch_size = tf.shape(self.input_batch)[0]  #多少行文本，例如batch_size　大小为128
    print("batch_size:", batch_size)
    # tf.fill(dims, value) 创建一个张量填充指定的常数
    start_tokens = tf.fill([batch_size], start_symbol_id)
    print("start_tokens:", start_tokens)
    print("start_symbol_id:", start_symbol_id)
    # tf.expand_dims(t,1) 给ｔ增加1维，最后一个参数１表示连接的维度
    ground_truth_as_input = tf.concat([tf.expand_dims(start_tokens, 1), self.ground_truth], 1)  
    
    # Use the embedding layer defined before to lookup embedings for ground_truth_as_input. 
    self.ground_truth_embedded = tf.nn.embedding_lookup(self.embeddings, ground_truth_as_input)######### YOUR CODE HERE #############
     
    # Create TrainingHelper for the train stage.
    train_helper = tf.contrib.seq2seq.TrainingHelper(self.ground_truth_embedded, 
                                                     self.ground_truth_lengths)
    
    # Create GreedyEmbeddingHelper for the inference stage.
    # You should provide the embedding layer, start_tokens and index of the end symbol.
    ######### YOUR CODE HERE #############
    infer_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding=self.embeddings, 
                                                            start_tokens=start_tokens,
                                                            end_token=end_symbol_id)
  
    def decode(helper, scope, reuse=None):
        """Creates decoder and return the results of the decoding with a given helper."""
        
        with tf.variable_scope(scope, reuse=reuse):
            # Create GRUCell with dropout. Do not forget to set the reuse flag properly.
            #decoder_cell = tf.contrib.rnn.GRUCell(num_units=hidden_size, reuse=reuse)######### YOUR CODE HERE #############
            """
                在rnn中使用dropout的方法和cnn不同，在rnn中进行dropout时，对于rnn的部分不进行dropout，也就是说t-1时刻的状态传递
                到t时刻进行计算时，这个中间不进行memory的dropout;仅在同一t时刻中，多层cell之间传递信息的时候进行dropout.
                注：Dropout只能是层与层之间的Dropout，同一个层里面，T时刻与T+1时刻是不会Dropout的
            """
            #tf.contrib.rnn.DropoutWrapper(decoder_cell, input_keep_prob=self.dropout_ph)
            decoder_cell = tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.GRUCell(num_units=hidden_size, reuse=reuse), input_keep_prob=self.dropout_ph)
            
            # Create a projection wrapper.
            decoder_cell = tf.contrib.rnn.OutputProjectionWrapper(decoder_cell, vocab_size, reuse=reuse)
            #print("decoder_cell.shape:", decoder_cell.shape)
            
            # Create BasicDecoder, pass the defined cell, a helper, and initial state.
            # The initial state should be equal to the final state of the encoder!
            decoder = tf.contrib.seq2seq.BasicDecoder(cell=decoder_cell,
                                                     helper=helper,
                                                     initial_state=self.final_encoder_state)######### YOUR CODE HERE #############
            
            # The first returning argument of dynamic_decode contains two fields:
            #   rnn_output (predicted logits)  原来我要的只是这个rnn_output
            #   sample_id (predictions)
            outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder=decoder, 
                                                              maximum_iterations=max_iter, 
                                                              output_time_major=False, 
                                                              impute_finished=True)
            return outputs
        
    self.train_outputs = decode(train_helper, 'decode')
    self.infer_outputs = decode(infer_helper, 'decode', reuse=True)

In [34]:
Seq2SeqModel.__build_decoder = classmethod(build_decoder)

In this task we will use [sequence_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss), which is a weighted cross-entropy loss for a sequence of logits. Take a moment to understand, what is your train logits and targets. Also note, that we do not want to take into account loss terms coming from padding symbols, so we will mask them out using weights.  

In [35]:
def compute_loss(self):
    """Computes sequence loss (masked cross-entopy loss with logits)."""
    
    weights = tf.cast(tf.sequence_mask(self.ground_truth_lengths), dtype=tf.float32) #真实的标签序列的权重
    
    ######### YOUR CODE HERE #############
    print("self.train_outputs.shape:", self.train_outputs)
    #tf.contrib.seq2seq.sequence_loss() 序列的加权交叉熵损失
    print("self.train_outputs.rnn_output.shape:", self.train_outputs.rnn_output)
    print("self.ground_truth.shape:", self.ground_truth.shape)
    """
    with ops.name_scope(name, "sequence_loss", [logits, targets, weights]):
        num_classes = array_ops.shape(logits)[2]
        logits_flat = array_ops.reshape(logits, [-1, num_classes])
        targets = array_ops.reshape(targets, [-1])
        if softmax_loss_function is None:
            crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits_flat)
    由这段代码来看，logits 和 target的前两维度必须相同．
    """
    #默认是平均权重的交叉熵损失
    self.loss = tf.contrib.seq2seq.sequence_loss(
                                                 self.train_outputs.rnn_output, 
                                                 self.ground_truth, 
                                                 weights 
                                                 )
    

In [36]:
Seq2SeqModel.__compute_loss = classmethod(compute_loss)

The last thing to specify is the optimization of the defined loss. 
We suggest that you use [optimize_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/optimize_loss) with Adam optimizer and a learning rate from the corresponding placeholder. You might also need to pass global step (e.g. as tf.train.get_global_step()) and clip gradients by 1.0.

In [37]:
def perform_optimization(self):
    """Specifies train_op that optimizes self.loss."""
    
    ######### YOUR CODE HERE #############
    self.train_op = tf.contrib.layers.optimize_loss(loss=self.loss, 
                                                    global_step=tf.train.get_global_step(), 
                                                    learning_rate=self.learning_rate_ph, 
                                                    optimizer="Adam", 
                                                    clip_gradients=1.0)
    

In [38]:
Seq2SeqModel.__perform_optimization = classmethod(perform_optimization)

Congratulations! You have specified all the parts of your network. You may have noticed, that we didn't deal with any real data yet, so what you have written is just recipies on how the network should function.
Now we will put them to the constructor of our Seq2SeqModel class to use it in the next section. 

In [39]:
"""
    至此，已经完成了我的网络中的所有的部分，下一章节中将会把这个网络放进seq2seq模型
"""

'\n    至此，已经完成了我的网络中的所有的部分，下一章节中将会把这个网络放进seq2seq模型\n'

In [40]:
#这个函数主要是进行一些初始化操作
def init_model(self, vocab_size, embeddings_size, hidden_size, 
               max_iter, start_symbol_id, end_symbol_id, padding_symbol_id):
    
    self.__declare_placeholders()
    self.__create_embeddings(vocab_size, embeddings_size)
    self.__build_encoder(hidden_size)
    self.__build_decoder(hidden_size, vocab_size, max_iter, start_symbol_id, end_symbol_id)
    
    # Compute loss and back-propagate.
    self.__compute_loss()
    self.__perform_optimization()
    
    # Get predictions for evaluation.
    self.train_predictions = self.train_outputs.sample_id   #取某个对象的某个字段，直接"."就可以了
    self.infer_predictions = self.infer_outputs.sample_id

In [41]:
Seq2SeqModel.__init__ = classmethod(init_model)

## Train the network and predict output

[Session.run](https://www.tensorflow.org/api_docs/python/tf/Session#run) is a point which initiates computations in the graph that we have defined. To train the network, we need to compute *self.train_op*. To predict output, we just need to compute *self.infer_predictions*. In any case, we need to feed actual data through the placeholders that we defined above. 

In [42]:
def train_on_batch(self, session, X, X_seq_len, Y, Y_seq_len, learning_rate, dropout_keep_probability):
    #原来self.input_batch 和 self.ground_truth　一个指的是样本集中的x，一个指的是样本集中的y
    feed_dict = {
            self.input_batch: X,
            self.input_batch_lengths: X_seq_len,
            self.ground_truth: Y,
            self.ground_truth_lengths: Y_seq_len,
            self.learning_rate_ph: learning_rate,
            self.dropout_ph: dropout_keep_probability
        }
    
    pred, loss, _ = session.run(
            [
            self.train_predictions,
            self.loss,
            self.train_op
            ], 
            feed_dict=feed_dict
    )
    return pred, loss

In [43]:
Seq2SeqModel.train_on_batch = classmethod(train_on_batch)

We implemented two prediction functions: *predict_for_batch* and *predict_for_batch_with_loss*. The first one allows only to predict output for some input sequence, while the second one could compute loss because we provide also ground truth values. Both these functions might be useful since the first one could be used for predicting only, and the second one is helpful for validating results on not-training data during the training.

In [44]:
"""
    predict_for_batch:仅仅会为部分输入序列预测输出
    predice_for_batch_with_loss:这个还可以计算损失，因为提供了ground truth 的值
"""

'\n    predict_for_batch:仅仅会为部分输入序列预测输出\n    predice_for_batch_with_loss:这个还可以计算损失，因为提供了ground truth 的值\n'

In [45]:
def predict_for_batch(self, session, X, X_seq_len):
    ######### YOUR CODE HERE #############
    feed_dict = {
        self.input_batch:X,
        self.input_batch_lengths:X_seq_len
    }
    
    pred = session.run(
            [self.infer_predictions], 
            feed_dict=feed_dict
        )[0]
    return pred

def predict_for_batch_with_loss(self, session, X, X_seq_len, Y, Y_seq_len):
    ######### YOUR CODE HERE #############
    feed_dict = {
        self.input_batch:X,
        self.input_batch_lengths:X_seq_len,
        self.ground_truth:Y,
        self.ground_truth_lengths:Y_seq_len
    }
    pred, loss = session.run([
            self.infer_predictions,
            self.loss,
        ], feed_dict=feed_dict)
    return pred, loss

In [46]:
Seq2SeqModel.predict_for_batch = classmethod(predict_for_batch)
Seq2SeqModel.predict_for_batch_with_loss = classmethod(predict_for_batch_with_loss)

## Run your experiment

Create *Seq2SeqModel* model with the following parameters:
 - *vocab_size* — number of tokens;
 - *embeddings_size* — dimension of embeddings, recommended value: 20;
 - *max_iter* — maximum number of steps in decoder, recommended value: 7;
 - *hidden_size* — size of hidden layers for RNN, recommended value: 512;
 - *start_symbol_id* — an index of the start token (`^`).
 - *end_symbol_id* — an index of the end token (`$`).
 - *padding_symbol_id* — an index of the padding token (`#`).

Set hyperparameters. You might want to start with the following values and see how it works:
- *batch_size*: 128;
- at least 10 epochs;
- value of *learning_rate*: 0.001
- *dropout_keep_probability* equals to 0.5 for training (typical values for dropout probability are ranging from 0.1 to 1.0); larger values correspond smaler number of dropout units;
- *max_len*: 20.

In [47]:
#init_model(self, vocab_size, embeddings_size, hidden_size,max_iter, start_symbol_id, end_symbol_id, padding_symbol_id):

tf.reset_default_graph()

model = Seq2SeqModel(vocab_size=15, 
                     embeddings_size=20, 
                     hidden_size=512, 
                     max_iter=7, 
                     start_symbol_id=word2id["^"], 
                     end_symbol_id=word2id["$"], 
                     padding_symbol_id=word2id["#"])######### YOUR CODE HERE #############

batch_size = 128######### YOUR CODE HERE #############
n_epochs = 10######### YOUR CODE HERE #############
learning_rate = 0.001######### YOUR CODE HERE #############
dropout_keep_probability = 0.5######### YOUR CODE HERE #############
max_len = 20######### YOUR CODE HERE #############

n_step = int(len(train_set) / batch_size)

batch_size: Tensor("strided_slice:0", shape=(), dtype=int32)
start_tokens: Tensor("Fill:0", shape=(?,), dtype=int32)
start_symbol_id: 0
self.train_outputs.shape: BasicDecoderOutput(rnn_output=<tf.Tensor 'decode/decoder/transpose:0' shape=(?, ?, 15) dtype=float32>, sample_id=<tf.Tensor 'decode/decoder/transpose_1:0' shape=(?, ?) dtype=int32>)
self.train_outputs.rnn_output.shape: Tensor("decode/decoder/transpose:0", shape=(?, ?, 15), dtype=float32)
self.ground_truth.shape: (?, ?)


Finally, we are ready to run the training! A good indicator that everything works fine is decreasing loss during the training. You should account on the loss value equal to approximately 2.7 at the beginning of the training and near 1 after the 10th epoch.

In [48]:
"""
    现在已经到了最后一步了，准备好运行这个训练，所有的步骤都运行得很好的一个表现就是损失函数一直在下降，刚开始的时候，损失函数的值可能是
    2.7, 运行完10个epoch之后，损失函数的值大概可以下降到1.0的样子
"""

'\n    现在已经到了最后一步了，准备好运行这个训练，所有的步骤都运行得很好的一个表现就是损失函数一直在下降，刚开始的时候，损失函数的值可能是\n    2.7, 运行完10个epoch之后，损失函数的值大概可以下降到1.0的样子\n'

In [51]:
#zip函数的用法
a = [1,2,3]
b = [4,5,6]
c = [4,5,6,7,8]
zipped = zip(a,b)     # 打包为元组的列表
for z in zipped:
    print(z)
for z in zip(a,b,c):
    print(z)

(1, 4)
(2, 5)
(3, 6)
(1, 4, 4)
(2, 5, 5)
(3, 6, 6)


In [58]:
session = tf.Session()
session.run(tf.global_variables_initializer())    #全局初始化
            
invalid_number_prediction_counts = []
all_model_predictions = []
all_ground_truth = []

print('Start training... \n')
for epoch in range(n_epochs):  
    random.shuffle(train_set)
    random.shuffle(test_set)
    
    print('Train: epoch', epoch + 1)
    #每一次遍历训练集，都分成batch_size个批次进行
    for n_iter, (X_batch, Y_batch) in enumerate(generate_batches(train_set, batch_size=batch_size)):
        ######################################
        ######### YOUR CODE HERE #############
        ######################################
        
        # prepare the data (X_batch and Y_batch) for training
        # using function batch_to_ids
        X, X_seq_len = batch_to_ids(X_batch, word2id, max_len=20) 
        Y, Y_seq_len = batch_to_ids(Y_batch, word2id, max_len=20) 
        #print("Y.shape:", len(Y), len(Y[0]))    128*20
        #print("X.shape:", len(X), len(X[0]))
        predictions, loss = model.train_on_batch(session, X, X_seq_len, Y, Y_seq_len, learning_rate, dropout_keep_probability)
        
        if n_iter % 200 == 0:
            print("Epoch: [%d/%d], step: [%d/%d], loss: %f" % (epoch + 1, n_epochs, n_iter + 1, n_step, loss))

    X_sent, Y_sent = next(generate_batches(test_set, batch_size=batch_size))
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    
    # prepare test data (X_sent and Y_sent) for predicting 
    # quality and computing value of the loss function
    # using function batch_to_ids
    X, X_seq_len = batch_to_ids(X_sent, word2id, max_len=20) 
    Y, Y_seq_len = batch_to_ids(Y_sent, word2id, max_len=20)  
    
    ######### YOUR CODE HERE #############
    predictions, loss = model.predict_for_batch_with_loss(session,  X, X_seq_len, Y, Y_seq_len)
    print('Test: epoch', epoch + 1, 'loss:', loss,)
    
    #zip() 每个迭代对象的第i个元素组合成一个tuple,有几个迭代对象，tuple里面就有几个元素，共len(最短迭代对象个tuple)
    for x, y, p  in list(zip(X, Y, predictions))[:10]:    #可能会报错
        print('X:',''.join(ids_to_sentence(x, id2word)))
        print('Y:',''.join(ids_to_sentence(y, id2word)))
        print('O:',''.join(ids_to_sentence(p, id2word)))
        print('')

    model_predictions = []
    ground_truth = []
    invalid_number_prediction_count = 0
    # For the whole test set calculate ground-truth values (as integer numbers)
    # and prediction values (also as integers) to calculate metrics.
    # If generated by model number is not correct (e.g. '1-1'), 
    # increase invalid_number_prediction_count and don't append this and corresponding
    # ground-truth value to the arrays.
    for X_batch, Y_batch in generate_batches(test_set, batch_size=batch_size):
        ######################################
        ######### YOUR CODE HERE #############
        ######################################

        X, X_seq_len = batch_to_ids(X_batch, word2id, max_len=20) 
        Y, Y_seq_len = batch_to_ids(Y_batch, word2id, max_len=20) 
        predictions = model.predict_for_batch(session, X, X_seq_len) #pred is a ndarray
        
        #是判断错误形式的答案(直接模仿上面的循环就可以)
        #预测值和真实值是否一致，那是后面的MAE做的事情
        for y, p, y_len in list(zip(Y, predictions, Y_seq_len)):
            true_y = "".join(ids_to_sentence(y, id2word))
            pred_y = "".join(ids_to_sentence(p, id2word))
            try:
                stop_idx = pred_y.find("$")
                if stop_idx != -1:
                    prediction = int(pred_y[:stop_idx])
                else:
                    prediction = int(pred_y)
                model_predictions.append(prediction)
                ground_truth.append(int(true_y[:y_len-1]))
            except ValueError:
                invalid_number_prediction_count += 1
               
                
                
    all_model_predictions.append(model_predictions)
    all_ground_truth.append(ground_truth)
    invalid_number_prediction_counts.append(invalid_number_prediction_count)
            
print('\n...training finished.')

Start training... 

Train: epoch 1
Epoch: [1/10], step: [1/625], loss: 2.712346
Epoch: [1/10], step: [201/625], loss: 1.841499
Epoch: [1/10], step: [401/625], loss: 1.756556
Epoch: [1/10], step: [601/625], loss: 1.691974
Test: epoch 1 loss: 1.60785
X: 6848+6299$
Y: 13147$
O: 12916$

X: 7868-1290$
Y: 6578$#
O: 6659$^

X: 7184+5281$
Y: 12465$
O: 11981$

X: 1270+5193$
Y: 6463$#
O: 6116$^

X: 6552+8342$
Y: 14894$
O: 12919$

X: 7137-4608$
Y: 2529$#
O: 2944$^

X: 610+2051$#
Y: 2661$#
O: 5119$^

X: 9976+184$#
Y: 10160$
O: 9989$^

X: 7848-9747$
Y: -1899$
O: -194$^

X: 9375-8483$
Y: 892$##
O: 1194$^

Train: epoch 2
Epoch: [2/10], step: [1/625], loss: 1.661276
Epoch: [2/10], step: [201/625], loss: 1.557394
Epoch: [2/10], step: [401/625], loss: 1.519405
Epoch: [2/10], step: [601/625], loss: 1.513980
Test: epoch 2 loss: 1.43447
X: 3757-4201$
Y: -444$#
O: -1155$

X: 4775+8214$
Y: 12989$
O: 13133$

X: 282-7554$#
Y: -7272$
O: -7933$

X: 8384-2877$
Y: 5507$#
O: 5918$^

X: 7327+6837$
Y: 14164$
O: 13813

## Evaluate results

Because our task is simple and the output is straight-forward, we will use [MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) metric to evaluate the trained model during the epochs. Compute the value of the metric for the output from each epoch.

In [59]:
from sklearn.metrics import mean_absolute_error

In [61]:
#enumerate的作用仅仅是为每个元素的顺序执行添加了一个对应的序号，仅此而已
for i, (gts, predictions, invalid_number_prediction_count) in enumerate(zip(all_ground_truth,
                                                                            all_model_predictions,
                                                                            invalid_number_prediction_counts), 1):
    mae = mean_absolute_error(y_pred=predictions,y_true=gts)  ######### YOUR CODE HERE #############
    print("Epoch: %i, MAE: %f, Invalid numbers: %i" % (i, mae, invalid_number_prediction_count))

Epoch: 1, MAE: 1038.991350, Invalid numbers: 0
Epoch: 2, MAE: 466.554350, Invalid numbers: 0
Epoch: 3, MAE: 289.825850, Invalid numbers: 0
Epoch: 4, MAE: 235.246850, Invalid numbers: 0
Epoch: 5, MAE: 179.358600, Invalid numbers: 0
Epoch: 6, MAE: 165.955500, Invalid numbers: 0
Epoch: 7, MAE: 99.184600, Invalid numbers: 0
Epoch: 8, MAE: 63.348700, Invalid numbers: 0
Epoch: 9, MAE: 52.505700, Invalid numbers: 0
Epoch: 10, MAE: 35.521250, Invalid numbers: 0


In [None]:
"""
 2018.7.2 完成，耗时两天
"""