
# 序列的字符序列
在这本笔记本中，我们将构建一个采用一系列字母的模型，并输出该序列的排序版本。 我们将使用我们迄今为止所学到的序列到序列模型。 该笔记本已更新，与TensorFlow 1.1合作，并建立在Dave Currie的工作之上。 查看Dave的帖子[亚马逊评论的文本摘要]（https://medium.com/towards-data-science/text-summarization-with-amazon-reviews-41801c2210b ）。

<img src="images/sequence-to-sequence.jpg"/>


## 数据集

数据集存在于/ data /文件夹中。 目前，它由以下文件组成：
  * ** letters_source.txt **：输入字母序列的列表。 每个序列都是自己的行。
  * ** letters_target.txt **：我们将在培训过程中使用的目标序列列表。 这里的每个序列都是对具有相同行号的letters_source.txt中的输入序列的响应。

In [1]:
import numpy as np
import time

import helper

source_path = 'data/letters_source.txt'
target_path = 'data/letters_target.txt'

source_sentences = helper.load_data(source_path)
target_sentences = helper.load_data(target_path)

我们先来看看数据集的当前状态。 `source_sentences`包含整个输入序列文件，作为由换行符分隔的文本。

In [2]:
source_sentences[:50].split('\n')

['bsaqq',
 'npy',
 'lbwuj',
 'bqv',
 'kial',
 'tddam',
 'edxpjpg',
 'nspv',
 'huloz',
 '']

`target_sentences`包含整个输出序列文件，由换行符号分隔的文本。 每行对应于`source_sentences`的行。 `target_sentences`包含行的排序字符。

In [3]:
target_sentences[:50].split('\n')

['abqqs',
 'npy',
 'bjluw',
 'bqv',
 'aikl',
 'addmt',
 'degjppx',
 'npsv',
 'hlouz',
 '']

 
## 预处理
要做任何有用的事情，我们需要将每个字符串变成一个字符列表：

<img src="images/source_and_target_arrays.png"/>

然后将字符转换为我们的词汇表中声明的int值：

In [4]:
def extract_character_vocab(data):
    special_words = ['<PAD>', '<UNK>', '<GO>',  '<EOS>']

    set_words = set([character for line in data.split('\n') for character in line])
    int_to_vocab = {word_i: word for word_i, word in enumerate(special_words + list(set_words))}
    vocab_to_int = {word: word_i for word_i, word in int_to_vocab.items()}

    return int_to_vocab, vocab_to_int

# Build int2letter and letter2int dicts
source_int_to_letter, source_letter_to_int = extract_character_vocab(source_sentences)
target_int_to_letter, target_letter_to_int = extract_character_vocab(target_sentences)

# Convert characters to ids
source_letter_ids = [[source_letter_to_int.get(letter, source_letter_to_int['<UNK>'])  for letter in line]   for line in source_sentences.split('\n')]
target_letter_ids = [[target_letter_to_int.get(letter, target_letter_to_int['<UNK>']) 
                      for letter in line] + [target_letter_to_int['<EOS>']] 
                         for line in target_sentences.split('\n')] 

print("Example source sequence")
print(source_letter_ids[:3])
print("\n")
print("Example target sequence")
print(target_letter_ids[:3])

Example source sequence
[[24, 25, 11, 27, 27], [28, 17, 4], [21, 24, 8, 15, 18]]


Example target sequence
[[11, 24, 27, 27, 25, 3], [28, 17, 4, 3], [24, 18, 21, 15, 8, 3]]


这是我们需要的最终形状。我们现在可以继续建立模型。

## 型号
#### 检查TensorFlow的版本
这将检查以确保您具有正确版本的TensorFlow

In [5]:
from distutils.version import LooseVersion
import tensorflow as tf
from tensorflow.python.layers.core import Dense


# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.1'), 'Please use TensorFlow version 1.1 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.2.1


### Hyperparameters

In [6]:
# Number of Epochs
epochs = 60
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 50
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 15
decoding_embedding_size = 15
# Learning Rate
learning_rate = 0.001

### Input

In [7]:
def get_model_inputs():
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')

    target_sequence_length = tf.placeholder(tf.int32, (None,), name='target_sequence_length')
    max_target_sequence_length = tf.reduce_max(target_sequence_length, name='max_target_len')
    source_sequence_length = tf.placeholder(tf.int32, (None,), name='source_sequence_length')
    
    return input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length



### 序列到序列模型

我们现在可以开始定义构建seq2seq模型的函数。 我们正在从下到上构建以下组件：

     2.1编码器Encoder
         - 嵌入Embedding
         - 编码器单元Encoder cell
     2.2解码器Decoder
         1-过程解码器输入Process decoder inputs
         2-设置解码器Set up the decoder
             - 嵌入Embedding
             - 解码器单元Decoder cell
             密集输出层Dense output layer
             - 训练解码器Training decoder
             - 推理解码器Inference decoder
     2.3 Seq2seq型号连接编码器和解码器Seq2seq model connecting the encoder and decoder
     2.4建立与该模型挂钩的训练图Build the training graph hooking up the model with the 
        优化optimizer

 

### 2.1编码器

我们将构建的模型的第一位是编码器。 在这里，我们将嵌入输入数据，构建我们的编码器，然后将嵌入的数据传递给编码器。

- 使用[`tf.contrib.layers.embed_sequence`]嵌入输入数据（https://www.tensorflow.org/api_docs/python/tf/contrib/layers/embed_sequence ）
<img src="images/embed_sequence.png" />

- 将嵌入式输入传递到一堆RNN。 保存RNN状态并忽略输出。
<img src="images/encoder.png" />

In [None]:
def encoding_layer(input_data, rnn_size, num_layers,
                   source_sequence_length, source_vocab_size, 
                   encoding_embedding_size):


    # Encoder embedding
    enc_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size, encoding_embedding_size)

    # RNN cell
    def make_cell(rnn_size):
        enc_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return enc_cell

    enc_cell = tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size) for _ in range(num_layers)])
    
    enc_output, enc_state = tf.nn.dynamic_rnn(enc_cell, enc_embed_input, sequence_length=source_sequence_length, dtype=tf.float32)
    
    return enc_output, enc_state



## 2.2解码器

解码器可能是这个模型中最受关注的部分。需要以下步骤来创建它：

    1-过程解码器输入
    2-设置解码器组件
         - 嵌入
         - 解码器单元
        密集输出层
         - 训练解码器
         - 推理解码器


### 过程解码器输入


在训练过程中，目标序列将用于两个不同的地方：

 用它们来计算损失
 2.在培训期间将它们送入解码器，使模型更加健壮。

现在我们需要解决第二点。我们假设我们的目标看起来像这样的信件/单词形式（我们这样做是为了可读性。在代码的这一点上，这些序列将是int形式）：


<img src="images/targets_1.png"/>

我们需要对张量进行简单的变换，然后再将其转换为解码器：

1-我们将在每个时间步骤将序列的项目提供给解码器。想想最后一个时间步 - 解码器输出其输出中的最后一个字。该步骤的输入是最后一个目标序列之前的项目。在这种情况下，解码器不能用于目标序列中的最后一个项目。所以我们需要删除最后一个项目。

我们使用tensorflow的tf.strided_slice（）方法。我们把它的张量，以及从哪里开始的指数，以及在哪里结束切割。

<img src="images/strided_slice_1.png"/>

2-我们馈送到解码器的每个序列中的第一个项目必须是GO符号。所以我们来补充一点。


<img src="images/targets_add_go.png"/>


现在，张量已准备好送入解码器。看起来像这样（如果我们从ints转换成字母/符号）：

<img src="images/targets_after_processing_1.png"/>

In [None]:
# Process the input we'll feed to the decoder
def process_decoder_input(target_data, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input



### 设置解码器组件

         - 嵌入
         - 解码器单元
        密集输出层
         - 训练解码器
         - 推理解码器

#### 1-嵌入
现在我们已经准备了训练解码器的输入，我们需要嵌入它们，以便它们可以准备好传递给解码器。

我们将创建一个嵌入式矩阵，如下所示，然后tf.nn.embedding_lookup将我们的输入转换为其嵌入的等价物：
<img src="images/embeddings.png" />

#### 2-解码器单元
然后我们声明我们的解码器单元。就像编码器一样，我们也在这里使用一个tf.contrib.rnn.LSTMCell。

我们需要声明用于训练过程的解码器，以及用于推理/预测过程的解码器。这两个解码器将共享其参数（使得在部署模型时可以使用在训练阶段设置的所有权重和偏差）。

首先，我们需要定义我们将用于解码器RNN的单元格类型。我们选择了LSTM。

#### 3-密集输出层
在我们宣布解码器之前，我们需要创建一个输出层，它将一个tensorflow.python.layers.core.Dense层，将解码器的输出转换为逻辑，告诉我们解码器词汇的哪个元素解码器选择在每个时间步长输出。

#### 4-培训解码器
基本上，我们将创建两个解码器，共享它们的参数。一个用于训练，一个用于推理。这两个类似于使用tf.contrib.seq2seq创建的类型。** BasicDecoder **和tf.contrib.seq2seq。** dynamic_decode **。然而，它们不同之处在于，我们将目标序列作为每个时间步长的训练解码器的输入，以使其更加健壮。

我们可以将训练解码器看作是这样的（除了可以批量使用序列）：
<img src="images/sequence-to-sequence-training-decoder.png"/>

训练解码器**不会将每个时间步长的输出提供给下一个。相反，解码器时间步长的输入是来自训练数据集（橙色字母）的目标序列。

#### 5-推理解码器
推理解码器是我们在将我们的模型部署到野外时将使用的解码器。

<img src="images/sequence-to-sequence-inference-decoder.png"/>

我们将编码器隐藏状态传递给训练和推理解码器，并将其处理其输出。 TensorFlow为我们处理大部分逻辑。我们只需要使用tf.contrib.seq2seq中的相应方法，并为它们提供适当的输入。

In [None]:
def decoding_layer(target_letter_to_int, decoding_embedding_size, num_layers, rnn_size,
                   target_sequence_length, max_target_sequence_length, enc_state, dec_input):
    # 1. Decoder Embedding
    target_vocab_size = len(target_letter_to_int)
    dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)

    # 2. Construct the decoder cell
    def make_cell(rnn_size):
        dec_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return dec_cell

    dec_cell = tf.contrib.rnn.MultiRNNCell([make_cell(rnn_size) for _ in range(num_layers)])
     
    # 3. Dense layer to translate the decoder's output at each time 
    # step into a choice from the target vocabulary
    output_layer = Dense(target_vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))


    # 4. Set up a training decoder and an inference decoder
    # Training Decoder
    with tf.variable_scope("decode"):

        # Helper for the training process. Used by BasicDecoder to read inputs.
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                            sequence_length=target_sequence_length,
                                                            time_major=False)
        
        
        # Basic decoder
        training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                           training_helper,
                                                           enc_state,
                                                           output_layer) 
        
        # Perform dynamic decoding using the decoder
        training_decoder_output = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                                       impute_finished=True,
                                                                       maximum_iterations=max_target_sequence_length)[0]
    # 5. Inference Decoder
    # Reuses the same parameters trained by the training process
    with tf.variable_scope("decode", reuse=True):
        start_tokens = tf.tile(tf.constant([target_letter_to_int['<GO>']], dtype=tf.int32), [batch_size], name='start_tokens')

        # Helper for the inference process.
        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(dec_embeddings,
                                                                start_tokens,
                                                                target_letter_to_int['<EOS>'])

        # Basic decoder
        inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        enc_state,
                                                        output_layer)
        
        # Perform dynamic decoding using the decoder
        inference_decoder_output = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            impute_finished=True,
                                                            maximum_iterations=max_target_sequence_length)[0]
         

    
    return training_decoder_output, inference_decoder_output

## 2.3 Seq2seq模型
现在我们来上面一步，使用我们刚刚宣布的方法连接编码器和解码器

In [None]:

def seq2seq_model(input_data, targets, lr, target_sequence_length, 
                  max_target_sequence_length, source_sequence_length,
                  source_vocab_size, target_vocab_size,
                  enc_embedding_size, dec_embedding_size, 
                  rnn_size, num_layers):
    
    # Pass the input data through the encoder. We'll ignore the encoder output, but use the state
    _, enc_state = encoding_layer(input_data, 
                                  rnn_size, 
                                  num_layers, 
                                  source_sequence_length,
                                  source_vocab_size, 
                                  encoding_embedding_size)
    
    
    # Prepare the target sequences we'll feed to the decoder in training mode
    dec_input = process_decoder_input(targets, target_letter_to_int, batch_size)
    
    # Pass encoder state and decoder inputs to the decoders
    training_decoder_output, inference_decoder_output = decoding_layer(target_letter_to_int, 
                                                                       decoding_embedding_size, 
                                                                       num_layers, 
                                                                       rnn_size,
                                                                       target_sequence_length,
                                                                       max_target_sequence_length,
                                                                       enc_state, 
                                                                       dec_input) 
    
    return training_decoder_output, inference_decoder_output
    



 

模型输出* training_decoder_output *和* inference_decoder_output *都包含一个'rnn_output'logits张量，如下所示：

<img src="images/logits.png"/>

我们从训练张量得到的逻辑，我们将传递给tf.contrib.seq2seq ** sequence_loss（）**来计算损失，最终计算梯度。


In [None]:
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length = get_model_inputs()
    
    # Create the training and inference logits
    training_decoder_output, inference_decoder_output = seq2seq_model(input_data, 
                                                                      targets, 
                                                                      lr, 
                                                                      target_sequence_length, 
                                                                      max_target_sequence_length, 
                                                                      source_sequence_length,
                                                                      len(source_letter_to_int),
                                                                      len(target_letter_to_int),
                                                                      encoding_embedding_size, 
                                                                      decoding_embedding_size, 
                                                                      rnn_size, 
                                                                      num_layers)    
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_decoder_output.rnn_output, 'logits')
    inference_logits = tf.identity(inference_decoder_output.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)


 
## 批次

当我们检索批次时，处理量很小。 这是一个简单的例子，假设batch_size = 2

源序列（实际上是int形式，为了清楚起见我们正在显示字符）：

<img src="images/source_batch.png" />

目标序列（也在int中，但为了清楚而显示字母）：

<img src="images/target_batch.png" />

In [None]:
def pad_sentence_batch(sentence_batch, pad_int):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]

In [None]:
def get_batches(targets, sources, batch_size, source_pad_int, target_pad_int):
    """Batch targets, sources, and the lengths of their sentences together"""
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size
        sources_batch = sources[start_i:start_i + batch_size]
        targets_batch = targets[start_i:start_i + batch_size]
        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))
        
        # Need the lengths for the _lengths parameters
        pad_targets_lengths = []
        for target in pad_targets_batch:
            pad_targets_lengths.append(len(target))
        
        pad_source_lengths = []
        for source in pad_sources_batch:
            pad_source_lengths.append(len(source))
        
        yield pad_targets_batch, pad_sources_batch, pad_targets_lengths, pad_source_lengths

## Train
We're now ready to train our model. If you run into OOM (out of memory) issues during training, try to decrease the batch_size.

In [None]:
# Split data to training and validation sets
train_source = source_letter_ids[batch_size:]
train_target = target_letter_ids[batch_size:]
valid_source = source_letter_ids[:batch_size]
valid_target = target_letter_ids[:batch_size]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size,
                           source_letter_to_int['<PAD>'],
                           target_letter_to_int['<PAD>']))

display_step = 20 # Check training loss after every 20 batches

checkpoint = "best_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
        
    for epoch_i in range(1, epochs+1):
        for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
                get_batches(train_target, train_source, batch_size,
                           source_letter_to_int['<PAD>'],
                           target_letter_to_int['<PAD>'])):
            
            # Training step
            _, loss = sess.run(
                [train_op, cost],
                {input_data: sources_batch,
                 targets: targets_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths})

            # Debug message updating us on the status of the training
            if batch_i % display_step == 0 and batch_i > 0:
                
                # Calculate validation cost
                validation_loss = sess.run(
                [cost],
                {input_data: valid_sources_batch,
                 targets: valid_targets_batch,
                 lr: learning_rate,
                 target_sequence_length: valid_targets_lengths,
                 source_sequence_length: valid_sources_lengths})
                
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}  - Validation loss: {:>6.3f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(train_source) // batch_size, 
                              loss, 
                              validation_loss[0]))

    
    
    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, checkpoint)
    print('Model Trained and Saved')

## Prediction

In [None]:
def source_to_seq(text):
    '''Prepare the text for the model'''
    sequence_length = 7
    return [source_letter_to_int.get(word, source_letter_to_int['<UNK>']) for word in text]+ [source_letter_to_int['<PAD>']]*(sequence_length-len(text))


In [None]:


input_sentence = 'hello'
text = source_to_seq(input_sentence)

checkpoint = "./best_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      target_sequence_length: [len(text)]*batch_size, 
                                      source_sequence_length: [len(text)]*batch_size})[0] 


pad = source_letter_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nSource')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([source_int_to_letter[i] for i in text])))

print('\nTarget')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([target_int_to_letter[i] for i in answer_logits if i != pad])))