# Build a sequence to sequence language model to generate Chinese poems
We train a seq2seq RNN on ~43,000 poems from the Tang Dynastry to learn the probability distribution of the next word/character in the sequence given the history of previous characters.

## Preprocess the file containing all the poems

The format of our input data is like this:

`(optional title + ":")poem`

We will do:

- Remove title
- Remove spaces
- Remove empty symbols
- Replace other symbols

In [0]:
import re
import numpy as np

data_filename ='poetry.txt'
poems = []
with open(data_filename, "r") as in_file:
  for line in in_file.readlines():
    line = line.strip()
    # find title if exists
    if ':' in line:
      line = line.split(':')
    # some poems are empty
    if len(line) == 2:
      poem = line[1]
    else:
      continue
    # discard if contains special symbols
    if re.search(r'[(（《_□]', poem):
      continue
    # discard if too short or too long
    if len(poem) < 5 or len(poem) > 40:
      continue
    # remove symbols
    poem = re.sub(u'[，。]','',poem)
    poems.append(poem)

poems = np.random.permutation(poems)

In [5]:
poems[0]

'闭门茅底偶为邻北阮那怜南阮贫却是梅花无世态隔墙分送一枝春'

In [6]:
poems[2]

'杜若溪边手自移旋抽烟剑碧参差何时织得孤帆去悬向秋风访所思'

We select 5 poems as our test set.

In [7]:
poems_train, poems_test = poems[:-5], poems[-5:]
len(poems_train), len(poems_test)

(11098, 5)

## Create a vocabulary mapping word to id using Tokenizer!

In [8]:
import time
from tensorflow.keras.preprocessing.text import Tokenizer

poem_tokenizer = Tokenizer(num_words=None, lower=False, char_level=True)
# Create word to ID dictionary
poem_tokenizer.fit_on_texts(poems)
# Get dictionary
word_index = poem_tokenizer.word_index

# Note that ID starts from 1!!
# We need to add special ID 0
word_index["<PAD>"] = 0
# Create ID to word 
reverse_word_index = dict([(v, k) for (k, v) in word_index.items()])
print("Number of unique chars: {}".format(len(word_index)))

Number of unique chars: 4762


Again, always check if there is any strange symbols in the dictionary. Here we only print first and last parts.

In [9]:
# sort word index by ID
for (w,i) in sorted(word_index.items(), key=lambda w: w[1]):
# print some words to check if there are errors!
  if i > 10 and i < len(word_index)-5: continue
  print("{} {}".format(w,i))

<PAD> 0
不 1
人 2
一 3
山 4
风 5
无 6
花 7
来 8
日 9
春 10
屩 4757
羖 4758
锱 4759
殖 4760
豗 4761


In [10]:
# Apply word to ID on training and test set
poems_train = poem_tokenizer.texts_to_sequences(poems_train)
poems_test = poem_tokenizer.texts_to_sequences(poems_test)
# Check and see if there is any error
print(poems_train[0])
print(''.join([reverse_word_index[w] for w in poems_train[0]]))

[577, 55, 791, 643, 708, 32, 640, 180, 1587, 424, 256, 54, 1587, 720, 130, 22, 504, 7, 6, 218, 1214, 402, 644, 172, 188, 3, 119, 10]
闭门茅底偶为邻北阮那怜南阮贫却是梅花无世态隔墙分送一枝春


## Prepare the data for input

We flatten the input to a long list.

In [0]:
# flatten to a long string of characters
poems_train = [w for po in poems_train for w in po]

# flatten to a long string of characters
poems_test = [w for po in poems_test for w in po]

## Define an input object

We need to put the input into batches.
* Reshape input data into a rectangular matrix and crop remainders
* Calculate shape of each batch
* Generate batch with input and output = input shift by one time step


In [0]:
import tensorflow as tf

In [0]:
class PoemInput(object):
  def __init__(self, data, config, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.sources, self.targets = self.input_producer(
        data, batch_size, num_steps, name=name)

  def input_producer(self, raw_data, batch_size, num_steps, name=None):
    """Reshape the poem data to form input and output.
    This chunks the raw_data into batches of examples and returns Tensors that
    are drawn from these batches.
    Args:
      raw_data: a list of words
      batch_size: int, the batch size.
      num_steps: int, the sequence length.
      name: the name of this operation (optional).
    Returns:
      A pair of Tensors, each shaped [batch_size, num_steps]. The second element
      of the tuple is the same data time-shifted to the right by one.
    """
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
    # get size of the 1-d tensor
    data_len = tf.size(raw_data)
    # calculate how many batches
    batch_len = data_len // batch_size
    # crop data that does not fit in a batch
    data = tf.reshape(raw_data[0:batch_size*batch_len],
                      [batch_size, batch_len])
    # calculate how many batches in an epoch
    epoch_size = (batch_len-1) // num_steps
    # make sure there is at least one batch
    assertion = tf.assert_positive(epoch_size,
        message="epoch_size == 0, decrease batch_size or num_steps")
    with tf.control_dependencies([assertion]):
      epoch_size = tf.cast(tf.identity(epoch_size, name="epoch_size"), tf.int64)

    # start generating slices
    # range_input_producer returns a sequence of IDs 
    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    x = data[:, i*num_steps  :(i+1)*num_steps]
    y = data[:, i*num_steps+1:(i+1)*num_steps+1]
    return x, y

## Define hyperparameters

In [0]:
# Define hyperparameters
class Hparam(object):
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 1
  num_steps = 35
  vocab_size = len(word_index)
  embedding_size = 100
  hidden_size = 100
  warmup_epochs = 3
  num_epochs_to_train = 20
  keep_prob = 0.6
  lr_decay = 0.9
  batch_size = 100

config = Hparam()

## Construct model
In this step, the entire model structure must be defined completely. Including
* Input
* Size of layers
* Connection between layers
* Variables in layers
* Output
* Loss
* Operations that apply the gradients (optimizer)
* Placeholder for feeding special values
* Properties that can be read from outside

Note that we will use CudnnLSTM to speed up our training if available. However, I will provide two versions of LSTM here in case you cannot find a machine with GPUs.

In [0]:
from tensorflow.contrib.cudnn_rnn import CudnnLSTM
from tensorflow.contrib.rnn import BasicLSTMCell, MultiRNNCell
from tensorflow.nn import embedding_lookup, dropout

# Build our model
class MySeq2SeqModel(object):
  def __init__(self, is_training, config, input_):
    self._is_training = is_training
    self._input = input_
    self._cell = None
    self.batch_size = input_.batch_size
    self.num_steps = input_.num_steps
    rnn_size = config.hidden_size
    vocab_size = config.vocab_size
    embedding_size = config.embedding_size

    # Embeddings can only exist on CPU
    with tf.device("/cpu:0"):
      embedding_weights = tf.get_variable("embedding", \
                     [vocab_size, embedding_size])
      embed_inputs = tf.nn.embedding_lookup(embedding_weights, input_.sources)

    if is_training and config.keep_prob < 1.:
      embed_inputs = tf.nn.dropout(embed_inputs, config.keep_prob)

    # build RNN using CudnnLSTM
    output, _ = self._build_rnn(embed_inputs, config, is_training)
    # build RNN using basic LSTM
    # output, _ = self._build_rnn_old_lstm(embed_inputs, config, is_training)

    # Remember RNN output is [batch_size x time, rnnsize]
    # Dense layer for projecting onto vocabulary size
    softmax_w = tf.get_variable("softmax_w", [rnn_size, vocab_size])
    softmax_b = tf.get_variable("softmax_b", [vocab_size])
    logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)
    # Reshape logits to be a 3-D tensor for sequence loss
    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
    self._logits = logits

    # Use the contrib sequence loss and average over the batches
    loss = tf.contrib.seq2seq.sequence_loss(
        logits,
        input_.targets,
        tf.ones([self.batch_size, self.num_steps]),
        average_across_timesteps=False,
        average_across_batch=True)

    # Update the cost
    self._cost = tf.reduce_sum(loss)

    if not is_training:
      return

    # A variable to store learning rate
    self._lr = tf.Variable(0.0, trainable=False)

    # Calculate gradients
    # Get a list of trainable variables
    tvars = tf.trainable_variables()
    # Get gradient and clip by norm
    grads, _ = tf.clip_by_global_norm(\
                 tf.gradients(self._cost, tvars),
                 config.max_grad_norm)
    # Define an optimizer
    # Note that the optimizer reads the value of learning rate from variable
    optimizer = tf.train.GradientDescentOptimizer(self._lr)
    # Define an operation that actually applies the gradients
    self._train_op = optimizer.apply_gradients(
        zip(grads, tvars),
        global_step=tf.train.get_or_create_global_step())
    # A placeholder for feeding new learning rates
    self._new_lr = tf.placeholder(
         tf.float32, shape=[], name="new_learning_rate")
    self._lr_update_op = tf.assign(self._lr, self._new_lr)
  
  def _build_rnn(self, inputs, config, is_training):
    # RNN requires time-major
    inputs = tf.transpose(inputs, [1, 0, 2])
    self._cell = CudnnLSTM(
        num_layers=config.num_layers,
        num_units=config.hidden_size,
        )
    self._cell.build(inputs.get_shape())
    outputs, state = self._cell(inputs)
    # Transpose from time-major to batch-major
    outputs = tf.transpose(outputs, [1, 0, 2])
    # Reshape from [batch, time, rnnsize] to [batch x time, rnnsize]
    # For computing softmax later
    outputs = tf.reshape(outputs, [-1, config.hidden_size])
    return outputs, state

  def _build_rnn_old_lstm(self, inputs, config, is_training):
    def make_cell():
      cell = BasicLSTMCell(
        config.hidden_size, forget_bias=0.0, state_is_tuple=True,
        reuse=not is_training)
      if is_training and config.keep_prob < 1:
        cell = tf.contrib.rnn.DropoutWrapper(
            cell, output_keep_prob=config.keep_prob)
      return cell

    cell = tf.contrib.rnn.MultiRNNCell(
        [make_cell() for _ in range(config.num_layers)], state_is_tuple=True)

    self._initial_state = cell.zero_state(config.batch_size, tf.float32)
    state = self._initial_state
    outputs = []
    inputs = tf.unstack(inputs, num=self.num_steps, axis=1)
    outputs, state = tf.nn.static_rnn(cell, inputs,
                                      initial_state=self._initial_state)
    output = tf.reshape(tf.concat(outputs, 1), [-1, config.hidden_size])
    return output, state
  
  def assign_lr(self, session, lr_value):
    session.run(self._lr_update_op, feed_dict={self._new_lr: lr_value})

  @property
  def input(self):
    return self._input

  @property
  def cost(self):
    return self._cost

  @property
  def lr(self):
    return self._lr

  @property
  def train_op(self):
    return self._train_op

  @property
  def logits(self):
    return self._logits

## Define a training operation for an epoch
This procedure gets the output from the model for each batch.
We need a dictionary with these keys:

* "cost": Reads the propertie `model.cost` that we defined above. 
* "do_op": Perform operation `model.train_op` that applies gradients

After running (calling `session.run()`), the same key will contain the return values.

We can add any key in the dictionary that corresponds to `@property` in the model!

In [0]:
def run_epoch(session, model, do_op=None, verbose=False):
  start_time = time.time()
  costs = 0.0
  iters = 0
  feed_to_model_dict = {
      "cost": model.cost,
  }
  # if an operation is provided, put that in the feed
  if do_op is not None:
    feed_to_model_dict["do_op"] = do_op

  for step in range(model.input.epoch_size):
    # use the session to run, feed the dictionary
    s_out = session.run(feed_to_model_dict)
    # The returned dictionary will contain the information we need
    cost = s_out["cost"]
    # Accumulate cost
    costs += cost
    # Accumulate number of training steps
    iters += model.input.num_steps
    # Print loss periodically
    if verbose and (step+1) % (model.input.epoch_size // 5) == 0:
      print("%.0f%% ppl: %.3f, speed: %.0f char/sec" %
            ((step+1) * 100.0 / model.input.epoch_size, \
             np.exp(costs/iters), \
             iters * model.input.batch_size/(time.time() - start_time)))

  return np.exp(costs / iters)


## Define a Generator
We will also create a Generator Model to generate new poems. Note that it is much less complicated than the training model.
However, we need to add a procedure to generate output for some steps. 

In [0]:
class MyGeneratorModel(object):
  def __init__(self, config):
    self._input = tf.placeholder(tf.int32, shape=[1], name="_input")
    self.batch_size = 1
    self.num_steps = config.num_steps
    rnn_size = config.hidden_size
    vocab_size = config.vocab_size
    embedding_size = config.embedding_size

    # Embeddings can only exist on CPU
    with tf.device("/cpu:0"):
      embedding_weights = tf.get_variable("embedding", \
                     [vocab_size, embedding_size])
      embed_inputs = tf.nn.embedding_lookup(embedding_weights, self._input)
      embed_inputs = tf.expand_dims(embed_inputs, 0)

    # build RNN using CudnnLSTM
    self._cell = CudnnLSTM(
        num_layers=config.num_layers,
        num_units=config.hidden_size,
        )

    # build final projection layer
    softmax_w = tf.get_variable("softmax_w", [rnn_size, vocab_size])
    softmax_b = tf.get_variable("softmax_b", [vocab_size])

    # Collect a sequence of output word IDs
    self._output_word_ids = []

    # Decode first word
    outputs, state = self._cell(embed_inputs)
    outputs = tf.reshape(outputs, [-1, config.hidden_size])
    logits = tf.nn.xw_plus_b(outputs, softmax_w, softmax_b)
    # Get input for next step
    next_input = tf.argmax(logits, axis=-1)
    next_input = tf.squeeze(next_input)
    self._output_word_ids.append(next_input)
    # Convert next input to word embeddings
    next_input = tf.nn.embedding_lookup(embedding_weights, next_input)
    next_input = tf.reshape(next_input, [1, 1, embedding_size])
    
    # Feed back to LSTM
    for _ in range(self.num_steps-1):
      outputs, state = self._cell(next_input, state)
      outputs = tf.reshape(outputs, [-1, config.hidden_size])
      logits = tf.nn.xw_plus_b(outputs, softmax_w, softmax_b)
      next_input = tf.argmax(logits, axis=-1)
      next_input = tf.squeeze(next_input)
      self._output_word_ids.append(next_input)

      next_input = tf.nn.embedding_lookup(embedding_weights, next_input)
      next_input = tf.reshape(next_input, [1, 1, embedding_size])

  @property
  def output_word_ids(self):
    return self._output_word_ids


## Define a call to generator
Again we need a decoder to translate word IDs back to words. And we need to define a procedure to communicate with the model. `feed_dict` and `fetches` are two keys to do that.

In [0]:
def decode_text(text, max_len_newline=5):
  words = [reverse_word_index.get(i, "<UNK>") for i in text]
  fixed_width_string = []

  for w_pos in range(len(words)):
    fixed_width_string.append(words[w_pos])
    if (w_pos+1) % max_len_newline == 0:
      fixed_width_string.append('\n')
  return ''.join(fixed_width_string)

def run_generator(session, model, seed_word, config):
  
  feed_to_model_dict = {
      model._input: [seed_word],
  }
  fetch_model_dict = {
      "output_word_ids": model.output_word_ids
  }

  # An example of sending and receiving data from the model
  vals = session.run(fetches=fetch_model_dict, feed_dict=feed_to_model_dict)
  output_word_ids = vals['output_word_ids']

  # Decode to readable words
  print(decode_text([seed_word] + output_word_ids, (config.num_steps+1)//4))
  return

## Main training controller
Finally, we define a controller that:
* Create the model for training
* Create the model for testing, copying from the training model
* Prepare the input data
* Define what to log in the progress of training
* Create a `session` that communicates with computation graph
* Change learning rate optionally
* Get test set results


In [0]:
def main(_):

  with tf.Graph().as_default():
    initializer = tf.random_uniform_initializer(-0.1, 0.1)

    with tf.name_scope("Train"):
      # Create input producer
      train_input = PoemInput(poems_train, config, name="TrainInput")
      # Create the model instance
      with tf.variable_scope("Model", reuse=None, initializer=initializer):
        m = MySeq2SeqModel(is_training=True, config=config, input_=train_input)
      # Add information to logs
      tf.summary.scalar("Training_Loss", m.cost)
      tf.summary.scalar("Learning_Rate", m.lr)

    with tf.name_scope("Test"):
      eval_config = Hparam()
      eval_config.batch_size = 1
      eval_config.num_steps = 20

      # Create another input for test data
      # Note that eval_config was set locally
      test_input = PoemInput(poems_test, eval_config, name="TestInput")
      # Create another model but reuse the variables in the training model
      with tf.variable_scope("Model", reuse=True):
        mtest = MySeq2SeqModel(is_training=False, config=eval_config,
                         input_=test_input)

    with tf.name_scope("Gen"):
      generator_config = Hparam()
      generator_config.batch_size = 1
      generator_config.num_steps = 19
      # Create generator model
      with tf.variable_scope("Model", reuse=True):
        mgenerate = MyGeneratorModel(config=generator_config)

    # Hardware settings
    config_proto = tf.ConfigProto(allow_soft_placement=True)
    # Create a MonitoredTrainingSession that controls the training process
    # Also automatically logs and reports 
    # Note the `checkpoint_dir` setting
    with tf.train.MonitoredTrainingSession(checkpoint_dir="logs", \
                                           config=config_proto, \
                                           log_step_count_steps=-1) as session:
      for i in range(config.num_epochs_to_train):

        # Calculate learning rate decay
        lr_decay = config.lr_decay ** max(i + 1 - config.warmup_epochs, 0.0)
        # Set learning rate
        m.assign_lr(session, config.learning_rate * lr_decay)
        # Print new learning rate
        print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
        # Train one epoch and report loss
        train_perplexity = run_epoch(session, m, do_op=m.train_op, verbose=True)
        print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity))
      
      # End of training
      # Evaluate test set performance
      test_perplexity = run_epoch(session, mtest)
      print("Test Perplexity: %.3f" % test_perplexity)
      
      # Set a seed word and generate new poem
      seed_word = '天'
      run_generator(session, mgenerate, seed_word=word_index[seed_word], config=generator_config)

## Start training
We can actually start training by calling the controller.

In [20]:
main(1)

Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.range(limit).shuffle(limit).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipeli

We can see the training process shown here. Observe that training loss keeps decreasing, which means that the model is actually learning. 

Also, due to the speedup of CudnnLSTM, the speed can be very fast (> 100,000 w/s). Using basic LSTM can only achieve ~6,000 w/s.

If you are running this script locally, start `tensorboard` and point it to the `logs` directory will allow you to see the loss plot over time. We will not be able to show that easily in Colab environment.

You can also continue training by calling the controller again. Try this later and see if the poems generated gets better over time.

## Clear previous output

Tensorflow will automatically load previous models if you specify a path for the `session`. However, that will be a problem if you change some parts of the model. e.g., change embedding size, LSTM size, or number of layers.

You will see something like 
```
INFO:tensorflow:Restoring parameters from logs/model.ckpt-4465
...
InvalidArgumentError: Assign requires shapes of both tensors to match.
```
Always remember to clear output directory if you are experimenting with different model structures!

In [0]:
!rm -R logs

# Summary
What we learned today:
1. Preprocessing for language modeling data
    * Create a dictionary that maps words to unique IDs
    * Convert words to ID
    * Reshape sequences to unified lengths
    * Create a helper to produce data
2. Building a model using tensorflow
    * Hyperparameters
    * Training operation
    * Testing operation
    * Control function
3. Training and evaluation
    * Observe loss
    * Evaluate on test set



# Appendix: connect your Google Drive to Colab for uploading your data

First, copy the file into Google Drive. Then run the following code to link your Drive to this notebook. Follow the link in the output.

In [22]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


Copy (`cp`) the file from `/gdrive` to this server.

In [0]:
!cp /gdrive/My\ Drive/Colab\ Notebooks/poetry.txt ./

In [24]:
!cp /gdrive/My\ Drive/Colab\ Notebooks/Book*.txt ./

cp: cannot stat '/gdrive/My Drive/Colab Notebooks/Book*.txt': No such file or directory
