# Generating Philosophy Texts by Using RNN

This is a project that uses RNN to generate new texts based ont trained texts. This project is based off one of the projects I have done at the Nanodegree program on Udacity. (See link below. Originally the project is about generating texts on Simpsons based on Tensorflow 1.0). A lot of software library packages and training techniques have been updated/deprecated, so I am thinking to reconstruct it in a new notebook, get it running again and see the results now. Although the generated texts are more or less nonsensical (oh well, just like every other philosophy text), based on the word pattern, we can still see how the RNN learns a seq2seq model. 

Here are some helpful links that on this issue:
* Karpathy, The Unreadonable Effectiveness of Recurrent Neural Networks: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
* Github Repo of Udacity's Deep Learning Nanodegree Foundation program: https://github.com/udacity/deep-learning


Here is how different parts of this notebook will be organized:
* Pre-processing
* Build the network
* Training
* Generating

## Pre-processing
### Check the environments
Make sure the Python version is 3.6 and gpu can be detected. (Haven't tested it on lower version of Python.)

In [3]:
import tensorflow as tf
import sys
print (sys.version)
print (tf.test.gpu_device_name())

  return f(*args, **kwds)


3.6.3 |Anaconda, Inc.| (default, Nov 20 2017, 20:41:42) 
[GCC 7.2.0]
/device:GPU:0


### Read in the txt file of the book

I've tried different books as the inputs. This time we use Kant's _Critique of Pure Reason_ . You can find the txt version of the book here: http://www.gutenberg.org/ebooks/4280

In [4]:
with open('data/kant_CPR.txt', 'r') as file:
    book = file.read()

### Lookup Table

In order to feed the texts into the neural network (i.e. represent the texts in the network), some important steps of pre-processing include to build a look-up table for each word.  

Depending on different ways that we want to build our network, there are other options for pre-precessing as well. Note that here we build a look-up table for each _word_, instead of each _character_. Both are valid choices for building seq2seq models. (Alternatively, I am not aware of whether people have doen seq2seq learning at the sentence level. But that does not seem be too interesting for generative models.) After all, the choice between word and character in building seq2seq models does determine how we encode the representations in the network, as we will see soon in the part of network building. 

In [5]:
def look_up_table(text):
    '''
    Generate a lookup table for all words in a text. Input should be the pre-processed list of strings(words).
    Return two dicts, which contain word:int and int:word as key:value pairs, respectively.
    '''
    vocab = set(text)
    vocab_to_int = {c: i for i, c in enumerate(vocab)}
    int_to_vocab = dict(enumerate(vocab))
    return vocab_to_int, int_to_vocab


In [6]:
def build_token_lookup():
    '''
    Build a token lookup table for special characters we encounter in the text. Used for both pre-processing and generating. 
    '''
    keys = ['.', ',', '"', ';', '!', '?', '(', ')', '--','\n'] 
    values = ['||Period||','||Comma||','||Quotation_Mark||','||Semicolon||','||Exclamation_mark||','||Question_mark||','||Left_Parentheses||','||Right_Parentheses||','||Dash||','||Return||']  
    return (dict(zip(keys,values)))

In [7]:
def main_pre_processing(book):
    '''
    Main function of pre-processing. 
    int_text is the encoded numerical representation of texts that we will use in the neural network.
    '''
    token_dict = build_token_lookup()
    for key, token in token_dict.items():
        book = book.replace(key, ' {} '.format(token))
    processed_book = book.lower().split()
    vocab_to_int, int_to_vocab = look_up_table(processed_book)
    int_text = [vocab_to_int[word] for word in processed_book]
    return token_dict, vocab_to_int, int_to_vocab, int_text

token_dict, vocab_to_int, int_to_vocab, int_text = main_pre_processing(book)

Now we can check what the encoded text looks like. Here are the first 200 encoded words of the text and the vocabulary size of the text. 

In [8]:
print (int_text[:200], max(int_text))

[6656, 2468, 2729, 3845, 65, 5025, 5025, 2817, 5545, 5643, 5025, 5025, 561, 2817, 3192, 3449, 2301, 3449, 5886, 3449, 6372, 5025, 5025, 5025, 5025, 5453, 6563, 6656, 301, 4936, 4726, 2198, 5025, 5025, 6291, 65, 4726, 1930, 5467, 715, 2729, 6025, 2328, 4726, 2122, 471, 493, 6563, 4629, 5025, 2944, 4726, 2140, 535, 4696, 5232, 4726, 2869, 5725, 5332, 6933, 2817, 6025, 92, 5025, 874, 4726, 6006, 2140, 535, 4696, 2627, 4726, 2869, 5725, 3450, 6586, 4584, 2729, 5025, 6656, 2125, 3449, 5025, 5025, 535, 3462, 2833, 632, 1170, 1253, 6652, 6975, 2729, 6025, 92, 3449, 535, 6073, 5025, 3143, 3962, 4726, 2140, 4696, 1873, 460, 3143, 1930, 6656, 4816, 2729, 5025, 1606, 4726, 6491, 6656, 1524, 6491, 4170, 2729, 2140, 5332, 4726, 6033, 6656, 6248, 5025, 166, 4726, 1407, 2817, 1606, 3449, 3143, 2569, 3962, 535, 4702, 4726, 1930, 5025, 588, 6563, 6656, 5292, 2729, 6025, 92, 874, 4726, 6563, 4982, 5816, 6491, 5200, 6510, 5025, 379, 3449, 6006, 535, 6244, 3081, 6217, 4726, 1930, 632, 49, 4726, 6025, 4980

## Build the Network

We build this RNN with LSTM cells. We first initial inputs, targets, and learning rates by using tensorflow's Placeholder function. get_init_cell is the function where we put a sequence of LSTM cells together to form the initial state of the network. These cells will also be used in running a dynamic RNN. 

In [9]:
def get_inputs():
    """
    Create TF Placeholders for input, targets, and learning rate.
    :return: Tuple (input, targets, learning rate)
    """
    input = tf.placeholder(tf.int32, [None, None],name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')  
    learning_rate = tf.placeholder(tf.float32, name='learningrate')
    input_tuple = (input, targets, learning_rate)
    
    return input_tuple

In the method _get_init_cell_, there are different options of LSTM cells from Tensorflow for the line here: `lstm = tf.contrib.rnn.LSTMCell(rnn_size)  `. We can use `BasicLSTMCell`, `LayerNormBasicLSTMCell`(which has layer normalization and recurrent dropout), or even `GRUCell` (but I haven't tested on this). For more information on varieties of cells in RNN, see https://www.tensorflow.org/api_docs/python/tf/contrib/rnn

In MultiRNNCell, the argument I use is `[lstm]*2`, which means we double the size/layers of the network. People usually choose 2 or 3. It is from "Tips and Tricks" for training RNN by Andrej Karpathy. See his README here: https://github.com/karpathy/char-rnn


In [10]:
def get_init_cell(batch_size, rnn_size):
    """
    Create an RNN Cell and initialize it. 
    :param batch_size: Size of batches
    :param rnn_size: Size of RNNs
    :return: Tuple (cell, initialize state)
    """
    lstm = tf.contrib.rnn.LSTMCell(rnn_size)  
    cell = tf.contrib.rnn.MultiRNNCell([lstm]*2)    
    initial_state = cell.zero_state(batch_size, tf.float32)
    initial_state = tf.identity(initial_state, name = 'initial_state')
    
    return cell, initial_state

### Get word embeddings
The signicance of this part is that we want to encode the representation of the text into the neural network. In the preprocessing part, we have already transform the document into a numerical representation and each word in the vocabulary of the document corresponds to a natural number. However, as we have notice, there are some 7000 words in the vocabulary in total and if we use one-hot encoding (which is what people would usually do in char2char model since there are only 52 upper and lower case letters and even if adding special characters the dimensions still won't be so big.) But it is different fromt the basic unit in the seq2seq model, which is each word in the vocabulary. Thus, we need to abandon one-hot encoding and embed these thousands of words into lower dimensions using Tensorflow's embedding_lookup function. 

In [11]:
def get_embed(input_data, vocab_size, embed_dim):
    """
    Embed input data for Tensorflow
    :param input_data: TF placeholder for text input.
    :param vocab_size: Number of words in vocabulary.
    :param embed_dim: Number of embedding dimensions
    :return: Embedded input.
    """
    embeddings = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, input_data)
    return embed

### Build RNN and the whole network

Now finally it's time for us to build the network. Unsurprisingly, RNN is the main part of the whole network. We use ` tf.nn.dynamic_rnn` function here. Note that in the Deep Learning Foundation program provided by Udacity, the unittest for this method can only pass in Tensorflow version <1.1, which seems dated now. There is probably something needed to be fixed in the unittest code. However, as I run through this process, it does not really make a difference. As I will show later, we just need to tweak a little thing when running the generative model in the end. 

And this issue seems never to be solved so far (see: https://github.com/udacity/deep-learning/issues/216).

In [12]:
def build_rnn(cell, inputs):
    """
    Create a RNN using a RNN Cell
    :param cell: RNN Cell
    :param inputs: Input text data
    :return: Tuple (Outputs, Final State)
    """
    outputs, final_state = tf.nn.dynamic_rnn(cell,  inputs, dtype=tf.float32)
    final_state = tf.identity(final_state, name = 'final_state')
    return outputs, final_state


In [13]:
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    """
    Build part ofRNN cell
    :param rnn_size: Size of rnns
    :param input_data: Input data
    :param vocab_size the neural network
    :param cell: : Vocabulary size
    :param embed_dim: Number of embedding dimensions
    :return: Tuple (Logits, FinalState)
    """
    
    input_data = get_embed(input_data, vocab_size, rnn_size)
    outputs, final_state = build_rnn(cell, input_data)
    logits = tf.contrib.layers.fully_connected (outputs, vocab_size)
    return logits, final_state


Next step is to get batches. This is for running the neural network. As I have mentioned before, we are building a sequence to sequence model here. In particular, we should map a certain length (determined by `seq_length`) of sequences of words to a sequences of words with the same length right after. Depending on the batch_size, the shapes vary. Here is a good example directly provided in the original Udacity notebook -- `get_batches([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 3, 2)` would return a Numpy array of the following:
```
[
  # First Batch
  [
    # Batch of Input
    [[ 1  2], [ 7  8], [13 14]]
    # Batch of targets
    [[ 2  3], [ 8  9], [14 15]]
  ]

  # Second Batch
  [
    # Batch of Input
    [[ 3  4], [ 9 10], [15 16]]
    # Batch of targets
    [[ 4  5], [10 11], [16 17]]
  ]

  # Third Batch
  [
    # Batch of Input
    [[ 5  6], [11 12], [17 18]]
    # Batch of targets
    [[ 6  7], [12 13], [18  1]]
  ]
]
```

Notice that in the third batch, the last element of _Batch of Targets_ is 1, which is the same with the first element in the _Batch of Input_ for the first batch. Because if we map a sequence of words to the another sequence of words right after, we will encounter problems for the last batch. Thus we map it back to the first element (taken care of by the line `ydata[-1] = xdata[0]  `) Alternatively, we can just ignore the last sequence. 

In [14]:
def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    characters_per_batch = batch_size * seq_length
    n_batches = len(int_text)//characters_per_batch
    
    xdata = np.array(int_text[: n_batches * characters_per_batch])
    ydata = np.array(int_text[1: n_batches * characters_per_batch + 1])

    x_batches = np.split(xdata.reshape(batch_size, -1), n_batches, 1)
    y_batches = np.split(ydata.reshape(batch_size, -1), n_batches, 1)
    
    ydata[-1] = xdata[0]  
 
    return np.array(list(zip(x_batches, y_batches)))


## Training

Setting the parameters for training here. As usual, if we lower the batch size, the training takes longer. But it is not the only important factor that determines the time and performance. `rnn_size` is another important one. I set `num_epochs` to be 150 in this setting because the loss function gets steady after 100 or so.

In [15]:
# Number of Epochs
num_epochs = 150
# Batch Size
batch_size = 32
# RNN Size
rnn_size = 512
# Embedding Dimension Size
embed_dim = 400
# Sequence Length
seq_length = 40
# Learning Rate
learning_rate = 0.003
# Show stats for every n number of batches
show_every_n_batches = 300

### Build the Graph

One important thing in building this graph is the softmax function. It takes the logits as the argument and output the probabilities for word selection. 

A common type of problem in training a vanilla RNN is the vanishing/exploding gradient. LSTM can usually help alleviate these issues. For even better results, we apply Gradient Clipping on the Adam Optimizer (mainly to avoid exploding).

In [16]:
from tensorflow.contrib import seq2seq

training = tf.Graph()
with training.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(logits, targets, tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

### Running the model

Time to run the model! We run the model and save it locally (for the purpose the generating texts later). Usually, the best I have done so far is to lower the loss to around 2.5. The optimal loss value also depends on the sizes of the texts and the network.

In [17]:
import numpy as np
batches = get_batches(int_text, batch_size, seq_length)

with tf.Session(graph=training) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})
        for batch_i, (x, y) in enumerate(batches):
            feed = {input_text: x, targets: y, initial_state: state, lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)
            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch: ', epoch_i, 'Batch: ', batch_i,'/', len(batches), 'train sequence loss: ', train_loss)

    # Save Model
    saver = tf.train.Saver()
    save_dir = './save'
    saver.save(sess, save_dir)
    print('Model Trained and Saved')

Epoch:  0 Batch:  0 / 205 train sequence loss:  8.868579
Epoch:  1 Batch:  95 / 205 train sequence loss:  6.013385
Epoch:  2 Batch:  190 / 205 train sequence loss:  5.3977532
Epoch:  4 Batch:  80 / 205 train sequence loss:  5.1920447
Epoch:  5 Batch:  175 / 205 train sequence loss:  4.935922
Epoch:  7 Batch:  65 / 205 train sequence loss:  4.8224144
Epoch:  8 Batch:  160 / 205 train sequence loss:  4.8773246
Epoch:  10 Batch:  50 / 205 train sequence loss:  4.496931
Epoch:  11 Batch:  145 / 205 train sequence loss:  4.5065074
Epoch:  13 Batch:  35 / 205 train sequence loss:  4.35735
Epoch:  14 Batch:  130 / 205 train sequence loss:  4.214121
Epoch:  16 Batch:  20 / 205 train sequence loss:  3.9536355
Epoch:  17 Batch:  115 / 205 train sequence loss:  4.0311327
Epoch:  19 Batch:  5 / 205 train sequence loss:  3.776989
Epoch:  20 Batch:  100 / 205 train sequence loss:  3.8227055
Epoch:  21 Batch:  195 / 205 train sequence loss:  3.7022438
Epoch:  23 Batch:  85 / 205 train sequence loss: 

## Generating Texts
### Helper functions

There are two helper functions for generating texts. `get_tensors` is to load the saved graph by Tensorflow. What we in the saved graph are input, initial_state, final_state, and probabilities, which are all from the results of running the RNN. 

The method `pick_word` is for picking another word after a word is chosen. And the distribution for the words to be selected is based on the probabilities of the next word. We choose the first one.

In [18]:
def get_tensors(loaded_graph):
    """
    Get input, initial state, final state, and probabilities tensor from <loaded_graph>
    :param loaded_graph: TensorFlow graph loaded from file
    :return: Tuple (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)
    """
    input_tensor =  loaded_graph.get_tensor_by_name("input:0")
    i_s_tensor =  loaded_graph.get_tensor_by_name("initial_state:0")
    f_s_tensor =  loaded_graph.get_tensor_by_name("final_state:0")
    probs_tensor =  loaded_graph.get_tensor_by_name("probs:0")
    print (i_s_tensor, f_s_tensor)
    return input_tensor, i_s_tensor, f_s_tensor, probs_tensor


In [19]:
def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    predicted = np.random.choice(list(int_to_vocab.values()), 1, p = probabilities)[0]
    return predicted


## Generating Process

Now it is the time for us to initiate a tf session again and generate predicated texts. We randomly select some words from the dictionary as the prime word (which is the beginning of the generated text). Based on the sequence length and the trained model, we feed input into the session. When picking a predicted word, for Tensorflow version >1.1, we need to have `pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)`, instead of `pred_word = pick_word(probabilities[dyn_seq_length-1], int_to_vocab)` (which is the line Udacity originally has.) We need to add `[0]` for the probabilities tensor. 

In [25]:
import random
gen_length = 600

prime_word = int_to_vocab[np.random.choice(list(vocab_to_int.values()))]
print ('The chosen prime word is ', prime_word)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(save_dir + '.meta')
    loader.restore(sess, save_dir)

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = [prime_word]
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})
    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])
        # Get Prediction
        probabilities, prev_state = sess.run([probs, final_state],{input_text: dyn_input, initial_state: prev_state})      
        pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)
        gen_sentences.append(pred_word)    
    # Remove tokens
    gen_texts = ' '.join(gen_sentences)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_texts = gen_texts.replace(' ' + token.lower(), key)
    gen_texts = gen_texts.replace('\n ', '\n')
    gen_texts = gen_texts.replace('( ', '(')

    print(gen_texts)

The chosen prime word is  substitution
INFO:tensorflow:Restoring parameters from ./save
Tensor("initial_state:0", shape=(2, 2, ?, 512), dtype=float32) Tensor("final_state:0", shape=(2, 2, ?, 512), dtype=float32)
substitution of truth, but
attributed to the use of reluctant cases, that friends and refuses,
is it more preception secondly, is representative into the
he-goat, but observers of a rash and intelligibilis, to subject's all
the expansion which excuse to unconnected with in our
chart of the forlorn and ball of approximate, but to drives it
changed, the principle of which costs from shield it
enjoyed. a impassable trademark/copyright datum speciosa, and that, the still key in our
differences succeed + descendants by the sense in time. reliable.
impetus learner must be subject, therefore, to be remain pretty matters not
on it, as the variable, and not nature by steeled, the
remarkably polysyllogistica, which is insignia of a merely 4557 appeals to an
comprehensibility of a equitab