# M2177.003100 Deep Learning <br> Assignment #3 Part 3: Language Modeling with CharRNN

Copyright (C) Data Science Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Sang-gil Lee, October 2018

This is a character-level language model using recurrent neural networks (RNNs).
It has become very popular as a starter kit for learning how RNN works in practice.

Before we start, what is "language modeling" anyway? Intuitively, "language modeling" is teaching the model about a general probability distribution of our words and sentences.

So we ask the model like: "hey just say whatever words from your estimation of the wikipedia word distribution", and the model responds like "ok, i learned from wikipedia, and the most frequent word is "the". so let me start with "the". the wikipedia is blah blah blah"

Thus, by teaching the model to speak for itself, we can test the model's capability of learning temporal relationships between sequences.

Original blog post & code:
https://github.com/karpathy/char-rnn
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

But the original code is written in lua torch which looks less pretty :(

There is a clean port of char-RNN in TensorFlow
https://github.com/sherjilozair/char-rnn-tensorflow
This iPython notebook is basically a copypasta of this repo.

That said, you are allowed to copy paste the codes from the original repo.
HOWEVER, <font color=red> try to implement the model yourself first </font>, and consider the original source code as a last resort.
You will learn a lot while wrapping around your head during the implementation. And you will understand nuts and bolts of RNNs more clearly in a code level.

### AND MOST IMPORTANTLY, IF YOU JUST BLINDLY COPY PASTE THE CODE, YOU SHALL RUIN YOUR EXAM.
### The exam is designed to be solvable for students that actually have written the code themselves.
At least strictly re-type the codes from the original repo line-by-line, and understand what each line means thoroughly.

## YOU HAVE BEEN WARNED. :)



### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  
Once you have done **all Assignment Part 1-5**, run the *CollectSubmission.sh* script with your **Team number** as input argument. <br>
This will produce a zipped file called *[Your team number].zip*. Please submit this file on ETL. &nbsp;&nbsp; (Usage: ./*CollectSubmission.sh* team_#)

### Character language modeling (20 points)

This assignment is an on/off one: just make this notebook **"work"** without problem by: 

1. implementing **1. \_\_init\_\_()** and **2. sample()** of RNN **Model()** class from **char_rnn.py**

2. briefly summarizing, at the end of the script, how you implmeneted the model & why you changed some other parts of the code. yes,  <font color=red> there are other intentional pitfalls inside the code </font>. just copy-pasting the \_\_init\_\_() will not work. can you tell me why?

### The Definition of **"work"** is as follows:

1. Training loss must be <font color=red> below 0.2 </font>. We will check the training loss output from the training code block. We don't split the data into train-valid-test. Don't forget to <font color=red> NOT clear the output from train(args)</font>, where the training loss will be printed! TA will check the logged output from train(args)

2. calling sample(args.sample) at the last code block <font color=red> must generate some meaningful sentences </font>. The quality of the sentence does not count, unless the generated sentence is something like "aaaaaaaaaaaaaabbbbbb" or "b" u tlttfcwaU c  fGcnrh i.\nh mt he!bsthpme".



Now proceed to the code.

In [1]:
# ipython magic function for limiting the gpu to be seen for tensorflow
# if you have just 1 GPU, specify the value to 0
# if you have multiple GPUs (nut) and want to specify which GPU to use, specify this value to 0 or 1 or etc.
%env CUDA_DEVICE_ORDER = PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES = 1

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=1


In [3]:
# load a bunch of libraries
from __future__ import print_function
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq
import numpy as np
import argparse
import time
import os
from six.moves import cPickle
from six import text_type
import sys

# this module is from the utils.py file of this folder
# it handles loading texts to digits (aka. tokens) which are recognizable for the model
from utils import TextLoader

# this module is from the char_rnn.py file of this folder
# the task is implementing the CharRNN inside the class definition from this file
from char_rnn import Model

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# for TensorFlow vram efficiency: if this is not specified, the model hogs all the VRAM even if it's not necessary
# bad & greedy TF! but it has a reason for this design choice FWIW, try googling it if interested
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

In [5]:
# argparsing
parser = argparse.ArgumentParser(
                    formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Data and model checkpoints directories
parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare',
                    help='data directory containing input.txt with training examples')
parser.add_argument('--save_dir', type=str, default='models_char_rnn',
                    help='directory to store checkpointed models')
parser.add_argument('--save_every', type=int, default=1000,
                    help='Save frequency. Number of passes between checkpoints of the model.')
parser.add_argument('--init_from', type=str, default=None,
                    help="""continue training from saved model at this path (usually "save").
                        Path must contain files saved by previous training process:
                        'config.pkl'        : configuration;
                        'chars_vocab.pkl'   : vocabulary definitions;
                        'checkpoint'        : paths to model file(s) (created by tf).
                                              Note: this file contains absolute paths, be careful when moving files around;
                        'model.ckpt-*'      : file(s) with model definition (created by tf)
                         Model params must be the same between multiple runs (model, rnn_size, num_layers and seq_length).
                    """)
# Model params
parser.add_argument('--model', type=str, default='lstm',
                    help='lstm, rnn, gru, or nas')
parser.add_argument('--rnn_size', type=int, default=128,
                    help='size of RNN hidden state')
parser.add_argument('--num_layers', type=int, default=5,
                    help='number of layers in the RNN')
# Optimization
parser.add_argument('--seq_length', type=int, default=500,
                    help='RNN sequence length. Number of timesteps to unroll for.')
parser.add_argument('--batch_size', type=int, default=5,
                    help="""minibatch size. Number of sequences propagated through the network in parallel.
                            Pick batch-sizes to fully leverage the GPU (e.g. until the memory is filled up)
                            commonly in the range 10-500.""")
parser.add_argument('--num_epochs', type=int, default=50,
                    help='number of epochs. Number of full passes through the training examples.')
parser.add_argument('--grad_clip', type=float, default=20.,
                    help='clip gradients at this value')
parser.add_argument('--learning_rate', type=float, default=0.01,
                    help='learning rate')
parser.add_argument('--decay_rate', type=float, default=0.7,
                    help='decay rate for rmsprop')
parser.add_argument('--output_keep_prob', type=float, default=0.1,
                    help='probability of keeping weights in the hidden layer')
parser.add_argument('--input_keep_prob', type=float, default=0.1,
                    help='probability of keeping weights in the input layer')

# needed for argparsing within jupyter notebook
# https://stackoverflow.com/questions/30656777/how-to-call-module-written-with-argparse-in-ipython-notebook
sys.argv = ['-f']
args = parser.parse_args()

# print args: see if the hyperparemeters look pretty to you
args

Namespace(batch_size=5, data_dir='data/tinyshakespeare', decay_rate=0.7, grad_clip=20.0, init_from=None, input_keep_prob=0.1, learning_rate=0.01, model='lstm', num_epochs=50, num_layers=5, output_keep_prob=0.1, rnn_size=128, save_dir='models_char_rnn', save_every=1000, seq_length=500)

In [39]:
# protip: always check the data and poke around the data yourself
# you will get a lot of insights by looking at the data
args.seq_length=100
data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length)
data_loader.reset_batch_pointer()

x, y = data_loader.next_batch()

# our data has a shape of (N, T), where N=batch_size and T=seq_length
print(x)
print(x.shape)
print(y)
print(y.shape)

loading preprocessed files
[[49  9  7 ... 50  3 13]
 [ 0  5  1 ... 19  9  2]
 [24  0  3 ...  6 23  3]
 ...
 [ 7  9 27 ... 13  2  0]
 [ 4  2  0 ...  0  8  3]
 [11  4 19 ... 30  6  0]]
(50, 100)
[[ 9  7  6 ...  3 13  0]
 [ 5  1  0 ...  9  2 15]
 [ 0  3  7 ... 23  3  9]
 ...
 [ 9 27  4 ...  2  0  2]
 [ 2  0 12 ...  8  3  2]
 [ 4 19  1 ...  6  0 11]]
(50, 100)


In [40]:
# see what the first entry of the batch look like
print(x[0])
print(y[0])
# y is just an x shifted to the left by one: so the network will predict the next token y given x. 

[49  9  7  6  2  0 37  9  2  9 57  1  8 24 10 43  1 18  3  7  1  0 17  1
  0 23  7  3 19  1  1 12  0  4  8 15  0 18 13  7  2  5  1  7 16  0  5  1
  4  7  0 14  1  0  6 23  1  4 28 25 10 10 26 11 11 24 10 35 23  1  4 28
 16  0  6 23  1  4 28 25 10 10 49  9  7  6  2  0 37  9  2  9 57  1  8 24
 10 50  3 13]
[ 9  7  6  2  0 37  9  2  9 57  1  8 24 10 43  1 18  3  7  1  0 17  1  0
 23  7  3 19  1  1 12  0  4  8 15  0 18 13  7  2  5  1  7 16  0  5  1  4
  7  0 14  1  0  6 23  1  4 28 25 10 10 26 11 11 24 10 35 23  1  4 28 16
  0  6 23  1  4 28 25 10 10 49  9  7  6  2  0 37  9  2  9 57  1  8 24 10
 50  3 13  0]


In [41]:
# training loop definition
def train(args):
    data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length)
    args.vocab_size = data_loader.vocab_size
    print("vocabulary size: " + str(args.vocab_size))

    # check compatibility if training is continued from previously saved model
    if args.init_from is not None:
        # check if all necessary files exist
        assert os.path.isdir(args.init_from)," %s must be a a path" % args.init_from
        assert os.path.isfile(os.path.join(args.init_from,"config.pkl")),"config.pkl file does not exist in path %s"%args.init_from
        assert os.path.isfile(os.path.join(args.init_from,"chars_vocab.pkl")),"chars_vocab.pkl.pkl file does not exist in path %s" % args.init_from
        ckpt = tf.train.latest_checkpoint(args.init_from)
        assert ckpt, "No checkpoint found"

        # open old config and check if models are compatible
        with open(os.path.join(args.init_from, 'config.pkl'), 'rb') as f:
            saved_model_args = cPickle.load(f)
        need_be_same = ["model", "rnn_size", "num_layers", "seq_length"]
        for checkme in need_be_same:
            assert vars(saved_model_args)[checkme]==vars(args)[checkme],"Command line argument and saved model disagree on '%s' "%checkme

        # open saved vocab/dict and check if vocabs/dicts are compatible
        with open(os.path.join(args.init_from, 'chars_vocab.pkl'), 'rb') as f:
            saved_chars, saved_vocab = cPickle.load(f)
        assert saved_chars==data_loader.chars, "Data and loaded model disagree on character set!"
        assert saved_vocab==data_loader.vocab, "Data and loaded model disagree on dictionary mappings!"

    if not os.path.isdir(args.save_dir):
        os.makedirs(args.save_dir)
    with open(os.path.join(args.save_dir, 'config.pkl'), 'wb') as f:
        cPickle.dump(args, f)
    with open(os.path.join(args.save_dir, 'chars_vocab.pkl'), 'wb') as f:
        cPickle.dump((data_loader.chars, data_loader.vocab), f)
    
    print("building the model... may take some time...")
    ##################### This line builds the CharRNN model defined in char_rnn.py #####################
    # set new args values
    args.model = 'nas'
    args.rnn_size = 256
    args.num_layers = 2
    args.seq_length = 100
    args.batch_size = 50
    args.num_epochs = 50
    args.grad_clip = 5.0
    args.learning_rate = 2e-3
    args.decay_rate = 0.97
    args.output_keep_prob = 1.0
    args.input_keep_prob = 1.0
    tf.reset_default_graph()
    model = Model(args)
    print("model built! starting training...")

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver(tf.global_variables(), max_to_keep=1)
        # restore model
        if args.init_from is not None:
            saver.restore(sess, ckpt)
        for e in range(args.num_epochs):
            sess.run(tf.assign(model.lr,
                               args.learning_rate * (args.decay_rate ** e)))
            data_loader.reset_batch_pointer()
            state = sess.run(model.initial_state)
            
            for b in range(int(data_loader.num_batches)):
                start = time.time()
                x, y = data_loader.next_batch()
                feed = {model.input_data: x, model.targets: y}
                for i, (c, h) in enumerate(model.initial_state):
                    feed[c] = state[i].c
                    feed[h] = state[i].h

                train_loss, state, _ = sess.run([model.cost, model.final_state, model.train_op], feed)

                end = time.time()
                
                # print training log every 100 steps
                if ((e * data_loader.num_batches + b) % 100 == 0):
                    print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}"
                          .format(e * data_loader.num_batches + b,
                                  args.num_epochs * data_loader.num_batches,
                                  e, train_loss, end - start))
                if (e * data_loader.num_batches + b) % args.save_every == 0\
                        or (e == args.num_epochs-1 and
                            b == data_loader.num_batches-1):
                    # save for the last result
                    checkpoint_path = os.path.join(args.save_dir, 'model.ckpt')
                    saver.save(sess, checkpoint_path,
                               global_step=e * data_loader.num_batches + b)
                    print("model saved to {}".format(checkpoint_path))


In [42]:
# let's train!
train(args)

loading preprocessed files
vocabulary size: 65
building the model... may take some time...
model built! starting training...
0/11150 (epoch 0), train_loss = 4.181, time/batch = 1.186
model saved to models_char_rnn/model.ckpt
100/11150 (epoch 0), train_loss = 2.430, time/batch = 0.204
200/11150 (epoch 0), train_loss = 2.073, time/batch = 0.206
300/11150 (epoch 1), train_loss = 1.885, time/batch = 0.206
400/11150 (epoch 1), train_loss = 1.737, time/batch = 0.205
500/11150 (epoch 2), train_loss = 1.653, time/batch = 0.206
600/11150 (epoch 2), train_loss = 1.654, time/batch = 0.204
700/11150 (epoch 3), train_loss = 1.586, time/batch = 0.206
800/11150 (epoch 3), train_loss = 1.529, time/batch = 0.206
900/11150 (epoch 4), train_loss = 1.529, time/batch = 0.207
1000/11150 (epoch 4), train_loss = 1.475, time/batch = 0.207
model saved to models_char_rnn/model.ckpt
1100/11150 (epoch 4), train_loss = 1.450, time/batch = 0.208
1200/11150 (epoch 5), train_loss = 1.477, time/batch = 0.205
1300/11150

In [45]:
# we're done with the model. the weights are now safe inside our storage
! ls {args.save_dir}

# so begone all the ops, graphs & variables!
# you may ask, why is this line needed? try commenting out the line and see what happens later in the sampling stage
tf.reset_default_graph()

chars_vocab.pkl  config.pkl			       model.ckpt-11149.index
checkpoint	 model.ckpt-11149.data-00000-of-00001  model.ckpt-11149.meta


## Evalulation

<font color=red>**Your model could be evaluated without traning procedure,**</font> if you saved and loaded your model properly.

In [46]:
%env CUDA_DEVICE_ORDER = PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES = 0

# load a bunch of libraries
from __future__ import print_function
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq
import numpy as np
import argparse
import time
import os
from six.moves import cPickle
from six import text_type
import sys

# this module is from the utils.py file of this folder
# it handles loading texts to digits (aka. tokens) which are recognizable for the model
from utils import TextLoader

# this module is from the char_rnn.py file of this folder
# the task is implementing the CharRNN inside the class definition from this file
from char_rnn import Model

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# sampling definition for evaluation phase
# it uses the saved model and spit out some characters from the RNN model
def sample_eval(args):
    with open(os.path.join(args.save_dir, 'config.pkl'), 'rb') as f:
        saved_args = cPickle.load(f)
    with open(os.path.join(args.save_dir, 'chars_vocab.pkl'), 'rb') as f:
        chars, vocab = cPickle.load(f)
    #Use most frequent char if no prime is given
    if args.prime == '':
        args.prime = chars[0]
    model = Model(saved_args, training=False)
    with tf.Session() as sess:
        tf.global_variables_initializer().run()
        saver = tf.train.Saver(tf.global_variables())
        ckpt = tf.train.get_checkpoint_state(args.save_dir)
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
            print(str(model.sample(sess, chars, vocab, args.n, args.prime, sampling_type=2)),'utf-8')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [47]:
# argparsing for sampling from the model
parser_sample = argparse.ArgumentParser(
                   formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser_sample.add_argument('--save_dir', type=str, default='models_char_rnn',
                    help='model directory to store checkpointed models')
parser_sample.add_argument('-n', type=int, default=500,
                    help='number of characters to sample')
parser_sample.add_argument('--prime', type=text_type, default=u'',
                    help='prime text')
sys.argv = ['-f']
args_sample = parser_sample.parse_args()
args_sample

Namespace(n=500, prime='', save_dir='models_char_rnn')

In [48]:
# let's sample!
sample_eval(args_sample)
tf.reset_default_graph()

INFO:tensorflow:Restoring parameters from models_char_rnn/model.ckpt-11149
 Dond Thomas Clarence that our death?

BUCKINGHAM:
Why, then I began his son we have found me in heaven,
And set the holy reason knows not for death.

KING EDWARD IV:
Why, sir, we will unto your grace hath been
The man Paris like to his enemies.

BUCKINGHAM:
Why, then can be content to dry like a day.

KING RICHARD III:
Stand by the law to his wife with my wife,
And set the law makes a man and proud Angelo.

BIANCA:
Why, then now the Tower to look to Doctor,
And he shall be so far graced and drea utf-8


## Briefly summarize what & how you did, and why you did that way.
This is also for checking yourself if you really learned something from this assignment.

* CharRNN model:
    - We use NAS RNN as based RNN model which we found giving a better result than LSTM model.
- Other argument values are:
    - rnn_size = hidden size of RNN cell = 256
    - rnn_layers = 2 (use 2 rnn layers, if we use 5 layers then we have bad result)
    - batch_size = 50, sequence length = 100 (input includes 100 chars), num_epoch = 50
    - learning rate init = 2e-3 and decay rate is 0.97
- Following are steps to build CharRNN model:
    - Step 0: We declare multi layered rnn cell into one cell with dropout (use tensorflow DropoutWrapper and MultiRNNCell)
    - Step 1: We declare input and target as tensors of (batch_size, sequence length). Initial state for RNN cells also initialized to zero state.
    - Step 2: We transform input characters into an embedding matrix of (batch_size, rnn_size) (using tf.nn.embedding_lookup)
    - Step 3: We unstack the input to fits in rnn model => (seq_length, batch_size, rnn_size)
    - Step 4: We use a rnn_decoder (using tensorflow legacy_seq2seq.rnn_decoder) to generate the ouputs and final state (initial_state as above declared initial state)
    - Step 5: Flatten the outputs from RNN cells and then use a softmax layer to create softmax outputs (to be able to compare with the targets)
    - Step 6: Loss is calculate by the log loss and taking the average of the batch.
    - Step 7: Calculate gradients and apply gradients for all trainable variables by using Adam Optimizer.
- Training:
    - We use a learning rate decay strategy by decaying learning rate every each epoch.
    - But for best of our effort, the training loss only goes down to around 1.1, we cannot make it smaller 1.0.
- Sampling:
    - To sample a paragraph from a prime character, we feed consequently the input character into the trained model, get its output (as a softmax vector of vocab size) and the final state. The output softmax vector is then transformed into a sample character and feed into the model as next character input.
    - The sampled output character can be taken by using highest probability (of softmax output) or by weighted probability pick if its previous character is a prime character or a space.