A clean, no_frills character-level generative language model.
Created by Danijar Hafner (danijar.com), edited by Chip Huyen
for the class CS 20SI: "TensorFlow for Deep Learning Research"
Based on Andrej Karpathy's blog: 
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import sys
sys.path.append('..')

import time

import tensorflow as tf

import utils

In [2]:
DATA_PATH = 'data/arvix_abstracts.txt'
HIDDEN_SIZE = 200
BATCH_SIZE = 64
NUM_STEPS = 50
SKIP_STEP = 40
TEMPRATURE = 0.7
LR = 0.003
LEN_GENERATED = 300

In [3]:
def vocab_encode(text, vocab):
    return [vocab.index(x) + 1 for x in text if x in vocab]

In [4]:
def vocab_decode(array, vocab):
    return ''.join([vocab[x - 1] for x in array])

In [5]:
def read_data(filename, vocab, window=NUM_STEPS, overlap=NUM_STEPS//2):
    for text in open(filename):
        text = vocab_encode(text, vocab)
        for start in range(0, len(text) - window, overlap):
            chunk = text[start: start + window]
            chunk += [0] * (window - len(chunk))
            yield chunk

In [6]:
def read_batch(stream, batch_size=BATCH_SIZE):
    batch = []
    for element in stream:
        batch.append(element)
        if len(batch) == batch_size:
            yield batch
            batch = []
    yield batch

In [7]:
def create_rnn(seq, hidden_size=HIDDEN_SIZE):
    cell = tf.contrib.rnn.GRUCell(hidden_size)
    in_state = tf.placeholder_with_default(
            cell.zero_state(tf.shape(seq)[0], tf.float32), [None, hidden_size])
    # this line to calculate the real length of seq
    # all seq are padded to be of the same length which is NUM_STEPS
    length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, in_state)
    return output, in_state, out_state

In [8]:
def create_model(seq, temp, vocab, hidden=HIDDEN_SIZE):
    seq = tf.one_hot(seq, len(vocab))
    output, in_state, out_state = create_rnn(seq, hidden)
    # fully_connected is syntactic sugar for tf.matmul(w, output) + b
    # it will create w and b for us
    logits = tf.contrib.layers.fully_connected(output, len(vocab), None)
    loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=logits[:, :-1], labels=seq[:, 1:]))
    # sample the next character from Maxwell-Boltzmann Distribution with temperature temp
    # it works equally well without tf.exp
    sample = tf.multinomial(tf.exp(logits[:, -1] / temp), 1)[:, 0] 
    return loss, sample, in_state, out_state

In [9]:
def training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state):
    saver = tf.train.Saver()
    start = time.time()
    with tf.Session() as sess:
        writer = tf.summary.FileWriter('graphs/gist', sess.graph)
        sess.run(tf.global_variables_initializer())
        
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/arvix/checkpoint'))
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
        
        iteration = global_step.eval()
        for batch in read_batch(read_data(DATA_PATH, vocab)):
            batch_loss, _ = sess.run([loss, optimizer], {seq: batch})
            if (iteration + 1) % SKIP_STEP == 0:
                print('Iter {}. \n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))
                online_inference(sess, vocab, seq, sample, temp, in_state, out_state)
                start = time.time()
                saver.save(sess, 'checkpoints/arvix/char-rnn', iteration)
            iteration += 1

In [10]:
def online_inference(sess, vocab, seq, sample, temp, in_state, out_state, seed='T'):
    """ Generate sequence one character at a time, based on the previous character
    """
    sentence = seed
    state = None
    for _ in range(LEN_GENERATED):
        batch = [vocab_encode(sentence[-1], vocab)]
        feed = {seq: batch, temp: TEMPRATURE}
        # for the first decoder step, the state is None
        if state is not None:
            feed.update({in_state: state})
        index, state = sess.run([sample, out_state], feed)
        sentence += vocab_decode(index, vocab)
    print(sentence)

In [11]:
vocab = (
            " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_abcdefghijklmnopqrstuvwxyz{|}")
seq = tf.placeholder(tf.int32, [None, None])
temp = tf.placeholder(tf.float32)
loss, sample, in_state, out_state = create_model(seq, temp, vocab)
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
optimizer = tf.train.AdamOptimizer(LR).minimize(loss, global_step=global_step)
utils.make_dir('checkpoints')
utils.make_dir('checkpoints/arvix')
training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state)

Iter 39. 
    Loss 9412.35546875. Time 8.740547895431519
T_FX   e  e  e                                                                                              e     e  e  e        e  e  e    e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e  e 
Iter 79. 
    Loss 8205.4765625. Time 8.216433763504028
TX te the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the
Iter 119. 
    Loss 7403.85205078125. Time 8.627424001693726
The the the the the the the the the the the the the the the the the seraling and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and a

Iter 959. 
    Loss 3589.32275390625. Time 9.002385139465332
The computational results on a simple convergence rates of the network problem in over general waik deep neural networks to learn a deep networks that are computation in the convergence rates of the network problem in the convergence rates of the network problem in the convergence rates of the networ
Iter 999. 
    Loss 3225.7978515625. Time 8.245115041732788
The speech recognition to train a deep learning machine learning and the state-of-the-art on the speed up the state-of-the-art on the speed up the state-of-the-art on the speed up the state-of-the-art on the speed up the state-of-the-art on the speed up the state-of-the-art on the speed up the state-
Iter 1039. 
    Loss 3577.48828125. Time 9.034195184707642
The massively a propose a new architectures shaling to a particlly learned for the to the sequence of the network architecture of the network architecture of the network architecture of the network architecture of th

Iter 1879. 
    Loss 2834.0107421875. Time 8.23366379737854
The spare of the art on the explore the approach capled the training and the different trained with the about 10 can be used to train and the classification performance on a single machine learning tasks size of the and theory is decoding learning to a single machine learning tasks. However, improves
Iter 1919. 
    Loss 2757.153076171875. Time 7.796314001083374
The recurrent neural networks are alts in the recent results on a method for training deep neural networks (RNNs) as a probabilistic framework for each neuron learning architectures by a pre-train non-convex optimization are standard formation of the arcaiest of pointic models and formal sentiment an
Iter 1959. 
    Loss 2880.18505859375. Time 7.692707777023315
The infinet from the layers and a new approach for posterior problem in the for derension for state-of-the-art model of the state-of-the-art model for the layers and a new approach for posterior problem in the f

Iter 2799. 
    Loss 2230.43798828125. Time 8.294495105743408
The proposed method for show that the proposed method for show that the proposed method for show that the proposed method for show that the proposed method for show that the proposed method for show that the proposed method for show that the proposed method for show that the proposed method for show 
Iter 2839. 
    Loss 2539.95068359375. Time 8.853570938110352
The approach for domain adaptation, in the descres in standard pooling operators much etsempond model and the on the convergence ratestive from RPM (RN) model as a standard for model and the ensemble of some points, is the new approaches for large visializedity proposed maxout units co presented as a
Iter 2879. 
    Loss 2228.732421875. Time 9.524260759353638
The seas effective of simple framework can be used to train a single model and computation of the sequence of the set of units, and stability techniques from the sequence of the sease of the search for a general 

Iter 3719. 
    Loss 2170.693359375. Time 7.042011022567749
The based on CIFAR-100 and main ans (ASR) consision for speech recognizer which acoustic layers, such as any distribution from an experiments learning rates of convolutional neural networks and dropout difficulties is effectively speedup deep learning wise larger network prant convex optimization pro
Iter 3759. 
    Loss 2318.3310546875. Time 7.5480897426605225
The recurrent layers can learn complex types of maximum inplications with a correctly processing are nonlinear sequence layers. Searning largy near-nover several datasets. We show deeper implrcatings to the normal sanclyins on a probo the connections that the network, and use of defensive distillatio
Iter 3799. 
    Loss 2247.143798828125. Time 7.095857858657837
The approaches on the network is a fixed-point into results are art non-convex optimization of each particaler of a neural network to learn network computer vision in the network depth and stochastic restricture

In [11]:
NUM_STEPS = 1

In [12]:
def create_rnn(seq, hidden_size=HIDDEN_SIZE):
    cell = tf.contrib.rnn.GRUCell(hidden_size)
    in_state = tf.placeholder_with_default(
            cell.zero_state(tf.shape(seq)[0], tf.float32), [None, hidden_size])
    # this line to calculate the real length of seq
    # all seq are padded to be of the same length which is NUM_STEPS
    sign = tf.sign(seq)
    sign = tf.Print(sign,[sign], 'sign = ')
    reduce_max = tf.reduce_max(sign, 2)
    reduce_max = tf.Print(reduce_max,[reduce_max],'reduce_max = ')
    length = tf.reduce_sum(reduce_max, 1)
    length = tf.Print(length,[length],'length = ')
    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, in_state)
    return output, in_state, out_state

In [None]:
vocab = (
            " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_abcdefghijklmnopqrstuvwxyz{|}")
seq = tf.placeholder(tf.int32, [None, None])
temp = tf.placeholder(tf.float32)
loss, sample, in_state, out_state = create_model(seq, temp, vocab)
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
optimizer = tf.train.AdamOptimizer(LR).minimize(loss, global_step=global_step)
utils.make_dir('checkpoints')
utils.make_dir('checkpoints/arvix')
training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state)

INFO:tensorflow:Restoring parameters from checkpoints/arvix/char-rnn-4639
Iter 4679. 
    Loss 1972.266845703125. Time 8.373229026794434
The experiments show that our neural network (RNN) architecture that the full minibability of error in training deep neural networks. Using variants of SVRG for new approximatery of sequent algorithms to train neural networks (DNNs) as achure, theoretical features wish maxout unit (Good to rectifit t
Iter 4719. 
    Loss 1856.330322265625. Time 8.259500980377197
The approach pooring complexity of the network's depth for neural networks the conventional complexity of the network's depth and then the finst such as convolutional and the network preserve to a log-order improvement in an eximen and non-convex optimization problem in the convolutional and the netw
Iter 4759. 
    Loss 1999.20458984375. Time 8.258482933044434
The recurrent and layer with a computational and examples in standard and state of the recently proposed deep are rates, and the class

Iter 5599. 
    Loss 1759.8946533203125. Time 7.352729082107544
The backpropagation of the first structure-cenvertinancy of the first structure-cinvex guminnspars. In this observation with no form of the network sizes, huch results compared to previnus dur computation in the fact of convolutional neural networks which has the potential to unseen data stochastic g
Iter 5639. 
    Loss 1683.31640625. Time 8.305567026138306
The system intented datasets (MNIST, USP, hus neural networks (DNNs) as the training set of a single model and their speaker and Krylov subspace stale. Rpp decal contrall layers and computational complexity. The algorithm is extensive to a constrained approximate an examples, the predictive machine l
Iter 5679. 
    Loss 1921.0531005859375. Time 7.801876068115234
The approach implements the target data to complex models at the teacher local optima, and function the model simple locally a finel-toadering. Recans, and the autoencoder's paper, we propose a new present an 