# Recurrent Neural Networks

**Recurrent Neural Network** introduced cycles and notion of time.  
![rnn](rnn.png)

We can unroll the RNN as follows:

![unrolled_rnn](unrolled_rnn.png)

### How to compute the forward and backward propogation for recurrent neural network?

For every timestep $t$, $h_t$ and $y_t$ is computed with the following formulas:

$$h_t = \text{tanh}(W_{hh}h_{t-1} + W_{xh}x_t)$$

$$y_t = W_{hy}h_t$$

Note same function and same parameter is used at every timestep.  Moreoever, we need to across time before calculating higher level parameters.


# Types of Recurrent Neural Networks

## Data Input

Recall for non-recurrent neural network, the dimension of data input is $x \in \mathbb{R}^{N, M}$ where $N$ is number of data points and $M$ is number of features.  

For recurrent neural network, we will add an additional dimension $T$, which represents the *timestamp*.  

In [1]:
# (Example network for image captioning in tensorboard)
from coco_input import *
import matplotlib.pyplot as plt
import PIL
import pandas as pd
data_dir = '/home/karen/workspace/data/cocodataset/'
coco_data = CocoCaptionData(data_dir)


In [2]:
caption_data, features = coco_data.sample_data()

In [4]:
word2id = os.path.join(data_dir, "word_to_id.csv")
word_map = pd.read_csv(word2id)

In [5]:
word_map = list([v for i, v in word_map.values])

In [29]:
VOCAB_SIZE = len(word_map)

In [30]:
print(VOCAB_SIZE)

35848


In [None]:
# caption_labels = pd.read_csv(coco_data.labels_file)


In [None]:
# print(caption_labels.values[0])
# image_id = [i[-1] for i in caption_labels.values]
# caption_ids = [np.trim_zeros(i[1:-2]) for i in caption_labels.values]
 
# print(image_id[0])
# print(caption_ids[0])
# file_name = np.array([np.array(features[i]) for i in image_id])


In [None]:
# ds = tf.data.Dataset.from_tensor_slices((dict(filename=file_name), caption_ids))
 

In [None]:
# image_id = level[-1]
# s = np.trim_zeros(level[1:-2])


In [None]:
# def get_vocab(caption_data):
#     vocab_data = set()
#     for c in caption_data:
#         for v in c.split(" "):
#             v = v.lower()
#             v = v.split(".")[0]
#             vocab_data.add(v)
#     return list(vocab_data)

In [None]:
# l = get_vocab(caption_data["caption"])

In [None]:
# df = pd.DataFrame(np.array(l))
# df.to_csv(os.path.join(data_dir, "word_to_id.csv"))

In [None]:
# word_map = {w:i for i, w in enumerate(l)}

In [None]:
# caption_array = []
# for c in caption_data["caption"]:
#     s = []
#     for v in c.split(" "):
#         v = v.lower()
#         v = v.split(".")[0]
        
#         s.append(int(word_map[v]))
#     length = len(s)
#     s = np.pad(np.array(s), (0, 180-length), "constant")
#     caption_array.append(np.array(s))
    

In [None]:
# for i, t in enumerate(caption_data["image_id"]):
#     caption_array[i] = np.append(caption_array[i], [t])

In [None]:
# img_caption = pd.DataFrame.from_dict(caption_array)
# img_caption.to_csv(os.path.join(data_dir, "caption_int.csv"))

In [None]:
# randint = np.random.randint(len(features))
# image_id = caption_data["image_id"][randint]
# caption = caption_data["caption"][randint]
# sample_image_id = features
# # img_url = features[sample_image_id]
# filename = sample_image_id[image_id]
# # d = coco_data.dict_data
# caption


In [None]:
im = PIL.Image.open(img)
plt.imshow(im)
print(im.size)
plt.show()
print(caption)

In [6]:
input_fn = coco_data.coco_input(data_dir, True, False, False, 1, 500, 128)

In [7]:
input_fn()

({'filename': <tf.Tensor 'IteratorGetNext:0' shape=(128,) dtype=string>,
  'img': <tf.Tensor 'IteratorGetNext:1' shape=(128, 224, 224, 3) dtype=float32>},
 <tf.Tensor 'IteratorGetNext:2' shape=(128, 180) dtype=int64>)

### Sample Models with RNN

![onetomany](12many.png)

**Image captioning**: image captioning would require one input and output frome many recurrent cells.

### ResNet for Image Model

In [8]:
from resnet50 import resnet_model_no_last_layer
from run_model import get_available_gpus,run_model_fn,per_device_batch_size

In [46]:
def resnet_rnn(inputs, labels, is_training, use_batchnorm, data_format, name="resnet_rnn"):
    feature_map = resnet_model_no_last_layer(inputs, is_training, use_batchnorm, data_format, name)
    # Output a N X 2048 Tensor
    initial_state = feature_map
    _, state_size = initial_state.shape
    print(state_size)
    print(VOCAB_SIZE)
    rand = tf.random_uniform([VOCAB_SIZE, 2048], -1.0, 1.0)
    embeddings = tf.Variable(rand)

    
    embed = tf.nn.embedding_lookup(embeddings, labels)

    print(embed.shape)
    return rnn_zero_state(state_size, embed)
    
def rnn_zero_state(state_size, data):
    cell1 = tf.nn.rnn_cell.BasicRNNCell(state_size)
    cell2 = tf.nn.rnn_cell.BasicRNNCell(state_size)
    multi_cell = tf.nn.rnn_cell.MultiRNNCell([cell1, cell2])
    outputs, state = tf.nn.dynamic_rnn(cell=multi_cell, inputs=data, dtype=tf.float32)
    return outputs, state



In [47]:
def run_model_fn(features, labels, mode, model, name, use_batchnorm, is_hydra, data_format):
    features = features["img"]
    # Generate a summary node for the images
    tf.summary.image('images', features, max_outputs=6)
    
    outputs, state = model(features, labels, mode == tf.estimator.ModeKeys.TRAIN, use_batchnorm, data_format, name)
    # This acts as a no-op if the logits are already in fp32 (provided logits are
    # not a SparseTensor). If dtype is is low precision, logits must be cast to
    # fp32 for numerical stability.
    logits = tf.cast(outputs, tf.float32)

    predictions = {
      'classes': tf.argmax(logits, axis=1),
      'probabilities': tf.nn.softmax(logits, name='softmax_tensor')
    }

    if mode == tf.estimator.ModeKeys.PREDICT:
        # Return the predictions and the specification for serving a SavedModel
        return tf.estimator.EstimatorSpec(
            mode=mode,
            predictions=predictions,
            export_outputs={
                'predict': tf.estimator.export.PredictOutput(predictions)
            })

    # Calculate loss, which includes softmax cross entropy and L2 regularization.
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)

    # Create a tensor named cross_entropy for logging purposes.
    tf.identity(cross_entropy, name='cross_entropy')
    tf.summary.scalar('cross_entropy', cross_entropy)


    # Add weight decay to the loss.
    l2_loss = weight_decay * tf.add_n(
      # loss is computed using fp32 for numerical stability.
      [tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()])
    tf.summary.scalar('l2_loss', l2_loss)
    loss = cross_entropy + l2_loss

    if mode == tf.estimator.ModeKeys.TRAIN:
        global_step = tf.train.get_or_create_global_step()

        learning_rate = learning_rate_fn(global_step)

        # Create a tensor named learning_rate for logging purposes
        tf.identity(learning_rate, name='learning_rate')
        tf.summary.scalar('learning_rate', learning_rate)

        optimizer = tf.train.MomentumOptimizer(
            learning_rate=learning_rate,
            momentum=momentum
        )

        if loss_scale != 1:
            # When computing fp16 gradients, often intermediate tensor values are
            # so small, they underflow to 0. To avoid this, we multiply the loss by
            # loss_scale to make these tensor values loss_scale times bigger.
            scaled_grad_vars = optimizer.compute_gradients(loss * loss_scale)

            # Once the gradient computation is complete we can scale the gradients
            # back to the correct scale before passing them to the optimizer.
            unscaled_grad_vars = [(grad / loss_scale, var)
                                for grad, var in scaled_grad_vars]
            minimize_op = optimizer.apply_gradients(unscaled_grad_vars, global_step)
        else:
            minimize_op = optimizer.minimize(loss, global_step)

        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        train_op = tf.group(minimize_op, update_ops)
    else:
        train_op = None

    if not tf.contrib.distribute.has_distribution_strategy():
        accuracy = tf.metrics.accuracy(labels, predictions['classes'])
    else:
        # Metrics are currently not compatible with distribution strategies during
        # training. This does not affect the overall performance of the model.
        accuracy = (tf.no_op(), tf.constant(0))

    metrics = {'accuracy': accuracy}

    # Create a tensor named train_accuracy for logging purposes
    tf.identity(accuracy[1], name='train_accuracy')
    tf.summary.scalar('train_accuracy', accuracy[1])

    return tf.estimator.EstimatorSpec(
        mode=mode,
        predictions=predictions,
        loss=loss,
        train_op=train_op,
        eval_metric_ops=metrics)

In [48]:
model_fn = lambda features, labels, mode :run_model_fn(features, labels, mode, 
                                                          resnet_rnn, "resnet_rnn", True, 
                                                          False, "channels_first")

In [49]:
# hardware configuration
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
run_config = tf.estimator.RunConfig(session_config=config)

# Definition of Estimators
classifier = tf.estimator.Estimator(model_fn=model_fn, 
                                    model_dir="tmp/test_model", 
                                    config=run_config,
                                    params=None)

INFO:tensorflow:Using config: {'_model_dir': 'tmp/test_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
  allow_growth: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8a9643d438>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [None]:
classifier.train(input_fn, max_steps=1)

## Problems with RNN

### Exploding/Vanishing Gradients

For $$h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t)$$ and $$y_t = W_{hy}h_t$$.

Using the Jacobians:

$$J_{L}(h_{t-1}) = J_L(h_t) J_{H_t}(h_{t-1})$$

$$J_{h_t}(h_{t-1}) = \frac{1}{\cosh^2 \hat{h}}W_{hh}

Since $\frac{1}{\cosh^2 \hat{h}} \leq 1$, $W_{hh}$ can be arbitrarily large/small.  

Since $W_{hh}$ is multiplied in the gradient derivation step, gradient is a power of $W_hh$.

If $\Lambda^{\max}_{W_hh} > 1$, gradient will grow exponentially (exploding).

If $\Lambda^{\max}_{W_hh} < 1$, gradient will shrink exponentially (vanishing).

## Interesting Topics
1. DRAW: A Recurrent Neural Network For Image Generation, Gregor et al.
2. Multiple Object Recognition with Visual Attention, Ba et al. 
3. Recurrent Network of Attention

