# Assignment 11: Sequence Learning: Recurrent and Recursive Neural Networks (deadline: 27 Jan, 23:59)

For theoretical tasks you are encouraged to write in $\\LaTeX$. Jupyter notebooks support them by default. For reference, please have a look at the examples in this short excellent guide: [Typesetting Equations](http://nbviewer.jupyter.org/github/ipython/ipython/blob/3.x/examples/Notebook/Typesetting%20Equations.ipynb)

Alternatively, you can upload the solutions in the written form as images and paste them inside the cells. But if you do this, **make sure** that the images are of high quality, so that we can read them without any problems.

### Exercise 1. Comparing Vanilla RNN and LSTM for different Sequence Lengths (7 points)

**Goal**: To study the variation of training performance for different sequence lengths in Vanilla RNN and Long-Short Term Memory (LSTM) Neural Networks, on a word prediction task.

For this exercise, you will need to familiarize yourself with LSTMs. A good tutorial on LSTMs is presented at [Colah's Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

The following LSTM tensorflow code is derived from an example [here](https://github.com/roatienza/Deep-Learning-Experiments/blob/master/Experiments/Tensorflow/RNN/rnn_words.py). This code allows you to run an LSTM-based Neural Network on a word prediction task. The learning is set up to predict the next word given the previous `n_input` words.

You will be using this code, the file `train.txt` from NNIA's resources page on Piazza and answering the following questions to complete this exercise. 

Note: You will need tensorflow installed to your IPython Notebook.

In [1]:
'''
A Recurrent Neural Network (LSTM) implementation example using TensorFlow..
Next word prediction after n_input words learned from text file.
A story is automatically generated if the predicted word is fed back as input.

Source Author: Rowel Atienza
Project: https://github.com/roatienza/Deep-Learning-Experiments
'''

from __future__ import print_function

import numpy as np
import tensorflow as tf
from tensorflow.contrib import rnn
import random
import collections
import time
from scipy import spatial
import matplotlib.pyplot as plt


start_time = time.time()
def elapsed(sec):
    if sec<60:
        return str(sec) + " sec"
    elif sec<(60*60):
        return str(sec/60) + " min"
    else:
        return str(sec/(60*60)) + " hr"

# Target log path
logs_path = '/tmp/tensorflow/rnn_words'
writer = tf.summary.FileWriter(logs_path)

# Text file containing words for training
training_file = 'train.txt'

def read_data(fname):
    with open(fname) as f:
        content = f.readlines()
    content = [x.strip() for x in content]
    content = [content[i].split() for i in range(len(content))]
    content = np.array(content)
    content = np.reshape(content, [-1, ])
    return content

training_data = read_data(training_file)
print("Loaded training data...")

def build_dataset(words):
    count = collections.Counter(words).most_common()
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, reverse_dictionary

dictionary, reverse_dictionary = build_dataset(training_data)
vocab_size = len(dictionary)

# Parameters
learning_rate = 0.001
training_iters = 5000
display_step = 1000
n_input = 10

# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        # (eg. [had] [a] [general] -> [20] [6] [33])
        x = tf.split(x,n_input,1)

        # 1-layer LSTM with n_hidden units but with lower accuracy.
        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call
        #rnn_cell = rnn.BasicLSTMCell(n_hidden)
        rnn_cell = tf.contrib.rnn.BasicRNNCell(n_hidden)
        
        
        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)
    
    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)
    
    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    
    # Initializing the variables
    init = tf.global_variables_initializer()
    
    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        writer.add_graph(session.graph)

        while step < training_iters:
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(training_data)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = [ [dictionary[ str(training_data[i])]] for i in range(offset, offset+n_input) ]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = np.zeros([vocab_size], dtype=float)
            symbols_out_onehot[dictionary[str(training_data[offset+n_input])]] = 1.0
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            if (step+1) % display_step == 0:
                print("Iter= " + str(step+1) + ", Average Loss= " + \
                      "{:.6f}".format(loss_total/display_step) + ", Average Accuracy= " + \
                      "{:.2f}%".format(100*acc_total/display_step))
                acc_total = 0
                loss_total = 0
                symbols_in = [training_data[i] for i in range(offset, offset + n_input)]
                symbols_out = training_data[offset + n_input]
                symbols_out_pred = reverse_dictionary[int(tf.argmax(onehot_pred, 1).eval())]
                print("%s - [%s] vs [%s]" % (symbols_in,symbols_out,symbols_out_pred))
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Elapsed time: ", elapsed(time.time() - start_time))

Loaded training data...
Iter= 1000, Average Loss= 18.179031, Average Accuracy= 5.50%
['fawned', 'upon', 'him,', 'and', 'licked', 'his', 'hands', 'like', 'a', 'friendly'] - [dog.] vs [was]
Iter= 2000, Average Loss= 8.306223, Average Accuracy= 6.50%
['his', 'hands', 'like', 'a', 'friendly', 'dog.', 'the', 'emperor,', 'surprised', 'at'] - [this,] vs [to]
Iter= 3000, Average Loss= 8.098105, Average Accuracy= 5.70%
['his', 'victim.', 'but', 'as', 'soon', 'as', 'he', 'came', 'near', 'to'] - [androcles] vs [all]
Iter= 4000, Average Loss= 8.001601, Average Accuracy= 5.60%
['to', 'androcles', 'he', 'recognised', 'his', 'friend,', 'and', 'fawned', 'upon', 'him,'] - [and] vs [the]
Iter= 5000, Average Loss= 7.931170, Average Accuracy= 4.60%
['and', 'roaring', 'towards', 'his', 'victim.', 'but', 'as', 'soon', 'as', 'he'] - [came] vs [soon]
Training Finished!
Elapsed time:  32.48968482017517 sec


a) The sequence length used for prediction in the above code is specified by `n_input`. For an LSTM cell, change `n_inputs` to 3, 7, 10 and report training accuracy for each. Note: running this code on a 4GB RAM, Core i3 processor with tensorflow v1.4 takes around four to five minutes. (**1.5 points**)

Ans : Average accuracy for `n_inputs = 3` after 5000 iterations is 31.30%

      Average accuracy for `n_inputs = 7` after 5000 iterations is 85.30%
      
      Average accuracy for `n_inputs = 10` after 5000 iterations is 90.70%

b) In the function `RNN`, replace the LSTM Cell with a Vanilla RNN Cell at `#TODO`. (**1 point**)

c) Repeat the experiment in a) for same `n_input` values. (**1.5 points**)

Ans : Average accuracy for `n_inputs = 3` after 5000 iterations is 6.60%

      Average accuracy for `n_inputs = 7` after 5000 iterations is 5.30% 
      
      Average accuracy for `n_inputs = 10` after 5000 iterations is 4.60%

d) While comparing Vanilla RNN and LSTM, what trends do you observe with training accuracy when the sequence length is varied? (**1 point**)

Ans : The average Accuray would go higher in LSTM model and go down in vanilla rnn when the sequence length is varied.

e) Why do you think one model learns much better than the other?  (**1 point**)

Ans : The difference in performance between the 2 models is due to the fact that LSTM can handle long text dependencies with varied n_input, whereas vanilla rnn can't.

f) Do you expect the model with higher training accuracy to generalize well? (**1 point**)

Ans : No. It could be that the model is just memorizing the dataset.

### Exercise 2. Unfolding Computational Graphs. (6 points)

Imagine you build a Vanilla Recurrent Neural Networks with an input layer, a Vanilla RNN based hidden layer and an output layer. 

a) The backpropagation through time algorithm looks back at a window of 4 previous time steps. Draw this computation as an unfolded graph like Figure 10.3 in the [DL book](http://www.deeplearningbook.org/contents/rnn.html). (**2 points**)

b) Which of the weight matrices used in the graph are same? Mark these arrows with the same symbol $W$. (**1 point**)

c) This unfolded computational graph for Vanilla RNN can be represented by an equivalent Recursive Neural Network. Draw the architecture for this graph. (**1 point**)

d) Can you construct a smaller height Recursive NN than c) with the same coverage of previous time steps? If no, then explain, else if yes, then draw this architecture? (**2 points**)

### Exercise 3. Forget Gate in LSTMs (3 points)

LSTMs forget information from its global cell state ($C_t$) that is irrelevant for prediction at the present time step by using the forget gate $f_t$: 

$C_t = f_t * C_{t-1}+i_t*\tilde{C_t}$. 

Refer [Colah's Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) for more information on the notation. 

As this information ($f_t * C_{t-1}$) is forgotten, it might happen at some time in the future that the prediction process requires this information again, however, as this information is forgotten there is no way to access it again. Suggest a way of saving this information from being forgotten completely? Your solution should work systematically as the LSTM moves over a sequence. 

Hint: Do you know about Caching Mechanism in physical memories in computers?

Ans: Need to save every cell state together with the hidden state that is generated in a cache. During the generation of a cell state look for similar cached hidden states than the current one and draw informations from it for the new cell state generation.


### Exercise 4. Recurrent and Recursive Neural Networks: Theory (3 points)

Following are statements that you should answer with either a True or a False. And, also provide a justification as to why you think so. To answer these questions, you will need to revisit the lecture slides and read the DL book's Chapter on [Sequence Modelling](http://www.deeplearningbook.org/contents/rnn.html).

a) A Convolution Neural Network layer forms a shallower way of sharing parameters through time than a Vanilla RNN layer. \[T/F\]

Ans: True. 

A convolutional network can be used to share paramters over time but it is shallow. And the output of the recurrent networks depend on the output of previous steps.

b) Networks with output recurrence are more powerful than hidden-to-hidden recurrence. \[T/F\]

Ans: False. 

Unless the output is very high-dimensional and rich, it will usually lack important information from the past. It might be easier to train though because each steps can be trained on its own and therefore be parallalized.

c) Removing the Global Cell State from LSTMs will result in a Vanilla RNN. \[T/F\]

Ans: True. 

Its clear that all gates depend on the cell state,so if the cell state vanishes the Cell will be a vanilla RNN cell.

---

## Submission instructions
You should provide a single Jupyter notebook as the solution. The naming should include the assignment number and matriculation IDs of all members in your team in the following format:
**assignment-11_matriculation1_matriculation2_matriculation3.ipynb** (in case of 3 members in a team). 
Make sure to keep the order matriculation1_matriculation2_matriculation3 the same for all assignments.

Please submit the solution to your tutor (with **[NNIA][assignment-11]** in email subject):
1. Maksym Andriushchenko <s8mmandr@stud.uni-saarland.de>
2. Marius Mosbach <s9msmosb@stud.uni-saarland.de>
3. Rajarshi Biswas <rbisw17@gmail.com>
4. Marimuthu Kalimuthu <s8makali@stud.uni-saarland.de>

Note :  **If you are in a team, you should submit only 1 solution to only 1 tutor.** <br>
$\hspace{2em}$ **Submissions violating these rules will not be graded.**