# NNIA 18/19 Project 4:  Optimization \& Recurrent Neural Networks

## Deadline: 28. Februrary 2019, 23:59

In [0]:
# imports
%matplotlib notebook
import re
import math
import random
import numpy as np
import tensorflow as tf
import matplotlib as mpl
from matplotlib import cm
import matplotlib.pyplot as plt
from tensorflow.contrib import rnn
from mpl_toolkits.mplot3d import Axes3D
mpl.rcParams['figure.figsize'] = (12.0, 8.0)

## 1. Optimization Algorithms$~$ (6 points)

In this task, we will get familiar with various optimization methods such as **Vanilla Gradient Descent** (GD), [**Gradient Descent with Momentum**](https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer), [**RMSProp**](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer) and [**AdaGrad**](https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer) by implementing them in TensorFlow and *visualizing* the path (convergence) towards minima using [Matplotlib 3D/Contour plots](https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html).

**3D Loss Surface**

For the following exercises we assume that the surface of the general **loss** we want to minimize is given by a function `z`. On this function, we apply different optimization methods and want to visualize their stepwise improvements. `z` is defined as:

$$ term1 = \frac{2}{\sqrt{(2\pi \alpha_{1}^{2})^{2}}} * \exp{\left(- \left[ \frac{(x-\mu_1)^2}{(\frac{\alpha {1}}{2})^2} + \frac{(y-\mu_1)^2}{(\alpha_1)^2}     \right] \right)} $$

$$ term2 = \frac{1}{\sqrt{(2\pi \alpha_{2}^{2})^{2}}} * \exp{\left(- \left[ \frac{(x-\mu_2)^2 + (y-\mu_2)^2}{(\alpha_2)^2} \right] \right)} $$

$$ term3 = \frac{1}{20} * \left(x^2  + xy + y^2 \right) $$ <br>

$$ z_{\alpha, \mu}(x, y) = term1 - term2 + term3 $$

To make yourself comfortable working with this function we provide a visualization by plotting it in 3D using [matplotlib-3D-wireframe](https://matplotlib.org/devdocs/gallery/mplot3d/wire3d.html). You can interactivaley play around with the plot to get familiar with the surface.

In [0]:
# %matplotlib inline

# params of our error surface `z`
alpha_1 = 1.0
alpha_2 = 2.0
mu_1 = 0.5
mu_2 = 0.0
range_x, range_y = np.arange(-2.0, 3.0, 0.5), np.arange(-2.0, 2.0, 0.5)

def func_z(X, Y):
    """
    function definition of our 3D error surface
    """
    exp_input_1 = -1 * ((((X - mu_1)**2) / (alpha_1/2)**2) + (((Y - mu_1)**2)/(alpha_1**2)))
    term_1 = 2/np.sqrt((2 * np.pi * alpha_1**2)**2) * np.exp(exp_input_1)
    
    exp_input_2 = -1 * ( ((X - mu_2)**2 + (Y - mu_2)**2) / alpha_2**2)
    term_2 = 1/np.sqrt((2 * np.pi * alpha_2**2)**2) * np.exp(exp_input_2)
    
    term_3 = 1/20 * (X**2 + X * Y + Y**2)
    
    return term_1 - term_2 + term_3

# x,y values for `Wireframe` plot
x_wireframe, y_wireframe = np.arange(-2.0, 3.0, 0.5), np.arange(-2.0, 2.0, 0.5)

# x,y values for `Contour` plot
x_contour, y_contour = np.arange(-2.0, 3.0, 0.1), np.arange(-2.0, 2.0, 0.1)

# Following code implements the plotting the Error Surface
X_sparse, Y_sparse = np.meshgrid(x_wireframe, y_wireframe)
Z_sparse = func_z(X_sparse, Y_sparse)

X_dense, Y_dense = np.meshgrid(x_contour, y_contour)
Z_dense = func_z(X_dense, Y_dense)

fig = plt.figure(figsize=(9,4))
ax1 = fig.add_subplot(121,projection='3d')
ax2 = fig.add_subplot(122)

ax1.plot_wireframe(X_sparse, Y_sparse, Z_sparse, linewidth=1, cmap=cm.jet, zorder=1, alpha=0.6)
ax2.contour(X_dense, Y_dense, Z_dense, 32,  cmap=cm.jet)

ax1.set_xlabel(r'$x$',fontsize=18)
ax1.set_ylabel(r'$y$',fontsize=18)
ax1.set_zlabel(r'$z$',fontsize=18)
ax1.set_title('3D Surface', fontsize=18)

ax2.contour(X_dense, Y_dense, Z_dense, 32,  cmap=cm.jet)
ax2.autoscale(False)
ax2.set_title('Contour plot', fontsize=18)

plt.show()

<IPython.core.display.Javascript object>

### 1.1 Error Implementation with Tensorflow

Usually, we minimize the loss function of a neural network which is defined by a tensorflow computational graph which allows us perform optimization easily. Here, we first need to implement the 3D surface of the loss function described above using tensorflow. 

Setup the graph of the function by implementing `problem_3d` using tensorflow operations and variables. Write your code as specified by `# TODO`. (**1 point**)

In [0]:
# The following variables  will come in handy when implementing the error surface using tensorflow functions below
tf_x, tf_y, tf_z, = None, None, None
tf_reinit_x, tf_reinit_y = None, None
session = None

def problem_3d(start_x, start_y):
    global session
    global tf_x, tf_y, tf_z
    global tf_reinit_x, tf_reinit_y
    
    tf.reset_default_graph()
    session = tf.InteractiveSession()

    with tf.variable_scope('opt'):
        tf_x = tf.get_variable('x', initializer=tf.constant(start_x, shape=None, dtype=tf.float32))
        tf_y = tf.get_variable('y', initializer=tf.constant(start_y, shape=None, dtype=tf.float32))

    with tf.variable_scope('opt', reuse=True):
        tf_reinit_x = tf.assign(tf.get_variable('x'), start_x)
        tf_reinit_y = tf.assign(tf.get_variable('y'), start_y)
    
    # TODO Implement 3D error surface using the above defined variables
    exp_input_1 = -1 * ((((tf_x - mu_1)**2) / (alpha_1/2)**2) + (((tf_y - mu_1)**2)/(alpha_1**2)))
    tf_term_1 = 2/np.sqrt((2 * np.pi * alpha_1**2)**2) * tf.math.exp(exp_input_1)
    
    exp_input_2 = -1 * ( ((tf_x - mu_2)**2 + (tf_y - mu_2)**2) / alpha_2**2)
    
    tf_term_2 = 1/np.sqrt((2 * np.pi * alpha_2**2)**2) * tf.math.exp(exp_input_2)
    
    tf_term_3 = 1/20 * (tf_x**2 + tf_x * tf_y + tf_y**2)
    
    
    tf_z = tf_term_1 - tf_term_2 + tf_term_3
    

---
**Points:** $0.0$ of $1$
**Comments:** None

---

### 1.2 Implementation of Gradient Descent with Momentum

In the lecture chapter 8 on slide 20 you got introduced to an advanced implementation of the Gradient Descent Optimizer, called Gradient Descent with Momentum. In this exercise you should implement Gradient Descent with Momentum using tensorflow operations. 

In the following, we provide a class for GD with Momentum where you have to fill in the `#TODO` sections only. To get gradients of the objective you want to minimize, use the function [`tf.gradients`](https://www.tensorflow.org/api_docs/python/tf/gradients). Make sure, that your variables are always shaped correctly! (**2 points**).

In [0]:
class GradientDescentMomentumOptimizer():
    
    def __init__(self, learning_rate, alpha):

        with tf.variable_scope('gdm_opt'):
            self.learning_rate = tf.get_variable('lr', initializer=tf.constant(learning_rate, shape=[], dtype=tf.float32))
            self.alpha = tf.get_variable('alpha', initializer=tf.constant(alpha, shape=None, dtype=tf.float32))
            self.v = tf.get_variable('v', initializer=tf.constant([0, 0], shape=[2, 1], dtype=tf.float32))

        # input 
        self.input_x = tf.placeholder("float", [])
        self.input_y = tf.placeholder("float", [])

        # optimized outputs
        self.out_x = None
        self.out_y = None  

        # gradients
        self.grads = None

        # objective to minimize      
        self.objective = None
        
    def minimize(self, objective):
        self.objective = objective
        return self.optimization_step()

    def update(self, new_x, new_y, new_v):

        with tf.variable_scope('opt', reuse=True):
            tf_reinit_x = tf.assign(tf.get_variable('x'), new_x[0])
            tf_reinit_y = tf.assign(tf.get_variable('y'), new_y[0])
           # print(new_x,'newwwww',new_y)

        with tf.variable_scope('gdm_opt', reuse=True):
            set_v =  tf.assign(tf.get_variable('v'), new_v)
            
        return tf_reinit_x, tf_reinit_y, set_v

    def optimization_step(self):
        
        global tf_x, tf_y
        
        
        # TODO: Implement this function returning the updated positions into self.out_x, self.out_y

        self.grads = tf.gradients(tf_z,[tf_x,tf_y],stop_gradients=[tf_x,tf_y]) 
        self.grads = tf.reshape(self.grads, [2,1])
        self.v = (self.alpha * self.v) - self.learning_rate * self.grads
        self.out_x = tf_x + self.v[0]
        self.out_y = tf_y + self.v[1]

        return self.out_x, self.out_y

In the following, use your implementation to find a local minimum in our loss function. We choose a fixed starting position and run the optimizer for a certain amount of steps.

In [0]:

# starting position
start_x, start_y = 0.55, 0.6
n_steps = 60

problem_3d(start_x,start_y)

lr = 0.2
alpha = 0.9

optimizer = GradientDescentMomentumOptimizer(lr, alpha)
opt_step = optimizer.minimize(objective=tf_z)

# initialize variables
session.run(tf.global_variables_initializer())

# set initial values
session.run([tf_reinit_x, tf_reinit_y])

# keep track of all steps
opt_gd_points_x, opt_gd_points_y, opt_gd_points_z = [],[],[]

# fill in the initial position
opt_gd_points_x.append(start_x)
opt_gd_points_y.append(start_y)
opt_gd_points_z.append(func_z(start_x,start_y))

x, y = [start_x], [start_y]

print(x)
print(y)

print('Momentum GD Optimization started')
import time
start = time.time()

for step in range(n_steps):

    # perform optimization step
    x, y, z, v, cur_gradient, _ = session.run([optimizer.out_x, optimizer.out_y, tf_z, optimizer.v, optimizer.grads, opt_step], feed_dict={optimizer.input_x: x[0], optimizer.input_y: y[0]}) 
    # update the function
    session.run([optimizer.update(x, y, v)])
    
    opt_gd_points_x.append(x[0])
    opt_gd_points_y.append(y[0])
    opt_gd_points_z.append(func_z(x[0], y[0]))
    
    if step % 10 == 0:
        print("Optimization step {} - minimized value: {}".format(step, z))
        
done = time.time()
elapsed = done - start
print(elapsed)

Instructions for updating:
Colocations handled automatically by placer.
[0.55]
[0.6]
Momentum GD Optimization started
Optimization step 0 - minimized value: 0.32791638374328613
Optimization step 10 - minimized value: 0.1342935860157013
Optimization step 20 - minimized value: 0.045574840158224106
Optimization step 30 - minimized value: 0.03934767097234726
Optimization step 40 - minimized value: -0.0014221835881471634
Optimization step 50 - minimized value: -0.022050585597753525
0.888592004776001


In [0]:
range_x,range_y = np.arange(-1.0,2.0,0.2), np.arange(-2.0,2.0,0.2)
X_lowres, Y_lowres = np.meshgrid(range_x, range_y)
Z_lowres = func_z(X_lowres,Y_lowres)

range_x,range_y = np.arange(-1.0,2.0,0.1), np.arange(-2.0,2.0,0.1)
X_hires, Y_hires = np.meshgrid(range_x, range_y)
Z_hires = func_z(X_hires,Y_hires)

fig = plt.figure(figsize=(9,4))

epsilon = 0.0
ax1 = fig.add_subplot(121,projection='3d')
ax2 = fig.add_subplot(122)

# plot
ax1.plot_wireframe(X_lowres, Y_lowres, Z_lowres, linewidth=1, cmap=cm.jet, zorder=1, alpha=0.6)
ax2.contour(X_hires, Y_hires, Z_hires, 32,  cmap=cm.jet)
ax2.autoscale(False)

for idx, (x,y,z) in enumerate(zip(opt_gd_points_x, opt_gd_points_y, opt_gd_points_z)):
    if idx != len(opt_gd_points_x)-1:
        ax1.scatter(x,y,z + epsilon , color='blue', alpha=(idx+10)/(n_steps+10.0), zorder=100)
        ax2.scatter(np.asarray(x),np.asarray(y) , color='blue')
    else:
        ax1.scatter(x,y,z + epsilon , color='blue', alpha=(idx+10)/(n_steps+10.0), label='GD with Momentum', zorder=100)
        ax2.scatter(x,y, color='blue', label='GD with Momentum')

ax1.set_xlabel(r'$x$', fontsize=18)
ax1.set_ylabel(r'$y$', fontsize=18)
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

Try out different combinations for the momentum scaler `alpha` and the learning rate `lr`. What are the impacts of these parameters? Does Momentum bring any benefit in this special example compared to GD without Momentum? - Briefly explain! (**1 point**).

Ans :  Attached are the files with different combinations for alpha and learning rate. It is evident from the figures that higher value of alpha takes longer path to converge but they take less number of steps (We conclude this by comparing the time stamps of running the same optimisation job with a fixed LR). Also, we conclude that higher LR takes less time/less steps to converge. GD with Momentum takes a longer path but less time to converge, since it is based on averaging over a set of samples. 

alpha = 0.5 ; lr =0.5 : https://drive.google.com/file/d/1a-FUSh3VW8ZHby7rJfo6O3vOPZA1cxmo/view?usp=sharing
alpha = 0.5 ; lr =0.2  : https://drive.google.com/file/d/1Gx1iNy077_TL2YnzfL2k53rkzmHmkf-0/view?usp=sharing
alpha = 0.2 ; lr =0.9 : https://drive.google.com/file/d/17UGhbdvXMzIlUoTJ7aEUpQjUTdJBzZeO/view?usp=sharing
alpha = 0.9 ; lr =0.2 : https://drive.google.com/file/d/1_Hoa-nXu2X1q0gP8itnGgGqiz-Y40-Mj/view?usp=sharing
alpha = 0.2 ; lr =0.2 : https://drive.google.com/file/d/1VnaCIcWAUDUGVsbkRNn5z4pxkaEP0TDf/view?usp=sharing
alpha = 0.9 ; lr =0.9 : https://drive.google.com/file/d/1769gC4V0s_93eEJ77tekWJtRBGaYF-qd/view?usp=sharing
alpha = 0.0 ; lr =0.2 : (Simple GD) https://drive.google.com/file/d/1G7Hl8oqipDZv5ANXFME3EU0pRH8kRPGJ/view?usp=sharing

---
**Points:** $0.0$ of $3$
**Comments:** None



### 1.3 Using Tensorflow's Optimizer Implementations

Tensorflow does of course provide optimizers which you do not have to implement explictly. In order to compare these different optimizers, complete the code below. Use the following implementations from the tensorflow library:

- [Gradient Descent](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer)
- [Gradient Descent with Momentum](https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer)
- [RMSProp](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer) 
- [AdaGrad](https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer)

(**0.5 points**)


In [0]:
# starting position
start_x, start_y = 0.55, 0.6
n_steps = 60

# Write code to define GD, Momentum, RMSProp and Adagrad implementations on tf_z global variable defined by problem_3d

with tf.variable_scope('gd', reuse=tf.AUTO_REUSE):
    # TODO: Define Gradient Descent Optimizer with learning rate = 0.1
    op_tf_optimize_z = tf.train.GradientDescentOptimizer(learning_rate=0.1)
    tf_optimize_z=op_tf_optimize_z.minimize(tf_z)

with tf.variable_scope('momentum', reuse=tf.AUTO_REUSE):
    # TODO: Define Gradient Descent with Nestrov's Momentum Optimizer with learning rate = 0.1 and momentum = 0.9
    op_tf_mom_optimize_z = tf.train.MomentumOptimizer(learning_rate=0.1,momentum=0.9,use_nesterov=True)
    tf_mom_optimize_z = op_tf_mom_optimize_z.minimize(tf_z)
    
with tf.variable_scope('rmsprop', reuse=tf.AUTO_REUSE):
    # TODO: Define RMSProp with learning rate = 0.1
    op_tf_rms_optimize_z = tf.train.RMSPropOptimizer(learning_rate=0.1)
    tf_rms_optimize_z=op_tf_rms_optimize_z.minimize(tf_z)
    
with tf.variable_scope('adagrad', reuse=tf.AUTO_REUSE):
    # TODO: Define Adagrad Optimizer with learning rate = 0.1
    op_tf_ada_optimize_z = tf.train.AdagradOptimizer(learning_rate=0.1)
    tf_ada_optimize_z=op_tf_ada_optimize_z.minimize(tf_z)

In [0]:
session.run(tf.global_variables_initializer())

# Run vanilla GD on Error Surface
session.run([tf_reinit_x, tf_reinit_y])

opt_gd_points_x, opt_gd_points_y, opt_gd_points_z = [],[],[]
opt_gd_points_x.append(start_x)
opt_gd_points_y.append(start_y)
opt_gd_points_z.append(func_z(start_x,start_y))

print('Vanilla GD Optimization started')
for step in range(n_steps):
    session.run(tf_optimize_z)
    x, y, z = session.run([tf_x, tf_y, tf_z])    
    opt_gd_points_x.append(x)
    opt_gd_points_y.append(y)
    opt_gd_points_z.append(z)
print('Vanilla GD Optimization finished')


# Run Nestrov's Momentum GD on Error Surface
session.run([tf_reinit_x, tf_reinit_y])

opt_mom_points_x, opt_mom_points_y, opt_mom_points_z = [],[],[]
opt_mom_points_x.append(start_x)
opt_mom_points_y.append(start_y)
opt_mom_points_z.append(func_z(start_x,start_y))



print("Momentum Optimization started")
for step in range(n_steps):
    session.run(tf_mom_optimize_z)
    x, y, z = session.run([tf_x, tf_y, tf_z])    
    opt_mom_points_x.append(x)
    opt_mom_points_y.append(y)
    opt_mom_points_z.append(z)
print('Momentum Optimization finished')
print(opt_mom_points_z)
    
# RMSProp
session.run([tf_reinit_x, tf_reinit_y])

opt_rms_points_x, opt_rms_points_y, opt_rms_points_z = [],[],[]
opt_rms_points_x.append(start_x)
opt_rms_points_y.append(start_y)
opt_rms_points_z.append(func_z(start_x,start_y))

print('RMSProp Optimization started')
for step in range(n_steps):
    session.run(tf_rms_optimize_z)
    x, y, z = session.run([tf_x, tf_y, tf_z])    
    opt_rms_points_x.append(x)
    opt_rms_points_y.append(y)
    opt_rms_points_z.append(z)
print('RMSProp Optimization finished')


# Run AdaGrad on Error Surface
session.run([tf_reinit_x, tf_reinit_y])

opt_ada_points_x, opt_ada_points_y, opt_ada_points_z = [],[],[]
opt_ada_points_x.append(start_x)
opt_ada_points_y.append(start_y)
opt_ada_points_z.append(func_z(start_x,start_y))


print('Adagrad Optimization started')
for step in range(n_steps):
    session.run(tf_ada_optimize_z)
    x, y, z = session.run([tf_x, tf_y, tf_z])    
    opt_ada_points_x.append(x)
    opt_ada_points_y.append(y)
    opt_ada_points_z.append(z)
print('Adagrad Optimization finished')

    
range_x,range_y = np.arange(-1.0,2.0,0.2), np.arange(-2.0,2.0,0.2)
X_lowres, Y_lowres = np.meshgrid(range_x, range_y)
Z_lowres = func_z(X_lowres,Y_lowres)

range_x,range_y = np.arange(-1.0,2.0,0.1), np.arange(-2.0,2.0,0.1)
X_hires, Y_hires = np.meshgrid(range_x, range_y)
Z_hires = func_z(X_hires,Y_hires)

# Subplots visualizing the minimization steps

fig = plt.figure(figsize=(9,4))

epsilon = 0.0
ax1 = fig.add_subplot(121,projection='3d')
ax2 = fig.add_subplot(122)

# plot
ax1.plot_wireframe(X_lowres, Y_lowres, Z_lowres, linewidth=1, cmap=cm.jet, zorder=1, alpha=0.6)
ax2.contour(X_hires, Y_hires, Z_hires, 32,  cmap=cm.jet)
ax2.autoscale(False)

# vanilla GD
for idx, (x,y,z) in enumerate(zip(opt_gd_points_x, opt_gd_points_y, opt_gd_points_z)):
    if idx != len(opt_gd_points_x)-1:
        ax1.scatter(x,y,z + epsilon , color='blue', alpha=(idx+10)/(n_steps+10.0), zorder=100)
        ax2.scatter(np.asarray(x),np.asarray(y) , color='blue')
    else:
        ax1.scatter(x,y,z + epsilon , color='blue', alpha=(idx+10)/(n_steps+10.0), label='GD', zorder=100)
        ax2.scatter(x,y, color='blue', label='GD')

# GD with momentum
for idx, (x,y,z) in enumerate(zip(opt_mom_points_x, opt_mom_points_y, opt_mom_points_z)):
    if idx != len(opt_mom_points_x)-1:
        ax1.scatter(x,y , z + epsilon , color='yellow', alpha=(idx+10)/(n_steps+10.0), zorder=100)
        ax2.scatter(x,y , color='yellow', alpha=(idx+10)/(n_steps+10.0))
    else:
        ax1.scatter(x,y,z + epsilon , color='yellow', alpha=(idx+10)/(n_steps+10.0), label='Momentum', zorder=100)
        ax2.scatter(x,y, color='yellow', alpha=(idx+10)/(n_steps+10.0), label='Momentum')

# RMSProp
for idx, (x,y,z) in enumerate(zip(opt_rms_points_x, opt_rms_points_y, opt_rms_points_z)):
    if idx != len(opt_rms_points_x)-1:
        ax1.scatter(x,y,z + epsilon , color='purple', alpha=(idx+10)/(n_steps+10.0), zorder=100)
        ax2.scatter(x,y , color='purple', alpha=(idx+10)/(n_steps+10.0))
    else:
        ax1.scatter(x,y,z + epsilon , color='purple', alpha=(idx+10)/(n_steps+10.0), label='RMSProp', zorder=100)
        ax2.scatter(x,y, color='purple', alpha=(idx+10)/(n_steps+10.0), label='RMSProp')
        
# AdaGrad
for idx, (x,y,z) in enumerate(zip(opt_ada_points_x, opt_ada_points_y, opt_ada_points_z)):
    if idx != len(opt_ada_points_x)-1:
        ax1.scatter(x,y,z + epsilon , color='green', alpha=(idx+10)/(n_steps+10.0), zorder=100)
        ax2.scatter(x,y , color='green', alpha=(idx+10)/(n_steps+10.0), zorder=100)
    else:
        ax1.scatter(x,y,z + epsilon , color='green', alpha=(idx+10)/(n_steps+10.0), label='AdaGrad', zorder=100)
        ax2.scatter(x,y,color='green', alpha=(idx+10)/(n_steps+10.0), label='AdaGrad', zorder=100)

ax1.set_xlabel(r'$x$', fontsize=18)
ax1.set_ylabel(r'$y$', fontsize=18)
ax1.set_title("Error surface 3D", fontsize=18)
ax2.set_title('GD vs Nesterov Momentum vs RMSProp vs AdaGrad ', fontsize=12)
plt.legend()
plt.show()

Vanilla GD Optimization started
Vanilla GD Optimization finished
Momentum Optimization started
Momentum Optimization finished
[0.3279163883810402, 0.32745057, 0.3263924, 0.32391322, 0.31787026, 0.30317938, 0.27050322, 0.21357372, 0.15144593, 0.115852214, 0.10629766, 0.107473284, 0.110381395, 0.11160818, 0.1102093, 0.10622542, 0.10009432, 0.09240939, 0.08380949, 0.07492136, 0.066322654, 0.05850863, 0.051853746, 0.046571713, 0.042691577, 0.0400717, 0.038457207, 0.03755916, 0.037121937, 0.036957435, 0.036946997, 0.037025142, 0.037158832, 0.03733082, 0.037529267, 0.037742853, 0.03795947, 0.03816678, 0.03835345, 0.038510233, 0.03863076, 0.038711745, 0.03875286, 0.03875629, 0.038726047, 0.038667366, 0.038585976, 0.038487524, 0.038377196, 0.038259394, 0.038137652, 0.03801464, 0.037892256, 0.03777183, 0.037654266, 0.0375403, 0.037430584, 0.037325792, 0.037226662, 0.03713399, 0.037048556]
RMSProp Optimization started
RMSProp Optimization finished
Adagrad Optimization started
Adagrad Optimizatio

<IPython.core.display.Javascript object>

Figure : https://drive.google.com/file/d/17ivbaK0rxVcxCwIQzrk8VKl5mnjD47n_/view?usp=sharing

Evaluate the function `z` at the termination points for each algorithm from the plots above. Which algorithm has made better progress in minimizing `z`?. Is it generally good to  always use this method? Briefly explain your findings. (**1.5 points**)





Ans : Adagrad Optimizer has the best progress in minimizing z. Yes, it is good to use this method because since it dynamically adapts the learning rate, the frequently occuring features get a lower learning rate and the rare features are being noticed by the model when encountered meaning the model can now identify the predictive but infrequent features more easily than if the learning rate was the same. 

---
**Points:** $0.0$ of $2$
**Comments:** None

---

## 2. RNN Implementation in Tensorflow$~$ (14 points)

In the following exercise you should implement a simple Recurrent Neural Network using tensorflow. The task we consider here is learning a certain repeating pattern of digits.

Consider the following infinite sequence: 

$$1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, ..., 9, 1, 2, 2, 3, ...$$

A digit $i \in [1, 9]$ appears i times subsequently followed by $i+1$. After $i$ equals to 9, the sequence continues with $i = 1$. 

While the recognition of this pattern is easy for humans, in this exercise, we want to train a recurrent neural network such that it is able to predict the next digit for a given sequence.

### 2.1. Prepraring the data

First, we have to generate our training and test data. The function `generate_dataset` should return a certain amount of valid sequence snippets from the pattern described above with a given sample_size.

Valid sequcences of size 5 are for example:
- $[1, 2, 2, 3, 3]$ - expected prediction: 3
- $[9, 9, 1, 2, 2]$ - expected prediction: 3
- $[7, 7, 7, 7, 7]$ - expected prediction: 7 or 8

not valid sequcences of size 5 are for example:
- $[1, 1, 2, 2, 3]$
- $[3, 4, 4, 5, 5]$
- $[9, 0, 1, 2, 2]$

Complete the function implementations below. (**1.5 points**)



In [0]:
def get_seq_samples(start=1, length=100):

    res_sequence = []
    
    # TODO: Implement this function returning one sequence starting with the digit 'start' containing 'length' items
    # modified to return length+1 items to make the generating prediction more efficient in the generate_dataset function
    curr=start
    count=np.random.randint(1,start+1)
    while len(res_sequence)<length+1:
      res_sequence=res_sequence+[curr]*count
      if curr==9:
          curr=1
      else:
         curr=curr+1
      count=curr
    
    res_sequence=res_sequence[:length+1]
    return res_sequence

def generate_dataset(sample_count, sample_size):

    dataset = []
    labels = []

    # TODO: Implement this function returning an array containing 'sample_count' generated samples of length 'sample_size'
    # and an array containing the corresponing digits which should get predicted
    i=0
    while i<sample_count:
      start=np.random.randint(1,10)
      seq=get_seq_samples(start=start,length=sample_size)
      dataset.append(seq[:-1])
      labels.append(seq[-1])
      i=i+1

    return np.array(dataset), np.array(labels)   

---
**Points:** $0.0$ of $1.5$
**Comments:** None

---

In [0]:
generate_dataset(40,2)

(array([[9, 9],
        [5, 5],
        [7, 7],
        [6, 6],
        [6, 7],
        [7, 7],
        [8, 8],
        [8, 8],
        [2, 3],
        [6, 6],
        [8, 8],
        [2, 3],
        [9, 9],
        [3, 4],
        [8, 8],
        [5, 5],
        [6, 6],
        [8, 8],
        [7, 7],
        [5, 5],
        [7, 7],
        [6, 6],
        [2, 3],
        [8, 9],
        [2, 2],
        [1, 2],
        [6, 7],
        [8, 8],
        [2, 3],
        [6, 6],
        [3, 3],
        [5, 5],
        [5, 6],
        [8, 8],
        [4, 5],
        [6, 6],
        [8, 8],
        [7, 7],
        [8, 8],
        [6, 7]]),
 array([1, 5, 7, 6, 7, 8, 8, 8, 3, 6, 8, 3, 9, 4, 8, 5, 6, 8, 7, 5, 7, 6,
        3, 9, 3, 2, 7, 8, 3, 6, 3, 6, 6, 8, 5, 6, 8, 7, 8, 7]))

### 2.2. Model Setup

In the next step we implement a recurrent many-to-one neural network which processes batches of input sequences to one single output value. 

An RNN-cell implements the following function:

$$a^{(t)} = W\cdot h^{(t-1)} + U\cdot x^{(t)} + b$$ 
$$h^{(t)} = tanh(a^{(t)})$$
$$...$$
$$o = V\cdot h^{(n)} + c$$

$t$ indicates the time step iteration, $b$ and $c$ are bias values and $U, V $ and $W$ weight parameters. $o$ is the resulting output which gets computed after processing a sequence of $n$ numbers.

a) To get familiar with the model design, draw an unfolded model graph for input sequences of length 3 (check the images in the [**Deep Learning Book - Chapter 10.2**](https://www.deeplearningbook.org/contents/rnn.html) ). For each cell, state its variable name. Also include where which mathematical operation ($+, \cdot, tanh()$) should be applied. (**2 points**)


b) Assume you have implemented the model from a) in tensorflow. For each cell in your image, add the tensor shapes (array dimensions) when the `batch_size` is set to 4 sequences. Assume that the inputs are sequences of digits, the outputs are one-hot encoded and the RNN layer size is 50. (**2 points**)


c) Finally, your task is to complete the following code at the `# TODO` sections, so that the neural network is able to process batches of size `batch_size` of digit sequences of the length `input_seq_len`. The hidden RNN size is given by `n_hidden`. (**5 points**)

Ans : ![2.2 Unfolded Model Graph]https://drive.google.com/open?id=17koUMxbTzelP4OygxzByY-OFjmCDJKEr

In [0]:
tf.reset_default_graph()

# parameters
learning_rate = 0.01
epochs = 70
batch_size = 5

# length of a single sequence
input_seq_len = 10

# number of units in RNN cell
n_hidden = 90

n_vocab=9


RNN_graph = tf.Graph()
with RNN_graph.as_default():

    # tf Graph input: X = sequences, Y = digits to predict 
    batchX_placeholder = tf.placeholder(tf.int32, [batch_size, input_seq_len])
    batchY_placeholder = tf.placeholder(tf.int32, [batch_size, 1])

    # init_state = h0
    init_state = tf.Variable(tf.random_normal([batch_size, n_hidden]))

    # TODO: RNN output node weights and biases - set the tf.Variables with correct shapes and random_normal initialization
    weights = {
        'U': tf.Variable(np.random.rand(n_vocab,n_hidden), dtype=tf.float32),
        'W': tf.Variable(np.random.rand(n_hidden, n_hidden), dtype=tf.float32),
        'V': tf.Variable(np.random.rand(n_hidden, n_vocab), dtype=tf.float32)
        }

    biases = {
        'b': tf.Variable(np.zeros((1,n_hidden)), dtype=tf.float32),
        'c': tf.Variable(np.zeros((1,n_vocab)), dtype=tf.float32)
        }

    # TODO: setup graph for the RNN
    inputs_series = tf.unstack(batchX_placeholder, axis=1)
    labels_series = tf.unstack(batchY_placeholder, axis=1)
    labels = tf.one_hot(labels_series,depth=n_vocab)
    
    h_prev = init_state
    states = []
    for t in range(input_seq_len):
#       input_t = tf.reshape(input_series[t], [batch_size, 1])
      x_t = tf.one_hot(inputs_series[t], depth=n_vocab)
      a_t = tf.matmul(h_prev,weights['W']) + tf.matmul(x_t,weights['U']) +biases['b']
      h_t = tf.tanh(a_t)
      states.append(h_t)
      h_prev = h_t
    
    # TODO: network output
    
    o = tf.matmul(h_t,weights['V'])+biases['c']
    predictions = tf.nn.softmax(o)

    # class predictions
    predictions = tf.argmax(o, axis=1)
    predictions = tf.reshape(predictions, [-1, 1])

    # TODO: accuracy 
    correct_prediction = tf.equal(predictions,labels_series)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # TODO: loss of the current batch
    total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels,logits=o))
    train_step = tf.train.RMSPropOptimizer(learning_rate).minimize(total_loss)


---
**Points:** $0.0$ of $9$
**Comments:** None

---

### 2.3. Training and Testing

a) For the training, generate a training set with 2000 and a test set of 100 samples. (**0.5 points**)

b) Fill in the `#TODO` sections so that after each 100th batch iteration, the current batch sequences get printed with the prediction and ground truth digit in one line like this: 

`
Epoch 34 Batch 200
Sequence: 9 9 1 2 2 3 3 3 4 4 - prediction: 5 label: 4
Sequence: 5 5 5 5 6 6 6 6 6 6 - prediction: 7 label: 7
Sequence: 1 2 2 3 3 3 4 4 4 4 - prediction: 5 label: 5
Sequence: 4 4 4 4 5 5 5 5 5 6 - prediction: 5 label: 6
Sequence: 7 7 7 7 7 7 7 8 8 8 - prediction: 5 label: 8
`

(**1 point**)

In [0]:

# TODO: Generate Train and Test datasets
X_train, y_train = generate_dataset(sample_count=2000,sample_size=10)
X_test, y_test = generate_dataset(sample_count=100,sample_size=10)

num_batches = len(X_train) // batch_size

# Launch the Session
with tf.Session(graph=RNN_graph) as session:

    # label shift for loss computation
    y_train = y_train - 1
    
    # Initializing the variables
    init = tf.global_variables_initializer()
    session.run(init)

    for cur_epoch in range(epochs):

        print("\nEpoch {}".format(cur_epoch))
        acc_sum = 0
        loss_sum = 0

        indices = np.random.permutation(len(X_train))

        for cur_batch_count in range(num_batches):

            batch_indices = np.array(indices[cur_batch_count:cur_batch_count + batch_size])

            x_batch = X_train[batch_indices]
            y_batch = y_train[batch_indices]    
            
            preds, cur_loss, cur_acc, _ = session.run([predictions, total_loss, accuracy, train_step], feed_dict={batchX_placeholder: x_batch, 
                                                                                    batchY_placeholder: np.reshape(y_batch, [batch_size, 1])
                                                                                    })
            acc_sum += cur_acc
            loss_sum += cur_loss
            
            # TODO: Implement the printing of the current batch predictions for batch 0, 100, 200, etc.
            if cur_batch_count%100 ==0:
              print('Batch '+ str(cur_batch_count), end="")
              for i in range(batch_size):
                s=(' Sequence: {} - prediction {} label: {} ').format(x_batch[i],preds[i],y_batch[i])
              print(s,end="")
            
            
        print("\nAvg Training Loss: {} Avg Train Accuracy: {}".format(loss_sum / num_batches, acc_sum / num_batches))
        
    # Testing
    num_batches = len(X_test) // batch_size
    y_test = y_test - 1

    acc_sum = 0
    loss_sum = 0

    for cur_batch_count in range(num_batches):
        x_batch = X_test[cur_batch_count:cur_batch_count+batch_size]
        y_batch = y_test[cur_batch_count:cur_batch_count+batch_size]            

        cur_loss, cur_acc = session.run([total_loss, accuracy], feed_dict={batchX_placeholder: x_batch, 
                                                                           batchY_placeholder: np.reshape(y_batch, [batch_size, 1])
                                                                           })

        acc_sum += cur_acc
        loss_sum += cur_loss

    print("\nFinal Test Loss: {} Final Test Accuracy: {}".format(loss_sum / num_batches, acc_sum / num_batches))


Epoch 0
Batch 0 Sequence: [1 2 2 3 3 3 4 4 4 4] - prediction [1] label: 4 Batch 100 Sequence: [7 7 7 8 8 8 8 8 8 8] - prediction [4] label: 7 Batch 200 Sequence: [5 6 6 6 6 6 6 7 7 7] - prediction [4] label: 6 Batch 300 Sequence: [2 3 3 3 4 4 4 4 5 5] - prediction [6] label: 4 
Avg Training Loss: 2.3982154205441475 Avg Train Accuracy: 0.2640000005811453

Epoch 1
Batch 0 Sequence: [2 2 3 3 3 4 4 4 4 5] - prediction [7] label: 4 Batch 100 Sequence: [5 5 6 6 6 6 6 6 7 7] - prediction [4] label: 6 Batch 200 Sequence: [1 2 2 3 3 3 4 4 4 4] - prediction [4] label: 4 Batch 300 Sequence: [6 7 7 7 7 7 7 7 8 8] - prediction [4] label: 7 
Avg Training Loss: 2.2896947883069516 Avg Train Accuracy: 0.3055000020377338

Epoch 2
Batch 0 Sequence: [3 3 3 4 4 4 4 5 5 5] - prediction [4] label: 4 Batch 100 Sequence: [4 4 4 5 5 5 5 5 6 6] - prediction [8] label: 5 Batch 200 Sequence: [4 4 4 5 5 5 5 5 6 6] - prediction [7] label: 5 Batch 300 Sequence: [6 6 6 6 6 7 7 7 7 7] - prediction [4] label: 6 
Avg Tr

---
**Points:** $0.0$ of $1.5$
**Comments:** None

---

### 2.4. Questions

a) What will happen to the test performance, if we trained the model with a sequence size of 3 or 15? Thereby, consider the generation of the sequences. (**1 point**)

**Ans** Generally, decreasing the sequence length results in a worse perform becasue the model does not have enough context to make a correct prediction and it will just  predict words that occur frequently in the corpus. 
However, in this case a sequence length of 3 resuls in a better performance becasue of the nature of the sequence. If the sequence ends with a number greater than 3, the RNN has to predict the same number to be correct. 
By increasing the sequence length, we the accuracy to improves, but not by a wide margin because RNNs are not good at encoding long range dependencies.

b) Actually we are just considering fixed sized vectors as input and apply a classification on them. Does this mean that we could also train a Fully Connected Neural Network or an SVM for this problem? Which would be more efficient and why? Briefly explain your answer. (**1 point**)

**Ans** RNN builds the sequence representation step by step, which is not the same as classifying fixed sized vectors. The latter is analogous to doing prediction with an n-gram language model with a large n, in which case performance would be bad because of data sparsity. The performance would remain more or less bad regardless of which model is used if the input is represented like this.

We can use a CNN for this problem, the kernels would slide across the input and amount of context encoded in the representation would depend on the kernel size (note that both the left and right context would be considered). If the kernel size is relatively small, this model would be more efficient because parameters are shared in the network. But if one wants to include large context, that would make the CNN really slow ad inefficient compared to the RNN.

In [0]:
#sequence_size = 3
tf.reset_default_graph()

# parameters
learning_rate = 0.01
epochs = 70
batch_size = 5

# length of a single sequence
input_seq_len = 3

# number of units in RNN cell
n_hidden = 90

n_vocab=9


RNN_graph = tf.Graph()
with RNN_graph.as_default():

    # tf Graph input: X = sequences, Y = digits to predict 
    batchX_placeholder = tf.placeholder(tf.int32, [batch_size, input_seq_len])
    batchY_placeholder = tf.placeholder(tf.int32, [batch_size, 1])

    # init_state = h0
    init_state = tf.Variable(tf.random_normal([batch_size, n_hidden]))

    # TODO: RNN output node weights and biases - set the tf.Variables with correct shapes and random_normal initialization
    weights = {
        'U': tf.Variable(np.random.rand(n_vocab,n_hidden), dtype=tf.float32),
        'W': tf.Variable(np.random.rand(n_hidden, n_hidden), dtype=tf.float32),
        'V': tf.Variable(np.random.rand(n_hidden, n_vocab), dtype=tf.float32)
        }

    biases = {
        'b': tf.Variable(np.zeros((1,n_hidden)), dtype=tf.float32),
        'c': tf.Variable(np.zeros((1,n_vocab)), dtype=tf.float32)
        }

    # TODO: setup graph for the RNN
    inputs_series = tf.unstack(batchX_placeholder, axis=1)
    labels_series = tf.unstack(batchY_placeholder, axis=1)
    labels = tf.one_hot(labels_series,depth=n_vocab)
    
    h_prev = init_state
    states = []
    for t in range(input_seq_len):
#       input_t = tf.reshape(input_series[t], [batch_size, 1])
      x_t = tf.one_hot(inputs_series[t], depth=n_vocab)
      a_t = tf.matmul(h_prev,weights['W']) + tf.matmul(x_t,weights['U']) +biases['b']
      h_t = tf.tanh(a_t)
      states.append(h_t)
      h_prev = h_t
    
    # TODO: network output
    
    o = tf.matmul(h_t,weights['V'])+biases['c']
    predictions = tf.nn.softmax(o)

    # class predictions
    predictions = tf.argmax(o, axis=1)
    predictions = tf.reshape(predictions, [-1, 1])

    # TODO: accuracy 
    correct_prediction = tf.equal(predictions,labels_series)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # TODO: loss of the current batch
    total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels,logits=o))
    train_step = tf.train.RMSPropOptimizer(learning_rate).minimize(total_loss)



X_train, y_train = generate_dataset(sample_count=2000,sample_size=3)
X_test, y_test = generate_dataset(sample_count=100,sample_size=3)

num_batches = len(X_train) // batch_size

# Launch the Session
with tf.Session(graph=RNN_graph) as session:

    # label shift for loss computation
    y_train = y_train - 1
    
    # Initializing the variables
    init = tf.global_variables_initializer()
    session.run(init)

    for cur_epoch in range(epochs):

        print("\nEpoch {}".format(cur_epoch))
        acc_sum = 0
        loss_sum = 0

        indices = np.random.permutation(len(X_train))

        for cur_batch_count in range(num_batches):

            batch_indices = np.array(indices[cur_batch_count:cur_batch_count + batch_size])

            x_batch = X_train[batch_indices]
            y_batch = y_train[batch_indices]    
            
            preds, cur_loss, cur_acc, _ = session.run([predictions, total_loss, accuracy, train_step], feed_dict={batchX_placeholder: x_batch, 
                                                                                    batchY_placeholder: np.reshape(y_batch, [batch_size, 1])
                                                                                    })
            acc_sum += cur_acc
            loss_sum += cur_loss
            
            # TODO: Implement the printing of the current batch predictions for batch 0, 100, 200, etc.
            if cur_batch_count%100 ==0:
              print('Batch '+ str(cur_batch_count), end="")
              for i in range(batch_size):
                s=(' Sequence: {} - prediction {} label: {} ').format(x_batch[i],preds[i],y_batch[i])
              print(s,end="")
            
            
        print("\nAvg Training Loss: {} Avg Train Accuracy: {}".format(loss_sum / num_batches, acc_sum / num_batches))
        
    # Testing
    num_batches = len(X_test) // batch_size
    y_test = y_test - 1

    acc_sum = 0
    loss_sum = 0

    for cur_batch_count in range(num_batches):
        x_batch = X_test[cur_batch_count:cur_batch_count+batch_size]
        y_batch = y_test[cur_batch_count:cur_batch_count+batch_size]            

        cur_loss, cur_acc = session.run([total_loss, accuracy], feed_dict={batchX_placeholder: x_batch, 
                                                                           batchY_placeholder: np.reshape(y_batch, [batch_size, 1])
                                                                           })

        acc_sum += cur_acc
        loss_sum += cur_loss

    print("\nFinal Test Loss: {} Final Test Accuracy: {}".format(loss_sum / num_batches, acc_sum / num_batches))



Epoch 0
Batch 0 Sequence: [5 5 6] - prediction [4] label: 5 Batch 100 Sequence: [8 8 8] - prediction [8] label: 7 Batch 200 Sequence: [3 3 3] - prediction [6] label: 3 Batch 300 Sequence: [1 2 2] - prediction [3] label: 2 
Avg Training Loss: 2.385942150056362 Avg Train Accuracy: 0.24829999793320895

Epoch 1
Batch 0 Sequence: [9 9 9] - prediction [2] label: 8 Batch 100 Sequence: [1 2 2] - prediction [7] label: 2 Batch 200 Sequence: [2 3 3] - prediction [5] label: 2 Batch 300 Sequence: [3 4 4] - prediction [2] label: 3 
Avg Training Loss: 2.2996459916234016 Avg Train Accuracy: 0.26779999924823644

Epoch 2
Batch 0 Sequence: [8 8 8] - prediction [8] label: 7 Batch 100 Sequence: [1 2 2] - prediction [2] label: 2 Batch 200 Sequence: [8 8 8] - prediction [2] label: 7 Batch 300 Sequence: [3 4 4] - prediction [2] label: 3 
Avg Training Loss: 2.3024367809295656 Avg Train Accuracy: 0.27209999952465297

Epoch 3
Batch 0 Sequence: [3 3 3] - prediction [2] label: 3 Batch 100 Sequence: [5 5 6] - pred

In [0]:
#sequence_length = 15

tf.reset_default_graph()

# parameters
learning_rate = 0.01
epochs = 70
batch_size = 5

# length of a single sequence
input_seq_len = 15

# number of units in RNN cell
n_hidden = 90

n_vocab=9


RNN_graph = tf.Graph()
with RNN_graph.as_default():

    # tf Graph input: X = sequences, Y = digits to predict 
    batchX_placeholder = tf.placeholder(tf.int32, [batch_size, input_seq_len])
    batchY_placeholder = tf.placeholder(tf.int32, [batch_size, 1])

    # init_state = h0
    init_state = tf.Variable(tf.random_normal([batch_size, n_hidden]))

    # TODO: RNN output node weights and biases - set the tf.Variables with correct shapes and random_normal initialization
    weights = {
        'U': tf.Variable(np.random.rand(n_vocab,n_hidden), dtype=tf.float32),
        'W': tf.Variable(np.random.rand(n_hidden, n_hidden), dtype=tf.float32),
        'V': tf.Variable(np.random.rand(n_hidden, n_vocab), dtype=tf.float32)
        }

    biases = {
        'b': tf.Variable(np.zeros((1,n_hidden)), dtype=tf.float32),
        'c': tf.Variable(np.zeros((1,n_vocab)), dtype=tf.float32)
        }

    # TODO: setup graph for the RNN
    inputs_series = tf.unstack(batchX_placeholder, axis=1)
    labels_series = tf.unstack(batchY_placeholder, axis=1)
    labels = tf.one_hot(labels_series,depth=n_vocab)
    
    h_prev = init_state
    states = []
    for t in range(input_seq_len):
#       input_t = tf.reshape(input_series[t], [batch_size, 1])
      x_t = tf.one_hot(inputs_series[t], depth=n_vocab)
      a_t = tf.matmul(h_prev,weights['W']) + tf.matmul(x_t,weights['U']) +biases['b']
      h_t = tf.tanh(a_t)
      states.append(h_t)
      h_prev = h_t
    
    # TODO: network output
    
    o = tf.matmul(h_t,weights['V'])+biases['c']
    predictions = tf.nn.softmax(o)

    # class predictions
    predictions = tf.argmax(o, axis=1)
    predictions = tf.reshape(predictions, [-1, 1])

    # TODO: accuracy 
    correct_prediction = tf.equal(predictions,labels_series)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # TODO: loss of the current batch
    total_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels,logits=o))
    train_step = tf.train.RMSPropOptimizer(learning_rate).minimize(total_loss)



X_train, y_train = generate_dataset(sample_count=2000,sample_size=15)
X_test, y_test = generate_dataset(sample_count=100,sample_size=15)

num_batches = len(X_train) // batch_size

# Launch the Session
with tf.Session(graph=RNN_graph) as session:

    # label shift for loss computation
    y_train = y_train - 1
    
    # Initializing the variables
    init = tf.global_variables_initializer()
    session.run(init)

    for cur_epoch in range(epochs):

        print("\nEpoch {}".format(cur_epoch))
        acc_sum = 0
        loss_sum = 0

        indices = np.random.permutation(len(X_train))

        for cur_batch_count in range(num_batches):

            batch_indices = np.array(indices[cur_batch_count:cur_batch_count + batch_size])

            x_batch = X_train[batch_indices]
            y_batch = y_train[batch_indices]    
            
            preds, cur_loss, cur_acc, _ = session.run([predictions, total_loss, accuracy, train_step], feed_dict={batchX_placeholder: x_batch, 
                                                                                    batchY_placeholder: np.reshape(y_batch, [batch_size, 1])
                                                                                    })
            acc_sum += cur_acc
            loss_sum += cur_loss
            
            # TODO: Implement the printing of the current batch predictions for batch 0, 100, 200, etc.
            if cur_batch_count%100 ==0:
              print('Batch '+ str(cur_batch_count), end="")
              for i in range(batch_size):
                s=(' Sequence: {} - prediction {} label: {} ').format(x_batch[i],preds[i],y_batch[i])
              print(s,end="")
            
            
        print("\nAvg Training Loss: {} Avg Train Accuracy: {}".format(loss_sum / num_batches, acc_sum / num_batches))
        
    # Testing
    num_batches = len(X_test) // batch_size
    y_test = y_test - 1

    acc_sum = 0
    loss_sum = 0

    for cur_batch_count in range(num_batches):
        x_batch = X_test[cur_batch_count:cur_batch_count+batch_size]
        y_batch = y_test[cur_batch_count:cur_batch_count+batch_size]            

        cur_loss, cur_acc = session.run([total_loss, accuracy], feed_dict={batchX_placeholder: x_batch, 
                                                                           batchY_placeholder: np.reshape(y_batch, [batch_size, 1])
                                                                           })

        acc_sum += cur_acc
        loss_sum += cur_loss

    print("\nFinal Test Loss: {} Final Test Accuracy: {}".format(loss_sum / num_batches, acc_sum / num_batches))





Epoch 0
Batch 0 Sequence: [6 6 6 6 6 7 7 7 7 7 7 7 8 8 8] - prediction [6] label: 7 Batch 100 Sequence: [1 2 2 3 3 3 4 4 4 4 5 5 5 5 5] - prediction [5] label: 5 Batch 200 Sequence: [4 4 4 5 5 5 5 5 6 6 6 6 6 6 7] - prediction [5] label: 6 Batch 300 Sequence: [4 4 4 4 5 5 5 5 5 6 6 6 6 6 6] - prediction [7] label: 6 
Avg Training Loss: 2.497075224816799 Avg Train Accuracy: 0.28520000195130707

Epoch 1
Batch 0 Sequence: [8 9 9 9 9 9 9 9 9 9 1 2 2 3 3] - prediction [5] label: 2 Batch 100 Sequence: [4 4 5 5 5 5 5 6 6 6 6 6 6 7 7] - prediction [5] label: 6 Batch 200 Sequence: [9 9 9 9 9 9 9 9 9 1 2 2 3 3 3] - prediction [5] label: 3 Batch 300 Sequence: [3 3 4 4 4 4 5 5 5 5 5 6 6 6 6] - prediction [5] label: 5 
Avg Training Loss: 2.3022355288267136 Avg Train Accuracy: 0.3478000047430396

Epoch 2
Batch 0 Sequence: [1 2 2 3 3 3 4 4 4 4 5 5 5 5 5] - prediction [8] label: 5 Batch 100 Sequence: [7 7 7 7 8 8 8 8 8 8 8 8 9 9 9] - prediction [5] label: 8 Batch 200 Sequence: [1 2 2 3 3 3 4 4 4 4 5 

---
**Points:** $0.0$ of $2$
**Comments:** None

---

## 3. Language Modelling using a LSTM$~$ (10 points)
Language Modelling describes a task similiar to 2. where a sequence of data is given and the subsequent element should get predicted. Hereby, the input sequence is a sequence of words from a natural language sentence and the model should predict the next upcomming word like in an auto correction system. 

![title](http://ofir.io/images/lm/keyboard.png)

For the model setup, we use the implementation of an RNN cell as well as of an LSTM cell by tensorflow.

### 3.1 Data Preparation

For this task, our dataset (= corpus) is a small text from the tale **"Androcles"** by Aesop you can find in "train.txt".

A data sample should consist of a sequence of integer word IDs, representing a single word each. The one-hot encoding of the subsequent word which should be predicted, constitues the respecitve label. 

One-hot encoding of words requires a mapping between words and word IDs. If n different words appear in the corpus, the encoding of a single word has shape (n, 1).

Fill in the `#TODO` sections to read the corpus, setup a vocabulary and generate one-hot encoded word sequences with respective label. (**3 points**)

In [0]:
import numpy as np
def setup_vocab(word_list):
    """Reads a string list and creates word wise dictionaries by assigning each word a unique id    
    """
    # TODO: create dict with id-word mapping for a list of words
    # so that e.g.: id_word_dict[25] = "dog"
    word_list=list(set(word_list))
    
    id_word_dict = {i : word_list[i] for i in range(0,len(word_list))}
    
    
   
    
    # TODO: create dict with word-id mapping for a list of words
    # so that e.g.: word_id_dict["dog"] = 25
    word_id_dict = {word_list[i] : i for i in range(0,len(word_list))}
   
    
    
    return id_word_dict, word_id_dict



def word_2_onehot(vocab, input_word):
    
    # TODO: implement this function returning the one-hot encoding (float array) of the word 'input_word'
    iwlist = [0]* len(vocab)
    idx =  vocab[input_word]
    iwlist[idx] = 1
    label_one_hot = iwlist
   
    

    return label_one_hot
  


def onehot_2_word(vocab, encoding):
    
    # TODO: implement this function returning the word string from a one-hot encoding
    word_decoded = vocab[np.argmax(encoding)]
   
    return word_decoded


def prepare_text(filepath="corpus.txt"):
    """Reads a text file, removes whitespaces and returns the text as string list
    """

    # read lines
    with open(filepath) as f:
        content = f.readlines()

    # strip lines
    content = [x.strip() for x in content]

    # split lines into single word lists
    content = [content[i].split() for i in range(len(content))]
    
    # remove non-alphabetics and make lowercase
    content = [re.sub('[^A-Za-z]', '', item.lower()) for sublist in content for item in sublist]

    # filter out empry strings
    content = list(filter(None, content))

    return np.array(content)
        
def prepare_sequences(word_list, vocab, seq_len):
    """
    Samples word sequences from word_list and returns sequences of size seq_len and one-hot encoded word successor word
    """

    samples = []
    labels = []
    
    for start_index in range(len(word_list) - seq_len):
        cur_sequence = []
        for offset in range(seq_len):
            
            # sequence of word-ids
            word = word_list[start_index + offset]
            word_id = vocab[word]
            cur_sequence.append(word_id)
            
        # word-id encoded samples
        cur_sequence = np.reshape(np.array(cur_sequence), [seq_len, 1])
        samples.append(cur_sequence)
        
        # one hot encoded label data
        word_label = word_list[start_index + seq_len]
        label_one_hot =  word_2_onehot(vocab, word_label)
        labels.append(label_one_hot)
        
    return np.array(samples), np.array(labels)

---
**Points:** $0.0$ of $3$
**Comments:** None

---

### 3.2 Model setup and Training

In the following, the complete implementation of RNN is given.

Extend the code below at `#TODO`, so that after `display_step` iterations, the function should **print** the currently considered sentence part with the model prediction and the ground truth word like this:

`Iteration 9500, Average Loss: 0.348980 Average Accuracy: 93.00%
sentence: bound up the paw of the - prediction: lion true word: lion`

_(The training of the model with 10000 iterations should not take longer than 20min. For debugging, you can reduce the number)_

(**2 points**)

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')
path='/content/gdrive/My Drive/'


KeyboardInterrupt: ignored

In [0]:
# Parameters
learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 6

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)

In [0]:
#RNN

# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
        rnn_cell = rnn.BasicRNNCell(n_hidden)
        # rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))

In [0]:
#LSTM

# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
#         rnn_cell = rnn.BasicRNNCell(n_hidden)
        rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))

---
**Points:** $0.0$ of $2$
**Comments:** None

---

### 3.3 RNN vs. LSTM

a) The sequence length used for prediction in the above code is specified by `n_input`; change `n_inputs` to 1, 3, 30 and report the training accuracies. After computing the values, replace the RNN cell with an LSTM cell (uncomment the line!) and repeat the procedure (in the end you should have 6 accuracy values in total). What trends do you observe with the training accuracies when the sequence length is varied for both models?  (**3 points**)

`n_input`      RNN     LSTM
---------------------------------------
    1       0.026    0.092
    3       0.054    0.279
    30      0.077    0.865

Overall, the LSTM performs better than the RNN for all sequence lengths. While increasing the sequence length improves the performance of both models, the improvement in the LSTM is much higher compared to the RNN. This means that when more context is available for training, the LSTM is able to learn more from that data. 

b) Which model do you think learns better than the other? Briefly explain your answer. (**1 point**)

The LSTM learns better than the RNN. This can be attributed to the model design because while the RNN tries to remember everything (i.e., includes input from all previous time steps in its representation) the LSTM has the ability to 'forget' some of that can only include parts of the sequence that are important for prediction.

c) Do you expect the model with higher training accuracy to generalize well? Why? Why not? (**1 point**)

If the difference in training accuracy is significantly large, the model with the higher accuracy will perform better on the test set. For example, the RNN and LSTM models have a difference of 78.8% for sequence length 30. So the LSTM will definitely perform better on new, unseen data. However if the difference in the training accuracy of the models is less, then we cannot guarantee the generalisation performance. Infact we can expect the model with higher training accuracy to perform worse because of overfitting on the training data. 

In [0]:
#for LSTM
#n_input = 1

# Parameters
learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 1

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)




# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
#         rnn_cell = rnn.BasicRNNCell(n_hidden)
        rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))

In [0]:
#For LSTM
#n_input = 3

# Parameters
learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 3

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)




# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
#         rnn_cell = rnn.BasicRNNCell(n_hidden)
        rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))

In [0]:
#For LSTM

#n_input = 30

# Parameters
learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 30

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)




# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
#         rnn_cell = rnn.BasicRNNCell(n_hidden)
        rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))

In [0]:
#RNN 

#n_input = 1

learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 1

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)


# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
        rnn_cell = rnn.BasicRNNCell(n_hidden)
        # rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))





In [0]:
#RNN 

#n_input = 3

learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 3

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)


# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
        rnn_cell = rnn.BasicRNNCell(n_hidden)
        # rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))


In [0]:
#RNN 

#n_input = 30

learning_rate = 0.001
training_iters = 10000
display_step = 100
n_input = 30

text_data = prepare_text(path+"train.txt")

id_word_dict, word_id_dict = setup_vocab(text_data)

samples, labels = prepare_sequences(text_data, word_id_dict, seq_len=n_input)

vocab_size = len(word_id_dict)


# number of units in RNN cell
n_hidden = 512

with tf.Graph().as_default():
    # tf Graph input
    x = tf.placeholder("float", [None, n_input, 1])
    y = tf.placeholder("float", [None, vocab_size])

    # RNN output node weights and biases
    weights = {
        'out': tf.Variable(tf.random_normal([n_hidden, vocab_size]))
    }
    biases = {
        'out': tf.Variable(tf.random_normal([vocab_size]))
    }

    def RNN(x, weights, biases):

        # reshape to [1, n_input]
        x = tf.reshape(x, [-1, n_input])

        # Generate a n_input-element sequence of inputs
        x = tf.split(x,n_input,1)

        # TODO replace the following layer with a Vanilla RNN tf.contrib.rnn call        
        rnn_cell = rnn.BasicRNNCell(n_hidden)
        # rnn_cell = rnn.BasicLSTMCell(n_hidden)

        # generate prediction
        outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

        # there are n_input outputs but
        # we only want the last output
        return tf.matmul(outputs[-1], weights['out']) + biases['out']


    pred = RNN(x, weights, biases)

    # Loss and optimizer
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
    optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

    # Model evaluation
    correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    # Initializing the variables
    init = tf.global_variables_initializer()

    # Launch the Session
    with tf.Session() as session:
        session.run(init)
        step = 0
        offset = random.randint(0,n_input+1)
        end_offset = n_input + 1
        acc_total = 0
        loss_total = 0

        while step < training_iters:
            
            # Generate a minibatch. Add some randomness on selection process.
            if offset > (len(samples)-end_offset):
                offset = random.randint(0, n_input+1)

            symbols_in_keys = samples[offset]
            symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])

            symbols_out_onehot = labels[offset] 
            symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])

            _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                    feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
            loss_total += loss
            acc_total += acc
            
            #TODO: after every 'display_step' steps, print the current information as stated in the exercise
            if step%display_step==0:
              in_seq= ' '.join([id_word_dict[i[0]] for i in symbols_in_keys[0]])
              y_hat= id_word_dict[np.argmax(onehot_pred)]
              y_real= id_word_dict[np.argmax(symbols_out_onehot)]
              s=('Iteration {}, Average Loss: {:.6f} Average Accuracy: {:.2f}% sentence: {} - prediction: {} true word: {}').format(step,loss_total/step,acc_total*100/step,in_seq,y_hat,y_real)
              print(s)
                            
            step += 1
            offset += (n_input+1)
        print("Training Finished!")
        print("Computing total accuracy...")
        acc = session.run([accuracy], feed_dict={x: samples, y: labels})

        print("\nTotal Accuracy: " + str(acc[0]))


---
**Points:** $0.0$ of $5$
**Comments:** None

---

---

## Submission instructions
You should provide a single Jupyter notebook (.ipynb file) as the solution. Put the names and student ids of your team members below. **Make sure to submit only 1 solution to only 1 tutor.**

- Khushboo Mehra, 2576512
- Soumya Sahoo, 2576610
- Vinit Hegiste, 2576578

## Points: 0.0 of 30.0 points