# Exercise on Stateful Recurrent Neural Network: SOLUTION

**In this exercise we want to use an improved RNN model for predicting if a ice cream store has ice on stock today. We only can use the past weather to make our predictions and hope that the ice stock today depends on the weather in the past couple of days.** 

**The weather is described by 3 states: 0=sunny, 1=cloudy and 2=rainy. People only buy ice when its sunny and the ice cream stand has an unknow stock of ice and reorders sometimes (unknown policy but we hope it depends on the weather).
Unfortunately, we are quite busy with working so we can only remember the weather of the last 2 days - for that reason our lookback is only 2 days.**

**To improve the simple RNN model we will use a stateful RNN model.  This means we will pass the learned hidden state into the next mini-batch connecting to the continuation of the sequence (not reset it to zero!). (For prediction with this stateful RNN we need to work on the test data with the same minibatch size as we have used for training). To work with a stateful RNN model we need to prepare our mini-batches in a special way - the first example of the fist batch has to be connected to the first example of the second batch and so on (see lecture slides).**  
**The idea  of passing the current hidden state into the next mini-batch is, that we can learn something from the past of the sequence that is further behind than only two steps (the past is summerized in the current hidden state).**


**a) Look at RNN model definition, the data preparation, and the model training, what is different compared to the simple RNN?**
            
Solution:

In the model definition we need to set the argument stateful=True in the SimpleRNN layer.
We need to prepare stateful mini-batches that connects sequence parts of the same sequence in the right order. We need to do that for the train, validation and test set seperately. During training we must not shuffle, since we must not destroy the order of the stateful batches.  After each epoch we call model.reset_states()  since we need to reset the hidden states after each epoch because there is no connection between last and first minibatch


**b) Take the trained model and predict the first two examples of the test set. What are the probabilities for ice/no-ice for this two examples?**   

Solution:

probability vector for outcome ice/no-ice that we get for first example: 0.22169746 (no-ice) 0.77830255 (ice)

probability vector for outcome ice/no-ice that we get for first example: 0.3353621 (no-ice)  0.66463786 (ice)

**c) Complete the code to do the prediction by "hand/numpy" using the extracted weight matrices. (We use model.get_weights() to get the learned weights.) Which state-values do we need to give the in-coming hidden state have for example 1 and for example 2 of the test data? Do we get the same probability vectors as we got it with model.predict?**

Solution:

For the first example of the test data we initialize the incoming hidden state with zeros.
For the second example of the test data, which is the continuation of the first example, we use the hidden state resulting from the first example - in this manner we do a stateful RNN prediction.

We get indeed the same probability vectors like we got with model.predict showing the predict function does what we expected it to do:

probability vector for outcome ice/no-ice that we get for first example: 0.22169746 (no-ice) 0.77830255 (ice)

probability vector for outcome ice/no-ice that we get for first example: 0.3353621 (no-ice)  0.66463786 (ice)


**d) Assess the performance of the stateful RNN model on the test data set. How does the achieved accuracy compare to the accuracy you have achieved with a simple RNN model?**

Solution:

We achieve with the stateful RNN an accuracy above the accuracy we saw with the simple RNN. With this
we showed that in our case stateful RNN outperforms a simple RNN model.

**e) Explain why the stateful RNN model does outperform the simple RNN model in our example. (Hint: remember the data generating process) How could you improve the performance of the simple RNN model? Play around with the code to check your ideas.**

Solution:

Since the data are simulated, we know the data generating process and we know that the ice stock depends on more than the last 2 days. Since the stateful RNN transfers the hidden state from the previous sequence part to the next minibatch we can make use of more than 2 days in the past which helped to improve the performance. To get a better performance with the simple RNN we could enlarge the used sequence. For this purpose we can e.g. use lookback=4 instead of lookback=2. We can indeed see that with a longer lookback the performance of the simple RNN improves and gets close to the corresponding stateful RNN.



### Import packages

In [1]:
import numpy as np
import sys
np.random.seed(42)
import tensorflow as tf

import keras
%matplotlib inline
import matplotlib.pyplot as plt
tf.__version__, sys.version_info
import pandas as pd

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Prepare data

In [2]:
def gen_data(size=1000000):
    Xs = np.array(np.random.choice(3, size=(size,))) #Random Weather
    Y = []
    ice = 2 # stock of icecream at start
    for t,x in enumerate(Xs):
        # (t-3) >= 0 the first ice cream could be delivered on day 3
        # Xs[t - 3] claudy three days before today => we ordered ice cream
        # ice < 2 not full
        if (t - 3) >= 0 and Xs[t - 3] == 1 and ice < 2: 
            ice += 1
        if x == 0: # It is sunny we therefore sell ice, if we have
            if ice > 0: # We have ice cream
                ice -= 1
        if ice > 0: #We are not out of stock
            Y.append(1)
        else:
            Y.append(0)
    return Xs, np.array(Y)

### generating the data and split it to a train valid and test set

In [3]:
X, Y = gen_data(40000) 

lookback=2

X_tr = X[0:20000]
Y_tr = Y[0:20000]
idx=np.arange(0, len(X_tr),lookback)
X_train=np.zeros((len(idx),lookback))
Y_train=np.zeros((len(idx),1))
for i in range(0,len(idx)-1):
    X_train[i]=X_tr[idx[i]:idx[i+1]]
    Y_train[i]=Y_tr[idx[i]+lookback]

X_va = X[20000:30000]
Y_va = Y[20000:30000]
idx=np.arange(0, len(X_va),lookback)
X_valid=np.zeros((len(idx),lookback))
Y_valid=np.zeros((len(idx),1))
for i in range(0,len(idx)-1):
    X_valid[i]=X_va[idx[i]:idx[i+1]]
    Y_valid[i]=Y_va[idx[i]+lookback]

X_te = X[30000:40000]
Y_te = Y[30000:40000]
idx=np.arange(0, len(X_te),lookback)
X_test=np.zeros((len(idx),lookback))
Y_test=np.zeros((len(idx),1))
for i in range(0,len(idx)-1):
    X_test[i]=X_te[idx[i]:idx[i+1]]
    Y_test[i]=Y_te[idx[i]+lookback]    

In [4]:
print(X_train.shape)
print(Y_train.shape)

print(X_valid.shape)
print(Y_valid.shape)

print(X_test.shape)
print(Y_test.shape)


(10000, 2)
(10000, 1)
(5000, 2)
(5000, 1)
(5000, 2)
(5000, 1)


### converting to one hot encoding for keras

In [5]:
from keras.utils.np_utils import to_categorical   

X_train=to_categorical(X_train,3)
Y_train=to_categorical(Y_train,2)

X_valid=to_categorical(X_valid,3)
Y_valid=to_categorical(Y_valid,2)

X_test=to_categorical(X_test,3)
Y_test=to_categorical(Y_test,2)


In [6]:
print(X_train.shape)
print(Y_train.shape)

print(X_valid.shape)
print(Y_valid.shape)

print(X_test.shape)
print(Y_test.shape)


(10000, 2, 3)
(10000, 2)
(5000, 2, 3)
(5000, 2)
(5000, 2, 3)
(5000, 2)


### prepare stateful batches


In [7]:
batch_s=50
#first create stateful mini-batches from the training data
batches=np.int(len(X_train)/batch_s)
idx=np.arange(0, batches*batch_s,batches)
for i in range(1,batches):
    idx=np.append(idx,np.arange(0, batches*batch_s,batches)+i)
print(idx[0:100])
X_train_stateful=np.zeros((len(X_train),lookback,3))
for i in range(0,len(idx)):
    X_train_stateful[i]=X_train[idx[i]]
Y_train_stateful=np.zeros((len(Y_train),2))
for i in range(0,len(idx)):
    Y_train_stateful[i]=Y_train[idx[i]]

[   0  200  400  600  800 1000 1200 1400 1600 1800 2000 2200 2400 2600
 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400
 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200
 8400 8600 8800 9000 9200 9400 9600 9800    1  201  401  601  801 1001
 1201 1401 1601 1801 2001 2201 2401 2601 2801 3001 3201 3401 3601 3801
 4001 4201 4401 4601 4801 5001 5201 5401 5601 5801 6001 6201 6401 6601
 6801 7001 7201 7401 7601 7801 8001 8201 8401 8601 8801 9001 9201 9401
 9601 9801]


In [8]:
#now create stateful mini-batches from the validation data
batches=np.int(len(X_valid)/batch_s)
idx=np.arange(0, batches*batch_s,batches)
for i in range(1,batches):
    idx=np.append(idx,np.arange(0, batches*batch_s,batches)+i)
X_valid_stateful=np.zeros((len(X_valid),lookback,3))
for i in range(0,len(idx)):
    X_valid_stateful[i]=X_valid[idx[i]]
Y_valid_stateful=np.zeros((len(Y_valid),2))
for i in range(0,len(idx)):
    Y_valid_stateful[i]=Y_valid[idx[i]]

## Setting up the stateful RNN model

In [9]:
from keras.layers import Activation, Dense, SimpleRNN

In [10]:
model = keras.models.Sequential()

name = 'RNN_stateful'

model.add(SimpleRNN(4, batch_input_shape=(50,lookback, 3),stateful=True))
model.add(Dense(2))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [11]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (50, 4)                   32        
_________________________________________________________________
dense_1 (Dense)              (50, 2)                   10        
_________________________________________________________________
activation_1 (Activation)    (50, 2)                   0         
Total params: 42
Trainable params: 42
Non-trainable params: 0
_________________________________________________________________


In [12]:
model.evaluate(X_train_stateful[0:50],Y_train_stateful[0:50],batch_size=50)



[0.7252042293548584, 0.5]

In [13]:
print(model.predict(X_train_stateful[0:50],batch_size=50)[0:5])
print(Y_train_stateful[0:5])

[[0.3813641  0.6186359 ]
 [0.39468852 0.6053115 ]
 [0.39468852 0.6053115 ]
 [0.51871794 0.4812821 ]
 [0.51871794 0.4812821 ]]
[[0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]


### train the stateful RNN model

In [14]:
for i in range(30):
    history1 = model.fit(X_train_stateful, Y_train_stateful, 
                        epochs=1, 
                        batch_size=50, 
                        verbose=2, 
                        validation_data=(X_valid_stateful,Y_valid_stateful),
                        shuffle=False) # since stateful batches are ordered, we must not shuffle
    model.reset_states()  
    # we need to reset the hidden states after each epoch 
    # since there is no connection between last and first minibatch

Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.7009 - acc: 0.5217 - val_loss: 0.6795 - val_acc: 0.5836
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6803 - acc: 0.5866 - val_loss: 0.6719 - val_acc: 0.5980
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6733 - acc: 0.5904 - val_loss: 0.6665 - val_acc: 0.5980
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6676 - acc: 0.5920 - val_loss: 0.6623 - val_acc: 0.6092
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6637 - acc: 0.6106 - val_loss: 0.6607 - val_acc: 0.6258
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6617 - acc: 0.6224 - val_loss: 0.6600 - val_acc: 0.6280
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6602 - acc: 0.6218 - val_loss: 0.6588 - val_acc: 0.6294
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
 - 0s - loss: 0.6584 - a

### After the training is completed we extract the learned weights

In [15]:
model.get_weights()

[array([[ 0.46829993, -0.27448082,  0.49198586,  0.18986978],
        [-0.07143547, -0.66210234, -0.05878811, -0.4841847 ],
        [-0.10000977, -0.64688206, -0.13557   , -0.382531  ]],
       dtype=float32),
 array([[-0.5852696 , -0.13081849,  0.42224163,  0.7146077 ],
        [ 1.0864695 , -0.57855856,  0.5216885 , -0.13379107],
        [ 0.84883505,  0.6864449 , -0.44127634,  0.44790152],
        [-0.47710732,  0.72620666,  0.7077398 ,  0.24581367]],
       dtype=float32),
 array([ 0.17804326,  0.24491675,  0.2880884 , -0.19558866], dtype=float32),
 array([[ 0.37118736, -0.96871555],
        [ 1.1224096 , -0.25177336],
        [ 0.1929656 , -0.647232  ],
        [-0.49972147, -0.40730447]], dtype=float32),
 array([-0.17113112,  0.17113112], dtype=float32)]

In [16]:
W1=np.row_stack(model.get_weights()[0:2])
b1=model.get_weights()[2]
W2=model.get_weights()[3]
b2=model.get_weights()[4]

In [17]:
W1 # stacked matrices of hidden and input 

array([[ 0.46829993, -0.27448082,  0.49198586,  0.18986978],
       [-0.07143547, -0.66210234, -0.05878811, -0.4841847 ],
       [-0.10000977, -0.64688206, -0.13557   , -0.382531  ],
       [-0.5852696 , -0.13081849,  0.42224163,  0.7146077 ],
       [ 1.0864695 , -0.57855856,  0.5216885 , -0.13379107],
       [ 0.84883505,  0.6864449 , -0.44127634,  0.44790152],
       [-0.47710732,  0.72620666,  0.7077398 ,  0.24581367]],
      dtype=float32)

### Prepare the test data for a stateful RNN model

In [18]:
# prepare the test data for a stateful RNN model
batch_s=50
batches=np.int(len(X_test)/batch_s)
idx=np.arange(0, batches*batch_s,batches)
for i in range(1,batches):
    idx=np.append(idx,np.arange(0, batches*batch_s,batches)+i)

X_test_stateful=np.zeros((len(X_test),lookback,3))
for i in range(0,len(idx)):
    X_test_stateful[i]=X_test[idx[i]]
Y_test_stateful=np.zeros((len(Y_test),2))
for i in range(0,len(idx)):
    Y_test_stateful[i]=Y_test[idx[i]]

### Do the prediction on the first two examples of the test data

In [19]:
# reset the hidden state to zero
model.reset_states()
# predict the first two mini-batches (each has size 50)
y_pred1=model.predict(X_test_stateful[0:100],batch_size=50)
print(y_pred1.shape) # we get for each time point the 2dim prediction
# check the prediction of the first instance in minibatch 1 and in minibatch 2 (each mini-batch has size 50):
# the hidden state gets passed from the first instance in minibatch 1 to first instance in minibatch 2
# below we will do this by hand and check if we get same predictions
print(y_pred1[0])
print(y_pred1[50])

(100, 2)
[0.22169746 0.77830255]
[0.3353621  0.66463786]


## One forwardpass of a stateful RNN in numpy by "hand"

### first determine the prediction of the first instance of mini-batch 1

In [20]:
# prepare the ingoing hidden state for the first example of the test data:
h0=np.array((0,0,0,0),dtype="float32")

In [21]:
h1=np.tanh(np.matmul(np.concatenate((X_test_stateful[0][0],h0)),W1)+b1)

In [22]:
h2=np.tanh(np.matmul(np.concatenate((X_test_stateful[0][1],h1)),W1)+b1)

In [23]:
Z=np.matmul(h2,W2)+b2
np.exp(Z)/np.sum(np.exp(Z))

array([0.22169747, 0.77830253])

In [24]:
# do the same again but this time with a for loop to go over the elements of a sequence
# initialize hidden state of first(!) mini-batch with zeros
ht_m1=np.array((0,0,0,0),dtype="float32") 

for i in range(0,lookback):
    ht_m1=np.tanh(np.matmul(np.concatenate((X_test_stateful[0][i],ht_m1)),W1)+b1)
Z=np.matmul(ht_m1,W2)+b2
np.exp(Z)/np.sum(np.exp(Z))

array([0.22169747, 0.77830253])

#### now determine the prediction of the first instance of mini-batch 2 (stateful connected to first instance of mini-batch 1)

In [25]:
# your code here to define the incoming hidden state for the second example of the test data
# keep hidden state from last mini-batch
ht_m2 = ht_m1
for i in range(0,lookback):
    ht_m2=np.tanh(np.matmul(np.concatenate((X_test_stateful[50][i],ht_m2)),W1)+b1)
Z=np.matmul(ht_m2,W2)+b2
np.exp(Z)/np.sum(np.exp(Z))

array([0.3353621, 0.6646379])

#### we get same result as we got with predict showing that the stateful RNN is doing what we expect it to do. 

## Check if the Performance of the stateful RNN is better than the simple RNN model

In [26]:
model.reset_states()
from sklearn.metrics import confusion_matrix
pred=model.predict(X_test_stateful,batch_size=50,)
print(confusion_matrix(np.argmax(Y_test_stateful,axis=1), np.argmax(pred,axis=1)))
np.sum(np.argmax(pred,axis=1)==np.argmax(Y_test_stateful,axis=1))/len(Y_test)

[[1314  760]
 [ 545 2381]]


0.739

#### we showed that stateful RNN can improve the performance