<font color="#de3023"><h1><b>REMINDER: MAKE A COPY OF THIS NOTEBOOK, DO NOT EDIT</b></h1></font>

# Lab 10: RNNs
Now that we've finished working through CNNs and computer vision, we're going to move onto the last major topic of the course: RNNs and NLP. RNN stands for "Recurrent Neural Network" and NLP for "Natural Language Processing." The field of NLP has existed for some time before RNNs and modern machine learning, but it has really taken large strides recently because of RNNs. We'll cover NLP in some detail next week. Today, we'll just be talking about the basics of RNNs.

## Why RNNs?
We've already discussed a ton of deep learning tools over this course:
- Dense layers
- Dropout
- Convolutions
- MaxPooling

So why add another? It's useful to see where the core deficiency lies with what we've learned so far. One of the main ones is that in *all* the data we've worked with so far, we've treated each prediction as being totally independent of one another. In all the datasets we've worked with so far, this assumption makes complete sense: after all, if you predict one image to be a dog, it makes no difference for how you will predict the next image you see. This is totally untrue, however, in some other datasets. Suppose we had a video where we wish to classify if there's a dog in the frame or not for each frame. If there's a dog that's there in the current frame, chances are it will be there in the next frame as well! So, in this case, we *could* continue to treat the predictions as independent and ignroe the history of things we had seen up to that point, but it seems likely that doing so would leave some accuracy on the table for us to scoop up.

Datasets that have this time-dependent property often fall into the field of "time series" analysis, although there are other cases where RNNs are useful, such as in NLP. In language, properties of the current word in a sentence are greatly influenced by those that preceded (or those that succeed) it, which is why we want to take these things into account. RNNs are what let us do this: they give us the ability to "keep track" of what we've seen so far in making predictions.

## Use Cases
RNNs can be used for a *number* of cases:

![](https://cs231n.github.io/assets/rnn/types.png)

- One-to-one: what we've been doing so far
- One-to-many: Text generation
- Many-to-one: Sentiment analysis
- Many-to-many: Stock predictions

The main unifying theme, and when you should have sirens in your head to signal the use of RNNs, is when your data is structured in a way where there is dependence, specifically where the past (or future, but we'll ignore this for now) is useful in predicting the next step. 

An important note is that this dependence **need not** be in your predictions! Sometimes it will be, which is why this could be confusing, but in general, the use of RNNs is a function of your **data (X)**. This is similar to how we exploited the structure of images in choosing to incorporate convolutions into our networks: it wasn't the **output** types that affected this choice of architecture. It was the data. So, in choosing your architectures, you should focus on properties of the data, **not** the labels. Keep "sentiment analysis" as a good example for this, since the data are sequential, but the final prediction is just a single overall label.

## Overall Architecture
Great! Now that we've got the main conceptual motivation out, let's see how these things actually work in practice. These architectures can be pretty involved, so we'll dive into specifics later and just walk through the high-level ideas first. The idea is that the network is going to learn how to "store" a representation of the things that it has seen in the past into a hidden state. This, **combined with** the next data point, is what's used to predict the next y!

![](https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-many-to-many-same-ltr.png?2790431b32050b34b80011afead1f232)

There are a bunch of different RNNs, but we've gone through a fair bit of theory at this point. So, let's switch gears to see how this is used in practice.

## Setup
For today's lab, we'll start by looking at how to use RNNs for stock prediction:

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
import urllib.request
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/kevincwu0/rnn-google-stock-prediction/master/Google_Stock_Price_Train.csv"
urllib.request.urlretrieve(url, "Google_Stock_Price_Train.csv")

url = "https://raw.githubusercontent.com/kevincwu0/rnn-google-stock-prediction/master/Google_Stock_Price_Test.csv"
urllib.request.urlretrieve(url, "Google_Stock_Price_Test.csv")

One thing to keep in mind is how we construct the datasets: now that we're actually baking in information about the past for predicting the future, we have to construct the dataset accordingly to account for this! Let's start by loading in the raw datset:

In [None]:
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:2].values 

from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)

In [None]:
print(training_set_scaled[0:10])

In [None]:
plt.plot(training_set, color = 'red', label = 'Real Google Stock Price')
plt.title('Google Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Price')
plt.legend()
plt.show()

Unlike past work you've done, this data can't be used as is for training: you have to construct the serieses yourself that will be used. Let's give that a go:

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Construct datasets X_train and y_train, which are, for each time step, the#
# price in the last 60 time steps and the current price resp.               #
# The dataset you should pull from is: training_set_scaled                  #
#############################################################################

num_pts, _ = training_set_scaled.shape
X_train = []
y_train = []

for i in range(____):
  ____
  ____

X_train = np.array(X_train)
y_train = np.array(y_train)
print(y_train.shape)

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

## SimpleRNNs
Let's start by seeing how to use RNNs for this task. We will use a new layer called `SimpleRNN` to get started:

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import Dropout

`SimpleRNN` is a basic RNN: we'll see extensions soon. The main two parameters to keep in mind are:

- `units`: The dimension of the hidden state: this does **not** have to match the window size! It's just another hyperparameter to choose
- `return_sequences`: Whether you want to return the **entire** prediction on the input data sequence or **just** the prediction on the final point of the sequence 

Let's see how the results differ before we construct a full RNN:

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Construct a SimpleRNN with 50 units (and return_sequences False) and pass 
# in an example from your training set through the layer. What do you expect 
# the dimension to be?
#############################################################################

simple_rnn = ____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
model(X_train[0:1,:])

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Repeat the above, now with return_sequences = True. What do you expect 
# the dimension to be now? 
#############################################################################

____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

Great! So, `return_sequences` is mostly useful when you either want to stack multiple LSTMs together *or* if you want to do a many-to-many prediction (like part of speech tagging in a block of text for instance). Let's now think about what we should be using for this task. The architecture of the model will be:

```
SimpleRNN -> Dropout -> Dense
```

Think about what the task is and what the output dimension should be as a result!

In [None]:
#############################################################################
# Exercise                                                              #
#                                                                           #
# Construct a sequential model with the architecture given above. Don't 
# train it yet: we'll do that next
#############################################################################

model = tf.keras.models.Sequential(
  ____
)

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Compile your model with the appropriate loss. Remember, we're doing a 
# regression task here!
#############################################################################

model.compile(____)

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

model.fit(X_train, y_train, epochs = 20, batch_size = 32)

Let's now take a look at the results:

In [None]:
dataset_test = pd.read_csv('Google_Stock_Price_Test.csv')
real_stock_price = dataset_test.iloc[:, 1:2].values

In [None]:
def plot_model(model):
  dataset_test = pd.read_csv('Google_Stock_Price_Test.csv')
  real_stock_price = dataset_test.iloc[:, 1:2].values

  # Getting the predicted stock price of 2017
  dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0)
  inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60:].values
  inputs = inputs.reshape(-1,1)
  inputs = sc.transform(inputs)
  X_test = []

  for i in range(60, 80):
      X_test.append(inputs[i-60:i, 0])

  X_test = np.array(X_test)
  X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
  predicted_stock_price = model.predict(X_test)
  predicted_stock_price = sc.inverse_transform(predicted_stock_price)

  # Visualising the results
  plt.plot(real_stock_price, color = 'red', label = 'Real Google Stock Price')
  plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted Google Stock Price')
  plt.title('Google Stock Price Prediction')
  plt.xlabel('Time')
  plt.ylabel('Google Stock Price')
  plt.legend()
  plt.show()

In [None]:
plot_model(model)

## LSTMs

One of the main issues with this SimpleRNN is that, due to how the hidden state is learned, it is unable to keep track of long-term dependencies. This means, if you consider longer and longer sequences, the values at the beginning of the sequence will not have any impact on the predictions made later in the sequence, which totally defeats the purpose of using RNNs in the first place. For this reason, we need to figure out a way of learning these longer form dependencies. Luckily for us, there has been a great deal of work that was put into fixing exactly this problem. For this reason, people have come up with something called a "Long Short-Term Memory" cell. Before we actually understand how they work, let's see how simple it is to use them: just replace the SimpleRNN with this LSTM!

In [None]:
from keras.layers import LSTM

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Construct a sequential model now with an LSTM and retrain it
#############################################################################

model = ____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

plot_model(model)

## Opening the Black Box
Let's now take a look inside the SimpleRNN model. Remember that a RNN can be represented by a sequence of hidden states $h_1, ..., h_t$, where each hidden state $h_t$ can be computed as some function of the previous hidden state $h_{t-1}$, the current input $x_t$, a set of weights $W$:

$$h_t = f_W(h_{t-1}, x_t)$$

The point of an RNN is to learn these parameters weights $W$ to make it possible to construct the hidden state on the fly. So, what is this function we're actually using? It's:

$$h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$$

We typically ignore the bias, so we'll drop that for the time being. Let's take a moment to parse this. Visually, this is the same as:

![](https://cs231n.github.io/assets/rnn/vanilla_rnn_mformula_1.png)

Conceptually, this looks pretty similar to stuff we've done in the past! Remember that a fully-connected layer is nothing but:

$$y_t = tanh(W_{xy}x_t)$$

(Where here, we're just using a tanh activation instead of a ReLU) So, we're doing nothing but iterating through the input, with the hidden state saved and used for the next prediction, where we now have **two** weight matrices: one for the hidden state and the other for the input!

In [None]:
hidden_dim = 50
x_dim = X_train[0, :].shape[0]

In [None]:
h = tf.Variable(tf.zeros((1, hidden_dim)))
x = tf.Variable(tf.zeros((1, x_dim)))

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Construct the two weight matrices (call them "Whh" and "Wxh") that are the
# correct dimensions from the equation above. Just fill them with 0s. Remember:
# h_t = tanh(W_{hh} h_{t-1} + W_{xh}x_t)
#############################################################################

Whh = ____
Wxh = ____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# Using the weights matrices you defined in the cell above, make predictions
# for the "current" hidden state h and input x
#############################################################################
print(x.shape)
h = ____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
h.shape

In [None]:
temp.shape

Let's now try to see how we would use this basic framework to actually model a particular image or sentence:

In [None]:
x = tf.Variable(tf.zeros((5, x_dim))) # make up data of sequence size 5
h = tf.Variable(tf.zeros((1, hidden_dim)))

In [None]:
#############################################################################
# Exercise                                                               #
#                                                                           #
# String the code from above together to make predictions on the *entire* 
# sequence x now. Remember: you update the hidden state and use that with
# each successive x to make a prediction!
#############################################################################

h = tf.Variable(tf.zeros((1, hidden_dim)))
num_examples, _  =  x.shape

for i in range(num_examples):
  ____
print(h)

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
#############################################################################
# Exercise                                                              #
#                                                                           #
# Let's try putting it all together into a single function: this function
# receives an input and you have to return the *final* state h. Use zeros
# for your initial hidden state. Make sure to also construct the weights
# with the appropriate sizes
#############################################################################

def simple_rnn(x, hidden_dim):
  num_examples, x_dim  =  x.shape
  Whh = ____
  Wxh = ____

  h = ____

  for i in range(num_examples):
    ____
  return h

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################
test_x = tf.Variable(tf.zeros((20, 5))) 
print(simple_rnn(test_x, 10).shape)

And that's basically it for SimpleRNNs!

## LSTMs in Depth
Finally, let's take a look at LSTMs. The main difference, as we mentioned, is **how** the hidden state is tracked:

![](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

There are a *lot* of moving parts here, but the core idea is that we're holding onto another state (called the "cell state") in addition to the hidden state. The hand-wavy intuition is that, by only tracking relevant things in the hidden state instead of the overwriting that happens at each step of a vanilla RNN, we are able to avoid "polluting" the memory of the RNN and therefore improve the long-term memory of the unit. The top line in the diagram above represents this cell state, and you can see how information from h and x are regulated by these "sigmoid gates" to actually impacting the cell state.

Remember: sigmoid outputs values between [0,1], so this roughly corresponds to "how much" of the input should be let through to the cell state. The rough ideas for the gates are as follows:

- Forget gate: How much of the cell state to just throw away
- Input gate: How should the cell state be updated to incorporate the new data that's coming in from x and h
- Output gate: How should the cell state by used in combinating with the hidden and x to produce another output

Thinking of the cell state as long-term memory and the hidden state as short term memory is a very crude intuition you can keep in mind. In math, we first compute an *activation vector* $a\in\mathbb{R}^{4H}$ as $a=W_xx_t + W_hh_{t-1}+b$. We then divide this into four vectors $a_i,a_f,a_o,a_g\in\mathbb{R}^H$ where $a_i$ consists of the first $H$ elements of $a$, $a_f$ is the next $H$ elements of $a$, etc. We then compute the *input gate* $g\in\mathbb{R}^H$, *forget gate* $f\in\mathbb{R}^H$, *output gate* $o\in\mathbb{R}^H$ and *block input* $g\in\mathbb{R}^H$ as

$$
\begin{align*}
i = \sigma(a_i) \hspace{2pc}
f = \sigma(a_f) \hspace{2pc}
o = \sigma(a_o) \hspace{2pc}
g = \tanh(a_g)
\end{align*}
$$

We compute the next cell state $c_t$ and next hidden state $h_t$ as

$$
c_{t} = f\odot c_{t-1} + i\odot g \hspace{4pc}
h_t = o\odot\tanh(c_t)
$$

Let's see how to begin to implement this:

In [None]:
cell_dim = 25
hidden_dim = 50
x_dim = X_train[0, :].shape[0]

x = tf.Variable(tf.zeros((1, x_dim)))
c = tf.Variable(tf.zeros((1, hidden_dim)))

In [None]:
#############################################################################
# Exercise                                                              #
#                                                                           #
# Construct the weight matrices AND h vector now to fit the dimensions for 
# the LSTM eqns
#############################################################################

____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
#############################################################################
# Exercise                                                              #
#                                                                           #
# Compute the four gates for the given x, h above. Remember the g gate is
# different from the rest
#############################################################################

____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################

In [None]:
#############################################################################
# Exercise                                                             #
#                                                                           #
# Compute the update to the cell state and that to the hidden state
#############################################################################

____

#############################################################################
#                              END OF YOUR CODE                             #
#############################################################################