# Basics of reinforcement learning

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span></li><li><span><a href="#Requirements" data-toc-modified-id="Requirements-2">Requirements</a></span><ul class="toc-item"><li><span><a href="#Knowledge" data-toc-modified-id="Knowledge-2.1">Knowledge</a></span></li><li><span><a href="#Python-Modules" data-toc-modified-id="Python-Modules-2.2">Python Modules</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-2.3">Data</a></span></li></ul></li><li><span><a href="#Recurrent-neural-network" data-toc-modified-id="Recurrent-neural-network-3">Recurrent neural network</a></span><ul class="toc-item"><li><span><a href="#Implementation" data-toc-modified-id="Implementation-3.1">Implementation</a></span><ul class="toc-item"><li><span><a href="#Forward-pass" data-toc-modified-id="Forward-pass-3.1.1">Forward pass</a></span></li><li><span><a href="#Backward-path" data-toc-modified-id="Backward-path-3.1.2">Backward path</a></span></li><li><span><a href="#Sampling" data-toc-modified-id="Sampling-3.1.3">Sampling</a></span></li><li><span><a href="#Trainings-process" data-toc-modified-id="Trainings-process-3.1.4">Trainings process</a></span><ul class="toc-item"><li><span><a href="#Hyperparameter" data-toc-modified-id="Hyperparameter-3.1.4.1">Hyperparameter</a></span></li><li><span><a href="#Main" data-toc-modified-id="Main-3.1.4.2">Main</a></span></li></ul></li><li><span><a href="#Learning-curve" data-toc-modified-id="Learning-curve-3.1.5">Learning curve</a></span></li></ul></li></ul></li><li><span><a href="#Licenses" data-toc-modified-id="Licenses-4">Licenses</a></span><ul class="toc-item"><li><span><a href="#Notebook-License-(CC-BY-SA-4.0)" data-toc-modified-id="Notebook-License-(CC-BY-SA-4.0)-4.1">Notebook License (CC-BY-SA 4.0)</a></span></li><li><span><a href="#Code-License-(MIT)" data-toc-modified-id="Code-License-(MIT)-4.2">Code License (MIT)</a></span></li></ul></li></ul></div>

## Introduction

We will go over a vanilla recurrent neural network with a softmax classifier by looking at a numpy implementation of it.
This implementation is based on the Lecture of [cs231 Recurrent Neural Network](https://www.youtube.com/watch?v=6niqTuYFZLQ) and [Karpathy's min char example](https://gist.github.com/karpathy/d4dee566867f8291f086). It is recommended to have a understanding for backpropagation and matrix calculus since we will use it but not go over it.

## Requirements

### Knowledge

- [Recommended] [Neural Network and Deep Learning - Backpropagation](http://neuralnetworksanddeeplearning.com/chap2.html)
- [Recommended] [Matrix Calculus](https://explained.ai/matrix-calculus/index.html)


### Python Modules

In [None]:
import numpy as np
import matplotlib.pyplot as plt

### Data

In [None]:
text = "Als rekurrente bzw. rückgekoppelte neuronale Netze bezeichnet man neuronale Netze, die sich im Gegensatz zu den Feedforward-Netzen durch Verbindungen von Neuronen einer Schicht zu Neuronen derselben oder einer vorangegangenen Schicht auszeichnen. Im Gehirn ist dies die bevorzugte Verschaltungsweise neuronaler Netze, insbesondere im Neocortex. In künstlichen neuronalen Netzen werden rekurrente Verschaltung von Modellneuronen benutzt, um zeitlich codierte Informationen in den Daten zu entdecken"

text_length = len(text)
chars = list(set(text))
char_length = len(chars)

# dictionaries which we will use in the future for transformations
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# transform the text string to a list of integers
X = np.array([char_to_int[char] for char in text])

print('text:\n', text, '\n')
print('length of text:\n', text_length, '\namount of characters:\n', char_length)
print('alphabet:\n', chars,'\n')
print('first 10 datas:\n', X[0:10])

## Recurrent neural network

In traditional neural networks (e.g. convolution) we have to assume that all inputs are independent of each other. The network classifies each image separately and does not care about the image which was classified before. But for some tasks thats not ideal: if you want to predict the next word in a sentence you want to know which words came before it. Recurrent neural networks on the other hand are capable of using sequential informations because they perform the same task for every element of a sequence, with the output being depended on the previous calculations. You can think of Recurrent Neural Networks as Neural Networks with a in build memory or storage in which they can store informations about which calculations the neural network make. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. A simple RNN can be displayed as:

<img src="images/vanilla-rnn.jpg" width="550px"> 
- $n$ is the size of your sequence which you look at, at each iteration.
- $x_n$ is one part of a sequence which you feed in the network, e.g. if we have a sequence $x = \text{Hello Word}$ then $x_0 = \text{Hello}$ and $x_1 = \text{Word}$ is possible. In our implementation we feed in only 1 character at a time which would give us $x_0 = \text{H}$, $x_1 = \text{e}$, $x_2 = \text{l}$, $x_4 = \text{l}$, $x_4 = \text{o}$, ....
- $h_n$ is our memory which is called the hidden state of our recurrent neural network. You can see here that the hidden state of $h_{n-1}$ goes to $h_{n}$.
- $W_hidden$ is the weight of our hidden state layer
- $y_n$ is the prediction or output our network made. This layer is always different and depending on the task we want to do. If we want to build a RNN to predict the next character, then $y_n$ could be the output of a softmax
- $W_{hy}$ is the weight of our output layer

**Note:** You may wonder why we don't put a loss function to our diagram. This is because with RNN you can decide where you want to put it. One possibility is to have a loss value for each prediction, another you could imagine only having a loss value for the last 10 predictions.

### Implementation
Here we go over the core functions of a recurrent neural network. Every function could be added to a ```rnn.py``` script to run the recurrent neural network outside of a notebook. This implementation is not efficient though and it is not recommended to use it on a very large text.

#### Forward pass

$$
\begin{array}
zz_t &= W_{xh} * x_t + W_{hh} * h_{t-1} \\
h_t &= tanh(z_t) \\
y_t &= W_{hy} * h_t \\
p_t &=\frac{e^{y_t}}{\sum_k e^{y_k}} \\
L &= \sum_t - log (p_t) \end{array}
$$

In [None]:
def forward_pass(X, Y, h):
    '''
    This is the forward path of our recurrent neural network. It gets the data, labels and the current hidden state
    and returns the new hidden date, a probability distribution of the data and the softmax loss
    Args:
        X: data as list of integers
        Y: labels as list of integers
        h: current hidden state as list

    Returns:
        (h, p, loss) - new hidden state as list, probability distribution as list, loss as number

    '''
    # initializes the variables which will be returned later
    h, p, loss = [h[-1]], [], 0    
    for t in range(len(Y)):
        # transforming the a datap qoint at a time step to a one hot encoded vector
        xt = np.zeros((char_length, 1))
        xt[X[t]] = 1
        # calculating forward pass based on the formular 
        z = np.tanh(np.dot(Wxh, xt) + np.dot(Whh, h[t]))
        h.append(z)
        y = np.dot(Why, h[t + 1])
        p.append(np.exp(y) / np.sum(np.exp(y)))
        loss += -np.sum(np.log(p[t][Y[t], 0]))
    return h[1:], p, loss

#### Backward path

$$
\begin{array}
{}\frac{\partial L_t}{\partial W_{hy}} &= (p_t - label_t) *h_t^T \\
\frac{\partial L_t}{\partial h_t} &= (p_t - label_t) * W_{hy}^T\\
\frac{\partial h_t}{\partial h_{t-1}} &= \frac{\partial h_t}{\partial z_t} \frac{\partial z_t}{\partial h_{t-1}} = (1 - h_t^2)* W_{hh}^T \\
\frac{\partial z_t}{\partial W_{hh}} &= h_{t-1}^T \\
\frac{\partial z_t}{\partial W_{xh}} &= x_t^T
\end{array}
$$

In [None]:
def backward_pass(X, Y, h, p):
    '''
    Calculates the gradient of our recurrent neutral network.

    Args:
        X: data as list of integers
        Y: labels as list of integers
        h: current hidden state as list
        p: probability distribution which was calculated at the forward path

    Returns:
        (dWhh, dWxh, dWhy) - (gradient of Whh, gradient of Wxh, gradient of Why)
    '''

    # initializes the gradients with zeros
    dWhh, dWxh, dWhy = np.zeros_like(Whh), np.zeros_like(Wxh), np.zeros_like(Why)
    # initializing our gradient dhprevious which is dh_t/dh_{t-1}
    dhprevious = np.zeros_like(h[0])
    for t in reversed(range(len(Y))):
        # transforming the a datapoint at a time step to a one hot encoded vector
        xt = np.zeros((char_length, 1))
        xt[X[t]] = 1
        # gradient of Why
        dy = np.copy(p[t])
        dy[Y[t]] -= 1
        dWhy += np.dot(dy, h[t].T)
        # gradient of Wxh and Whh
        dh = np.dot(Why.T, dy) + dhprevious  # backprop into h
        dz = (1 - h[t] ** 2) * dh  # backprop through tanh nonlinearity
        dWxh += np.dot(dz, xt.T)
        dWhh += np.dot(dz, h[t - 1].T)
        dhprevious = np.dot(Whh.T, dz)

    # gradient clip to mitigate exploding gradients
    for dparam in [dWxh, dWhh, dWhy]:
        np.clip(dparam, -5, 5, out=dparam)
    return dWhh, dWxh, dWhy

#### Sampling

In [None]:
def sample(start_char, h, n):
    '''
    This functions returns a sentence of length n based on the starting character, the latest hidden state.
    
    Args:
        start_char: the character which we start with as integer
        h: latest hidden state
        n: length of sentence as integer

    Returns:
        A sample sentence as String
    '''

    # transforming the a starting character to a one hot encoded vector
    x = np.zeros((char_length, 1))
    x[start_char] = 1
    # initializing output
    text = ''
    for t in range(n):
        # predicting which character will be next
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h))
        y = np.dot(Why, h)
        p = np.exp(y) / np.sum(np.exp(y))
        # adds the predicted character to our output String
        text += chars[np.argmax(p)]
        # generates a random sample/new character from a given array based on a probability distributipn p
        random_index = np.random.choice(range(char_length), p=p.ravel())
        # transforming the generated character to a one hot encoded vector
        x = np.zeros((char_length, 1))
        x[random_index] = 1
    return text

#### Trainings process

##### Hyperparameter

In [None]:
# how many characters it will look at, at each step
seq_size = 20

# size of the hidden layer 
hidden_size = 100

# learning rate for the gradient descent algorithm
learning_rate = 1e-1

# how many times the model sees the whole data
epochs = 500

print('sequence size', seq_size, '\nhidden size:', hidden_size, '\nlearning rate:', learning_rate, '\n epochs:', epochs)

##### Main

In [None]:
# initializing weights
Wxh = np.random.randn(hidden_size, char_length) * 0.01
Whh = np.random.randn(hidden_size, hidden_size) * 0.01
Why = np.random.randn(char_length, hidden_size) * 0.01

# initializing hidden state, loss history, squared gradient and loss
grad_squared_xh, grad_squared_hh, grad_squared_hy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
loss_history = []
h = [np.zeros((hidden_size, 1))]
loss = 0

for epoch in range(epochs):
    h = [np.zeros((hidden_size, 1))]

    for steps in range(0, text_length, seq_size):
        # splits the data to the right sequence size
        inputs = X[steps:steps+seq_size]
        targets = X[steps+1:steps+1+seq_size]
        # calculate the new hidden state, probability distribution and loss
        h, p, loss = forward_pass(inputs, targets, h)
        loss_history.append(loss)
        # get the gradients
        dWhh, dWxh, dWhy = backward_pass(inputs, targets, h, p)

        # perform parameter update with Adagrad
        for param, dparam, mem in zip([Wxh, Whh, Why],
                                      [dWxh, dWhh, dWhy],
                                      [grad_squared_xh, grad_squared_hh, grad_squared_hy]):
                mem += dparam * dparam
                param += -learning_rate * dparam / \
                    np.sqrt(mem + 1e-8)  # adagrad update

    if epoch % 10 == 0:
        print('sample at epoch:', epoch, 'with loss of:', loss)
        print(sample(inputs[0], h[-1:][0], 200), '\n')

#### Learning curve

In [None]:
plt.plot(loss_history)
plt.ylabel('loss')
plt.xlabel('iterations')

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

_Notebook title_ <br/>
by _Author (provide a link if possible)_ <br/>
is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).<br/>
Based on a work at https://gitlab.com/deep.TEACHING.


### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2018 Benjamin Voigt, Steven Mi

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.