### RNN

This is an implementation of vanilla RNN for character-level language model. This implementation is inspired by the great [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and lectures of Andrej Karpathy. For more details about RNNs please see references. Below you can see the math of forward pass and backward pass of simple network.

#### Model Parameters

* $ W_{hh} $ - hidden to hidden weight matrix, size = (hidden_size, hidden_size)
* $ W_{hx} $ - input to hidden weight matrix, size = (hidden_size, vocab_size)
* $ W_{hy} $ - hidden to output weight matrix, size = (vocab_size, hidden_size)

#### Forward pass

$$ \text{randomly initialize wieght matrixes} $$


$$ \text{start with initial hidden state $h_{0}$ equal to zeros} $$


$$ z_{t} = W_{hh}h_{t-1} + W_{xh}x $$


$$ h_{t} = f(z_{t}) \text{, where f- is an activation function } (\sigma, \tanh, \text{etc.)} $$


$$ y_{t} = W_{hy}h_{t} $$


$$ p_{t} = softmax(y_{t}) $$


$$ J_{t} = lossfunction(p_{t}, labels_{t}) \text{, where loss function can be a crossentropy, MSE or something else} $$


$$ J = \sum_{t = 1}^{T} {J_{t}} $$

#### Backward pass 

During backward pass we have to calculate gradient of loss against weight matrixes.

$$ \frac{\partial J}{\partial W_{hy}}, \frac{\partial J}{\partial W_{hh}}, \frac{\partial J}{\partial W_{hx}} $$

For calulating them we need to do backward step by step as follows:

$$ \frac{\partial J_{t}}{\partial y_{t}} = p_{t} - labels_{t} $$

$$ \frac{\partial J_{t}}{\partial W_{hy}} = (p_{t} - labels_{t})h_{t} $$

$$ \frac{\partial J_{t}}{\partial h_{t}} = (p_{t} - labels_{t})W_{hy} $$

$$ 
\frac{\partial h_{t}}{\partial z_{t}} = 
\frac{\partial f}{\partial z_{t}} =
\begin{cases} 
    1 - \tanh^2{z_{t}}, & \text{ if } f = \tanh \\
    \sigma(z_{t})(1 - \sigma(z_{t})), & \text{ if } f = \sigma
\end{cases} =
\begin{cases} 
    1 - h_{t} ^ 2, & \text{ if } f = \tanh \\
    h_{t}(1 - h_{t}), & \text{ if } f = \sigma
\end{cases} $$

$$ \frac{\partial z_{t}}{\partial h_{t - 1}} = W_{hh} $$

$$ \frac{\partial z_{t}}{\partial W_{hx}} = x $$

$$ \frac{\partial z_{t}}{\partial W_{hh}} = h_{t - 1} $$

At any time $\tau$ the change of $ W_{hy} $ will affect only on $ y_{\tau} $, so

$$
\frac{\partial J_{t}}{\partial W_{hy}} =  
\frac{\partial J_{t}}{\partial y_{t}} \frac{\partial y_{t}}{\partial W_{hy}} =
(p_{t} - labels_{t})h_{t}.
$$

At any time $\tau$ the change of $ W_{hh} $ will affect all $ h_{k}, \text{ for all } k = 1 \rightarrow \tau  $, so

$$
\frac{\partial J_{t}}{\partial W_{hh}} =  
\sum_{k=0}^{t}
{\frac{\partial J_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{k}}
\frac{\partial h_{k}}{\partial z_{k}} \frac{\partial z_{k}}{\partial W_{hh}}}.
$$

With the same logic we will have

$$
\frac{\partial J_{t}}{\partial W_{hx}} =  
\sum_{k=0}^{t}
{\frac{\partial J_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{k}}
\frac{\partial h_{k}}{\partial z_{k}} \frac{\partial z_{k}}{\partial W_{hx}}}.
$$

Please note also that the total loss derivative will be just sum of all time loss derivatives, for example:



$$
\frac{\partial J}{\partial W_{hh}} =  
\sum_{t=1}^{T} {\frac{\partial J_{t}}{\partial W_{hh}}}.
$$

#### Implementation

In [None]:
class RNN
