# Intro to RNNs

Recursive Neural Networks are an important and truly useful format. However to have a clear understanding of their use, one needs to have an equally clear understanding of the surrounding topics:

1. RNNs, what for?
2. The nature and types of sequential data
3. Recursive Layer design
4. Basic applications of RNNs

## [Recurrent Neural Networks](https://arxiv.org/pdf/1506.00019v4.pdf)

Sometime in the 1980's Recurrent Neural Networks (RNNs) were invented in order to tackle the issues of trying to learn sequential data. This is a special challenge because the network has to be able to abstract out the features of a concept as it is described through a chain of data points. Consider a probabilistic machine learning to predict the word 'hello' from reading a sentence:

$$ h \rightarrow P(h) = 1, P(h|?) = 1$$
$$ e \rightarrow P(e) = \frac{1}{2}, P(e|h) = 1$$
$$ l \rightarrow P(l) = \frac{1}{3}, P(l|he) = 1$$
$$ l \rightarrow P(l) = \frac{1}{2}, P(l|hel) = 1$$
$$ o \rightarrow P(o) = \frac{1}{5}, P(o|hell) = 1$$

You'll immediately notice that the prior is sort of informative, whereas the likelihood is worthless. Presumably there would be many opportunities over the course of a training set to train the likelihood function to a better degree, but even with a large training set, there still isn't a ton of certainty about the likelihood (i.e. 'hell', 'hella', 'hellacious', 'hellene', etc.). However we can inform the training set to a much better degree if we improve the depth of the likelihood sample by considering the order of the letters coming in beforehand along with the joint histograms, i.e.:

$$P(l|hel) = P(l|e)P(e|he)P(he|hel)$$

Great, but it seems like a big sampling problem. How are you going to get appropriate samples for each of those? The trick here is that the order in which they appear can be used as a piece of information. This is called temporal association (i.e. "the last time I saw all this, this happened"). 

Although there are other ways to get at this type of temporal association in different models, we can use an individual neuron to adjust for temporal association within a given feature, if we allow the neuron to inform its own previous state. Therefore, we can add a new **temporal weighting scheme** to each neuron, effectively allowing the neuron to change its output based on its previous state:

![recurrentneuron](./images/recurrent_neuron_2_3.png)

The neuron will have an activation function that depends on its past values as well as its current values:

$$h_{t} = f_{h}(W_{xh}x_{t}+W_{hh}h_{t-1}+\theta_{h})$$

The neuron outputs to a temporal response layer (often nonlinear) that considers the value of this point in time with respect to all other point in time:

$$y_{t} = f_{y}(W_{hy}h_{t} + \theta_{y})$$


The network can be trained by unrolling it in time and considering it as a standard feed-forward NN, with time as a new dimension.


![TheRecurrentNeuron](./images/RNN-unrolled.png)


Now we have an opportunity to leverage time as a sampling dimension, now providing an opportunity to address the above problem. This very simple design is historically called an **Elman Network** after its inventor. The Elman network is essentially the three-layer MLP with a set of "context units" with a vector of contexts $U_{h}$  attached to the hidden layer (these are the temporal weights described above.). The context units allow the network to keep context while **scanning** over a sequence: 

$$h_{t} = \sigma_{h}(W_{h}x_{t}+U_{h}h_{t-1}+\theta_{h})$$

$$y_{t} = \sigma_{y}(W_{h}h_{t} + \theta_{y})$$

There is another variant of the Elman network that uses the output layer to feed the context units. This is called a **Jordan Network**:

$$h_{t} = \sigma_{h}(W_{h}x_{t}+U_{h}y_{t-1}+\theta_{h})$$

$$y_{t} = \sigma_{y}(W_{h}h_{t-1} + \theta_{y})$$


## How it works

![hello-problem](./images/hello_problem_1_2.PNG)

The hello problem is now modeled phenomenonlogically. Reading from left to right we can see that the model output is trained to predict the following letter in the sequence. This might seem a little odd as the model is outputting hte same sequence forward shifted in time. 

![hello-problem](./images/hello_problem_2_1_2.PNG)


The prediction itself is used to help train the hidden layer. As you can see, we can learn a character-level language model this way:

$$P(c_t|\left\{c_{t-1},c_{t-2},c_{t-3},c_{t-4}, \ldots c_{0} \}\right)$$

## RNN Architectures

Recursive Neurons can be leveraged in a variety of ways so as to enable the network to accomplish a large number of differing tasks:

### [Language models](http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf) also [here](http://www.fit.vutbr.cz/research/groups/speech/publi/2011/mikolov_icassp2011_5528.pdf) : Many-to-Many, delayed

![translation](./images/translation_model.png)

Entire mappings for one language to another can be constructed this way given enough training

### Sentiment models: Many-to-One

![sentiment](./images/sentiment_model.png)

### [RNN for Text Genration](http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Sutskver_524.pdf)

A somewhat impractical and often hilarious application of recurrent nets is their use in automatic text generation. Original attempts at this had hoped for working code - hence computers writing programs. Worry not, the modern programmer is safe from having her job taken by a computer, for now.

### CNN-RNN for Image Description

![CNN-RNN](./images/CNN_RNN_1.png)


The combination of CNN and RNN together is [unusually powerful](http://cs.stanford.edu/people/karpathy/deepimagesent/) for the task of automatically labeling pictures in natural language. These have have the following structure, besides the filtration of the CNN, now adding a set of temporal hidden weights dependent upon the previous prediction, making the network context sensitive to previous predictions: ("The last time this happened, these features occurred in the context of 'hat', and the likelihood of this feature set occurring is relative to the appearance of these features"). 

$$h_{t} = tanh(W_{xh}x_{t}+W_{hh}h_{t-1}+W_{ih}y_{t-1}+\theta_{h})$$

$$y_{t} = tanh(W_{hy}h_{t} + \theta_{y})$$

### Recursive Neural Networks

Where recurrent neural networks are ultimately linear and feedforward, **recursive** networks can learn **any** logical structure (including loops and unbalanced trees). The basic principle is that the same set of weights is propagated recursively through the data. Individual nodes are combined into their parents using a weight matrix shared through the whole network, and the weight matrix is optimized over the entire dataset, for the entire network. 

Recursive networks are on the forefront of next-generation tools that are likely to be used to achieve previously unachievable tasks in data science, such as learning a finite reduced representation from a large, dirty dataset.  

### Notable: [Attentional CNN-RNN for Image Description](https://arxiv.org/pdf/1502.03044.pdf) (case example)

Recent times have seen amazing advances in this type of technology, in this case adding an additional prediction layer (set of weights) for detecting emphasis (attention) in the input data. Given the right training methods, this can lead to much stronger labeling outcomes than without. This is a complex model outside the scope of this class and so we will only mention it here.


## Problems with simple RNNs

#### Vanishing/Exploding Gradient

The biggest issue at hand has to do with training the RNN. You'll note that the weights from a previous state of the network are fed back into it during every step of the scan. This means that the weights are restricted to a relatively small range of magnitudes if the there are any more than a few temporal steps in the sequence. This limits the utility of such architectures for most ranges of practical problems.

![vanishing](./images/vanishing_gradient_rnn.png)

Consider a particular weight near $W_{1}{1} = 0.01$ over 100 steps, $0.01^{100} = 1.0e-200$! Despite the computational considerations, the gradient disappears long before ever finding a general minimum. The energy surface becomes effectively infinitely long and shallow (vanishing gradient), or in weights larger than $1$, the energy surface becomes infinitely narrow and steep (exploding gradient).

Nothing can be done for exploding gradients, however in the case of vanishing gradients, it is possible to apply a manual cutoff, either to the gradient or the weights themselves, forcing the gradient to remain at a reasonable level of steepness. The unfortunate side effect of this is that the artifical gradient may not reflect a reasonable approximation of the true minimum, meaning really badly fit parameters. This makes simple RNNs impractical for all but the most artifical of situations. 


#### Circular Predictions

The second biggest issue with the simple RNN is memory. The RNN neuron can only look at most one temporal step backward, and so the tendency is to overfit on single words looking forward. Thus the previous member of the sequence can create undesirable short-term memory effects, e.g.

" Doug saw Jane" $\rightarrow$ "Doug saw Jane saw Steve saw ..."

" Doug saw Spot." $\rightarrow$ "Doug. Jane. Spot."

Both of the above problems can be solved by providing each of the neurons a concept of **memory**, which will be the object of the next lecture session. 