# Recurrent Neural Networks

Just like we are able to train algorithms to process tabular data to learn patterns and make predictions based on different features learnt from the patterns, we can train them to do the same with text data or language data. But why do we need to develop and study separate algorithms for text data? Why can't we use the exact same neural networks that we used for tabular data? The answer comes from the the difference in the nature of the data itself. Here's a slide from Andrew Ng's course:

![image.png](attachment:image.png)

1. Each document, sentence or word, whatever the vector x represents can be of a different size. This is not the same as tabular data where the number of inputs is always fixed.
2. The algorithm needs to be aware that the positional order of the input may not always mean that it's a completely alien input and rather all patterns learnt should be application to all positions in the input space of the text data. 

## Architecture

An RNN as we will learn, doesn't have either of these constraints. So since RNNs are also a type of neural network which we already have a strong foundation of, the novelty is actually the structure of the neural network that falls under this algorithm and that's what makes it. As you see more and more use-cases you'll notice that most different algorithms are just different arrangements of neurons and the layers and different arrangements of how the data passes through them.
Let's look at the RNN basic architecture:

![image.png](attachment:image.png)

Let's take x to be a sequence of input to the model. The sequence could be a sequence of words, sentences etc. Also x would be the vectorized form of the input, meaning it would already have been converted to vector of numbers by the time it is input to the RNN model.
So x[1] would be the first element of the sequence then x[2] and so on.

So the architecture of an RNN model is such that there is a layer of neurons that the input x[1] passes through and gives us output[1] through a set of mathematical transformations. Then the second input x[2] passes through that same layer and follows the same process. An additional input to the layers along with x[i] is the activation value from the forward propagation of x[i-1]. Here we denote it with a[i-1], a[i] and so on.  
For x[1] the value of a[0] is a pseudo activation value and is either initialized randomly or set to zero.

But the focus is that through the addition of a[i-1] to the input for prediction of y[i] we allow the model to see information from the past inputs as far as possible. So x[1], x[2] and x[3] all contribute to the prediction of y[3]. Since this is a unidirectional RNN, only inputs from the words that occur before x[3] are used for prediction but in bidirectional RNN both side inputs become relevant. 


## Forward Pass Calculations

![image.png](attachment:image.png)

Here are the equations for the calculations that are performed in a forward pass. Here for e.g. Wax means that the weight is being used to calculate some 'a' value and is being multiplied by some 'x' value.

You can see how a[1] calculations include x as expected but also input the previous a[0] in order to consider information from the past input as well and each a[i] will keep having information of all previous inputs like this by including the last 'a' value in its calculations.

## Simplified Notation

![image.png](attachment:image.png)


It's the stacking of a and x vectors and waa and wax weight vectors as a way to simplify the writing. The calculation still remains the same.

## Backward Propagation

![image.png](attachment:image.png)

So far we have seen a very specific case of RNNs where the number of characters in both the input and the output are the same but that may not always be the case. Therefore there are many times of RNNs.