# RNN - Code Example

## Code

In [None]:
import copy
import numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x, deriv=False):
    if deriv is True :
        return x*(1-x)
    return 1/(1 + np.exp(-x))

# training dataset generation
int2binary = {}
binary_dim = 8

largest_number = pow(2,binary_dim)
binary = np.unpackbits(
    np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]


# input variables
alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1


# initialize neural network weights
synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

# training logic
for j in range(10000):
    
    # generate a simple addition problem (a + b = c)
    a_int = np.random.randint(largest_number/2) # int version
    a = int2binary[a_int] # binary encoding

    b_int = np.random.randint(largest_number/2) # int version
    b = int2binary[b_int] # binary encoding

    # true answer
    c_int = a_int + b_int
    c = int2binary[c_int]
    
    # where we'll store our best guess (binary encoded)
    d = np.zeros_like(c)

    overallError = 0
    
    layer_2_deltas = list()
    layer_1_values = list()
    layer_1_values.append(np.zeros(hidden_dim))
    
    # moving along the positions in the binary encoding
    for position in range(binary_dim):
        
        # generate input and output
        X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
        y = np.array([[c[binary_dim - position - 1]]]).T

        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))

        # output layer (new binary representation)
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # did we miss?... if so, by how much?
        layer_2_error = y - layer_2
        layer_2_deltas.append((layer_2_error)*sigmoid(layer_2, True))
        overallError += np.abs(layer_2_error[0])
    
        # decode estimate so we can print it out
        d[binary_dim - position - 1] = np.round(layer_2[0][0])
        
        # store hidden layer so we can use it in the next timestep
        layer_1_values.append(copy.deepcopy(layer_1))
    
    future_layer_1_delta = np.zeros(hidden_dim)
    
    for position in range(binary_dim):
        
        X = np.array([[a[position],b[position]]])
        layer_1 = layer_1_values[-position-1]
        prev_layer_1 = layer_1_values[-position-2]
        
        # error at output layer
        layer_2_delta = layer_2_deltas[-position-1]
        # error at hidden layer
        layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid(layer_1, True)

        # let's update all our weights so we can try again
        synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
        synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
        synapse_0_update += X.T.dot(layer_1_delta)
        
        future_layer_1_delta = layer_1_delta
    

    synapse_0 += synapse_0_update * alpha
    synapse_1 += synapse_1_update * alpha
    synapse_h += synapse_h_update * alpha    

    synapse_0_update *= 0
    synapse_1_update *= 0
    synapse_h_update *= 0
    
    # print out progress
    if(j % 1000 == 0):
        print("Error:" + str(overallError))
        print("Pred:" + str(d))
        print("True:" + str(c))
        out = 0
        for index,x in enumerate(reversed(d)):
            out += x*pow(2,index)
        print(str(a_int) + " + " + str(b_int) + " = " + str(out))
        print("------------")

## Description

Importing our dependencies and seeding the random number generator.  
* Numpy is for matrix algebra
* Copy is to copy things
```python
import copy
import numpy as np
np.random.seed(0)
```


Our nonlinearity and derivative.
```python
# compute sigmoid nonlinearity
def sigmoid(x, deriv=False):
    if deriv is True :
        return x*(1-x)
    return 1/(1 + np.exp(-x))
```

We're going to create a lookup table that maps from an integer to its binary representation.  
The binary representations will be our input and output data for each math problem we try to get the network to solve.  
This lookup table will be very helpful in converting from integers to bit strings.
```python
int2binary = {}
```

This is where I set the maximum length of the binary numbers we'll be adding.  
If I've done everything right, you can adjust this to add potentially very large numbers.
```python
binary_dim = 8
```

This computes the largest number that is possible to represent with the binary length we chose
```python
largest_number = pow(2,binary_dim)
```

This is a lookup table that maps from an integer to its binary representation.  
We copy it into the int2binary. This is kindof un-ncessary but I thought it made things more obvious looking.
```python
binary = np.unpackbits(
    np.array([range(largest_number)],dtype=np.uint8).T, axis=1)

for i in range(largest_number):
    int2binary[i] = binary[i]
```

This is our learning rate.
```python
alpha = 0.1
```

We are adding two numbers together, so we'll be feeding in two-bit strings one character at the time each.  
Thus, we need to have two inputs to the network (one for each of the numbers being added).
```python
input_dim = 2
```

This is the size of the hidden layer that will be storing our carry bit.  
Notice that it is way larger than it theoretically needs to be.  
Play with this and see how it affects the speed of convergence.  
* Do larger hidden dimensions make things train faster or slower? 
* More iterations or fewer?
```python
hidden_dim = 16
```

Well, we're only predicting the sum, which is one number. Thus, we only need one output
```python
output_dim = 1
```

This is the matrix of weights that connects our input layer and our hidden layer.  
Thus, it has "input_dim" rows and "hidden_dim" columns. (2 x 16 unless you change it).  
```python
synapse_0 = 2*np.random.random((input_dim, hidden_dim)) - 1
```

This is the matrix of weights that connects the hidden layer to the output layer. 
Thus, it has "hidden_dim" rows and "output_dim" columns. (16 x 1 unless you change it). 
```python
synapse_1 = 2*np.random.random((hidden_dim, output_dim)) - 1
```

This is the matrix of weights that connects the hidden layer in the previous time-step to the hidden layer in the current timestep. It also connects the hidden layer in the current timestep to the hidden layer in the next timestep (we keep using it).  
Thus, it has the dimensionality of "hidden_dim" rows and "hidden_dim" columns. (16 x 16 unless you change it). 
```python
synapse_h = 2*np.random.random((hidden_dim, hidden_dim)) - 1
```

These store the weight updates that we would like to make for each of the weight matrices.  
After we've accumulated several weight updates, we'll actually update the matrices.
```python
synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)
```

We're iterating over 10,000 training examples
```python
for j in range(10000):
    ...
```

We're going to generate a random addition problem. So, we're initializing an integer randomly between 0 and half of the largest value we can represent. If we allowed the network to represent more than this, than adding two number could theoretically overflow (be a bigger number than we have bits to represent).  
Thus, we only add numbers that are less than half of the largest number we can represent.
```python
a_int = np.random.randint(largest_number/2) # int version
```

We lookup the binary form for "a_int" and store it in "a"
```python
a = int2binary[a_int] # binary encoding
```

Same as "a_int", just getting another random number.
```python
b_int = np.random.randint(largest_number/2) # int version
```

Same as "a", looking up the binary representation.
```python
b = int2binary[b_int] # binary encoding
```

We're computing what the correct answer should be for this addition
```python
c_int = a_int + b_int
```

Converting the true answer to its binary representation
```python
c = int2binary[c_int]
```

Initializing an empty binary array where we'll store the neural network's predictions (so we can see it at the end).  
You could get around doing this if you want...but i thought it made things more intuitive
```python
# where we'll store our best guess (binary encoded)
d = np.zeros_like(c)
```

Resetting the error measure
```python
overallError = 0
```

These two lists will keep track of the layer 2 derivatives and layer 1 values at each time step.
```python
layer_2_deltas = list()
layer_1_values = list()
```

Time step zero has no previous hidden layer, so we initialize one that's off.
```python
layer_1_values.append(np.zeros(hidden_dim))
```

#### forward propagation
This for loop iterates through the binary representation
```python
for position in range(binary_dim):
    ...
```

X is a list of 2 numbers, one from a and one from b. It's indexed according to the "position" variable, but we index it in such a way that it goes from right to left. So, when position == 0, this is the farhest bit to the right in "a" and the farthest bit to the right in "b". When position equals 1, this shifts to the left one bit. 
```python
X = np.array([[a[binary_dim - position - 1], b[binary_dim - position - 1]]])
```

Same indexing as `layer_1_values.append(np.zeros(hidden_dim))`, but instead it's the value of the correct answer (either a 1 or a 0)
```python
y = np.array([[c[binary_dim - position - 1]]]).T
```

**This is the magic!!!** To construct the hidden layer, we first do two things. 
* First, we propagate from the input to the hidden layer `np.dot(X,synapse_0)`. 
* Then, we propagate from the previous hidden layer to the current hidden layer `np.dot(prev_layer_1, synapse_h)`. 
* Then WE SUM THESE TWO VECTORS!!!!... 
* and pass through the sigmoid function.

So, how do we combine the information from the previous hidden layer and the input? After each has been propagated through its various matrices (read: interpretations), we sum the information. 
```python
layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))
```

It propagates the hidden layer to the output to make a prediction
```python
layer_2 = sigmoid(np.dot(layer_1,synapse_1))
```

Compute by how much the prediction missed
```python
layer_2_error = y - layer_2
```

We're going to store the derivative (mustard orange in the graphic above) in a list, holding the derivative at each timestep.
```python
layer_2_deltas.append((layer_2_error)*sigmoid(layer_2, True))
```

Calculate the sum of the absolute errors so that we have a scalar error (to track propagation).  
We'll end up with a sum of the error at each binary position.
```python
overallError += np.abs(layer_2_error[0])
```

Rounds the output (to a binary value, since it is between 0 and 1) and stores it in the designated slot of d. 
```python
d[binary_dim - position - 1] = np.round(layer_2[0][0])
```

Copies the layer_1 value into an array so that at the next time step we can apply the hidden layer at the current one.
```python
layer_1_values.append(copy.deepcopy(layer_1))
```

#### backpropagation
So, we've done all the forward propagating for all the time steps, and we've computed the derivatives at the output layers and stored them in a list. Now we need to backpropagate, starting with the last timestep, backpropagating to the first
```python
for position in range(binary_dim):
    ...
```

Indexing the input data like we did before
```python
X = np.array([[a[position],b[position]]])
```

Selecting the current hidden layer from the list.
```python
layer_1 = layer_1_values[-position-1]
```

Selecting the previous hidden layer from the list
```python
prev_layer_1 = layer_1_values[-position-2]
```

Selecting the current output error from the list
```python
layer_2_delta = layer_2_deltas[-position-1]
```

This computes the current hidden layer error given the error at the hidden layer from the future and the error at the current output layer.
```python
layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid(layer_1, True)
```

Now that we have the derivatives backpropagated at this current time step, we can construct our weight updates (but not actually update the weights just yet). We don't actually update our weight matrices until after we've fully backpropagated everything.  
Why? Well, we use the weight matrices for the backpropagation. Thus, we don't want to go changing them yet until the actual backprop is done.
```python
synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
synapse_0_update += X.T.dot(layer_1_delta)
```

Now that we've backpropped everything and created our weight updates.  
It's time to update our weights (and empty the update variables).
```python
synapse_0 += synapse_0_update * alpha
synapse_1 += synapse_1_update * alpha
synapse_h += synapse_h_update * alpha 

synapse_0_update *= 0
synapse_1_update *= 0
synapse_h_update *= 0
```

Just some nice logging to show progress
```python
if(j % 1000 == 0):
    print("Error:" + str(overallError))
    print("Pred:" + str(d))
    print("True:" + str(c))
    out = 0
    for index,x in enumerate(reversed(d)):
        out += x*pow(2,index)
    print(str(a_int) + " + " + str(b_int) + " = " + str(out))
    print("------------")
```

Source: [Anyone Can Learn To Code an LSTM-RNN in Python](https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/)