_______________________
# Gradient flow in neural networks
__________________________________
[Check the cs231 course for more details](http://cs231n.github.io/optimization-2/)   

Understanding backpropagation is important in understanding how deep neural nets work.Given the Loss function L, we are interested in finding the gradients of variables with respect to the loss. Instead of deriving gradients with chain rule, which can be tedious for very large network, we can think of the neural network as a computational graph and figure out how gradient flows at each of the units. Once we know how gradients flow across each of the units, then its simple to grasp backpropagation without writing out derivatives using chain rule. Remember, this is still chain rule at work, but developing this new intuation, will make it easier to undertand really large and complex networks.
  
Here we look at the nodes a typical neural network is made of: the multiplicative gate, additive gate and the activation gate. 

<img src='assets/grad_gates1.png'>

**Multiplicative gate: **
* The incoming gradient is multiplied with inputs switched for calculating their respective gradient. Suppose we want to find the gradient of 'L' with respect to loss 'w', for updating the weights during backpropagation, the gradient is obtained as: $\large \frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} * \frac{\partial z }{\partial w}$, using chain rule we can view it as the incoming gradient $\large \frac{\partial L}{\partial z}$ gets multiplied with the local gradient $\large \frac{\partial z }{\partial w}=x$.
 
**Additive gate:**
* The incoming gradient is distributed to its inputs. The gradient flows without any changes

**Activation gate:**
* The incoming gradient is multiplied with the derivative of the activation unit.

____________________

## Lets now look at gradient flow at a node
<img src='assets/grad_node.png'>
> So calculate the input gradients, the incoming gradient gets multiplied by the 'activation gate' gradient and its passed to both branches of the additive unit and then it gets multiplied with the inputs (reversed).   
> So to calculate $\large \frac{\partial L}{\partial w2}$, the gradient is obtained by the product of three factors: Incoming gradient, the 'local gradient' g'(z)=$\large \frac{\partial g}{\partial z}$ and 'x'.   
> Generally during forward pass the 'local gradient' g'(z) is calculated 

## Lets look at a neural network in terms of computational graph
The figure below shows how a neural network is generally represented   

<img src='assets/comp_graph1.png'>
The above figure is represented in the form of computational units we defined earlier. We dont need to form this everytime we need to undertand a neural network, but we will start with this and we will form simple rules as we progress then we will be able to interpret in the original representation.


### An example

Let $X=[\begin{array}{ll}{x1\\x2} \end{array}] = [\begin{array}{ll}{0.1\\0.2} \end{array}]$,  and the output y = 1.   
and let the initial weights be,  

$W1=[\begin{array}{cccc} W1_{11} & W2_{12} \\ W1_{21} & W2_{22}\end{array}]  =   [\begin{array}{cccc} 0.3 & 0.4 \\ 0.5 & 0.6\end{array}]$,  

and $W2=[\begin{array}{ll}{W2_{11}\\W2_{21}} \end{array}] = [\begin{array}{ll}{0.7\\0.8} \end{array}]$

Then the forward pass will be:   

<img src='assets/forward_pass1.png'>


Now we want to update the weight $W1_{22}$ using backpropagation, so lets compute the gradient of the Loss (L) with respect to the weight $W1_{22}$
:$\large \frac{\partial L}{\partial W_{22}}$ for updating the weight during backpropagation:
* We will start from output Loss function gradient $\large \frac{\partial L}{\partial y} \small= \nabla L = -0.31$.
* It meets the activation gate next, so we multiply with $\large \frac{\partial g{2}}{\partial z} \small=\nabla g2=0.2414$, and the result is -0.066
* Then it meets the additive gate, so the gradient computed above is just branched 
* Then the multiplication unit, so we need to multiply the other input $W2_{21}=0.8$. Now the gradient at this node is -0.053
* Next comes the activation gate, so we multiply with $\large \frac{\partial g{1\_2}}{\partial z}$ and the gradient is -0.0131
* It comes to add gate, to the information flows through.
* Then we go to $W1_{22}$, here we multiply the other input to this gate x2 with the incoming gradient, thereby obtaining $\large \frac{\partial L}{\partial W_{22}}$ which is -0.00262.
<img src='assets/backprop.png'>

## Checking it in tensorflow

In [19]:
import tensorflow as tf
import numpy as np

#The inputs and initial weights
X = tf.constant([[0.1, 0.2]], dtype=tf.float32)
W1 = tf.constant([[0.3, 0.4], [0.5, 0.6]], dtype=tf.float32)
W2 = tf.constant([0.7, 0.8], dtype=tf.float32)
W2 = tf.reshape(W2,[2,1])
y = tf.constant(1.0, dtype=tf.float32)

# the network
z = tf.matmul(X, W1)
h = tf.sigmoid(z)
out = tf.sigmoid(tf.matmul(h,W2))

#Calculate the error
cost = (1/2.)*(out -y)**2 

#Compute the gradients
gradients = tf.gradients(cost, W1)[0]

sess = tf.Session()
sess.run(tf.global_variables_initializer())
out, grad = sess.run([out, gradients])
#print(out)
print ('The gradient at W1_22 is:', grad[1][1])

The gradient at W1_22 is: -0.0026226947


We obtain the same gradient as worked out before.

##  Lets do the above in the matrix form

For a very large network, we cant look at each node individually to do back propagation. Since they are all stacked up together, we will convert all the neural network nodes in matrix form for easier gradient computation.

* #### Stacked layers of Multiplicative and additive gate:
<img src='assets/mul_add_node.png'>
Here we know that the incoming gradients are multiplied by the inputs switched, to obtain the gradients of the inputs. In matrix form they can be written as: $W^{T}* \nabla (incoming)$ for the input x and $X^{T}* \nabla (incoming)$ for the weights.     

* #### Stacked layers of Activation gate:
<img src='assets/act_node.png'>
Here we know that the incoming gradients are multiplied by the local gradient, to obtain the gradients of the inputs. Since this is a element by element multiplication, if we need to convert this to matrix form then we need to form a diagonal elements of the activation (local) gradients. 

## An example

<img src='assets/graph.png'>

#### We will implement the above graph in both numpy and tensorflow


In [14]:
# Define inputs
n_hidden_1 = 3 # first hidden unit size
n_hidden_2 = 2 # second hidden unit size

X = np.array([[0.9, 0.2, 0.7]])
y = np.array([[0.3, 0.7]])
W1 = np.array([[0.3, 0.9, 0.5],[0.8, 0.1, 0.5],[0.8, 0.2, 0.7]])
W2 = np.array([[0.7, 0.5],[0.1, 0.5],[0.6, 0.1]])
W3 = np.array([[0.2, 0.3], [0.3, 0.4]])
print('Shapes of the Variables: x={}, y={}, W1={}, W2={}, W3={}'.format(X.shape, y.shape, W1.shape, W2.shape, W3.shape))

Shapes of the Variables: x=(1, 3), y=(1, 2), W1=(3, 3), W2=(3, 2), W3=(2, 2)


In [17]:
# Numpy Version

# sigmoid function
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# sigmoid Gradient
def sigmoidGradient(z):
    return sigmoid(z) * (1 - sigmoid(z))

# relu activation
def relu(z):
    return (z > 0) * z

# relu gradient
def reluGradient(z):    
    return (z > 0) *1.

#------------------------
# Forward Pass
#------------------------
z1 = X.dot(W1)
h1 = relu(z1)
dh1 = reluGradient(z1) # Calculate the local gradient- derivative of activation
    
z2 = h1.dot(W2)
h2 = relu(z2)
dh2 = reluGradient(z2)
    
z3 = h2.dot(W3)
pred = sigmoid(z3)
dh3 = sigmoidGradient(z3)

#-----------------------
# Backpropagation
#-----------------------
Error = (1/2.)*(pred-y)**2
df = (pred-y) # Error gradient

grad = df*dh3 # the global gradient is multiplied with the sigmoid gradient
dW3 = h2.T.dot(grad) # gradient for the weight W3
grad = grad.dot(W3.T) # gradient for the input h2

grad = grad*dh2 # the global gradient is multiplied with the relu gradient
dW2 = h1.T.dot(grad) # gradient for the weight W2
grad = grad.dot(W2.T) # gradient for the input h1

grad = grad*dh1 # the global gradient is multiplied with the relu gradient
dW1 = X.T.dot(grad) # gradient for the weight W1

print('The gradient of the weights, dW3:', dW3)
print('The gradient of the weights, dW2:', dW2)
print('The gradient of the weights, dW1:', dW1)

The gradient of the weights, dW3: [[0.11214822 0.00065075]
 [0.08597502 0.00049888]]
The gradient of the weights, dW2: [[0.01584061 0.02373813]
 [0.0155206  0.02325858]
 [0.01664064 0.02493703]]
The gradient of the weights, dW1: [[0.02087045 0.01223012 0.01079834]
 [0.00463788 0.0027178  0.00239963]
 [0.01623257 0.00951231 0.00839871]]


In [18]:
#tensorflow version

X_tf = tf.constant(X)
y_tf = tf.constant(y)
W1_tf = tf.constant(W1)
W2_tf = tf.constant(W2)
W3_tf = tf.constant(W3)

pred_tf = tf.nn.sigmoid(tf.matmul(tf.nn.relu(tf.matmul(tf.nn.relu(tf.matmul(X_tf, W1_tf)), W2_tf)), W3_tf))
cost = tf.reduce_sum((1/2.)*(pred_tf-y_tf)**2)

dW3_tf = tf.gradients(cost, W3_tf)[0]
dW2_tf = tf.gradients(cost, W2_tf)[0]
dW1_tf = tf.gradients(cost, W1_tf)[0]

with tf.Session() as sess:
    print('The tf gradient of the weights, dW3:',sess.run(dW3_tf))
    print('The tf gradient of the weights, dW2:',sess.run(dW2_tf))
    print('The tf gradient of the weights, dW1:',sess.run(dW1_tf))


The tf gradient of the weights, dW3: [[0.11214822 0.00065075]
 [0.08597502 0.00049888]]
The tf gradient of the weights, dW2: [[0.01584061 0.02373813]
 [0.0155206  0.02325858]
 [0.01664064 0.02493703]]
The tf gradient of the weights, dW1: [[0.02087045 0.01223012 0.01079834]
 [0.00463788 0.0027178  0.00239963]
 [0.01623257 0.00951231 0.00839871]]


*The tensorflow and numpy version is the same!*

-- EOF --