In [707]:
# Pre-requisites
import numpy as np
import time

# To clear print buffer
from IPython.display import clear_output

# Importing code from previous tutorial:

In [30]:
# Initializing weight matrices from layer sizes
def initializeWeights(layers):
    weights = [np.random.randn(o, i+1) for i, o in zip(layers[:-1], layers[1:])]
    return weights

# Add a bias term to every data point in the input
def addBiasTerms(X):
        # Make the input an np.array()
        X = np.array(X)
        
        # Forcing 1D vectors to be 2D matrices of 1xlength dimensions
        if X.ndim==1:
            X = np.reshape(X, (1, len(X)))
        
        # Inserting bias terms
        X = np.insert(X, 0, 1, axis=1)
        
        return X

# Sigmoid function
def sigmoid(a):
    return 1/(1 + np.exp(-a))

# Forward Propagation of outputs
def forwardProp(X, weights):
    # Initializing an empty list of outputs
    outputs = []
    
    # Assigning a name to reuse as inputs
    inputs = X
    
    # For each layer
    for w in weights:
        # Add bias term to input
        inputs = addBiasTerms(inputs)
        
        # Y = Sigmoid ( X .* W^T )
        outputs.append(sigmoid(np.dot(inputs, w.T)))
        
        # Input of next layer is output of this layer
        inputs = outputs[-1]
        
    return outputs

# Training Neural Networks

$$ Y^{(l)}_{n{\times}o_{l}} = Sigmoid\;(\;X^{(l)}_{n{\times}i_{l}} \; .* \; W^{(l)}{^{T}}_{i_{l}{\times}o_{l}}) \;\;\;\;\;\;-------------(1)$$

Neural networks are advantageous when we are able to compute that $W$ which satisfies $Y = Sigmoid(X\cdot*W)$, for given $X$ and $Y$ (in supervised training).

But, since there are so many weights (for bigger networks), it is time-intensive to algebraically solve the above equation. (Something like $W = X^{-1} \;.*\; Sigmoid^{-1}(Y)$...)

## Set W to minimize cost (computationally intensive)

A quicker way to compute W would be to randomly initialize it, and keep updating its value in such a way as to decrease the cost of the neural network.

Define the cost as the mean squared error of the output of the neural network:

$$error = yPred-Y$$

Here, $yPred$ = ``forwardProp``$(X)$, and $Y$ is the desired output value from the neural network.

$$Cost \; J = \frac{1}{2} \sum \limits_{n} \frac{ {\left( error \right)}^2 }{n} = \frac{1}{2} \sum \limits_{n} \frac{ {\left( yPred-Y \right)}^2 }{n}$$

Once we have initialized W, we need to change it such that J is minimized.

The best way to minimize J w.r.t. W, is to partially derive J w.r.t. W and equate it to 0: $\frac{{\partial}J}{{\partial}W} = 0$. But, this is computationally intensive.

In [433]:
# Compute COST (J) of Neural Network
def nnCost(weights, X, Y):
    # Calculate yPred
    yPred = forwardProp(X, weights)[-1]
    
    # Compute J
    J = 0.5*np.sum((yPred-Y)**2)/len(Y)
    
    return J

In [434]:
# Initialize network
layers = [2, 2, 1]
weights = initializeWeights(layers)

In [435]:
# Declare input and desired output for AND gate
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([[0], [0], [0], [1]])

In [436]:
# Cost
J = nnCost(weights, X, Y)
print(J)

0.284231765606


## Randomly initialize W, change it to decrease cost (more feasible)

Instead, we initialize $W$ by randomly sampling from a standard normal distribution, and then keep changing $W$ so as to decrease the cost $J$.

But what value to change $W$ by? To find out, let us focus on the weights of one of the neurons in the last layer, $W^{(L)}_{[k]}$, differentiate $J$ by it to see what we get:

$$\frac{ {\partial}J} {{\partial}W^{(L)}_{[k]} }=\frac{\partial}{{\partial}W^{(L)}_{[k]}}\left(\frac{1}{2}\sum\limits_{n}{\frac{ {\left( yPred-Y \right)}^2 }{n} }\right)=\frac{1}{2*n}\sum\limits_{n} \left( \frac{\partial} {{\partial}W^{(L)}_{[k]}} (yPred-Y)^2 \right)=\frac{1}{n}\sum\limits_{n} \left( (yPred-Y) * \frac {{\partial} \; yPred} { {\partial}W^{(L)}_{[k]} } \right)$$

$$\Rightarrow \frac{ {\partial}J} {{\partial}W^{(L)}_{[k]} } = \frac{1}{n}\sum\limits_{n} \left( (error) * \frac {{\partial} \; yPred} { {\partial}W^{(L)}_{[k]} }  \right)$$

The above equation tells us how $J$ changes by changing $W^{(L)}_{[k]}$. Approximating it for numerical analysis:

$${\Delta}J ={{\Delta}W^{(L)}_{[k]}} * \left[ \frac{1}{n}\sum\limits_{n} \left( (error) * \frac {{\partial} \; yPred} { {\partial}W^{(L)}_{[k]} } \right) \right] \;\;\;\;\;\;-------------(2)$$ 

## Change $W^{(L)}_{[k]}$ so that $J$ always decreases

If we ensure that ${\Delta}W^{(L)}_{[k]}$ is equal to $-\left[ \frac{1}{n}\sum\limits_{n} \left( (error) * \frac {{\partial} \; yPred} { {\partial}W^{(L)}_{[k]} } \right) \right]$, we see that ${\Delta}J = {\Delta}W^{(L)}_{[k]}*\left(-\left[{\Delta}W^{(L)}_{[k]}\right]\right) = -\left[{\Delta}W^{(L)}_{[k]}\right]^{2} \Rightarrow$ negative! 

Thus, we decide to change $W^{(L)}_{[k]}$ by that amount which ensures $J$ always decreases!

$${\Delta}W^{(L)}_{[k]} = -\left[ \frac{1}{n}\sum\limits_{n} \left( (error) * \frac {{\partial} \; yPred} { {\partial}W^{(L)}_{[k]} } \right) \right] \;\;\;\;\;\;-------------(3)$$ 

So, for each weight in the last layer, that ${\Delta}W^{(L)}_{[k]}$ which shall (for sure) decrease J can be computed. 

## Gradient Descent

If we update each weight as $W^{(L)}_{[k]} \leftarrow W^{(L)}_{[k]} + {\Delta}W^{(L)}_{[k]}$, it is guaranteed that with the new weights, the neural network shall produce outputs that are closer to the desired output.

This is how to train a neural network - randomly initialize $W$, iteratively change $W$ according to eq (3).

**This is called Gradient Descent.**

One way to think about this is - assuming the graph of $J$ vs. $W$ is like an upturned hill, we are slowly descending down the hill by changing $W$, to the point where $J$ is minimum.

J is (sort of) a quadratic function on W, so we can assume it's (sort of) like an upturned hill.

# Computing ${\Delta}W^{(L)}$ of last layer

To compute ${\Delta}W$, we need to compute $error$ and $\frac{{\partial}\;yPred}{{\partial}W^{(L)}}$

## 1. Computing error

$ error = yPred - Y = $ ``forwardProp``$(X) - Y \;\;\;\;\;\;-------------(4)$

For example, suppose we want to compute those $W$'s in a 3-neuron network that are able to perform AND logic on two inputs.

Here, for $X = \left[\begin{array}{c}(0,0)\\(0,1)\\(1,0)\\(1,1)\end{array}\right]$, $Y = \left[\begin{array}{c}0\\0\\0\\1\end{array}\right]$

In [686]:
# Initialize network
layers = [2, 2, 1]
weights = initializeWeights(layers)

print("weights:")
for i in range(len(weights)):
    print(i+1); print(weights[i].shape); print(weights[i])

weights:
1
(2, 3)
[[-0.87271574  0.35621485  0.95252276]
 [-0.61981924 -1.49164222  0.55011796]]
2
(1, 3)
[[-1.57656753 -1.10359895 -0.34594249]]


Our weights have been randomly initialized. Let us see what yPred they give:

In [687]:
# Declare input and desired output for AND gate
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([[0], [0], [0], [1]])

In [688]:
# Calculate outputs at each layer by forward propagation
outputs = forwardProp(X, weights)
print("outputs"); print(outputs)

outputs
[array([[ 0.29468953,  0.34982256],
       [ 0.51994117,  0.48258173],
       [ 0.37367081,  0.10798781],
       [ 0.60731071,  0.17345395]]), array([[ 0.11682925],
       [ 0.08969868],
       [ 0.11646832],
       [ 0.09056134]])]


In [689]:
# Calculate yPred as the last output from forward propagation
yPred = outputs[-1]
print(yPred.shape); print(yPred)

(4, 1)
[[ 0.11682925]
 [ 0.08969868]
 [ 0.11646832]
 [ 0.09056134]]


In [690]:
# Error = yPred - Y
error = yPred - Y
print(error.shape); print(error)

(4, 1)
[[ 0.11682925]
 [ 0.08969868]
 [ 0.11646832]
 [-0.90943866]]


## 2. Computing $\frac{{\partial}\;yPred}{{\partial}W^{(L)}_{[k]}}$

From eq. (1), $yPred$ can be written as:

$$yPred = Sigmoid(X^{(L)}\;.*\;W^{(L)}{^{T}}) = Sigmoid(\sum\limits_{o_{L}}X^{(L)}.*W^{(L)})$$

So,

$$\frac{{\partial}\;yPred}{{\partial}W^{(L)}_{[k]}} = \frac{{\partial}}{{\partial}W^{(L)}_{[k]}}\left(Sigmoid\left(\sum\limits_{o_{L}}X^{(L)}.*W^{(L)}\right)\right) = Sigmoid^{'}\left(\sum\limits_{o_{L}}X^{(L)}.*W^{(L)}\right)*\left(\frac{{\partial}}{{\partial}W^{(L)}_{[k]}}\left(\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})\right)\right)$$

### - Computing $Sigmoid^{'}\left(\sum\limits_{o_{L}}X^{(L)}.*W^{(L)}\right)$

It can be verified that $Sigmoid^{'}(a) = Sigmoid(a)*(1-Sigmoid(a))$. Thus, $Sigmoid^{'}(\sum\limits_{o_{L}}x^{(L)}.*W^{(L)}) = yPred*(1 - yPred)$. So,

$$\frac{{\partial}\;yPred}{{\partial}W^{(L)}_{[k]}} = \left(yPred*(1 - yPred)*\left(\frac{{\partial}\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})}{{\partial}W^{(L)}_{[k]}}\right)\right)$$

$${\Delta}W^{(L)}_{[k]} = -\left[\frac{1}{n}\sum\limits_{n}\left(error*yPred*(1 - yPred)*\left(\frac{{\partial}\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})}{{\partial}W^{(L)}_{[k]}}\right)\right)\right]  \;\;\;\;\;\;-------------(5)$$

### - Computing $\frac{{\partial}}{{\partial}W^{(L)}_{[k]}}(\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)}))$

It can be seen that $\frac{{\partial}}{{\partial}W^{(L)}_{[k]}}(\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})) = \frac{{\partial}}{{\partial}W^{(L)}_{[k]}}(X^{(L)}.*W{(L)}_{[0]}+...+X^{(L)}.*W{(L)}_{[k]}+...+X^{(L)}.*W{(L)}_{[o_{L}-1]}) = X^{(L)}$

We also know that $X^{(L)} = \left[ \begin{array}{c} 1 & Y^{(L-1)} \end{array} \right]_{n{\times}i_{L}}$, and $Y^{(L-1)}$ have been computed during Forward Propagation. So,

$$\frac{{\partial}\;yPred}{{\partial}W^{(L)}} = (yPred*(1-yPred))*X^{(L)} $$

$${\Delta}W^{(L)}_{[k]} = -\left[\frac{1}{n}\sum\limits_{n}\left(error*yPred*(1 - yPred)*X^{(L)}\right)\right]\;\;\;\;\;\;-------------(6)$$

## Combining terms to simplify computation

Here, dimension of $error$, $yPred$, and $(1-yPred)$ is $n{\times}o_{L}$, while that of $x^{(L)}$ is $n{\times}i_{L}$. A little thought has to be given towards how those quantities are multiplied.

First of all, we can combine the mentioned three into one and call it $\delta$.

$${\delta}_{n{\times}o_{L}} = error_{n{\times}o_{L}}*yPred_{n{\times}o_{L}}*(1-yPred)_{n{\times}o_{L}} \;\;\;\;\;\;-----(7)$$

$${\Delta}W^{(L)}_{[k]} = -\left[\frac{1}{n}\sum\limits_{n}\left({\delta}*x^{(L)}\right)\right] $$

One way of figuring out how $\delta$ and $x^{(L)}$ are combined is to see that the dimension of ${\Delta}W$ is $o_{L}{\times}i_{L}$, dimension of $\delta$ is $n{\times}o_{L}$, and the dimension of $x^{(L)}$ is $n{\times}i_{L}$.

Clearly, the $\sum\limits_{n}\left({\delta}*x^{(L)}\right)$ term, when considered for all the weights, is equal to $\delta^{T}_{o_{L}{\times}n}\;.*\;x^{(L)}_{n{\times}i_{L}}$, the summation over $n$ being taken care of by the dot product, and the output dimension ${o_{L}{\times}i_{L}}$ matches that of $W^{(L)}$.

Hence, using matrix operations, ${\Delta}W^{(L)}$ can be found as:

$${\Delta}W^{(L)}_{{o_{L}{\times}i_{L}}} = -\frac{1}{n}\left({\delta}^{T}{_{o_{(L)}{\times}n}}\;.*\;x^{(L)}_{n{\times}i_{L}}\right) \;\;\;\;\;\;-------------(8)$$

In [691]:
# Calculate delta for the last layer
delta = np.multiply(np.multiply(error, yPred), 1-yPred)
print(delta.shape); print(delta)

(4, 1)
[[ 0.01205446]
 [ 0.00732415]
 [ 0.01198499]
 [-0.07490136]]


In [692]:
# Find input to the last layer
xL = addBiasTerms(outputs[-2])
print(xL.shape); print(xL)

(4, 3)
[[ 1.          0.29468953  0.34982256]
 [ 1.          0.51994117  0.48258173]
 [ 1.          0.37367081  0.10798781]
 [ 1.          0.60731071  0.17345395]]


In [693]:
# Find deltaW for last layer
deltaW = -np.dot(delta.T, xL)/len(Y)
print(deltaW.shape); print(deltaW)

(1, 3)
[[ 0.01088444  0.00841238  0.00098657]]


In [694]:
# Checking cost of neural network before and after change in W^{L}
newWeights = [np.array(w) for w in weights]
newWeights[-1] += deltaW

print("old weights:")
for i in range(len(weights)):
    print(i+1); print(weights[i].shape); print(weights[i])

print("new weights:")
for i in range(len(newWeights)):
    print(i+1); print(newWeights[i].shape); print(newWeights[i])

print("old cost:"); print(nnCost(weights, X, Y))
print("new cost:"); print(nnCost(newWeights, X, Y))

old weights:
1
(2, 3)
[[-0.87271574  0.35621485  0.95252276]
 [-0.61981924 -1.49164222  0.55011796]]
2
(1, 3)
[[-1.57656753 -1.10359895 -0.34594249]]
new weights:
1
(2, 3)
[[-0.87271574  0.35621485  0.95252276]
 [-0.61981924 -1.49164222  0.55011796]]
2
(1, 3)
[[-1.5656831  -1.09518657 -0.34495592]]
old cost:
0.107792308277
new cost:
0.107601673739


### **Congratulations! You've just learned how to back propagate!**
(1 layer only)

# Back-propagation through layers

For the last layer, according to eq. (5),
$${\Delta}W^{(L)}_{[k]} = -\frac{1}{n}\sum\limits_{n}\left(error*yPred*(1 - yPred)*\left(\frac{{\partial}\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})}{{\partial}W^{(L)}_{[k]}}\right)\right) = -\frac{1}{n}\sum\limits_{n}\left(\delta^{(L)}*\frac{{\partial}\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})}{{\partial}W^{(L)}_{[k]}}\right)$$

### Computing for Layer L-1

If we go back one more layer to find out ${\Delta}W$ for the $k^{th}$ neuron in the $(L-1)^{th}$ layer,

$${\Delta}W^{(L-1)}_{[k]} = -\frac{1}{n}\sum\limits_{n}\left(\delta^{(L)}*\frac{{\partial}\sum\limits_{o_{L}}(X^{(L)}.*W^{(L)})}{{\partial}W^{(L-1)}_{[k]}}\right) = -\frac{1}{n}\sum\limits_{n}\left(\delta^{(L)}*\frac{{\partial}\sum\limits_{o_{L}}(Y^{(L-1)}.*W^{(L)})}{{\partial}W^{(L-1)}_{[k]}}\right)$$

Ignoring dimensionalities for now, we can see that change in $W^{(L-1)}$ does not affect $W^{(L)}$.

But, change in $W^{(L-1)}$ does produce change in $Y^{(L-1)}$, because $Y^{(L-1)} = Sigmoid(X^{(L-1)}.*W^{(L-1)})$. So,

$${\Delta}W^{(L-1)}_{[k]} = -\frac{1}{n}\sum\limits_{n}\left(\delta^{(L)}*W^{(L)}*\frac{{\partial}\;(Y^{(L-1)})}{{\partial}W^{(L-1)}_{[k]}}\right) = -\frac{1}{n}\sum\limits_{n}\left(\delta^{(L)}*W^{(L)}*\frac{{\partial}\;(Sigmoid(X^{(L-1)}.*W^{(L-1)})}{{\partial}W^{(L-1)}_{[k]}}\right)$$

We know how this goes now.

$$\frac{{\partial}\;(Sigmoid(X^{(L-1)}.*W^{(L-1)})}{{\partial}W^{(L-1)}_{[k]}} = Sigmoid^{'}(X^{(L-1)}.*W^{(L-1)})*\frac{{\partial}(X^{(L-1)}.*W^{(L-1)})}{{\partial}W^{(L-1)}_{[k]}} = Y^{(L-1)}*(1 - Y^{(L-1)}))*X^{(L-1)}$$


Thus,

$${\Delta}W^{(L-1)}_{[k]} = -\left[\frac{1}{n}\sum\limits_{n}(\delta^{(L)}*W^{(L)}*Y^{(L-1)}*(1 - Y^{(L-1)})*X^{(L-1)}\right] \;\;\;\;\;\;--------------(9)$$

We can observe here that the terms $\delta^{(L)}$ and $W^{(L)}$ are back-propagated from the last layer. Let's combine them and call it the back-propagated error:
$$bpError^{(L-1)} = \delta^{(L)}*W^{(L)}$$ 

Thus,
$${\Delta}W^{(L-1)}_{[k]} = -\left[\frac{1}{n}\sum\limits_{n}(bpError^{(L-1)}*Y^{(L-1)}*(1 - Y^{(L-1)})*X^{(L-1)}\right] \;\;\;\;\;\;--------------(10)$$

### Simplifying to matrix operation

Just as we had done for the last layer,

<center>$bpError^{(l)}_{n{\times}(o_{l}+1)} = \delta^{(l+1)}_{n{\times}o_{l+1}}*W^{(l+1)}_{o_{l+1}{\times}i_{l+1}}$ (calculated in the next layer)

While back-propagating, we need to ignore the term associated with the bias weight to make bpError's dimensions correct ($n{\times}o_{l}$).

$${\delta}^{(l)}_{n{\times}o_{l}} = bpError^{(l)}_{n{\times}o_{l}}*yPred^{(l)}_{n{\times}o_{l}}*(1-yPred^{(l)})_{n{\times}o_{l}}$$

$${\Delta}W^{(l)}_{{o_{l}{\times}i_{l}}} = -\frac{1}{n}\left({\delta^{(l)}}^{T}{_{o_{(l)}{\times}n}}\;.*\;X^{(l)}_{n{\times}i_{l}}\right) \;\;\;\;\;\;-------------(11)$$


In [695]:
# IMPLEMENTING BACK-PROPAGATION
def backProp(weights, X, Y):
    # Forward propagate to find outputs
    outputs = forwardProp(X, weights)
    
    # For the last layer, bpError = error = yPred - Y
    bpError = outputs[-1] - Y
    
    # Back-propagating from the last layer to the first
    for l, w in enumerate(reversed(weights)):
        
        # Find yPred for this layer
        yPred = outputs[-l-1]
        
        # Calculate delta for this layer using bpError from next layer
        delta = np.multiply(np.multiply(bpError, yPred), 1-yPred)
        
        # Find input to the layer, by adding bias to the output of the previous layer
        # Take care, l goes from 0 to 1, while the weights are in reverse order
        if l==len(weights)-1: # If 1st layer has been reached
            xL = addBiasTerms(X)
        else:
            xL = addBiasTerms(outputs[-l-2])
        
        # Calculate deltaW for this layer
        deltaW = -np.dot(delta.T, xL)/len(Y)
        
        # Calculate bpError for previous layer to be back-propagated
        bpError = np.dot(delta, w)
        
        # Ignore bias term in bpError
        bpError = bpError[:,1:]
        
        # Change weights of the current layer (W <- W + deltaW)
        w += deltaW

In [698]:
# To check with the single back-propagation step done before,
# back up the current weights
oldWeights = [np.array(w) for w in weights]
print("old weights:")
for i in range(len(oldWeights)):
    print(i+1); print(oldWeights[i].shape); print(oldWeights[i])

print("old cost:"); print(nnCost(oldWeights, X, Y))

old weights:
1
(2, 3)
[[-0.87271574  0.35621485  0.95252276]
 [-0.61981924 -1.49164222  0.55011796]]
2
(1, 3)
[[-1.57656753 -1.10359895 -0.34594249]]
old cost:
0.107792308277


Let us define a function to compute the accuracy of our model, irrespective of the number of neuron in the output layer.

In [699]:
# Evaluate the accuracy of weights for input X and desired outptut Y
def evaluate(weights, X, Y):
    yPreds = forwardProp(X, weights)[-1]
    # Check if maximum probability is from that neuron corresponding to desired class,
    # AND check if that maximum probability is greater than 0.5
    yes = sum( int( ( np.argmax(yPreds[i]) == np.argmax(Y[i]) ) and 
                    ( (yPreds[i][np.argmax(yPreds[i])]>0.5) == (Y[i][np.argmax(Y[i])]>0.5) ) )
              for i in range(len(Y)) )
    print(str(yes)+" out of "+str(len(Y))+" : "+str(float(yes/len(Y))))

Check the results of back-propagation:

In [722]:
# BACK-PROPAGATE, checking old & new weights and costs

# Re-initialize to old weights
weights = [np.array(w) for w in oldWeights]

#print("old weights:")
#for i in range(len(weights)):
#    print(i+1); print(weights[i].shape); print(weights[i])

print("old cost: "); print(nnCost(weights, X, Y))
print("old accuracy: "); print(evaluate(weights, X, Y))
for i in range(1000):
    # Back propagate
    backProp(weights, X, Y)

    #print("new weights:")
    #for i in range(len(weights)):
    #    print(i+1); print(weights[i].shape); print(weights[i])
    
    if i%50==0:
        time.sleep(1)
        clear_output()
        print(i)
        print("new cost:"); print(nnCost(weights, X, Y))
        print("new accuracy: "); evaluate(weights, X, Y)
        print(forwardProp(X, weights)[-1])


950
new cost:
0.0113971310862
new accuracy: 
4 out of 4 : 1.0
[[ 0.03022141]
 [ 0.13740936]
 [ 0.13683374]
 [ 0.7705247 ]]


In [718]:
# Revert back to original weights (if needed)
weights = [np.array(w) for w in oldWeights]

### Training

Keep calling backProp() again and again until the cost decreases so much that we reach our desired accuracy.

You can observe the cost of the function going down with iterations.

# Problems

### - Not reaching desired accuracy fast enough

It takes too many iterations of the backProp algorithm for the network to reach the desired output.

One of the simplest ways of solving this problem is by adding a Learning Rate (described below) to the back-propagation algorithm.

### - Taking too long to compute one iteration

Within one iteration, the multiplication and summing operations take too long because there are too many data points feeded into the network.

This problem is tackled using Stochastic Gradient Descent (talked about in the next tutorial). The above algorithm is running Batch Gradient Descent. 

# Learning Rate

Usually, it is desired that we change the amount with which we back propagate, so that we can train our network to reach the desired accuracy faster. So we multiply ${\Delta}W$ with a factor to control this.

$$W \leftarrow W + \eta*{\Delta}W$$

If $\eta$ is large, then we take bigger steps to the assumed minimum. If $\eta$ is small, we take smaller steps.

Remember that we are not actually travelling on the gradient, we are only approximating the direction using a ${\Delta}W$ instead of a ${\delta}W$. So we don't always point in the direction of the minimum, we could undershoot or overshoot.

If $\eta$ is too small, we might take too long to get to the minimum.

If $\eta$ is too big, we might start climbing back up the hill and our cost would keep increasing instead of decreasing!

One way to ensure that we get the best learning rate is to start at, say, 1,
- increase $\eta$ by 5% if the cost is decreasing
- decrease $\eta$ to 50% if the cost is increasing

### Different ways to manipulate learning rate

There are various methods available that leverage the variability of learning rate, to produce results that "converge" (reach a minimum) faster. The following list includes those with even more complicated methods of trying to converge faster:

<center>![Optimizers](images/optimizers.gif)

As can be seen, Stochastic Gradient Descent (SGD) itself performs slower than all the other methods, and the one that we are using (Batch Gradient Descent) is even slower.

Below is an implementation of backProp with provision for learning rate:

In [485]:
# IMPLEMENTING BACK-PROPAGATION WITH LEARNING RATE
# Added eta, the learning rate, as an input
def backProp(weights, X, Y, learningRate):
    # Forward propagate to find outputs
    outputs = forwardProp(X, weights)
    
    # For the last layer, bpError = error = yPred - Y
    bpError = outputs[-1] - Y
    
    # Back-propagating from the last layer to the first
    for l, w in enumerate(reversed(weights)):
        
        # Find yPred for this layer
        yPred = outputs[-l-1]
        
        # Calculate delta for this layer using bpError from next layer
        delta = np.multiply(np.multiply(bpError, yPred), 1-yPred)
        
        # Find input to the layer, by adding bias to the output of the previous layer
        # Take care, l goes from 0 to 1, while the weights are in reverse order
        if l==len(weights)-1: # If 1st layer has been reached
            xL = addBiasTerms(X)
        else:
            xL = addBiasTerms(outputs[-l-2])
        
        # Calculate deltaW for this layer
        deltaW = -np.dot(delta.T, xL)/len(Y)
        
        # Calculate bpError for previous layer to be back-propagated
        bpError = np.dot(delta, w)
        
        # Ignore bias term in bpError
        bpError = bpError[:,1:]
        
        # Change weights of the current layer (W <- W + eta*deltaW)
        w += learningRate*deltaW

Given this back-propagation code, it is better to launch another function that calls it iteratively until we reach the desired accuracy.

We shall look at training schemes and experiments in the next tutorial.