In [5]:
# Pre-requisites
import numpy as np

# To clear print buffer
from IPython.display import clear_output

Bringing code from the previous tutorial:

In [6]:
# Initializing weight matrices from layer sizes
def initializeWeights(layers):
    weights = [np.random.randn(o, i+1) for i, o in zip(layers[:-1], layers[1:])]
    return weights

# Add a bias term to every data point in the input
def addBiasTerms(X):
        # Make the input an np.array()
        X = np.array(X)
        
        # Forcing 1D vectors to be 2D matrices of 1xlength dimensions
        if X.ndim==1:
            X = np.reshape(X, (1, len(X)))
        
        # Inserting bias terms
        X = np.insert(X, 0, 1, axis=1)
        
        return X
    
# Sigmoid function
def sigmoid(a):
    return 1/(1 + np.exp(-a))

# Forward Propagation of outputs
def forwardProp(X, weights):
    # Initializing an empty list of outputs
    outputs = []
    
    # Assigning a name to reuse as inputs
    inputs = X
    
    # For each layer
    for w in weights:
        # Add bias term to input
        inputs = addBiasTerms(inputs)
        
        # Y = Sigmoid ( X .* W^T )
        outputs.append(sigmoid(np.dot(inputs, w.T)))
        
        # Input of next layer is output of this layer
        inputs = outputs[-1]
        
    return outputs

# Training Neural Networks

$$ Y^{(l)}_{n{\times}o_{l}} = Sigmoid\;(\;X^{(l)}_{n{\times}i_{l}} \; .* \; W^{(l)}{^{T}}_{i_{l}{\times}o_{l}} \;)\; $$

Neural networks are advantageous when we are able to compute that $W$ which satisfies $Y = Sigmoid(X\cdot*W)$, for given $X$ and $Y$ (in supervised training).

But, since there are so many weights (for bigger networks), it is time-intensive to algebraically solve the above equation. (Something like $W = X^{-1} \;.*\; Sigmoid^{-1}(Y)$...)

## Set W to minimize cost (computationally intensive)

A quicker way to compute W would be to randomly initialize it, and keep updating its value in such a way as to decrease the cost of the neural network.

Define the cost as the mean squared error of the output of the neural network:

$$error = yPred-\hat{Y}$$

Here, $yPred$ = ``forwardProp``$(X)$, and $\hat{Y}$ is the desired value of $Y$.

$$Cost \; J = \frac{1}{2} \sum \limits_{n} \frac{ {\left( error \right)}^2 }{n} = \frac{1}{2} \sum \limits_{n} \frac{ {\left( yPred-\hat{Y} \right)}^2 }{n}$$

Once we have initialized W, we need to change it such that J is minimized.

The best way to minimize J w.r.t. W, is to partially derive J w.r.t. W and equate it to 0: $\frac{{\partial}J}{{\partial}W} = 0$. But, this is computationally intensive.

## Randomly initialize W, change it to decrease cost (more feasible)

Instead, we initialize $W$ by randomly sampling from a standard normal distribution, and then keep changing $W$ so as to decrease the cost $J$.

But what value to change $W$ by? To find out, let us differentiate $J$ by $W^{(L)}$ (weight matrix of the last layer) and see what we get:

$$\frac{ {\partial}J} {{\partial}W^{(L)} }=\frac{\partial}{{\partial}W^{(L)}}\left(\frac{1}{2}\sum\limits_{n}{\frac{ {\left( yPred-\hat{Y} \right)}^2 }{n} }\right)=\frac{1}{2*n}\sum\limits_{n} \left( \frac{\partial} {{\partial}W^{(L)}} (yPred-\hat{Y})^2 \right)=\frac{1}{n}\sum\limits_{n} \left( (yPred-\hat{Y}) \cdot \frac {{\partial} \; yPred} { {\partial}W^{(L)} } \right)$$

$$\Rightarrow \frac{ {\partial}J} {{\partial}W^{(L)} } = \frac{1}{n}\sum\limits_{n} \left( (error) \cdot \frac {{\partial} \; yPred} { {\partial}W^{(L)} }  \right)$$

Approximating the above equation for numerical analysis:

$${\Delta}J ={{\Delta}W^{(L)}} * \left[ \frac{1}{n}\sum\limits_{n} \left( (error) \cdot \frac {{\partial} \; yPred} { {\partial}W^{(L)} } \right) \right] \;\;\;\;\;\;-------------(1)$$ 

## Change $W^{(L)}$ so that $J$ always decreases

If we ensure that ${{\Delta}W^{(L)}} = -\left[ \frac{1}{n}\sum\limits_{n} \left( (error) \cdot \frac {{\partial} \; yPred} { {\partial}W^{(L)} } \right) \right]$, we see that ${\Delta}J = {{\Delta}W^{(L)}}*\left(-\left[{{\Delta}W^{(L)}}\right]\right) = -\left[{{\Delta}W^{(L)}}\right]^{2} \Rightarrow$ negative! 

Thus, we change $W$ by that amount which ensures $J$ always decreases!

$${{\Delta}W^{(L)}} = -\left[ \frac{1}{n}\sum\limits_{n} \left( (error) \cdot \frac {{\partial} \; yPred} { {\partial}W^{(L)} } \right) \right] \;\;\;\;\;\;-------------(2)$$ 

## Computing $W^{(L)}$

To compute ${\Delta}W$, we need to compute $error$ and $\frac{{\partial}\;yPred}{{\partial}W^{(L)}}$

### Computing error

$ error = (yPred) - \hat{Y} = $ ``forwardProp(X)`` $ - \hat{Y} \;\;\;\;\;\;-------------(3)$

For example, suppose we want to compute those $W$'s in a 3-neuron network that are able to perform AND logic on two inputs.

Here, for $X = \left[\begin{array}{c}(0,0)\\(0,1)\\(1,0)\\(1,1)\end{array}\right]$, $\hat{Y} = \left[\begin{array}{c}0\\0\\0\\1\end{array}\right]$

In [7]:
# Initialize network
layers = [2, 2, 1]
weights = initializeWeights(layers)

print("weights:")
for i in range(len(weights)):
    print(i+1); print(weights[i].shape); print(weights[i])

weights:
1
(2, 3)
[[-0.17485632 -2.30228101 -0.48034053]
 [ 0.84314964  1.07391944  0.58713279]]
2
(1, 3)
[[ 0.86482232  0.51864617  1.07953998]]


Our weights have been randomly initialized. Let us see what yPred they give:

In [12]:
# Declare input and desired output for AND gate
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([[0], [0], [0], [1]])

# Calculate yPred as the last output from forward propagation
yPred = forwardProp(X, weights)[-1]

# Error = yPred - Y
error = yPred - Y
print(error)

[[ 0.86486132]
 [ 0.87138219]
 [ 0.86367565]
 [-0.13142692]]
