# Neural Networks :  Problem 3
## Autograd. Name says it all.

## Disclaimer
Some parts of this exercises are based on the Stanford Machine Learning Course [CS229](http://cs229.stanford.edu) of Andrew Ng. The environment of the exercise have been tuned to the theory content taught at Ravensburg Weingarten University.

We are using the Python programming language. If you don't know Python or if you would like to refresh your memory, take a look at the [Python tutorial](http://docs.python.org/tut/).
We will mostly work with NumPy, the fundamental package for scientific computing in Python. Please read the [NumPy quickstart tutorial](https://numpy.org/devdocs/user/quickstart.html). In addition, the documention of MatPlotLib and Scipy lib can be found here: [MatplotLib](https://matplotlib.org/) [Scipy](https://docs.scipy.org/doc/scipy/reference/tutorial/).

For this part, we will work with [Autograd](https://github.com/HIPS/autograd) library in Python. Imagine, you want to train a Machine Learning model. You would have to go throught the whole process of writing the Loss function forst and then computing the derivative of the function. In case of Neural Networks, this task becomes more prominent as we have to compute gradients w.r.t. all weight vectors. For a Convolutional Neural Network, the gradient computation and programming would be then an enormous task. How could we simplify it ?

One one hand you can use semantic systems such as Tensorflow, or you can use Autograd library. We just have to write down the loss function using a standard numerical library like Numpy, and Autograd will give you its gradient.

### 3.A The simple same-as-in-lecture problem

During the lectures, we learned to compute the Backpropogation of a very simple 2 layered network (shown below). We will choose the same network to demonstrate how we can compute the gradient of the output function w.r.t. to each input variable. 

<img src='./graphic/2layer.png' width='450' height='450'>

The are namely 2 input variables in Layer 1, $x$ and $y$. In the second the layer the summation of these is computed and later multiplied with the third input variable $z$ to get the so-called output of the network $f$. The computation of output is modelled as a function below.

In [None]:
import autograd.numpy as np   # Thinly-wrapped version of Numpy
from autograd import grad # Gradient Func

In [None]:
# Define the output function

def f(x,y,z):
    return (x+y)*z

We can now just compute the gradients of defined function by calling the **grad()** function from the library. Note that there are 3 arguments in the input function. Thus the gradient will be rather partial gradients w.r.t. each varibale. The syntax use is **grad(input_func, var_idx)**, where **var_idx** is the index of the corresponding variable. The index if 1st variable (which is x here) is **0**.

In [None]:
# Partial derivatives

d_x = '''Code here''' # wrt x
d_y = '''Code here''' # wrt y
d_z = '''Code here''' # wrt z

In [None]:
# Derivate wrt x
print(d_x)

The output of grad() function is stored in a varibale which can compute the numerical values further. We will now define the numerical values of all the three variable and check if the results is same as it was obtained manually.

In [None]:
# Define values of input variables

X = np.array([2])
Y = np.array([-3])

In [None]:
Z =np.array([4])

In [None]:
# Print and check the derivate values

print('The partial derivate of f wrt x = {}'.format('''Code here''')
print('The partial derivate of f wrt y = {}'.format('''Code here''')
print('The partial derivate of f wrt z = {}'.format('''Code here''')

### 3.B Autograds with MNIST dataset

In this part, we will define the Loss function for a multi layered Neural Network as seen in previous problems. The loss is function will be defined by computing a forward pass and comparing it with the true output for that input.

The cost function for the neural network (without regularization) is:

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m}\sum_{k=1}^{K} \left[ - y_k^{(i)} \log \left( \left( h_\theta \left( x^{(i)} \right) \right)_k \right) - \left( 1 - y_k^{(i)} \right) \log \left( 1 - \left( h_\theta \left( x^{(i)} \right) \right)_k \right) \right]$$

where $h_\theta \left( x^{(i)} \right)$ is computed as shown in the neural network figure above, and K = 10 is the total number of possible labels. Note that $h_\theta(x^{(i)})_k = a_k^{(3)}$ is the activation (output
value) of the $k^{th}$ output unit. Also, recall that whereas the original labels (in the variable y) were 0, 1, ..., 9, for the purpose of training a neural network, we need to encode the labels as vectors containing only values 0 or 1, so that

$$ y = 
\begin{bmatrix} 1 \\ 0 \\ 0 \\\vdots \\ 0 \end{bmatrix}, \quad
\begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \quad \cdots  \quad \text{or} \qquad
\begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix}.
$$

For example, if $x^{(i)}$ is an image of the digit 5, then the corresponding $y^{(i)}$ (that you should use with the cost function) should be a 10-dimensional vector with $y_5 = 1$, and the other elements equal to 0.

In [None]:
# will be used to load MATLAB mat datafile format
from scipy.io import loadmat

#  training data stored in arrays X, y
data = loadmat('./data/ex4data1.mat')

X, y = data['X'], data['y'].ravel()
m = y.size

print('Total size of dataset is {} images'.format(m))

You are provided with a set of network parameters ($\Theta^{(1)}$, $\Theta^{(2)}$) already trained by Standford Uni. They are stored in `ex4weights.mat`. Let us load those parameters into variables `Theta1` and `Theta2`. The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 output units (corresponding to the 10 digit classes). Note that each of the two weight matrix have one column extra.

**Why ?** Remember Bias terms !

In [None]:
# Setup the parameters you will use for this exercise
input_layer_size  = 400  # 20x20 Input Images of Digits
hidden_layer_size = 25   # 25 hidden units
num_labels = 10          # 10 labels, from 0 to 9

# Load the .mat file, which returns a dictionary 
weights = loadmat('./data/ex4weights.mat')

# get the model weights from the dictionary
# Theta1 has size 25 x 401
# Theta2 has size 10 x 26
Theta1, Theta2 = weights['Theta1'], weights['Theta2']

print('Size of Weight-Vector for Hidden Layer is: {} '.format(Theta1.shape))
print('Size of Weight-Vector for Output Layer is: {} '.format(Theta2.shape))

In [None]:
# Setting the zero digit to 0.
# This is an artifact due to the fact that this dataset was used in...
# ...MATLAB where there is no index 0
y[y == 10] = 0

# Swap first and last columns of Theta2, due to legacy from MATLAB indexing, 
# Since the weight file ex3weights.mat was saved based on MATLAB indexing
Theta2 = np.roll(Theta2, 1, axis=0)

# Unroll parameters to inclue all information in 1 array. Two vector can be separatd later based on sizes.
nn_params = np.concatenate([Theta1.ravel(), Theta2.ravel()])

In [None]:
def sigmoid(z):
    
    # Computes the sigmoid of z.
    
    return 1.0 / (1.0 + np.exp(-z))

### Loss Function for 1 sample 

<img src='./graphic/sample.png' width='350' height='350'>

In [None]:
def loss(Theta1,Theta2, X, y):
    
    '''Code here'''

    return J

### Gradient wrt Theta1 and Theta 2

In [None]:
Theta1_grad = grad('''Code here''') # wrt Theta1
Theta2_grad = grad('''Code here''') # wrt Theta2

We will compute the loss function and gradient for each sample one by one and then average out both the value. **Why do we do this ?**

**Because the the Grad function from Autograd library works with scalers and if we don't work with one-at-a-time input then the function won't simply function. There exist other functions in Autograd library: Try jacobian, elementwise_grad or holomorphic_grad for arrays.**

In [None]:
# Initiate as zero
J = 0
grad_auto = np.concatenate([np.zeros(Theta1.shape).ravel(), np.zeros(Theta2.shape).ravel()])

# Run loop over range of inputs
for i in range(m):
  '''Code here'''

# Average Out
J = J/m
grad_auto = grad_auto/m

print('Shape of concatenated gradient by Autograd is {}'.format(grad_auto.shape))
print('First few elements look like : {}'.format(grad_auto[:5]))

### Compute with Backpropogation

**Theory:**
Revise that the intuition behind the backpropagation algorithm. Given a training example  (𝑥(𝑡),𝑦(𝑡)) , we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis  ℎ𝜃(𝑥) . Then, for each node  𝑗  in layer  𝑙 , we compute an “error term”  𝛿(𝑙)𝑗  that measures how much that node was “responsible” for any errors in our output.

For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta_j^{(3)}$ (since layer 3 is the output layer). For the hidden units, you will compute $\delta_j^{(l)}$ based on a weighted average of the error terms of the nodes in layer $(l+1)$. In detail, here is the backpropagation algorithm. Step 5 will divide the accumulated gradients by $m$ to obtain the gradients for the neural network cost function.

1. Perform a feedforward pass, computing the activations $(z^{(2)}, a^{(2)}, z^{(3)}, a^{(3)})$ for layers 2 and 3. You have already done this part above while compute the Cost function.

1. For each output unit $k$ in layer 3 (the output layer), set 
$$\delta^{(3)} = \left(a^{(3)} - y \right)$$
where $y_k \in \{0, 1\}$ indicates whether the current training example belongs to class $k$ $(y_k = 1)$, or if it belongs to a different class $(y_k = 0)$.

1. For the hidden layer $l = 2$, set 
$$ \delta^{(2)} = \left( \Theta^{(2)} \right)^T \delta^{(3)} * g'\left(z^{(2)} \right)$$
Note that the symbol $*$ performs element wise multiplication in `numpy`.  Also you should chuck the bias term of from the Weight vector.

1. Accumulate the gradient from this example using the following formula. 
$$ \Delta^{(l)} = \delta^{(l+1)} (a^{(l)})^{(T)} $$

1. Obtain the gradient for the neural network cost function by dividing the accumulated gradients by $\frac{1}{m}$:
$$ \frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)}$$



In [None]:
def sigmoid_grad(z):
    # Grad of Sigmoid func can be written as g'(z) = g(z) * (1-g(z))

    g = np.zeros(z.shape)
    g = sigmoid(z) * (1 - sigmoid(z))
    
    return g

def nnCostFunction(Theta1, Theta2,num_labels, X, y, lambda_=0):
    
    # Size of dataset
    m = y.size
         
     
    # Neural Network activations...    
    a1 = np.concatenate([np.ones((m, 1)), X], axis=1) # adding row of 1s
    a2 = sigmoid(a1.dot(Theta1.T)) # first dot product and then func application  
    a2 = np.concatenate([np.ones((a2.shape[0], 1)), a2], axis=1) # adding bias term again in hidden layer
    a3 = sigmoid(a2.dot(Theta2.T))
    
    # Modifying Output matrix as per above explanation
    y_matrix = y.reshape(-1)
    y_matrix = np.eye(num_labels)[y_matrix]
 
    # Compute J
    J = (-1. / m) * np.sum((np.log(a3) * y_matrix) + np.log(1 - a3) * (1 - y_matrix))
    
    # Gradients are initiliazed
    Theta1_grad = np.zeros(Theta1.shape)
    Theta2_grad = np.zeros(Theta2.shape)
    
    
    
    
    '''Complete the Code here'''
    
    
    

    
    grad = np.concatenate([Theta1_grad.ravel(), Theta2_grad.ravel()])
    
    return J, grad

J_n, grad_n = nnCostFunction(Theta1, Theta2,num_labels, X, y, lambda_=0)

print('Shape of concatenated gradient by Backpropogation is {}'.format(grad_n.shape))
print('First few elements look like : {}'.format(grad_n[:5]))

### Compare two results

We compute the differential term to check how similar are the two gradient vectors received.

In [None]:
diff = np.linalg.norm(grad_n-grad_auto)/np.linalg.norm(grad_n+grad_auto)

print('If your implementation is correct, then the relative difference will be small (less than 1e-9).') 
print('Relative Diff is {}'. format(diff))

# What can you contribute further for this notebook ?

1. You may have realized that loop to compute the loss and gradient is slow. Maybe try out different function of Autograd library and find out a faster one. 