# Midterm Exam : CSC 84020 Neural Networks and Deep Learning

### Problem 1
(1). Briefly describe the key advantages of
 * Batch Normalization
 * Self-Normalizing Activations
 * Max Norm Weight Constraints
 
 (a) Advantages of Batch Normalization:
  * Minimizes the effects due to 'internal covariant shift'. 'Covariate shift' is defined as the change in distribution of the input to a learning algorithm. For deep neural nets, we have a layered structure where in each layer is affected by the previous layer and hence, even a small level of change in the input can be amplified as it goes through the layers. Since with batch normalization we keep the mean and variance of each layer fixed, we minimize this effect.
  * The gradient is no longer dependent on the scale of the input features and hence converges faster. Some features that have a larger scale compared to other features can sometimes have larger effect on the gradient. With batch normalization all the features are normalized to the same scale and hence the training would converge faster.
  * Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques. In batch normalization we use small batches for the normalization, this can reduce the effect of outliers as the error that comes from these outliers is distributed across all the layers.
  * As the input to each layer is normalized, we can use higher learning rates and not worry about hitting the low gradient zones in saturating nonlinear functions.
  
  (b) Advantages of Self-Normalizaing Activations:
  * SNN's have an inherent normalizing effect and hence there is no need for an extenal normalization technique like batch normalization.
  * The mapping of variance has an upper and lower bound, which prevents it from exploding or vanishing.
  * The authors proved through results that FNN's with batch normalization take longer to train than SNN's.
  * With SNN's, the mean and variance stays close to 0 and 1 for much deeper networks as compared to other techniques.
  * Like Batch normalization, SNN's also have a regularization effect and hence eliminates the need for extenal regularization techniques.
  
  (c) Advantages of Max Norm Weight Constraints:
   * As it is a regularization technique, it prevents the training algorithm to overfit the data as the weights are bounded by a max norm constraint.
   * Higher learning rates can be used. As the weight vector for each neuron is bounded by a max norm constraint, we can safely use higher learning rates without worrying about the network "exploding".
   * The learning converges faster. As it possible to use higher learning rates and the also since the weights are regularized, the learning would typically converge faster. It has been shown to improve the performance of stochastic gradient descent.
  
 

 ### Problem 2
 (2). In this problem you are going to execute forward and backward modes of neural network on paper or by using a program (your   choice) for 3 iterations. The model is shown in Fig. 1. This network comprises of a hidden layer with 3 ReLU units and a squared-error loss. (Note: Please, use tanh as activation function in the output unit.)

In [1]:
import math
import numpy as np

### Definition of ReLu (Rectified linear unit)
The definition of ReLu, followed by its vectorization. The ReLu function is defined as:

$$f(x)=\max(0,x)$$

In [2]:
# definition of ReLu
def ReLu(x):
    if x > 0:
        r = x
    else:
        r = 0
    return r  

# Vectorizing the function
vecReLu = np.vectorize(ReLu)

### STEP 1
### Initialization of the weights and the input and output arrays

In [3]:
W1 = np.array([[0.6,0.7,0],[0.01,0.43,0.88]])
W2 = np.array([[0.02],[0.03],[0.09]])
X = np.array([[0.75,0.8],[0.20,0.05],[-0.75,0.8],[0.20,-0.05]])
Y = np.array([[1,1,-1,-1]])

### STEP 2
### Predict the labels for the input data and compute the loss.
The following function forward propogates through the network to compute the predictions for a given set of weights and inputs. Given 'm' inputs, its returns 'm' predictions for each of the inputs along with the average loss which is given as:
$$\mathcal{L} = \frac{1}{2} \sum_{i=1}^m (y_{out}^{(i)} - y^{(i)})^2$$

In [4]:
def forward_prop(W1,W2,X,Y):
    # compute the weighted input to each neuron in first layer
    S = np.matmul(X,W1)
    # pass the output through ReLu
    Z = vecReLu(S)
    # compute the input to output layer
    S_out = np.matmul(Z,W2)
    # pass the input through tanh activation
    Y_out = np.ndarray.flatten(np.tanh(S_out))
    # compute the loss
    Loss = np.sum((Y-Y_out)**2)/2
    # return the output labels Y_out, the first layer outputs Z and the loss 
    return Y_out, Z, Loss

_,_, Loss = forward_prop(W1,W2,X,Y)
print("Loss with the initlialized weights : {:.4f}".format(Loss))

Loss with the initlialized weights : 1.9666


### Derivatives of ReLu and tanh activations
The derivative of ReLu activation function is $1$ for $x>0$, $0$ for $x<0$ and undefined for $x=0$. However, taking the derivate of ReLu to be $0$ or $1$ at $x=0$ is a general practice and has no effect on the performance. For this implementation we take the derivative to be $1$ for $x\geq 0$ and $0$ otherwise. The derivate of $tanh$ is $\frac{\partial}{\partial x}tanh(x) = 1 - tanh(x)^2$ 

### Backward propogation
The following function "backward_prop" computes the error signal $\frac{\partial \mathcal{L}}{\partial s_{out}}$ at the output layer and propogates backwards to compute the error signals $\frac{\partial \mathcal{L}}{\partial s_j}$ for each of the $j$ hidden units. It then uses these error signals to compute the weight updates $\Delta W$(denoted by delta_W1 in the code) and $\Delta w$ (delta_W2) for the hidden and output layer respectively. It takes as params the hidden layer weights W (denoted by W1 in the code), output layer weights w (denoted by W2), a matrix of inputs X, the learning_rate and the number of epochs (number of repetitions). Next, we run this function for 3 epochs by inputing the wieghts W1, W2, the input X and the labels Y as initialized before and using a learning rate of 0.1. It displays the updated weight and the loss at the end of each epoch. The loss is decreased after each epoch.

In [5]:
def backward_prop(W1,W2,X,Y,learning_rate,epochs):
    Y_out, Z, _ = forward_prop(W1,W2,X,Y)
    
    for e in range(0, epochs):
        for i in range(0,X.shape[0]):
            #compute delta_out
            delta_out = (1-(Y_out.T[i])**2)*(Y_out.T[i] - Y.T[i])

            delta_j = np.zeros((len(Z[i]),1))

            for j in range(0,len(Z[i])):
                if Z[i][j] >= 0:
                    delta_j[j] = W2[j] * delta_out
                else:
                    delta_j[j] = 0


            del_W2 = (delta_out * Z[i]).reshape(3,1)
            #print(del_W2.shape)

            del_W1 = (np.array([X[i]*3]).T * delta_j.T)

            W1 = (W1 - learning_rate * del_W1)

            W2 = (W2 - learning_rate * del_W2)
            
        print("\n#### Updated weights after epoch : {:d} ####\n".format(e+1))
        print("Hidden layer weight : \n")
        print(W1)
        print("\nOutput layer weight : \n")
        print(W2)
        _,_, loss = forward_prop(W1,W2,X,Y)
        print("\nLoss = {:.4f} \n".format(loss))
        print("----------------------------------------")
    
    return W1,W2
    
    
W1n, W2n = backward_prop(W1,W2,X,Y,0.1,3)


#### Updated weights after epoch : 1 ####

Hidden layer weight : 

[[ 0.62058652  0.73439707  0.05962456]
 [-0.00221834  0.40848283  0.86284415]]

Output layer weight : 

[[ 0.06079306]
 [ 0.11165028]
 [ 0.08266383]]

Loss = 1.8786 

----------------------------------------

#### Updated weights after epoch : 2 ####

Hidden layer weight : 

[[ 0.65904748  0.80457115  0.1160346 ]
 [-0.01484139  0.38615561  0.84576108]]

Output layer weight : 

[[ 0.10158612]
 [ 0.19330056]
 [ 0.07532765]]

Loss = 1.7978 

----------------------------------------

#### Updated weights after epoch : 3 ####

Hidden layer weight : 

[[ 0.71538288  0.91052223  0.16923013]
 [-0.02786915  0.36301833  0.82875079]]

Output layer weight : 

[[ 0.14237918]
 [ 0.27495084]
 [ 0.06799148]]

Loss = 1.7230 

----------------------------------------
