#### Question:1
There are many activations utilized in DL, e.g., sigmoid, ReLU, SeLU, etc. Can you compare their pros and cons? Please also address proper application contexts for each activation function.

#### Answer.1:
### 1.Sigmoid Activation Function: 
The sigmoid function is a logistic function, which means that, whatever you input, you get an output ranging between 0 and 1. That is, every neuron, node or activation that you input, will be scaled to a value between 0 and 1.
                                    sigmoid(x)=σ=1/1+e−x

### Pros: 
It is nonlinear in nature. Combinations of this function are also nonlinear.
1. It will give an analog activation unlike step function.
2. It has a smooth gradient too.
3. It’s good for a classifier.
4. The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. 
### Cons:
1. Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
2. It gives rise to a problem of “vanishing gradients”.
3. Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
4. Sigmoids saturate and kill gradients.
5. The network refuses to learn further or is drastically slow.

### Application:
1. If your output is for binary classification then, sigmoid function is very natural choice for output layer.
Usually used in output layer of a binary classification, where result is either 0 or 1, as value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.

2. Examples of the application of the logistic S-curve to the response of crop yield (wheat) to both the soil salinity and depth to water table in the soil are shown in logistic function for example in agriculture for modelling modeling crop response.

4. In artificial neural networks, sometimes non-smooth functions are used instead for efficiency; these are known as hard sigmoids.

5. In audio signal processing, sigmoid functions are used as waveshaper transfer functions to emulate the sound of analog circuitry clipping.[4]

6. In biochemistry and pharmacology, the Hill equation and Hill-Langmuir equation are sigmoid functions.


### 2.ReLU Activation Function: 
It computes the function 𝑓(𝑥)=max(0,𝑥)f(x)=max(0,x). In other words, the activation is simply thresholded at zero.
### Pros:
1. Less time and space complexity, because of sparsity, and compared to the sigmoid, it does not evolve the exponential operation, which are more costly. So, ReLU is computationally efficient
2. Avoids the vanishing gradient problem.
3. It was found to greatly accelerate the convergence of stochastic gradient descent compared to the sigmoid/tanh functions
4. ReLU can be implemented by simply thresholding a matrix of activations at zero.
5. ReLU has a derivative function and allows for backpropagation.

### Cons:
1. The Dying ReLU problem—when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.
2. Unfortunately, ReLU units can be fragile during training and can "die". A large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on.
3. In practice, we could find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. 
4. ReLUs does not avoid the exploding gradient problem.


### Application:

1. It’s also helpful if you wish to apply a “filter” to partially keep a certain value (like in an LSTM’s forget gate).
2. Biological plausibility is  One-sided for ReLU whereas biological plausibility is antisymmetry compared to the tanh.
3. Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (have a non-zero output).
4. Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.


#### 3.ELU Activation Function: 
                     {x                           if x>0
                     {α(e with power(x) −1),       if x<0
                     
### Pros:
1. Avoids the dead relu problem.
2. Produces negative outputs, which helps the network nudge weights and biases in the right directions.
3. Produce activations instead of letting them be zero, when calculating the gradient.


### Cons:
1. Introduces longer computation time, because of the exponential operation included
2. Does not avoid the exploding gradient problem
3. The neural network does not learn the alpha value

### Application:
ELU using a ConvNet that is trained on the MNIST dataset, the results suggest that ELU benefits when we train for many epochs, possibly with deeper networks.



#### 4.SELU Activation Function:
The equation for it looks like this:
                 SELU(x)=λ {x              if x>0
                           {αex−α          if x≤0
                           
#### Pros:
1. Internal normalization is faster than external normalization, which means the network converges faster.
2. Vanishing and exploding gradient problem is impossible, shown by their theorems 2 & 3 in the appendix.
3. ReLU function is the most widely used function and performs better than other activation functions in most of the cases. 

### Cons

1. Relatively new activation function – needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored
2. ReLU function has to be used only in the hidden layers and not in the outer layer

### Application:
1. For SNNs (Self Normalizing Networks)to work, they need two things, a custom weight initialization method and the SELU activation function.
   SNNs are a way to instead use external normalization techniques (like batch norm), the normalization occurs inside the activation function.
   To make it clear, instead of normalizing the output of the activation function — the activation function suggested (SELU — scaled exponential linear units) outputs normalized values.SNNs, using the specified initialization and the SELU activation function, does on the MNIST and CIFAR-10 datasets.


#### 5.Tanh Activation Function:
tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).

#### Pros:
1. The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.
2. The function is differentiable.
3. The function is monotonic while its derivative is not monotonic.
4. The tanh function is mainly used classification between two classes.
5. Both tanh and logistic sigmoid activation functions are used in feed-forward nets.

### Cons:
1. A general problem with both the sigmoid and tanh functions is that they saturate. This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Further, the functions are only really sensitive to changes around their mid-point of their input, such as 0.5 for sigmoid and 0.0 for tanh.

2. The limited sensitivity and saturation of the function happen regardless of whether the summed activation from the node provided as input contains useful information or not. Once saturated, it becomes challenging for the learning algorithm to continue to adapt the weights to improve the performance of the model.

#### Application: 
1. Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
2. The tanh function is mainly used classification between two classes.
3. Usually used in hidden layers of a neural network as it’s values lies between -1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier.





### Question:2)How to avoid overfitting at DNN model? Please discuss at least three ways and compare them.

#### Answer:2)
#### 1. L1 and L2 Regularization:
The biggest reasons for regularization are 1) to avoid overfitting by not generating high coefficients for predictors that are sparse. 2) to stabilize the estimates especially when there's collinearity in the data.
2. A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.
3. Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function whereas Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.
4. For L1 sparsity wll be like( 1,0,0,1,1) whereas for L2 it is( 0.5, 0.3 ,-0.2, 0.1)
5. L1 is good for feature selection whereas L2 normally better for training models.



#### 2. Drop out: 
1. This idea is actually very simple - every unit of our neural network (except those belonging to the output layer) is given the probability p of being temporarily ignored in calculations. Hyper parameter p is called dropout rate and very often its default value is set to 0.5. Sometimes one part of the network have very large weights and it ends up dominating all the training,and other part of the network doesn't really play much of a role. So, we turn this part off and let the rest of the network train.

#### 3. Early stopping:
1. It is very convenient to sample our model every few iterations and check how well it works with our validation set. Every model that performs better than all the previous models is saved. We also set a limit, i.e. the maximum number of iterations during which no progress will be recorded. When this value is exceeded, the learning is stopped. Although early stopping allows for a significant improvement in the performance of our model, in practice, its application greatly complicates the process of optimization of our model. 

#### 4. lambda factor (regularization rate)


#### Question:3)How to use Data Normalization to improve DNN model training and performance? Please discuss at least two methods and compare them

#### Answer:3)
Normalization is an approach which is applied during the preparation of data in order to change the values of numeric columns in a dataset to use a common scale when the features in the data have different ranges. The following methods are different approaches to normalize data:
1. Batch Normalization
2. Weight Normalization
3. Layer Normalization
4. Group Normalization
5. Instance Normalization


#### Batch Normalization: 
1. Batch normalization is a general technique that can be used to normalize the inputs to a layer.
2. It can be used with most network types, such as Multilayer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks. It may be more appropriate after the activation function if for s-shaped functions like the hyperbolic tangent and logistic function.
3. It may be appropriate before the activation function for activations that may result in non-Gaussian distributions like the rectified linear activation function, the modern default for most network types.
4. Using batch normalization makes the network more stable during training.This may require the use of much larger than normal learning rates, that in turn may further speed up the learning process.
5. In a batch-normalized model, we have been able to achieve a training speedup from higher learning rates, with no ill side effects
6. The faster training also means that the decay rate used for the learning rate may be increased.
7. Batch normalization offers some regularization effect, reducing generalization error, perhaps no longer requiring the use of dropout for regularization.
8. It enables faster and stable training of deep neural networks by stabilising the distributions of layer inputs during the training phase. This approach is mainly related to internal covariate shift (ICS) where internal covariate shift means the change in the distribution of layer inputs caused when the preceding layers are updated. In order to improve the training in a model, it is important to reduce the internal co-variant shift.

9. The advantages of batch normalization are mentioned below:
Batch normalization reduces the internal covariate shift (ICS) and accelerates the training of a deep neural network.This approach reduces the dependence of gradients on the scale of the parameters or of their initial values which result in higher learning rates without the risk of divergence.Batch Normalisation makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes.


#### Weight Normalization:
1. It is a method developed by Open AI that, instead of normalizing the mini-batch, normalizes the weights of the layer.
2. Weight normalization is a process of reparameterization of the weight vectors in a deep neural network which works by decoupling the length of those weight vectors from their direction. In simple terms, we can define weight normalization as a method for improving the optimisability of the weights of a neural network model.
3.  By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent.
4. They then propose to reparameterize each weight vector 𝐰 in terms of a parameter vector 𝐯 and a scalar parameter 𝑔 and to perform stochastic gradient descent with respect to those parameters instead.

                            𝐰=g*v /‖𝐯‖
   
   where 𝐯 is a 𝑘-dimensional vector, 𝑔 is a scalar, and ‖𝐯‖ denotes the Euclidean norm of 𝐯. They call this reparameterizaton weight normalization.
   

5. Weight normalization improves the conditioning of the optimisation problem as well as speed up the convergence of stochastic gradient descent.
6. It can be applied successfully to recurrent models such as LSTMs as well as in deep reinforcement learning or generative models
7. The experimental results of the paper show that weight normalization combined with mean-only batch normalization achieves the best results on CIFAR-10, an image classification dataset.


### Comparison between batch normalization and weight normalization:
### Batch Normalization:
## Pros:
1. Stable if the batch size is large
2.  Robust (in train) to the scale & shift of input data
3. Robust to the scale of weight vector
4. Scale of update decreases while training

### Cons:
1.  Not good for online learning
2.  Not good for RNN, LSTM
3. Different calculation between train and test

### Weight Normalization:
    
### Pros
1. Smaller calculation cost on CNN
2. Well-considered about weight initialization
3. Implementation is easy
4. Robust to the scale of weight vector
### Cons:
1. Compared with the others, might be unstable on training
2. High dependence to input data

### 5. Best problem provided by our talented students! 
##### 1. When we train a neural network, we notice that the loss does not decrease in a few starting epochs, what’s the reason for this? What is the strategy to improve?

#Please calculate X’s min-max normalizations, Z-score normalization, l2 normalization
import numpy as np
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])

<b>The reasons for this could be</b>

1. The learning is rate is low
2. Regularization parameter is high
3. Stuck at local minima

<b>Strategy to improve?</b>

1. Jitter the learning rate, i.e. change the learning rate for a few epochs
2. Change weight Initialization 
3. Increase or decrease learning rate
4. Reduce Regularization

In [4]:
# Min-max Normalization

from sklearn.preprocessing import MinMaxScaler
import numpy as np
import sklearn 

X = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
scaler = MinMaxScaler()
print(scaler.fit(X))
MinMaxScaler()
print(scaler.data_max_)
print(scaler.data_min_)

# Z-score normalization
print("----------------------------")
from scipy import stats

print("Z-score")
print(stats.zscore(X))

print("----------------------------")
print("l2 normalization")
print(sklearn.preprocessing.normalize(X, norm='l2'))

MinMaxScaler(copy=True, feature_range=(0, 1))
[2. 1. 2.]
[ 0. -1. -1.]
----------------------------
Z-score
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
----------------------------
l2 normalization
[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]


### 2. The problem is you are trying to train a deep learning model but only a small amount of data is available. Fortunately, there is a pre-trained neural network that was trained on a similar problem. Which one of the following methodologies is the best choice that can make use of this pre-trained network?

A. Retrain the model from scratch for the new dataset

B. Only fine-tune the last couple of layers of the pre-trained model with small learning
rate 

C. Freeze all the layers except the last, and only retrain the last layer

D. Assess on every layer how the pre-trained model performs and only select a few of them with
respect to their performances

Answer : C (Freeze all the layers except the last, and only retrain the last layer). 
If the dataset is mostly similar, the best method would be to train only the last layer, as previous all layers work as feature extractors.

### 3. Please provide at least three learning rate scheduling and briefly describe each of them.

### Answer : Common learning rate schedules are:
1. constant, 
2. time-based decay, 
3. step decay and 
4. exponential decay.

<b>Constant learning rate</b>: 
This is a default learning rate schedule in SGD optimizer in Keras. Both Momentum and decay rate are set to zero by default. It is tricky to choose the right learning rate. SGD optimizer also has an argument called nesterov which is set to false by default.

<b>Time-Based Decay</b>: 
The mathematical form of time-based decay is lr = lr0/(1+kt) where lr, k are hyperparameters and t is the iteration number When the decay argument is specified, it will decrease the learning rate from the previous epoch by the given fixed amount.

             lr *= (1. / (1. + self.decay * self.iterations))
             
             
Momentum method helps the parameter vector to build up velocity in any direction with constant gradient descent so as to prevent oscillations. A typical choice of momentum is between 0.5 to 0.9. Nesterov momentum is another type of the momentum with stronger theoretical converge guarantees for convex functions and works slightly better than standard momentum.

<b>Step Decay</b>: 
Step decay schedule drops the learning rate by a factor every few epochs. A typical way of calculating factor is to drop the learning rate by half every 10 epochs.
The mathematical form of step decay is :
                lr = lr0 * drop^floor(epoch / epochs_drop)
  
<b>Exponential Decay</b>:
Another common schedule is exponential decay. It has the mathematical form

              lr = lr0 * e^(−kt)
where lr, k are hyperparameters and t is the iteration number. Similarly, we can implement this by defining exponential decay function,but the only difference is we define a different custom decay function.
The python function for exponential decay as follows:

def exp_decay(epoch):
   initial_lrate = 0.1
   k = 0.1
   lrate = initial_lrate * exp(-k*t)
   return lrate
lrate = LearningRateScheduler(exp_decay)

### Answer 4( California Housing) and Answer 6( Fashion MNIST) are in separate google colab notebooks with .ipynb file format. I compressed  all 3 files in a single zip file.