![alt](images/neural_network.jpg)

A Neural Network is simply a combination of multiple layers of Logistic Regression, with varying activation functions for each layer. 

- An activation function decides whether a neuron (or node) should be activated or not based on whether its input is important for making the final prediction.
- An activation function adds non-linearity to the network so that it can solve complex problems by being able to approximate any continous function. A neural network without activation functions is just a linear regression model.
- The activation functions and the weights of the hidden layers will transform the input features and output the final results.


#### Activation Functions
<img src="images/sigmoid.jpg" style="width: 400px;"/>  <img src="images/sigmoid_gradient.jpg" style="width: 360px;"/> 
- Because output ranges between 0 and 1, commonly used for predicting probabilities or for binary classification.
- The gradient approaches 0 on the sides, suffers from the vanishing gradient problem.
- Not symmetrical around zero. This makes the training of the neural network more difficult.
- Its variation the Softmax is commonly used in the last layer for multi-class classification. 

<img src="images/tanh.jpg" style="width: 400px;"/>  <img src="images/tanh_gradient.jpg" style="width: 390px;"/> 
- Output is zero-centered; output values can be mapped as strongly negative, neutral, or strongly positive.
- Sigmoid/Tanh should not be used in deep hidden layers as they are prone to vanishing gradients.

<img src="images/relu.jpg" style="width: 400px;"/>  <img src="images/relu_gradient.jpg" style="width: 400px;"/> 
- ReLU accelerates the convergence of gradient descent.
- Gradient value of zero on the negative side creates dead neurons which never get activated. 
- Should only be used in the hidden layers.

<img src="images/leaky_relu.jpg" style="width: 400px;"/> <img src="images/swish.jpg" style="width: 400px;"/>
- Leaky Relu enables backpropagation, even for negative input values.
- The gradient for negative values is a small value that makes learning time-consuming.
- Swish function is used in neural networks having a depth greater than 40 layers.




etc. For a 3 layer Neural Network (1 input layer, 1 hidden layer, 1 output layer) we'd have ... <br>
$\sigma(\hat{Y}_{NxC}) = \sigma(X_{NxM}W_{MxC}) \ \ \ \ $input layer to hidden layer (sigmoid activation)<br>
$s(\hat{Z}_{NxK})  = s(\sigma(\hat{Y}_{NxC})V_{CxK}) \ \ \ \ \ \ \ \ $hidden layer to output layer (softmax activation)<br> 

CrossEntropy ... <br>
J = $-\dfrac{1}{N} \sum^{N} \sum^{K}[z_{nk}log(s({\hat{z}_{nk}}))]$<br>
J = $-\dfrac{1}{N} \sum^{N} \sum^{K}[z_{nk} log(s(\sigma(\hat{y}_{nc})v_{ck}))]$<br>
J = $-\dfrac{1}{N} \sum^{N} \sum^{K}[z_{nk} log(s(\sigma(x_{nm}w_{mk})v_{ck}))]$

$\dfrac{dJ}{dw_{mc}} = \dfrac{-1}{N} \sum^{N}_{n=1} \sum^{K}_{k=1} \dfrac{d(z_{nk}log(s({\hat{z}_{nk}})))}{ds(\hat{z}_{nk})} \dfrac{ds(\hat{z}_{nk})}{d\hat{z}_{nj}} \dfrac{d\hat{z}_{nj}}{d(\sigma(x_{nm}w_{mc}))} \dfrac{d(\sigma(x_{nm}w_{mc})}{d(x_{nm}w_{mc})} \dfrac{d(x_{nm}w_{mc})}{dw_{mc}} $

$\dfrac{dJ}{dw_{mc}} = \dfrac{-1}{N} \sum^{N}_{n=1} [z_{nj}-s(\hat{z}_{nj})] v_{cj} \sigma(x_{nm}w_{mc})[1-\sigma(x_{nm}w_{mc})]x_{nm}$

In matrix form:<br>
$\dfrac{dJ}{dw} = \dfrac{1}{N} \sum^{N}_{n=1} \sum^{K}_{j=1} \sum^{M}_{m=1} \sum^{C}_{c=1} [s(\hat{z}_{nj})-z_{nj}] v_{cj} \sigma(\hat{y}_{nc})[1-\sigma(\hat{y}_{nc})] x_{nm}$

$\dfrac{dJ}{dw} =  \dfrac{1}{N} X^{T}_{MxN} \big[ [s(\hat{Z})-Z]V^{T} \odot \sigma(\hat{Y}) \odot [1-\sigma(\hat{Y})] \big]_{NxC} $ where $\odot$ stands for element-wise multiplication.

### Feature Scaling

Different features of the input data may have different scales, one could range in the thousands while another could range from 0-10. Scaling the input may improve performance of machine learning models. It can also help with the vanishing gradients of activation functions like the sigmoid. 

**Min-max scaling** (or normalization) involes shifting the values such that they end up ranging between 0 and 1. It is done by subtracting the min value, and dividing by the max minus the min.

**Standardization** involes fitting a normal distribution over the input data. It is done by subtracting the mean value then dividing by the standard deviation.

### Epochs

The number of Epochs represent the number of times the learning algorithm will work through the entire training set. The more complex the problem/model the higher the number of epochs should be, so the gradient descent algorithm will have a better chance to converge to the global minimum. 

### Weights Initialization

The initial weights should neither be too high nor too low, else we may encounter vanishing gradients when plugged into functions like the sigmoid. 

In [1]:
from tensorflow import keras
import numpy as np
import os
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'


fashion_mnist = keras.datasets.fashion_mnist
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
 "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
K = np.max(Y_train)+1
print("Number of output classes:",K)
# one hot encode
Y_train = keras.utils.to_categorical(Y_train)
Y_test = keras.utils.to_categorical(Y_test)

N,w,h = X_train.shape
print("There are", N ,'observations in the test set, each is a', w,'x',h,'image')
print("Each obsevation should be represented as a vector, so we should flatten each image")
print("The pixel intensities range from 0 to 255, it is always a good idea to scale the input values such that they range from 0 to 1")
X_train = X_train.reshape((N,-1))
X_train = X_train/255
bias = np.ones((X_train.shape[0],1))
X_train = np.hstack((bias,X_train))

X_test= X_test.reshape((X_test.shape[0],-1))
X_test = X_test/255
bias = np.ones((X_test.shape[0],1))
X_test = np.hstack((bias,X_test))

Number of output classes: 10
There are 60000 observations in the test set, each is a 28 x 28 image
Each obsevation should be represented as a vector, so we should flatten each image
The pixel intensities range from 0 to 255, it is always a good idea to scale the input values such that they range from 0 to 1


In [2]:
def sigmoid(Y):
    return 1/(1+np.exp(-Y))

def softmax(Y):
    expY = np.exp(Y)
    return expY/expY.sum(axis=1,keepdims=True)

C = 400
lr = 0.01
W = np.random.normal(size=(X_train.shape[1],C))
V = np.random.normal(size=(C,K))

i= 0
while i<N:
    X_r = X_train[i]
    Y_r = Y_train[i]
    X_r = X_r.reshape((1,-1))

    S = sigmoid(X_r.dot(W))
    Z = softmax(S.dot(V))
    gradients_W = X_r.T.dot(((Z-Y_r).dot(V.T))*S*(1-S))
    gradients_V = S.T.dot(Z-Y_r)

    W -= lr*gradients_W
    V -= lr*gradients_V
    i+=1
print("W=",W.sum())
print("V=",V.sum())
print("\n\n")

Epoch # 0
W= -1526.1314239348037
V= -47.45581055906597



Epoch # 1
W= -1734.3514829131282
V= -47.455810558897745



Epoch # 2
W= -1878.3105057031912
V= -47.45581055873485





In [3]:
from sklearn.metrics import accuracy_score
S = sigmoid(X_test.dot(W))
Z = softmax(S.dot(V))
Z = np.floor(Z/np.max(Z,axis=1)[:,None])

accuracy_score(Y_test, Z)

0.8242

The accuracy rate of our model is not that great. Lets try to implement a similar model with Keras.

In [4]:
fashion_mnist = keras.datasets.fashion_mnist
(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()
X_train = X_train/255
X_test = X_test/255

In [5]:
# Create model, number of layers, number of nodes and activation
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=X_train.shape[1:]),
    keras.layers.Dense(C,activation="sigmoid"),
    keras.layers.Dense(K,activation="softmax")
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 400)               314000    
                                                                 
 dense_1 (Dense)             (None, 10)                4010      
                                                                 
Total params: 318,010
Trainable params: 318,010
Non-trainable params: 0
_________________________________________________________________


In [6]:
# Specify loss function and optimizer
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])
print(model.optimizer.get_config())

## Fit and eval model
history = model.fit(X_train,Y_train)
model.evaluate(X_test, Y_test)

{'name': 'SGD', 'learning_rate': 0.01, 'decay': 0.0, 'momentum': 0.0, 'nesterov': False}
Epoch 1/3
Epoch 2/3
Epoch 3/3


[0.6181872487068176, 0.7797999978065491]

Trying a deeper model with relu activation functions and 15 epochs.

In [7]:
# Create model, number of layers, number of nodes and activation
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=X_train.shape[1:]),
    keras.layers.Dense(500,activation="relu"),
    keras.layers.Dense(300,activation="relu"),
    keras.layers.Dense(100,activation="relu"),
    keras.layers.Dense(K,activation="softmax")
])

# Specify loss function and optimizer
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

## Fit and eval model
history = model.fit(X_train,Y_train,epochs=15)
model.evaluate(X_test, Y_test)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


[0.3339943587779999, 0.8804000020027161]

## Hyperparameters Tuning

The flexibility of neural networks is also one of their main drawbacks: there are many hyperparamateres to tweak. Tweaking the parameters manually or randomly are not encouraged. Fortunately, there are some Python libraries that can be used to optimize hyperparameters.

### Number of Hidden Layers
For many problems, you can begin with a single hidden layer and get reasonable results. However, for complex problems, deeper networks can reach much better performance, but they may take very long to train and require a lot of data. 

It is much common to reuse parts of a pretrained state-of-the-art network that performs a similar task. Training will then be a lot faster and require much less data

### Number of Neurons per Hidden Layer
For the hidden layers, it is best to have the same number of neurons for all hidden layers so that we only have to tweak this one value. 

In practice, it’s often simpler and more efficient to pick a model with more layers and neurons than you actually need, then use early stopping and other regularization techniques to prevent overfitting.

### Others
- In general ReLU will be a good default for all hidden layers. 
- The number of iterations (epochs) does not need to be tweaked, just use early stopping instead. 
- In general, small batch sizes should be used (i.e. less that 32). Large batch sizes should be used only in conjunction with learning rate warmup.
- Picking the learning rate and the optimizer is also important 