<h1><center> Deep Neural Networks from scratch in Python </center></h1>

In this guide we will build a deep neural network, with as many layers as you want! The network can be applied to supervised learning problem with binary classification.

![Structure](img/001.png)

### Notation
    Superscript [l] denotes a quantity associated with the lᵗʰ layer.
    Superscript (i) denotes a quantity associated with the iᵗʰ example.
    Lowerscript i denotes the iᵗʰ entry of a vector.

![Neuron](img/002.png)

A neuron computes a linear function (z = Wx + b) followed by an activation function. We generally say that the output of a neuron is a = g(Wx + b) where g is the activation function (sigmoid, tanh, ReLU, …).

#### Dataset
Let’s assume that we have a very big dataset.

One common preprocessing step in machine learning is to center and standardize your dataset, meaning that you subtract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array.

#### General methodology (building the parts of our algorithm)
    We will follow the Deep Learning methodology to build the model:
    
    1. Define the model structure (such as number of input features)
    
    2. Initialize parameters and define hyperparameters:
        number of iterations
        number of layers L in the neural network
        size of the hidden layers
        learning rate α
    
    3. Loop for num_iterations:
        Forward propagation (calculate current loss)
        Compute cost function
        Backward propagation (calculate current gradient)
        Update parameters (using parameters, and grads from backprop)
    4. Use trained parameters to predict labels

#### Initialization

The initialization for a deeper L-layered neural network is more complicated because there are many more weight matrices and bias vectors. I provide the tables below in order to help you keep the right dimensions of the structures.

![Dimension Table](img/003.png)


##### Dimensions of weight matrix W, bias vector b and activation Z for the neural network for our example architecture

![Dimension Table ABOVE arch](img/004.png)

In [6]:
def initialize_parameters(nn_architecture, seed = 3):
    np.random.seed(seed)
    # python dictionary containing our parameters "W1", "b1", ..., "WL", "bL"
    parameters = {}
    number_of_layers = len(nn_architecture)

    for l in range(1, number_of_layers):
        parameters['W' + str(l)] = np.random.randn(
            nn_architecture[l]["layer_size"],
            nn_architecture[l-1]["layer_size"]
            ) * 0.01
        parameters['b' + str(l)] = np.zeros((nn_architecture[l]["layer_size"], 1))
        
    return parameters


Parameters initialization using small random numbers is simple approach, but it guarantees good enough starting point for our algorithm.
#### Remember:
    Different initialization techniques such as Zero, Random, He or Xavier lead to different result
    Random initialization makes sure different hidden units can learn different things (initializing all the weights to zero causes, that every neuron in each layer will learn the same thing)
    Don’t initialize to values that are too large

#### Activation functions
    Activation functions give the neural networks non-linearity. 
    In our example, we will use sigmoid and ReLU.
    Sigmoid outputs a value between 0 and 1 which makes it a very good choice for binary classification. You can classify the output as 0 if it is less than 0.5 and classify it as 1 if the output is more than 0.5.


In [7]:
def sigmoid(Z):
    S = 1 / (1 + np.exp(-Z))
    return S

def relu(Z):
    R = np.maximum(0, Z)
    return R

def sigmoid_backward(dA, Z):
    S = sigmoid(Z)
    dS = S * (1 - S)
    return dA * dS

def relu_backward(dA, Z):
    dZ = np.array(dA, copy = True)
    dZ[Z <= 0] = 0
    return dZ

In above code section we can see the vectorized implementation of activation functions and their derivatives.The code will be used in the further calculation.

#### Forward propagation

During forward propagation, in the forward function for a layer l you need to know what the activation function in a layer is (Sigmoid, tanh, ReLU, etc.). Given input signal from the previous layer, we compute Z and then apply selected activation function.

![Forward Propagation](img/005.png)

In [8]:
def L_model_forward(X, parameters, nn_architecture):
    forward_cache = {}
    A = X
    number_of_layers = len(nn_architecture)
    for l in range(1, number_of_layers):
        A_prev = A 
        W = parameters['W' + str(l)]
        b = parameters['b' + str(l)]
        activation = nn_architecture[l]["activation"]
        Z, A = linear_activation_forward(A_prev, W, b, activation)
        forward_cache['Z' + str(l)] = Z
        forward_cache['A' + str(l-1)] = A_prev
    AL = A
    return AL, forward_cache

def linear_activation_forward(A_prev, W, b, activation):
    if activation == "sigmoid":
        Z = linear_forward(A_prev, W, b)
        A = sigmoid(Z)
    elif activation == "relu":
        Z = linear_forward(A_prev, W, b)
        A = relu(Z)
    return Z, A

def linear_forward(A, W, b):
    Z = np.dot(W, A) + b
    return Z

We use “cache” (Python dictionary, which contains A and Z values computed for particular layers) to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

#### Loss function

In order to monitor the learning process, we need to calculate the value of the cost function. We will use the below formula to calculate the cost.


![Cost Equation](img/006.png)

In [9]:
def compute_cost(AL, Y):
    m = Y.shape[1]

    # Compute loss from AL and y
    logprobs = np.multiply(np.log(AL),Y) + np.multiply(1 - Y, np.log(1 - AL))
    # cross-entropy cost
    cost = - np.sum(logprobs) / m

    cost = np.squeeze(cost)
    
    return cost

### Backpropagation

    Backpropagation is used to calculate the gradient of the loss function with respect to the parameters. This algorithm is the recursive use of a “chain rule” known from differential calculus.

    Equations used in backpropagation calculation:
    
  ![Backward Propagation](img/007.png)
  
  
  
### The general idea:
The derivative of the loss function with respect to Z from lᵗʰ layer helps to calculate the derivative of the loss function with respect to A from (l-1)ᵗʰ layer (the previous layer). Then the result is used with the derivative of the activation function.

##### Backward propagation for our example neural network

![Backward Propagation for our model](img/008.png)

In [10]:
def L_model_backward(AL, Y, parameters, forward_cache, nn_architecture):
    grads = {}
    number_of_layers = len(nn_architecture)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    dA_prev = dAL

    for l in reversed(range(1, number_of_layers)):
        dA_curr = dA_prev

        activation = nn_architecture[l]["activation"]
        W_curr = parameters['W' + str(l)]
        Z_curr = forward_cache['Z' + str(l)]
        A_prev = forward_cache['A' + str(l-1)]

        dA_prev, dW_curr, db_curr = linear_activation_backward(dA_curr, Z_curr, A_prev, W_curr, activation)

        grads["dW" + str(l)] = dW_curr
        grads["db" + str(l)] = db_curr

    return grads

def linear_activation_backward(dA, Z, A_prev, W, activation):
    if activation == "relu":
        dZ = relu_backward(dA, Z)
        dA_prev, dW, db = linear_backward(dZ, A_prev, W)
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, Z)
        dA_prev, dW, db = linear_backward(dZ, A_prev, W)

    return dA_prev, dW, db

def linear_backward(dZ, A_prev, W):
    m = A_prev.shape[1]

    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis=1, keepdims=True) / m
    dA_prev = np.dot(W.T, dZ)

    return dA_prev, dW, db

#### Update parameters
The goal of the function is to update the parameters of the model using gradient optimization.

In [11]:
def update_parameters(parameters, grads, learning_rate):
    L = len(parameters) // 2 # number of layers in the neural network

    for l in range(1, L+1):
        parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * grads["dW" + str(l)]
        parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * grads["db" + str(l)]

    return parameters

In [None]:
def predict(X, y, parameters):    
    m = X.shape[1]
    n = len(parameters) // 2 # number of layers in the neural network
    p = np.zeros((1,m))
    
    # Forward propagation
    probas, caches = L_model_forward(X, parameters,nn_architecture)

    
    # convert probas to 0/1 predictions
    for i in range(0, probas.shape[1]):
        if probas[0,i] > 0.5:
            p[0,i] = 1
        else:
            p[0,i] = 0
    
    #print results
    #print ("predictions: " + str(p))
    #print ("true labels: " + str(y))
    print("Accuracy: "  + str(np.sum((p == y)/m)*100) + " %")
        
    return p

### Full model
The full implementation of the neural network model consists of the methods provided above.

In [12]:
def L_layer_model(X, Y, nn_architecture, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):
    np.random.seed(1)
    # keep track of cost
    costs = []
    
    # Parameters initialization.
    parameters = initialize_parameters(nn_architecture)
    
    # Loop (gradient descent)
    for i in range(0, num_iterations):

        # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID.
        AL, forward_cache = L_model_forward(X, parameters, nn_architecture)
        
        # Compute cost.
        cost = compute_cost(AL, Y)
    
        # Backward propagation.
        grads = L_model_backward(AL, Y, parameters, forward_cache, nn_architecture)
 
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
                
        # Print the cost every 100 training example
        if print_cost and i % 100 == 0:
            print("Cost after iteration %i: %f" %(i, cost))

        costs.append(cost)
            
    # plot the cost
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per tens)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from DNN import *

In [None]:
nn_architecture = [
    {"layer_size": 4, "activation": "none"}, # input layer
    {"layer_size": 5, "activation": "relu"},
    {"layer_size": 4, "activation": "relu"},
    {"layer_size": 3, "activation": "relu"},
    {"layer_size": 1, "activation": "sigmoid"}
]


In [None]:
parameters = L_layer_model(train_x.T, train_y.T, nn_architecture, num_iterations = 2500, print_cost = True)

In [None]:
pred_train = predict(train_x.T, train_y.T, parameters)

In [None]:
pred_test = predict(test_x.T, test_y.T, parameters)

##### In order to make a prediction, you only need to run a full forward propagation using the received weight matrix and a set of test data.
##### We can modify nn_architecture in Snippet 1 to build a neural network with a different number of layers and sizes of the hidden layers.
##### Moreover, prepare the correct implementation of the activation functions and their derivatives (Snippet 2).
##### The implemented functions can be used to modify linear_activation_forward method in Snippet 3 and linear_activation_backward method in Snippet 5.

## Further improvements
    We can face the “overfitting” problem if the training dataset is not big enough. It means that the learned network doesn’t generalize to new examples that it has never seen. We can use regularization methods such as L2 regularization (it consists of appropriately modifying our cost function) or dropout ( it randomly shuts down some neurons in each iteration).

    We used Gradient Descent to update the parameters and minimize the cost. You can learn more advanced optimization methods that can speed up learning and even get you to a better final value for the cost function for example:

    Mini-batch gradient descent
    Momentum
    Adam optimizer

## References:
    Course Name: Neural Networks and Deep Learning
    Instructor: Andrew ng
    From:deeplearning.ai
    Platform: Coursera
    Course URL: https://www.coursera.org/learn/neural-networks-deep-learning/

### Thanks,
### Shubham Sagar
### Follow me at: www.instagaram.com/shubhamthrills  / https://github.com/shubhamthrills