<a href="https://colab.research.google.com/github/shIsmael/DeepLearning/blob/main/Deep_Learning_Shallow_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Instructions:
1. Make a copy of this notebook by opening the "File" tab and selecting "Save a copy in Drive"
2. Close this tab and move to your copy of this notebook
3. Follow the written guides within this notebook
4. If instructed to, add your own code in the corresponding cell

After completing this notebook, you will have:
- An understanding of hidden layers and units
- An understanding of what activation functions are
- An understanding of random initialization
- An understanding of backpropagation for networks with hidden layers
- Built a shallow neural network with one hidden layer to classify planar data




### Neural Networks and Vectorization
Recall the "neural network" that we built and trained in the last notebook for binary classification, in which we used the equation $\hat{y} = \sigma(w^{T}x + b)$ for computing predictions. This equation, consisting of two calculations (the sigmoid is considered a separate calculation) can be used as a single *node* (or *unit*) within a neural network. Each node has a linear computation ($w^{T}x + b$) and an *activation function* (the purpose of which is to add some sort of nonlinear property to a node), which in this case is the sigmoid function. We can form a *layer* by "stacking" these nodes together; However, the nodes in a given layer will have no influence on each other. By stacking these layers, we can build a fully-fledged neural network! 

The first layer is referred to as the *input layer*, which, as you might guess, is where we can feed in input data. The last layer is referred to as the *output layer*, and any layers between the input and output layer are called *hidden layers*, since we don't know what goes on in the hidden layers. Each node is connected (in other words, it passes on its activation) to every node in the subsequent layer. Here's a simple neural network with two layers/one hidden layer (we'll be training a different architecture!): 

![shallow neural net](https://miro.medium.com/max/782/1*CfdaqnNb6RHLzPJTt1UXjQ.png)

Notation-wise, we'll refer to the "activation" in the input as $a^{[0]} = X$.  We'll refer to the activations in the layer after the input layer as $a^{[1]}$ (don't confuse this with our notation for training examples!) which is derived from $a^{[1]} = \sigma(W^{[1]T}x+b^{[1]})$, where, in the above image, $W^{[1]T}$ is a 4x3 matrix and $b^{[1]}$ is a 4x1 matrix, since there are $4*3 = 12$ connections from the input layer to the hidden layer and there are $4*1 = 4$ nodes in the hidden layer. The activation of the node at index $i$ (starting from the 1st node to the $n$th node) in layer $l$, each node's activation is referred to as $a^{[l]}_n$. 

For the purposes of demonstrating exactly how the input is propagated through the network, we'll use the first node in the hidden layer in the above image. It computes $a^{[1]}_1 = \sigma(w^{[1]T}_1x + b^{[1]}_1)$, where $x$ is the three inputs $x_1\dots x_3$. 

It turns out that computing every activation in layer $l$ with a for loop is quite inefficient, so we'll stack all of our weights $\{w^{[1]}_1 \dots w^{[1]}_n\}$ vertically into a 4x3 matrix $W^{[1]}$ and stack our biases $\{b^{[1]}_1 \dots b^{[1]}_n\}$ vertically into a 4x1 matrix $b^{[1]}$. We can then use vectorization (which you may recall from the Python tutorial) to perform the linear computations for the entire layer, resulting in $Z^{[1]}$. The activation function, sigmoid, can then be applied elementwise to $Z^{[1]}$ to calculate $a^{[1]}$. The activations $a^{[1]}$ can then be passed into the equation for the next layer, $a^{[2]} = \sigma(W^{[2]T}a^{[1]}+b^{[2]})$ to repeat the process. 

Vectorization can similarly also be applied to the $m$ training examples $\{x^{(1)}\dots x^{(m)}\}$ that we initially fed through the network with a for loop. Recall that $X$ is a matrix containing every training example as a column. We can then define the vectorizated implementation as follows for layer 1: $A^{[1]} = \sigma(W^{[1]}X + b^{[1]})$. 

In [None]:
# importing the necessary libraries
import numpy as np # you know what this is
import matplotlib.pyplot as plt # for plotting graphs
import h5py # for working with datasets stored on H5 files
import sklearn # for generating our dataset
import sklearn.datasets # for generating our dataset
import sklearn.linear_model # for generating our dataset
%matplotlib inline 

# loading our dataset, don't worry about the loading function
def load_planar_dataset():
    np.random.seed(1)
    m = 400 # number of examples
    N = int(m/2) # number of points per class
    D = 2 # dimensionality
    X = np.zeros((m,D)) # data matrix where each row is a single example
    Y = np.zeros((m,1), dtype='uint8') # labels vector (0 for red, 1 for blue)
    a = 4 # maximum ray of the flower

    for j in range(2):
        ix = range(N*j,N*(j+1))
        t = np.linspace(j*3.12,(j+1)*3.12,N) + np.random.randn(N)*0.2 # theta
        r = a*np.sin(4*t) + np.random.randn(N)*0.2 # radius
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        Y[ix] = j
        
    X = X.T
    Y = Y.T

    return X, Y

X, Y = load_planar_dataset() 

# visualizing the data: our goal is to build a model that defines these regions as either red or blue
plt.scatter(X[0, :], X[1, :], c=Y, s=40, cmap=plt.cm.Spectral)

# defining our neural network's architecture!
# n_x is the size (number of nodes) of the input layer, n_h is the size of the hidden layer, and n_y is the size of the output layer
def layer_sizes(X, Y):
    # X: input dataset of shape (input size, number of examples)
    # Y: labels of shape (output size, number of examples)
  
    n_x = X.shape[0] # we could hardcode this as 12288, but we'll use the shape of the input so any input can be used
    n_h = 7 # we'll hardcode this as 7 for now, but you can play around with the number of hidden nodes in the final model
    n_y = Y.shape[0]
    return n_x, n_h, n_y

# defining a random parameter initialization function
def initialize_parameters(n_x, n_h, n_y):
    W1 = np.random.randn(n_h, n_x) # random.randn generates an array of random floats
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h)
    b2 = np.zeros((n_y, 1))
    
    assert (W1.shape == (n_h, n_x)) # sanity checks
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))
    
    parameters = {"W1": W1, # weight matrix of shape (n_h, n_x)
                  "b1": b1, # bias vector of shape (n_h, 1)
                  "W2": W2, # weight matrix of shape (n_y, n_h)
                  "b2": b2} # bias vector of shape (n_y, 1)
    
    return parameters

# defining a forwardpropagation function to pass our inputs through our network according the architecture we've previously defined
def forward_propagation(X, parameters): # parameters is the output of our random initialization function
    # retrieving each parameter from the dictionary "parameters"
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    
    # implementing forwardpropagation to calculate A2 (the prediction/activation of the output layer)
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1) 
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    assert(A2.shape == (1, X.shape[1]))
    
    cache = {"Z1": Z1, # this will be used for optimization, since we need the values of each parameter to perform gradient descent!
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache


### Activation Functions
Up until now, we've just been using the sigmoid function as our activation function for all layers. However, there exist other activation functions that have their own uses, advantages, and drawbacks. 

Generally, the sigmoid function isn't the best function to be using for hidden layers, as the *hyperbolic tangent*/tanh function, defined as $\frac{e^x - e^{-x}}{e^x + e^{-x}}$, almost always results in better performance. The tanh function is essentially a version of sigmoid that is bounded by -1 and 1 instead of 0 and 1. However, the sigmoid function is still useful in the cases where the output layer needs to output a real number between 0 and 1. 

A problem with both sigmoid and tanh is that, if the input is very large or small, then the gradient of either function will be very small, which can slow down gradient descent. The recently introduced *rectified linear unit*/ReLU function, defined by $max(0,x)$, solves this by fixing the derivative at either 0 (for 0 and negative inputs) and 1 (for positive inputs). Another variant of ReLU (that isn't used quite often) is the Leaky ReLU activation function, where the derivative for negative inputs is not 0 due to slightly sloping the graph downwards for negative inputs. 

The rule of thumb to remember is that ReLU/tanh is usually the best choice for hidden layers, while sigmoid should be used for binary classification problems. 

In [None]:
# sigmoid
def sigmoid(x):
    s = 1/(1+np.exp(-x))
    return s
  
# tanh can be called with np.tanh(x)!

# ReLU!
def ReLU(Z):  
    A = np.maximum(0,Z)
    return A

### Backpropagation and Gradient Descent
If we use the network defined in "Neural Networks and Vectorization", our parameters to train will be $W^{[1]}$, $b^{[1]}$, $W^{[2]}$, and $b^{[2]}$. We'll use the following cost function (cross-entropy loss): $J(W^{[1]}$, $b^{[1]}$, $W^{[2]}$, $b^{[2]}) = \frac{1}{m}\sum^{m}_{i=1} \mathcal{L}(\hat{y}, y)$; $\mathcal{L}(\hat{y}, y) = -(y^{(i)}log(a^{[2](i)}) + (1-y^{(i)})log(1-a^{[2](i)}))$. 

Our parameters will be initialized randomly (because we don't want every single hidden unit to calculate the same function!), and we'll predict for $m$ examples. Then, we can calculate the partial derivatives of the cost function $J$ and update each parameter accordingly (i.e. $W^{[1]} := W^{[1]} - \alpha \frac{\partial J}{\partial W^{[1]}}$). The gradient descent loop will repeat until our paramters look like they're converging towards the global minimum. 
Here are the equations for the partial derivatives of each parameter, which you'll implement yourself in the code cell below between the "INSERT CODE HERE" comments (note that the full partial derivative notation will be reduced to, say, $dW^{[1]}$ for the first weight matrix}:
- $dW^{[1]} = \frac{1}{m}dZ^{[1]}X^T$
- $db^{[1]} = \frac{1}{m}$ np.sum($dZ^{[1]}$, axis = 1, keepdims = True)
- $dZ^{[1]} = W^{[2]T}dZ^{[2]}\star g^{[1]\prime} (Z^{[1]})$ (the apostrophe denotes the derivative of the first activation function)
- $dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]T}$
- $db^{[2]} = \frac{1}{m}$ np.sum($dZ^{[2]}$, axis = 1, keepdims = True)
- $dZ^{[2]} = A^{[2]} - Y$

In [None]:
def compute_cost(A2, Y, parameters):
    
    m = Y.shape[1] # number of example
    logprobs = np.sum((Y*np.log(A2)+(1 - Y)*np.log(1 - A2)))
    cost = -logprobs/m
    
    cost = float(np.squeeze(cost))  
    assert(isinstance(cost, float))
    
    return cost

# defining a function that performs backpropagation (in other words, calculates the partial derivatives of the cost function with respect to each parameter)
def backward_propagation(parameters, cache, X, Y): # parameters is from initialize_parameters(), and cache is from forward_propagation()
    m = X.shape[1] 

    # retrieving weights
    W1 = parameters["W1"] 
    W2 = parameters["W2"]
    
    # retreiving cached activations
    A1 = cache["A1"]
    A2 = cache["A2"]
    
    # backpropagation!
    # INSERT CODE HERE
    dZ2 = None
    dW2 = None
    db2 = None
    dZ1 = None
    dW1 = None
    db1 = None
    # INSERT CODE HERE
    
    grads = {"dW1": dW1, # dictionary containing the partial derivatives with respect to each parameter
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

# next, we'll implement our update rules for gradient descent
def update_parameters(parameters, grads, learning_rate = 1.2):
    # retrieving parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # retrieving gradients
    dW1 = grads["dW1"]
    db1 = grads["db1"]
    dW2 = grads["dW2"]
    db2 = grads["db2"]
    
    # updating each parameter
    W1 = W1 - learning_rate*dW1
    b1 = b1 - learning_rate*db1
    W2 = W2 - learning_rate*dW2
    b2 = b2 - learning_rate*db2
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters # dictionary containing updated parameters 

### Putting It All Together
Using the functions we've defined so far and the network defined in the previous notebook as reference, define a function called model() that takes in X_train, Y_train, n_h (the number of hidden nodes), a variable number of iterations to train for (you can choose this), and a boolean variable *print_cost* that decides whether the function prints the cost every 1000 iterations. It should return a dictionary containing the final parameters of the model. After then, you should run the function by assigning it to a new variable with the required parameters. A template for this function is provided below, along with the code for cost printing. (Warning: Training may take a few minutes, so just be patient)

Don't worry if you're unable to define the model() function; The fully defined function is provided under the next text cell, "Check Your Work". 

In [None]:
def nn_model(X, Y, n_h, num_iterations, print_cost):
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]
    
    # INSERT CODE HERE: initialize your parameters here!

    # gradient descent
    for i in range(0, num_iterations):
        # INSERT CODE HERE: perform forwardpropagation here! you should assign the return values to new variables "A2" and "cache"
        
        # INSERT CODE HERE: compute the cost here! you should assign the returned value to a new variable "cost"
 
        # INSERT CODE HERE: compute the gradient of the cost function here! you should assign the returned value to a new variable "grads"

        # INSERT CODE HERE: update the parameters here! you should assign the returned value to a new variable "parameters"
        
        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters

In [None]:
# INSERT CODE HERE: run your function here by assigning model() with all of the required parameters (pick 100 for n_h and any reasonable value around 10000 for num_iterations)

# defining a prediction function to test your network!
def predict(parameters, X):
    A2, cache = forward_propagation(X, parameters)
    predictions = (A2 > 0.5)
    
    return predictions

# testing your network! you should get an accuracy of around 0.5, which is fairly good
predictions = predict(parameters, X)
print("predictions mean = " + str(np.mean(predictions)))

# plotting the decision boundary, you're not expected to understand the workings of this function
def plot_decision_boundary(model, X, y):
    # Set min and max values and give it some padding
    x_min, x_max = X[0, :].min() - 1, X[0, :].max() + 1
    y_min, y_max = X[1, :].min() - 1, X[1, :].max() + 1
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    # Predict the function value for the whole grid
    Z = model(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.scatter(X[0, :], X[1, :], c=y, cmap=plt.cm.Spectral)
    
plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))

### Check Your Work (Do not open unless finished)

In [None]:

















# backpropagation calculations!
dZ2 = A2 - Y
dW2 = (np.dot(dZ2, A1.T))/m
db2 = (np.sum(dZ2, axis = 1, keepdims = True))/m
dZ1 = W2.T*dZ2*(1 - np.power(A1, 2))
dW1 = (np.dot(dZ1, X.T))/m
db1 = (np.sum(dZ1, axis = 1, keepdims = True))/m

In [None]:
# the final model
def model(X, Y, n_h, num_iterations, print_cost):
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]
    
    parameters = initialize_parameters(n_x, n_h, n_y)

    for i in range(0, num_iterations):
        A2, cache = forward_propagation(X, parameters)

        cost = compute_cost(A2, Y, parameters)
 
        grads = backward_propagation(parameters, cache, X, Y)

        parameters = update_parameters(parameters, grads)
        
        if print_cost and i % 1000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters

parameters = model(X, Y, n_h = 7, num_iterations = 10000, print_cost = True)