<a href="https://colab.research.google.com/github/shIsmael/DeepLearning/blob/main/Deep_Learning_Deep_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Instructions:
1. Make a copy of this notebook by opening the "File" tab and selecting "Save a copy in Drive"
2. Close this tab and move to your copy of this notebook
3. Follow the written guides within this notebook
4. If instructed to, add your own code in the corresponding cell

After completing this notebook, you will have:
- An understanding of what deep neural networks are
- An understanding of how to verify matrix/vector dimensions 
- Implemented cache to pass information from forwardpropagation to backpropagation
- Implemented functions for forwardpropagation and backpropagation
- An understanding of hyperparameters
- Built a deep neural network to classify cats from non-cats!




### Deep Neural Networks and Notation
In the context of deep learning, "depth" refers to the number of layers a neural network has (this is a *hyperparameter*, which we'll cover in this notebook). The logistic regression neural network (1 layer) and the shallow neural network (2 layers; remember, we don't count the input layer) that we built in the last notebook are both quite shallow. A neural network is essentially able to learn a nonlinear function based on its architecture and training data. By introducing more layers ($\geq 3$), we can build networks that can model more complex functions to accomplish more complex tasks. We'll cover how to choose the appropriate number of layers for a given task later. Here is an example of a 3-layer neural network:

![neuralnet](https://victorzhou.com/media/nn-series/network.png)

Notation-wise:
- We'll denote the number of layers in a network (again, not including the input layer) as $L$; Thus, $\hat{y}$ is the activation of layer $L$

- We'll use $n^{[l]}$ to denote the number of nodes in layer $l$. (In the above example, $n^{[2]} = 6$) 
- The input layer is $n^{[0]} = n_x$
- The activations in layer $l$ will be denoted by $a^{[l]}$; Thus, $a^{[l]} = g^{[l]}(z^{[l]})$, where $g^{[l]}$ is the activation function for layer $l$
- The weights and biases of layer $l$ are denoted by $W^{[l]}$ and $b^{[l]}$ (used to compute $z^{[l]}$)

In [None]:
# importing
import time 
import numpy as np
import h5py # used to retrieve the dataset, which is stored on an h5 file
import matplotlib.pyplot as plt # for plotting graphs
import scipy # used for post-training testing
from PIL import Image # used for post-training testing
from scipy import ndimage # used for post-training test

# "magic" commands for configuring matplotlib plots that you don't have to worry about :)
%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1) # setting a fixed seed is used for reproducibility

# download the datasets from https://drive.google.com/drive/folders/1R5kzlNNvhABEm2oCpxwXsmmV48Vp1kQQ?usp=sharing and import them into this notebook when prompted
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn]))) 

# we'll be using the same cat vs non-cat dataset as seen in "Logistic Regression with a Neural Network"!
def load_data():
    train_dataset = h5py.File('train_catvnoncat.h5', "r") # we're opening this h5 file in read ("r") mode
    train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
    train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels

    test_dataset = h5py.File('test_catvnoncat.h5', "r")
    test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
    test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels

    classes = np.array(test_dataset["list_classes"][:]) # the list of classes
    
    train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
    test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
    
    return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes

# loading our data
train_x_orig, train_y, test_x_orig, test_y, classes = load_data()

# Explore your dataset 
m_train = train_x_orig.shape[0] # number of training examples
num_px = train_x_orig.shape[1] # this should be 64, meaning that our images are 64x64x3
m_test = test_x_orig.shape[0] # number of testing examples

print("train_x_orig shape: " + str(train_x_orig.shape))
print("train_y shape: " + str(train_y.shape))
print("test_x_orig shape: " + str(test_x_orig.shape))
print("test_y shape: " + str(test_y.shape))

# reshaping and normalizing our examples 
train_x_flatten = train_x_orig.reshape(train_x_orig.shape[0], -1).T   # The "-1" makes reshape flatten the remaining dimensions
test_x_flatten = test_x_orig.reshape(test_x_orig.shape[0], -1).T

# standardizing data to have feature values between 0 and 1.
train_x = train_x_flatten/255.
test_x = test_x_flatten/255.

print(train_x.shape)
print(test_x.shape)

# with our dataset loaded and standardized, we can begin implementing the functions for forwardpropagation, backpropagation, and gradient descent!
# these functions will be applicable to neural networks of any size, unlike the functions we defined in the previous notebooks

### Why Use Deep Neural Networks?
For the sake of gaining intuition as to why we use deep neural networks, let's consider a type of neural network architecture that we'll cover in Chapter 6: a *convolutional neural network* (CNN). CNNs, originally created for image-related tasks, use *convolutional layers* as "filters" to extract image features. Therefore, starting from the first layer onwards, we can think of each layer as successively specializing in more complex tasks. For example, the first layer might recognize all of the edges in an image, the second might recognize corners, the third might recognize basic facial features (e.g. eyes, lips), the fourth might recognize portions of faces, and so on. This type of compositional representation applies to other types of data as well. For example, you might start by detecting basic waveform features in the case of auditory data. 

You can also think of deep neural networks as Taylor polynomials: adding more terms will result in higher accuracy and a better fit to the original (target) function, which may be extremely complex. As for choosing the number of hidden layers to use, you should try working up from a "baseline" number of hidden layers for a given task, though very deep architectures have been the best models for certain tasks in recent years. 

### Forwardpropagation and Backpropagation
Given a training example $x$ (or $a^{[0]}$), the general equations for forwardpropagation in a forward function are $z^{[l]} = W^{[l]}a^{[l-1]}+b^{[l]}$; $a^{[l]} = g^{[l]}(z^{[l]})$. For a training set $X$, the vectorized equations are $Z^{[l]} = W^{[l]}X + b^{[l]}$; $A^{[l]} = g^{[l]}(Z^{[l]})$ (To recap: we're stacking our training examples horizontally to perform forwardpropagation on the entire set). However, the above equations would be repeatedly computed in a for loop for each layer in the network, which is unavoidable. In addition, we'll be "caching" the linear computation $Z^{[l]}$ for backpropagation (computing the gradient of the cost function) when computing the activations of layer $l$. 

For backpropagation at layer $l$, (note that we'll be using different notation for derivatives) we can use these four equations for implementing our backward function:
- $dZ^{[l]} = dA^{[l]} \star g^{[l]\prime}(Z^{[l]})$, where $g^{[l]\prime}()$ denotes the derivative of $g^{[l]}$ the activation function of layer $l$; note that we're using our cached $Z^{[l]}$ from our forward function
- $dW^{[l]} = dZ^{[l]} \cdot A^{[l-1]T}$
- $db^{[l]} = dZ^{[l]}$
- $dA^{[l-1]} = W^{[l]T}\cdot dZ^{[l]}$

The forward and backward functions that you'll need to build your first deep neural network are implemented below. 

You will be using the cross-entropy cost function, defined as:
$-\frac{1}{m}\sum^{m}_{i=1}(y^{(i)}log(a^{[L](o)})+(1-y^{(i)})log(1-a^{[L](i)}))$

The update rule can be generalized as
$W^{[l]} := W^{[l]} - \alpha \: dW^{[l]}$; $b^{[l]} := b^{[l]} - \alpha \: db^{[l]}$

In [None]:
# firstly, let's define the activation functions and activation function derivative that we'll need to define our forward/backward function and our network

# sigmoid! note that we are "caching" Z for later use in the computation of dZ
def sigmoid(Z): # Z is the linear computation Wx+b    
    A = 1/(1+np.exp(-Z))
    cache = Z 
    return A, cache

# backpropagation for a single sigmoid unit, where dA is the partial derivative of an activation, cache is the linear computation Z, and 
# dZ is the partial derivative of the cost with respect to Z
def sigmoid_backward(dA, cache):    
    Z = cache
    s = 1/(1+np.exp(-Z)) # don't worry about this derivation
    dZ = dA * s * (1-s)
    assert (dZ.shape == Z.shape) # sanity check
    return dZ

# relu (rectified linear unit)! again, note that we're storing Z for later use
def relu(Z):    
    A = np.maximum(0,Z)   
    assert(A.shape == Z.shape)
    cache = Z 
    return A, cache

# backpropagation for a single relu unit, where dA is the partial derivative of an activation, cache is the linear computation Z, and 
# dZ is the partial derivative of the cost with respect to Z
def relu_backward(dA, cache):    
    Z = cache
    dZ = np.array(dA, copy=True) # don't worry about this derivation
    dZ[Z <= 0] = 0
    assert (dZ.shape == Z.shape)
    return dZ

# we can then implement a function to initialize our parameters for a given number of layers (quantified by L)
def initialize_parameters_deep(layer_dims): # layer_dims will be a list containing the dimensions of each layer in our network    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)  # number of layers in the network

    for l in range(1, L): # iterating over all layers
        # parameters is a dictionary that stores the weights and biases of each layer
        parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01 # weight matrix of shape (layer_dims[l], layer_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layer_dims[l],1)) # bias vector of shape (layer_dims[l], 1)
        # after this loop is finished, we can call, say, W1 and retrieve the weights for layer 1

        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1])) # sanity checks
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

    return parameters

# we can now implement our forwardpropagation module! 

#let's start by implementing a function that computes the vectorized version of Wx+b with caching:
def linear_forward(A, W, b): # where A is the activation from the previous layer of shape (size of previous layer, number of examples)
# use the previous notebook as reference for the shapes of the weights and biases    
    Z = np.dot(W,A)+b # linear computation
    assert(Z.shape == (W.shape[0], A.shape[1])) # sanity check
    cache = (A, W, b) # cache (a tuple) will store these values for later use in backpropagation
    
    return Z, cache

# we'll implement a function that applies the given activation function to linear_forward
def linear_activation_forward(A_prev, W, b, activation): # where activation is a string specifying which activation function to use
# and where A_prev is the activations from the previous layer in the same shape as specified in linear_forward
    if activation == "sigmoid": # if the input to this function was "sigmoid", apply the sigmoid function and store the activation cache, as specified when we defined sigmoid
        Z, linear_cache = linear_forward(A_prev,W,b)
        A, activation_cache = sigmoid(Z)
    elif activation == "relu":
        Z, linear_cache = linear_forward(A_prev,W,b)
        A, activation_cache = relu(Z)
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache) # a tuple containing linear_cache and activation_cache for computing backpropagation efficiently

    return A, cache

# finally, we can implement a function that performs vectorized forwardpropagation throughout the entire network
# every hidden layer will use the ReLU activation function, and the output layer will use the sigmoid activation function, since we're doing binary classification
def L_model_forward(X, parameters): # where X has the shape/dimensions (input size, number of examples) and parameters is the output of initialize_parameters_deep
    caches = [] 
    A = X # redefining X as the activation of the input layer
    L = len(parameters) // 2  # number of layers in the neural network
    
    for l in range(1, L): # iterating over layer every aside from the last layer, with "l" representing the current layer
        A_prev = A # redefining the activation as the activation of the previous layer
        A, cache = linear_activation_forward(A_prev,parameters['W'+str(l)],parameters['b' + str(l)],"relu") # using our function for the linear computation and activation
        caches.append(cache) # adding to our cache for use in backpropagation
    
    # manually performing the linear -> activation for the final layer using the sigmoid activation function
    AL, cache = linear_activation_forward(A,parameters['W'+str(L)],parameters['b'+str(L)],"sigmoid")
    caches.append(cache) # adding to cache
    
    assert(AL.shape == (1,X.shape[1])) # sanity check
            
    return AL, caches # where AL is the last activation value (in other words, yhat)

# we can now implement our cost function to check if our model is actually learning:
def compute_cost(AL, Y): # where Y is a vector containing the true labels (0 if non-cat, 1 if cat) in the shape (1, num of examples)
    m = Y.shape[1] # number of examples
    # computing the cost, as defined in the equation above
    cost = -(np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL)))/m  
    cost = np.squeeze(cost) # making sure that the shape of cost is what we want
    assert(cost.shape == ()) # sanity check
    return cost

# with our forwardpropagation functions and cost function defined, we can implement our backpropagation functions!

# firstly, we'll implement a function to calculate dW, db, and dA_prev
def linear_backward(dZ, cache): #dZ is defined by the equation above and cache is a tuple of the values (A_prev, W, b)
    A_prev, W, b = cache
    m = A_prev.shape[1] # number of layers

    # the equations for these computations are defined above
    dW = (np.dot(dZ, A_prev.T))/m # we're dividing by m because we're using a vectorized implementation
    db = (np.sum(dZ, axis = 1, keepdims = True))/m
    dA_prev = np.dot(W.T, dZ)
    
    assert (dA_prev.shape == A_prev.shape) # sanity check
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

# next, we'll define a function that merges linear_backward and the backward activation functions that we defined earlier
def linear_activation_backward(dA, cache, activation): # where dA is the activation gradient of layer l and activation is a string specifying the desired activation
    linear_cache, activation_cache = cache # note that cache is a tuple of values (linear_cache, activation_cache)
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache) # we're using the backward activation functions to obtain the partial deriative dZ for use in linear_backward
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache) 
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db

# finally, we can implement linear_activation_backward for the whole network
# we'll use the cache from each layer from L_model_forward to backpropagate (compute derivatives/gradients) through each layer l
def L_model_backward(AL, Y, caches): # caches is from L_model_forward
    grads = {}
    L = len(caches) # the number of layers
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # initializing the backpropagation by computing the partial derivative of AL (activation of last layer) with respect to the cost function
    dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # you don't need to know how this is derived
    
    # computing the last (Lth) layer's partial derivatives
    current_cache = linear_activation_backward(dAL, caches[L-1], "sigmoid")
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = current_cache
    
    # iterating over all other layers
    for l in reversed(range(L-1)):
        # lth layer partial derivatives
        # Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] 
        current_cache = linear_activation_backward(grads["dA"+str(l+1)], caches[l], "relu") # using linear_activation_backward to compute the necessary partial derivatives
        dA_prev_temp, dW_temp, db_temp = current_cache 
        grads["dA" + str(l)] = dA_prev_temp # adding the partial derivatives to the gradient dictionary
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads # dictionary containing every partial derivative

# finally, we can implement the update rule for all layers using the partial derivatives from grads
def update_parameters(parameters, grads, learning_rate): # parameters and grads are both dictionaries    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L): # iterating over all layers
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*grads["dW"+str(l+1)] # updating using the formula above
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*grads["db"+str(l+1)]
    return parameters # return updated parameters

# you won't need manually implement the forward and backward functions once you start using a deep learning library (i.e. Tensorflow or PyTorch), but you're seeing 
# this manual implementation to gain some intuition as to what those libraries are actually doing behind the scenes :)



### Matrix Dimensions
A common debugging strategy for verifying the validity of your inputs and your neural network architecture is manually verifying your matrix dimensions.

For the architecture presented in the graphic in "Deep Neural Networks and Notation", the input dimensions are (4, 1) (more generally, the dimensions are ($n^{[0]}$, 1)), and the first layer's weight matrix dimensions are (6, 1). If we perform elementwise multiplication, our product should be a (6, 4) matrix. 

We can determine the shape of the weights of a layer $l$ by using the number of nodes in layer $l$ and the number of nodes in layer $l-1$: $W^{[l]} : (n^{[l]}, n^{[l-1]})$. Similarly, we can determine the shape of a given layer's bias vector with $b^{[l]} : (n^{[l]}, 1)$. During backpropagation, the shapes of dW and db should have the same shapes as $W^{[l]}$ and $b^{[l]}$. 

Using the properties given above, try working out the shapes in a vectorized implementation across $m$ training examples!

### Hyperparameters
The parameters in a neural network are the various weights and biases used in each layer of the network, which are iteratively updated to improve the network's performance via backpropagation and gradient descent. However, there are other parameters called *hyperparameters* that we need to set before training a neural network, an example of which is the *learning rate* $\alpha$ (which you've seen in previous notebooks). 

The learning rate controls the magnitude of our updates; If we set $\alpha$ too high, our updates are likely to overshoot the global minimum of the cost function, and if we set it too low, our updates will be too slow. Another hyperparameter is the number of iterations we should train for. If we train our network for too long, our network might become unable to make accurate predictions for examples outside of our training dataset through what is known as *overfitting*. Conversely, if we don't train long enough, our network might not be able to make accurate predictions on even the examples in the training dataset. 

More examples of a hyperparameters include the number of hidden layers and the number of nodes within those hidden layers. Generally, more hidden layers means that a network can model more complex functions, but has to train for a longer period of time to achieve reasonable accuracy. 

The choice of activation function for each layer, however, not a hyperparameter because it cannot be iteratively tuned. Since applied deep learning is an empirical process (i.e. the process is a cycle of idea -> code -> experiment), hyperparameters can be systemically tuned manually. However, you'll find that, as your neural network changes, the best choice of hyperparameters will change accordingly. Luckily, many deep learning libraries (i.e. Keras) have functions for automatically optimizing hyperparameters. 

Later in this course, we'll introduce more hyperparameters, such as momentum, minibatch size, and regularization hyperparameters. 



### Putting It All Together
Using the functions we've defined so far, define a function called model() that takes in train_x, train_y, layers_dims (a list containing the dimensions of each layer), the learning rate (a float), the number of iterations to train for (you can choose this), and a boolean variable *print_cost* that decides whether the function prints the cost every 100 examples. It should return a dictionary containing the final parameters of the model. After then, you should run the function by assigning it to a new variable with the required parameters. A template for this function is provided below (insert your code where indicated), along with the code for cost printing and a sample list for a 4-layer neural network. (Warning: Training may take a few minutes, so just be patient)

Don't worry if you're unable to define the model() function; The fully defined function is provided under the next text cell, "Check Your Work". 

In [None]:
layers_dims = [12288, 20, 7, 5, 1] # sample list for 4-layer model, change as you wish as long as the first and layer layers remain constant


def model(X, Y, layers_dims, learning_rate, num_iterations, print_cost):
    costs = [] # keeping track of cost
    
    # INSERT CODE: initialize your parameters with layers_dims!
    
    # gradient descent loop!
    for i in range(0, num_iterations): # iterating for the specified number
        # INSERT CODE: forwardpropagation! you should be storing the returned values in new variables AL and caches
        
        # INSERT CODE: compute the cost! you should be storing the returned value in a new variable cost
    
        # INSERT CODE: perform backpropagation! you should be storing the returned value in a new variable grads
 
        # INSERT CODE: update the parameters! you should be storing the returned value in a new variable parameters
                
        # print the cost every 100 training example if cost is true
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)

    return parameters

# default values for the function parameters if you don't know what to choose: learning_rate = 0.0075, num_iterations = 3000, print_cost = True

# INSERT CODE: train the model by assigning it to a new variable named parameters!

In [None]:
# for fun: use your first deep neural network to classify an image of your choice!
from google.colab import files # upload your image (in .jpg format) here
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn]))) 

my_image = "insertfilename.jpg" # change this string to the name of your image file
my_label_y = [1] # insert the true class of your image (1 is cat, 0 is non-cat)

# predicting!
fname = "images/" + my_image
image = np.array(ndimage.imread(fname, flatten=False))
my_image = scipy.misc.imresize(image, size=(num_px,num_px)).reshape((num_px*num_px*3,1))
my_image = my_image/255.
my_predicted_image = predict(my_image, my_label_y, parameters)

plt.imshow(image)
print ("y = " + str(np.squeeze(my_predicted_image)) + ", your deep neural network predicts a \"" + classes[int(np.squeeze(my_predicted_image)),].decode("utf-8") +  "\" picture!")

### Check Your Work

In [None]:
layers_dims = [12288, 20, 7, 5, 1]
def model(X, Y, layers_dims, learning_rate, num_iterations, print_cost):
    costs = []                     

    parameters = initialize_parameters_deep(layers_dims)

    for i in range(0, num_iterations):
        AL, caches = L_model_forward(X, parameters)

        cost = compute_cost(AL, Y)

        grads = L_model_backward(AL, Y, caches)

        parameters = update_parameters(parameters, grads, learning_rate)

        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)
    
    return parameters

parameters = model(train_x, train_y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost = True) # "default" parameters

### Further Reading: Comparison to The Brain
Similarly to a neural network's use of linear computation and activation, a biological neuron receives inputs from other neurons, does a simple thresholding computation, and if the resulting signal is within a certain range, the neuron fires a pulse of electricity outwards to other neurons. However, this comparison is extremely simple and outdated, as there is still much to be discovered as to how exactly neurons work and how the brain as a whole learns. We don't know if the brain updates its own neurons with something like gradient descent or if anything like forwardpropagation is used during computation, as the process of altering the state of neurons is still very mysterious. Furthermore, neurons can store memories depending on their connection to other neurons, whereas standard neural network rely on external data storage. 