# <font color='black'>Deep Neural Networks</font>

---

<img src="images/ipsa_logo.png" width="100" align="right">


> Version: **1.0**




# 1 . Implementation of a Neural Network

In this exercise you will learn how to implement from scratch a deep neural network.


## Set-up

Firstly you will import all the packages used through the notebook.  

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import h5py

%matplotlib inline

%load_ext autoreload
%autoreload 2

np.random.seed(3)

from utils import *

c:\users\efabr\miniconda3\lib\site-packages\numpy\.libs\libopenblas.QVLO2T66WEPI7JZ63PS3HMOHFEY472BC.gfortran-win_amd64.dll
c:\users\efabr\miniconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
  stacklevel=1)


## Initialization

Start by defining a function that allows to initialize the parameters of a deep neural network where the dimensions. The number of units in the different layers are passed as argument with `layer_dims`.


In [42]:
def initialization(layer_dims):

    np.random.seed(5)
    parameters = {}
    L = len(layer_dims)            
    
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01
        parameters['b' + str(l)] = np.zeros((1,layer_dims[l]))
        
    return parameters

In [44]:
parameters = initialization([5,4,3])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

W1 = [[ 0.00441227 -0.0033087   0.02430771 -0.00252092  0.0010961 ]
 [ 0.01582481 -0.00909232 -0.00591637  0.00187603 -0.0032987 ]
 [-0.01192765 -0.00204877 -0.00358829  0.00603472 -0.01664789]
 [-0.00700179  0.01151391  0.01857331 -0.0151118   0.00644848]]
b1 = [[0. 0. 0. 0.]]
W2 = [[-9.80607885e-03 -8.56853155e-03 -8.71879183e-03 -4.22507929e-03]
 [ 9.96439827e-03  7.12421271e-03  5.91442432e-04 -3.63310878e-03]
 [ 3.28884293e-05 -1.05930442e-03  7.93053319e-03 -6.31571630e-03]]
b2 = [[0. 0. 0.]]


## Forward propagation

The forward propagation has been split in different steps. Firstly, the linear forward module computes the following equations:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$$

where $A^{[0]} = X$. 

Define a function to compute $Z^{[l]}$

In [45]:
def linear_forward(A, W, b):
    Z = W@A + b
    cache = (A, W, b)
    
    return Z, cache

In [46]:
A, W, b = linear_forward_test()

Z, linear_cache = linear_forward(A, W, b)
print("Z = " + str(Z))

Z = [[-0.67356113  0.67062057]]


**Expected output**:

<table style="width:35%">
  
  <tr>
    <td> **Z** </td>
    <td> [[ -0.67356113 0.67062057]] </td> 
  </tr>
  
</table>

### Activation Functions

In the first notebook you implemented the sigmoid function:

- **Sigmoid**: $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$.

In this notebook, you will need to implement the ReLU activation defined as:

- **ReLU**: $A = RELU(Z) = max(0, Z)$. 

Complete the function below that computes the ReLU an activation fucntion.

In [47]:
def relu(Z):

    A = np.maximum(0,Z)
    cache = Z 
    
    return A, cache

You have implemented a function that determines the linear foward step. You will now combine the output of this function with either a sigmoid() or a relu() activation function. 

In [48]:
def forward_one(A_prev, W, b, activation):
    Z, linear_cache = linear_forward(A_prev, W, b)
    if activation == 'relu' :
        A, activation_cache = relu(Z)
        
    elif activation == 'sigmoid' :
        A, activation_cache = sigmoid(Z)
        
    cache = (linear_cache, activation_cache)

    return A, cache

In [49]:
A_prev, W, b = forward_one_test()

A, linear_activation_cache = forward_one(A_prev, W, b, activation = "sigmoid")
print("With sigmoid: A = " + str(A))

A, linear_activation_cache = forward_one(A_prev, W, b, activation = "relu")
print("With ReLU: A = " + str(A))

With sigmoid: A = [[0.96313579 0.22542973]]
With ReLU: A = [[3.26295337 0.        ]]


### Forward propagation model

The structure you will implement in this exercise consists on $L-1$ layers using a ReLU activation function and a last layer using a sigmoid.
Implement the forward propagation of the above model.

In [50]:
def forward_all(X, parameters):

    caches = []
    A = X
    L = len(parameters) // 2                
    
    for l in range(1, L):
        A_prev = A 
        # Implement L-1 layers of RELU and for each layer add "cache" to the "caches" list.
        A, cache = forward_one(A,parameters["W"+ str(l)],parameters["b"+ str(l)],"relu")
        caches.append(cache)
    AL, cache = forward_one(A,parameters["W"+ str(L)],parameters["b"+ str(L)],"sigmoid")
    caches.append(cache)
    
            
    return AL, caches

In [51]:
X, parameters = forward_all_test()
AL, caches = forward_all(X, parameters)
print("AL = " + str(AL))
print("Length of caches list = " + str(len(caches)))

AL = [[0.03921668 0.70498921 0.19734387 0.04728177]]
Length of caches list = 3


###  Cost function

You will now compute the cross-entropy cost $J$, for all the training set using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$$


In [52]:
def cost_function(AL, Y):
    
    m = Y.shape[1]

    cost = (-1/m)*np.sum((np.dot(Y,np.log(AL.T))+np.dot((1-Y),np.log(1-AL).T))) 
    cost = np.squeeze(cost)      #  Eliminates useless dimensionality for the variable cost.
    
    return cost

In [53]:
Y, AL = compute_cost()
print("cost = " + str(cost_function(AL, Y)))

cost = 0.2797765635793422


<table>

    <tr>
    <td>**cost** </td>
    <td> 0.2797765635793422</td> 
    </tr>
</table>

##  Backpropagation 

You will now implement the functions that will help you compute the gradient of the loss function with respect to the different parameters.

To move backward in the computational graph you need to apply the chain rule.

### Linear backward

For each layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

Suppose you have already calculated the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$. You want to get $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$.


The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ are computed using the input $dZ^{[l]}$. The formulas you saw in class are:
$$ dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$
$$ db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$


In [54]:
def linear_backward(dZ, cache):
    A_prev, W, b = cache
    m = A_prev.shape[1]
    
    dW = 1/m * dZ@A_prev.T
    db = 1/m * np.sum(dZ,1, keepdims = True)
    dA_prev = W.T@dZ
    
    return dA_prev, dW, db
    

In [55]:
# Set up some test inputs
dZ, linear_cache = linear_backward_test()

dA_prev, dW, db = linear_backward(dZ, linear_cache)

print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))


dA_prev = [[ 1.62477986e-01  2.08119187e+00 -1.34890293e+00 -8.08822550e-01]
 [ 1.25651742e-02 -2.21287224e-01 -5.90636554e-01  4.05614891e-03]
 [ 1.98659671e-01  2.39946554e+00 -1.86852905e+00 -9.65910523e-01]
 [ 3.18813678e-01 -9.92645222e-01 -6.57125623e-01 -1.46564901e-01]
 [ 2.48593418e-01 -1.19723579e+00 -4.44132647e-01 -6.09748046e-04]]
dW = [[-1.05705158 -0.98560069 -0.54049797  0.10982291  0.53086144]
 [ 0.71089562  1.01447326 -0.10518156  0.34944625 -0.12867032]
 [ 0.46569162  0.31842359  0.30629837 -0.01104559 -0.19524287]]
db = [[ 0.5722591 ]
 [ 0.04780547]
 [-0.38497696]]


** Expected Output**:
    
```
dA_prev = 
[[  1.62477986e-01   2.08119187e+00  -1.34890293e+00  -8.08822550e-01]
 [  1.25651742e-02  -2.21287224e-01  -5.90636554e-01   4.05614891e-03]
 [  1.98659671e-01   2.39946554e+00  -1.86852905e+00  -9.65910523e-01]
 [  3.18813678e-01  -9.92645222e-01  -6.57125623e-01  -1.46564901e-01]
 [  2.48593418e-01  -1.19723579e+00  -4.44132647e-01  -6.09748046e-04]]
dW = 
[[-1.05705158 -0.98560069 -0.54049797  0.10982291  0.53086144]
 [ 0.71089562  1.01447326 -0.10518156  0.34944625 -0.12867032]
 [ 0.46569162  0.31842359  0.30629837 -0.01104559 -0.19524287]]
db = 
[[ 0.5722591 ]
 [ 0.04780547]
 [-0.38497696]]
```

### Activation Functions

Now you need to write the code that computes the derivatives for the activation functions. You have learned the derivatives for the sigmoid and the ReLU during theory class.
Complete the two function below.

In [56]:
def sigmoid_backward(dA, cache):    
    Z = cache
    
    s = Z*(1-Z)
    dZ = dA*s
    
    return dZ

In [57]:
def relu_backward(dA, cache):
    
    Z = cache 
    dZ = np.array(dA, copy=True) # convert dz to an array.
    dZ = dZ*np.where(Z>0,1,0)
    return dZ

### One backpropagation step

Next, you will create a function that implements one step of backpropagation,

In [58]:
def backward_one(dA, cache, activation):
    linear_cache, activation_cache = cache  
    if activation == "relu":
        dZ = relu_backward(dA,activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA,activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db

In [59]:
dAL, linear_activation_cache = linear_activation_backward_test()

dA_prev, dW, db = backward_one(dAL, linear_activation_cache, "sigmoid")
print ("sigmoid:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db) + "\n")

dA_prev, dW, db = backward_one(dAL, linear_activation_cache, activation = "relu")
print ("relu:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

sigmoid:
dA_prev = [[ 0.00410547  0.03685307]
 [-0.01417887 -0.12727776]
 [ 0.00764463  0.06862266]]
dW = [[ 0.03231386 -0.0904648   0.02919517]]
db = [[0.06163813]]

relu:
dA_prev = [[ 0.01679913  0.16610885]
 [-0.05801838 -0.57368247]
 [ 0.031281    0.30930474]]
dW = [[ 0.14820532 -0.40668077  0.13325465]]
db = [[0.27525652]]


### Backpropagation model

Now you will put all together to compute the backward function for the whole network. 
In the backpropagation step, you will use the variables you stored in cache in the `forward_all` function to compute the gradients. You will iterate from the last layer backwards to layer $1$.

You need to start by computing the derivative of the loss function with respect to $A^{[L]}$. And propagate this gradient backward thourgh all the layers in the network.

You need to save each dA, dW and db in the grads dictionary. 

In [60]:
def backward_all(AL, Y, caches):
    grads = {}
    L = len(caches) 
    m = AL.shape[1]
    Y = Y.reshape(AL.shape) 

    dZ = AL-Y
    current_cache = caches[L-1]
    grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_backward(dZ, current_cache[0])
    dAL = grads["dA" + str(L-1)]
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = backward_one(dAL, current_cache, "relu")
        grads["dA" + str(l)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp
    return grads

In [61]:
AL, Y_assess, caches = backward_all_test()
grads = backward_all(AL, Y_assess, caches)
print_grads(grads)

dW1 = [[0.41642713 0.07927654 0.14011329 0.10664197]
 [0.         0.         0.         0.        ]
 [0.05365169 0.01021384 0.01805193 0.01373955]]
db1 = [[-0.22346593]
 [ 0.        ]
 [-0.02879093]]
dA1 = [[-0.80745758 -0.44693186]
 [ 0.88640102  0.49062745]
 [-0.10403132 -0.05758186]]


**Expected Output**

<table style="width:60%">
  
  <tr>
    <td > dW1 </td> 
           <td > [[ 0.41010002  0.07807203  0.13798444  0.10502167]
 [ 0.          0.          0.          0.        ]
 [ 0.05283652  0.01005865  0.01777766  0.0135308 ]] </td> 
  </tr> 
  
    <tr>
    <td > db1 </td> 
           <td > [[-0.22007063]
 [ 0.        ]
 [-0.02835349]] </td> 
  </tr> 
  
  <tr>
  <td > dA1 </td> 
           <td > [[ 0.12913162 -0.44014127]
 [-0.14175655  0.48317296]
 [ 0.01663708 -0.05670698]] </td> 

  </tr> 
</table>



### Gradient Descent

Finally you can update the parameters of the model according: 

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} $$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} $$

where $\alpha$ is the learning rate. After computing the updated parameters, store them in the parameters dictionary. 

In [62]:
def gradient_descent(parameters, grads, learning_rate):
    L = len(parameters) // 2 

    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W"+ str(l+1)] - learning_rate * grads["dW"+ str(l+1)]
        parameters["b" + str(l+1)] = parameters["b"+ str(l+1)] - learning_rate * grads["db"+ str(l+1)]
    return parameters

In [63]:
parameters, grads = gradient_descent_test_case()
parameters = gradient_descent(parameters, grads, 0.1)

print ("W1 = "+ str(parameters["W1"]))
print ("b1 = "+ str(parameters["b1"]))
print ("W2 = "+ str(parameters["W2"]))
print ("b2 = "+ str(parameters["b2"]))

W1 = [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 = [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 = [[-0.55569196  0.0354055   1.32964895]]
b2 = [[-0.84610769]]


**Expected Output**:

<table style="width:100%"> 
    <tr>
    <td > W1 </td> 
           <td > [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]] </td> 
  </tr> 
  
    <tr>
    <td > b1 </td> 
           <td > [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]] </td> 
  </tr> 
  <tr>
    <td > W2 </td> 
           <td > [[-0.55569196  0.0354055   1.32964895]]</td> 
  </tr> 
  
    <tr>
    <td > b2 </td> 
           <td > [[-0.84610769]] </td> 
  </tr> 
</table>


You can now create a deep neural network  combining all the functions defined above.

In [64]:
def dnn(X, Y, layers_dims, learning_rate = 0.009, num_iterations = 100, print_cost=True):#lr was 0.009
    costs = []                         
    
    parameters = initialization(layers_dims)
    for i in range(0, num_iterations):
        AL, caches = forward_all(X, parameters)
        cost = cost_function(AL, Y)
        costs.append(cost)
        if print_cost:
            print(cost)
        grads = backward_all(AL, Y, caches)
        parameters = gradient_descent(parameters, grads, learning_rate)
        
    
    return parameters, costs

# 2 . Deep Neural Networks for Classification

Consider now the dataset you used in the previous exercise. You solved the classification problem using Logistic Regression. Propose a Deep Neural Network architecture using the code you developed in the first part of this exercise that improves on the classification results of Logistic Regression.

In [65]:
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from PIL import Image
from scipy import ndimage
from lr_utils import load_dataset

%matplotlib inline

In [66]:
train_set_x_orig, train_set_y, test_set_x_orig, test_set_y, classes = load_dataset()

In [67]:
m_train = train_set_x_orig.shape[0]
m_test = test_set_x_orig.shape[0]
num_px = train_set_x_orig.shape[1] 
train_set_x_flatten = train_set_x_orig.reshape(m_train,num_px * num_px * 3).T
test_set_x_flatten = test_set_x_orig.reshape(m_test,num_px * num_px * 3).T
train_set_x = train_set_x_flatten/255.
test_set_x = test_set_x_flatten/255.

In [69]:
print(train_set_x_flatten.shape, num_px)
np.swapaxes(train_set_x_flatten)
dnn(train_set_x_flatten, train_set_y, [train_set_x_flatten.shape[0], 4, 1])

(12288, 209) 64


TypeError: _swapaxes_dispatcher() missing 2 required positional arguments: 'axis1' and 'axis2'