# Neural Network

## Dependency

In [1]:
import numpy as np

## Concept

If network has $s_j$ units in layer $j$ and $s_{j + 1}$ units inlayer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j + 1} \times (s_{j} + 1)$. $+ 1$ because $s_{j + 1}$ has additional **bias unit**. $\Theta^{(j)}$ is a matrix of **weights** controlling function mapping from layer $j$ to layer $j + 1$.

When a neural network has **no hidden layers** and has only **one unit in output layer**,

- If output layer is **linear activation**, it's **linear regression** because $y = I (\Theta x) = \Theta x$.
- If output layer is **sigmoid activation**, it's **logistic regression** because $y = \sigma(\Theta x)$ where $\sigma = \frac{1}{1 + e^{(-\Theta x)}}$.

## Cost Function

In **multi-class classification** where $n$ is the number of data, $L$ is the number of layers in neural network including input and output layers, $s_{l}$ is the number of units (not including bias unit) in layer $l$, $K$ is the number of classes, $\Theta$ is the weight matrices, $h_{\Theta}(x)$ is the output of neural network and $\in \mathbb{R}^K$, $(h_{\Theta}(x))_i$ is $i^{th}$ output, and $J(\Theta)$ is the cost.

$$
J(\Theta) = - \frac{1}{n} \left[ \sum_{i = 1}^{n} \sum_{k = 1}^{K} y_{k}^{(i)} \log (h_{\Theta}(x^{(i)}))_{k} + (1 - y_{k}^{(i)}) \log (1 - (h_{\Theta}(x^{(i)}))_{k}) \right] + \frac{\lambda}{2n} \sum_{l = 1}^{L} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (\Theta_{ji}^{(l)})^2
$$

This math takes the form of,

$$
\text{Regularized cost} = \text{Cost} + \lambda \times \text{Regularization}
$$

The first $\sum_{i = 1}^{n} \sum_{k = 1}^{K} y_{k}^{(i)}$ part says that we get the **log-likelihood** by each class and sum up all the $n$ items and divide it by $n$ to get the average cost.

The second $\sum_{l = 1}^{L} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}}$ says that we get all the weight parameters in the neural network to regularize them.

## Backpropagation

**Backpropagation** is neural network terminology for minimizing the cost function. The goal is to compute,

$$
\underset{\Theta}{\min} J(\Theta)
$$

It means that we want to minimize the cost function $J$ using an optimal set of parameters $\Theta$.

## Gradient Descent

### Logistic Regression Gradient Descent

$$
z = w^T x + b
$$
$$
\hat{y} = a = \sigma(z)
$$
$$
\mathcal{L}(a, y) = -(y \log(a) + (1 - y) \log(1 - a))
$$

When $p = 2$,

Computation graph is,

$x_1, w_1, x_2, w_2, b \rightarrow z = w_1 x_1 + w_2 x_2 + b \rightarrow \hat{y} = a = \sigma(z) \rightarrow \mathcal{L}(a, y)$ 

By changing $w_1, w_2, b$, we want to reduce $\mathcal{L}(a, y)$

The loss function is,

$$
\mathcal{L}(a, y) = -(y \log(a) + (1 - y) \log(1 - a))
$$

Derivative of loss function with respect to $a$ is, by derivative of log and chain rule,

$$
\frac{d \mathcal{L}}{da} = -y \frac{1}{a} - (1 - y) \frac{1}{1 - a} (-1)
$$
$$
= \frac{-y}{a} + \frac{1 - y}{1 - a}
$$
$$
= \frac{-y(1 - a)}{a(1 - a)} + \frac{(1 - y)a}{(1 - a)a}
$$
$$
= \frac{-y + ay + a - ay}{a(1 - a)}
$$

So we have,

$$
\frac{d \mathcal{L}}{da} = \frac{a - y}{a(1 - a)}
$$

Next, derivative of $a$ with respect to $z$, because $a = \sigma(z)$,

$$
\frac{da}{dz} = \frac{d}{dz} \sigma(z)
$$

Because derivative of sigmoid function is $\frac{d}{dz} \sigma(z) = \sigma(z)(1 - \sigma(z)$ and $a = \sigma(z)$,

$$
\frac{da}{dz} = a (1 - a)
$$

Finally, derivative of loss function with respect to $z$ is, by chain rule,

$$
\frac{d \mathcal{L}}{dz} = \frac{d \mathcal{L}}{da} \frac{da}{dz}
$$
$$
= \frac{a - y}{a(1 - a)} a (1 - a)
$$

So we have,

$$
\frac{d \mathcal{L}}{dz} = a - y 
$$

Now, we get derivative with respect to parameters, $\frac{d \mathcal{L}}{d w_1}$, $\frac{d \mathcal{L}}{d w_2}$, and $\frac{d \mathcal{L}}{db}$. By chain rule,

$$
\frac{d \mathcal{L}}{d w_1} = \frac{d \mathcal{L}}{da} \frac{da}{dz} \frac{dz}{dw_1}
$$

Because $z = w_1 x_1 + w_2 x_2 + b$, derivative of $z$ with respect to $w_1$ is,

$$
\frac{dz}{dw_1} = x_1
$$

Likewise,

$$
\frac{dz}{dw_2} = x_2
$$
$$
\frac{dz}{db} = 1
$$

So finally derivative of loss function is each parameter is,

$$
\frac{d \mathcal{L}}{dw_1} = \frac{a - y}{a(1 - a)} a (1 - a) x_1 = (a - y) x_1
$$
$$
\frac{d \mathcal{L}}{dw_2} = \frac{a - y}{a(1 - a)} a (1 - a) x_2 = (a - y) x_2
$$
$$
\frac{d \mathcal{L}}{db} = \frac{a - y}{a(1 - a)} a (1 - a) 1 = (a - y)
$$

Because we got the gradient with respect to parameters, finally we can do gradient descent by

$$
w_1 = w_1 - \alpha \frac{d \mathcal{L}}{d w_1} = w_1 - \alpha (a - y) x_1
$$
$$
w_2 = w_2 - \alpha \frac{d \mathcal{L}}{d w_2} = w_2 - \alpha (a - y) x_2
$$
$$
b = b - \alpha \frac{d \mathcal{L}}{d b} = b - \alpha (a - y)
$$

### Pseudocode for Gradient Descent in Neural Network

When neural network architecture uses logistic regression, $p = 2$, $n$ is the number of data, and use the simplified expression of $dw_1$ for the derivative $\frac{d \mathcal{L}}{dw_1}$,

```
# Initialize variables to accumulate sums to compute average
J = 0, dw_1 = 0, dw_2 = 0, db = 0

# Iterate each example
for i = 1 to n
  
  # Forward propagation to compute loss
  z_i = w x_i + b
  a_i = sigma(z_i)
  J += -(y_i log(a_i) + (1 - y_i)log(1 - a_i))
  
  # Backpropagation to compute derivative
  dz_i = a_i - y_i
  dw_1 += x_1_i dz_i
  dw_2 += x_2_i dz_i
  db += dz_i
  
# Compute average
J /= n, dw_1 /= n, dw_2 /= n, db /= n 

# Gradient descent
w_1 = w_1 - alpha dw_1
w_2 = w_2 - alpha dw_2
b = b - alpha db
```

## Vectorization

**Whenever possible, avoid explicit for-loops** in coding neural network

## Resource

- [Machine Learning by Stanford University | Coursera](https://www.coursera.org/learn/machine-learning)
- [Deep Learning Specialization | Coursera](https://www.coursera.org/specializations/deep-learning)

## Note

$X = (p \times n)$, $Y = (1 \times n)$

Logistic regression $\hat{y} = \sigma(w^T x + b)$

Loss function is for single data error, $l(\hat{y^{(i)}}, y^{(i)})$

Cost function is for sum of loss functions for the entire dataset, $J(w, b)$

https://www.coursera.org/learn/neural-networks-deep-learning/programming/XaIWT/logistic-regression-with-a-neural-network-mindset
https://github.com/Kulbear/deep-learning-coursera/blob/master/Neural%20Networks%20and%20Deep%20Learning/Logistic%20Regression%20with%20a%20Neural%20Network%20mindset.ipynb

## Coding Neural Network Logistic Regression

In [8]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def initialize_parameters_with_zeros(dim):
    w = np.zeros(shape=(dim, 1))
    b = 0
    return w, b


def propagate(w, b, X, Y):
    """
    n is the number of data. p is the number of features

    Argument:
    w: (p x 1) weights
    b: bias scalar
    X: (p x n) input data
    Y: (1 x n) output data
    
    Return:
    cost: a scalar, negative log-likelihood
    dw: gradient of the loss with respect to w, (p x 1), same shape as w
    db: gradient of the loss with respect to b, (1 x 1), same shape as b
    """
    n = X.shape[1]
    
    # Forward propagation
    # Compute activation function
    A = sigmoid(np.dot(w.T, X) + b)
    # Compute cost function
    cost = (- 1 / n) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
    cost = np.squeeze(cost)
    
    # Backpropagation
    dw = (1 / n) * np.dot(X, (A - Y).T)
    db = (1 / n) * np.sum(A - Y)
    grads = {
        'dw': dw,
        'db': db
    }
    
    return grads, cost

    
print(sigmoid(0))
print(sigmoid(3))
print(sigmoid(-3))
print(initialize_parameters_with_zeros(2))
w, b, X, Y = np.array([[1], [2]]), 2, np.array([[1,2], [3,4]]), np.array([[1, 0]])
grads, cost = propagate(w, b, X, Y)
print ("dw = " + str(grads["dw"]))
print ("db = " + str(grads["db"]))
print ("cost = " + str(cost))

0.5
0.9525741268224334
0.04742587317756678
(array([[0.],
       [0.]]), 0)
dw = [[0.99993216]
 [1.99980262]]
db = 0.49993523062470574
cost = 6.000064773192205
