# Deep Learning Specialisation

## Neural networks and deep learning 

### Logistic regression as neural network

- Inputs are images of cats a pixels of rgb and output in 0 or 1
- m training examples (x, y) where $x\in\mathbb{R}^n, y\in\{0,1\}$
- Often X represents the training data where the rows are samples however in this course the columns of X are the training examples and hence has shape (n, m).
- Y represents the vector of y values with shape (m, 1)
- $\hat{y}$ is the prediction. In the cat example $\hat{y} = P(y=1|x)$
- In logisitic regression $\hat{y}=\sigma(w^Tx+b)$ where $\sigma(z) = \frac{1}{1+e^{-z}}$
- Loss function $L(\hat{y}, y) = - (y \log(\hat{y}) + (1-y)\log(1-\hat{y}))$
- Cost function $J(w, b) = \frac{1}{m} \sum_{i=1}^mL(\hat{y}^i, y^i)$, J is convex
- Gradient descent updates w and b to minimise the cost
- We repeatedly update $w := w-\alpha\frac{\partial J}{\partial w} = w-\alpha dw$ where alpha is the learning rate and  $b := b-\alpha\frac{\partial J}{\partial b} = b -\alpha db$
- When implementing we need to vectorise the implementation
- Logistic regression diagram
```mermaid
flowchart LR
    x1(($$x_1$$)) --> z("$$z = w^Tx + b$$")
    x2(($$x_2$$)) --> z
    x3(($$x_3$$)) --> z
    z --> a("$$\hat{y} = a= \sigma(z) $$")
```
- Logistic regression equations are often summarised as
$$
\begin{align*}
z^{(i)} &= w^Tx^{(i)}+b \\
\hat{y}^{(i)} &= a^{(i)} = sigmoid(z^{(i)}) \\
L(a^{(i)}, y^{(i)}) &= -y^{(i)}\log(a^{(i)}) - (1 - y^{(i)})\log(1-a^{(i)}) \\
J &= \frac{1}{m}\sum_{i=1}^mL(a^{(i)} - y^{(i)})
\end{align*}
$$

In vector form

$$
\begin{align}
A &= \sigma(W^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)}) \\
\frac{\partial J}{\partial W} &= dW = \frac{1}{m}X(A-Y)^T \\
\frac{\partial J}{\partial b} &= db = \frac{1}{m}\sum_{i=1}^m (a^{(i)}-y^{(i)}) \\
J &= -\frac{1}{m}(Ylog(A)^T + (1-Y)log(1-A)^T)
\end{align}
$$
$$
X - dim: (n, m) \quad 
w - dim: (n, 1) \quad
b - dim: (1, 1) \quad
A - dim: (1, m) \quad
Y - dim: (1, m)
$$

### Shallow neural networks

- For a network we use super script [i] to descripe the ith layer where n the dimension of x in $n^{[0]}$
- network flow

```mermaid
flowchart LR
    x1(($$x_1$$)) --> z1("$$z^{[1]} = W^{[1]}x + b[1]$$")
    x2(($$x_2$$)) --> z1
    x3(($$x_3$$)) --> z1
    z1 --> a1("$$a^{[1]} = \sigma(z^{[1]}) $$")
    a1 --> z2("$$z^{[2]} = W^{[2]}a^{[1]} + b[2]$$")
    z2 --> a2("$$\hat{y} = a^{[2]} = \sigma(z^{[2]}) $$")
```
- The forward pass is calculated as
$$
\begin{align}
z^{[1](i)} &= W^{[1]}x^{(i)}+b^{[1]}, \quad\text{i is the ith sample} \\
z^{[1]} &= W^{[1]}X+b^{[1]} \\
A^{[1]} &= \sigma(z^{[1]}) \\ 
z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \\
Y = A^{[2]} &= \sigma(z^{[2]}) \\
\end{align}
$$
- Activation functions could be tanh, relu, leaky relu or the signmoid
- Taking the example of a two layer network (note we don't include the input layer in the count so we have 1 hidden layer) we have params

$$\begin{align}
x^{(i)} &- dim: (n, 1) \\
X &- dim: (n, m) \\
W^{[1]} &- dim: (n^{[1]}, n^{[0]}) \\
b^{[1]} &- dim: (n^{[1]}, 1) \\
Z^{[1]} &- dim: (n^{[1]}, m) \\
A^{[1]} &- dim: (n^{[1]}, m) \\
W^{[2]} &- dim: (n^{[2]}, n^{[1]}) \\
b^{[2]} &- dim: (n^{[2]}, 1) \\
Z^{[2]} &- dim: (n^{[2]}, m) \\
A^{[2]} &- dim: (n^{[2]}, m) \\
&n^{[2]} = 1 \\
Y &- dim: (n^{[2]}, m) \\
\end{align}
$$ 
- Back prop equations
$$\begin{align}
dZ^{[2]} &= A^{[2]} - Y \\
dW^{[2]} &= \frac{1}{m}dZ^{[2]}A^{[1]T} \\
db^{[2]} &= \frac{1}{m}np.sum(dZ^{[2]}, axis=1, keepdims=True) \\
dZ^{[1]} &= W^{[2]T}dZ^{[2]} * \sigma'(Z^{[1]}) \\
dW^{[1]} &= \frac{1}{m}dZ^{[1]}X^T \\
db^{[1]} &= \frac{1}{m}np.sum(dZ^{[1]}, axis=1, keepdims=True) \\
\end{align}
$$ 
- initialise weights randomly

### deep neural networks

 - The equations for the forward pass can we summarised as below where there are $L$ layers
 $$
\begin{align}
z^{[l](i)} &= W^{[l]}a^{[l-1](i))}+b^{[l]}, \quad\text{ith sample, lth layer} \\
a^{[l](i)} &= g(z^{[l](i)}) , \quad\text{ith sample, lth layer, g - activation function} \\
z^{[0](i)} &= x^{(i)} \\
\hat{y^{(i)}} &= a^{[L](i)} \\
L(a^{(i)}, y^{(i)}) &= -y^{(i)}\log(a^{(i)}) - (1 - y^{(i)})\log(1-a^{(i)}) \\
\end{align}
$$
 - As vectors
$$
\begin{align}
Z^{[l]} &= W^{[l]}A^{[l-1]}+b^{[l]} \\
A^{[l]} &= g(Z^{[l]}), \quad\text{lth layer, g - activation function} \\ 
Z^{[0]} &= X \\
\hat{Y} &= A^{[L]} \\
J &= -\frac{1}{m}(Ylog(A)^T + (1-Y)log(1-A)^T)
\end{align}
$$
- The equations for the backward pass
$$
\begin{align}
dZ^{[L]} &= A^{[L]} - Y \\
dZ^{[l]} &= W^{[l+1]^T}dZ^{[l+1]} * g'(Z^{[l]}) \quad l\in\{1...L-1\} \\
dW^{[l]} &= \frac{1}{m}dZ^{[l]}A^{[l-1]^T} \quad l\in\{1...L-1\} \\
db^{[l]} &= \frac{1}{m}np.sum(dZ^{[l]}, axis=1, keepdims=True) \quad l\in\{1...L-1\} \\
\end{align}
$$


In [1]:
from typing import Callable

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from scipy.special import expit


def relu(x):
    return x * (x > 0)

def relu_derivative(x):
    return 1. * (x > 0)

sigmoid = expit

def sigmoid_derivative(z):
    return sigmoid(z) * (1.0 - sigmoid(z))

def leaky_relu(x, leaky_constant:float = 0.01):
    return np.where(x > 0.0, x,  x * leaky_constant)

def leaky_relu_derivative(x, leaky_constant=0.01):
    return np.where(x > 0, 1, leaky_constant)

def log_loss(A, Y):
    m = A.shape[1] # m is number samples
    return - (1/m) * (Y @ np.log(A).T + (1 - Y) @ np.log(1 - A).T)

def square_loss(A, Y):
    m = A.shape[1] # m is number samples
    return - (1/m) * (A - Y) @ (A - Y).T

ACTIVATION_FUNCTIONS : dict[str, Callable] = {
    "relu": relu,
    "leaky_relu": leaky_relu,
    "sigmoid": expit,
}

ACTIVATION_FUNCTION_DERIVATIVES: dict[str, Callable] = {
    "relu": relu_derivative,
    "leaky_relu": leaky_relu_derivative,
    "sigmoid": sigmoid_derivative,
}

LOSS_FUNCTIONS: dict[str, Callable] = {
    "log_loss": log_loss,
    "square_loss": square_loss
}

In [14]:
import logging

logger = logging.getLogger()

class NeuralNetwork:

    def __init__(
        self,
        layer_sizes: list[int],
        learning_rate: float = 0.5,
        layer_activations: dict[int, str] | None = None,
        cost_function: str = "log_loss"
    ):
        self.learning_rate = learning_rate
        self.layer_sizes = layer_sizes
        self.L = len(layer_sizes) - 1
        self.m = layer_sizes[0]
        self.cost_function = cost_function
        self.learning_rate
        self.layer_activations = layer_activations
        if not layer_activations:
            self.layer_activations = {l:"sigmoid" for l in range(1, self.L)} | {self.L:"sigmoid"}

    def initialise_weights(self) -> None:
        # Note layer indexes are off by one because python indexes by 0
        # so Ws[0] is really W^{[1]}

        # This is using He initialisation. Try changing to * 0.01 and see the change in cost plot.
        self.Ws = {
            l:np.random.normal(size=(n_l, n_l_minus_1)) * np.sqrt(2 / n_l_minus_1)
            for (l, (n_l, n_l_minus_1)) in enumerate(zip(self.layer_sizes[1:], self.layer_sizes), start=1)
        }
        self.bs = {l:np.zeros((n_l, 1)) for l, n_l in enumerate(self.layer_sizes[1:], start=1)}
        logger.info("Weights initialised")
        logger.debug(f"{self.Ws=}")

    def forward(self, X, cache=False) -> None:
        Zs, As = {}, {0:X}
        for l in range(1, self.L + 1):
            Zs[l] = self.Ws[l] @ As[l-1] + self.bs[l]
            g = ACTIVATION_FUNCTIONS[self.layer_activations[l]]
            logger.debug(f"Applying {self.layer_activations[l]} in layer{l}")
            As[l] = g(Zs[l])
        if cache:
            self.Zs, self.As = Zs, As
        return As[self.L]

    def backward(self, Y) -> None:
        dZs = {self.L: self.As[self.L] - Y}
        m = self.As[0].shape[1]
        dWs, dbs = {}, {}
        # [w1, w2, w3]
        for l in range(self.L, 0, -1):
            logger.debug(f"calculating dZ for layer_id {l}")
            if l != self.L:
                dZs[l] = self.Ws[l+1].T @ dZs[l+1] * \
                    ACTIVATION_FUNCTION_DERIVATIVES[self.layer_activations[l]](self.Zs[l])
                    # For sigmoid we could just use self.As[l] * (1 - self.As[l])
            dWs[l] = (1 / m) * dZs[l] @ self.As[l-1].T
            dbs[l] = (1 / m) * np.sum(dZs[l], axis=1, keepdims=True)
        self.dWs, self.dbs = dWs, dbs

    def update_weights(self):
        for l in range(1, self.L + 1):
            self.Ws[l] -= self.learning_rate * self.dWs[l]
            self.bs[l] -= self.learning_rate * self.dbs[l]

    def train(self, X, Y, n_epochs=10, log_every=100, plot_cost=False):
        import plotly.graph_objs as go
        from IPython.display import display, clear_output

        costs = []
        epochs = []

        if plot_cost:
            fig = go.FigureWidget()
            fig.add_scatter(x=[], y=[], mode='lines+markers', name='Cost')
            fig.update_layout(title='Training Cost over Epochs', xaxis_title='Epoch', yaxis_title='Cost')
            display(fig)

        for epoch in range(n_epochs):
            A = self.forward(X, cache=True)
            cost = self.cost(A, Y)
            if epoch % log_every == 0:
                logger.info(f"Cost after epoch {epoch} = {cost}")
            if plot_cost:
                costs.append(float(cost))
                epochs.append(epoch)
                if epoch % 10 == 0:
                    with fig.batch_update():
                        fig.data[0].x = epochs
                        fig.data[0].y = costs
            self.backward(Y)
            self.update_weights()

    def cost(self, A, Y):
        if self.cost_function == "log_loss":
            return log_loss(A, Y)
        elif self.cost_function == "square_loss":
            return square_loss(A, Y)
        else:
            raise Exception(f"Incorrect value for self.cost_function:= {self.cost_function}")

    def predict(self, X, return_probability=False):
        Y_hat = self.forward(X)
        if return_probability:
            return Y_hat
        return np.where(Y_hat>0.5, 1, 0)


In [15]:
def example():
    X_train = pd.read_feather('../titanic/processed/X_train.feather').to_numpy().T
    y_train = pd.read_feather('../titanic/processed/y_train.feather').to_numpy().T
    X_test = pd.read_feather('../titanic/processed/X_test.feather').to_numpy().T
    y_test = pd.read_feather('../titanic/processed/y_test.feather').to_numpy().T

    logging.basicConfig(level=logging.INFO, force=True)

    layer_sizes = [30, 50, 20, 1] # L = 3, A[3] = Yhat
    neural_network = NeuralNetwork(layer_sizes=layer_sizes)

    # uncomment to see relu converges better !
    # layer_sizes = [30, 50, 1] # L = 3, A[3] = Yhat
    # neural_network = NeuralNetwork(
    #     layer_sizes=layer_sizes,
    #     layer_activations={1:"relu", 2:"relu", 3:"sigmoid"},
    # )

    neural_network.initialise_weights()
    neural_network.train(X_train, y_train, n_epochs=2000, plot_cost=True)

    y_test_pred = neural_network.predict(X_test)
    accuracy = (y_test_pred == y_test).sum() / y_test.shape[1]

    print(f"X_train.shape: {X_train.shape}")
    print(f"layer_activations = {neural_network.layer_activations}")
    print(f"L = {neural_network.L}")
    print("ws shapes:",  [(i, w.shape) for i, w in neural_network.Ws.items()])
    print("As shapes:",  [(i, a.shape) for i, a in neural_network.As.items()])
    print("zs shapes:",  [(i, z.shape) for i, z in neural_network.Zs.items()])
    print(f"Accuracy on test: {accuracy}")

In [16]:
example()

INFO:root:Weights initialised


FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Cost',
              'type': 'scatter',
              'uid': '5083a0c6-8206-4719-a84e-679b0abface0',
              'x': [],
              'y': []}],
    'layout': {'template': '...',
               'title': {'text': 'Training Cost over Epochs'},
               'xaxis': {'title': {'text': 'Epoch'}},
               'yaxis': {'title': {'text': 'Cost'}}}
})

INFO:root:Cost after epoch 0 = [[0.67791741]]

Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)

INFO:root:Cost after epoch 100 = [[0.46155934]]
INFO:root:Cost after epoch 200 = [[0.43274977]]
INFO:root:Cost after epoch 300 = [[0.42003758]]
INFO:root:Cost after epoch 400 = [[0.41334262]]
INFO:root:Cost after epoch 500 = [[0.40872809]]
INFO:root:Cost after epoch 600 = [[0.40503333]]
INFO:root:Cost after epoch 700 = [[0.40186904]]
INFO:root:Cost after epoch 800 = [[0.39904907]]
INFO:root:Cost after epoch 900 = [[0.39646523]]
INFO:root:Cost after epoch 1000 = [[0.39405118]]
INFO:root:Cost after epoch 1100 = [[0.39176588]]
INFO:root:Cost after epoch 1200 = [[0.38958391]]
INFO:root:Cost after epoch 1300 = [[0.3874885]]
INFO:root:Cost after epoch 1400 = [[0.38546622]]
INFO:root:Cost after epoch 1500 = [[0.38350336]]
INFO:root:Cost after epoch

X_train.shape: (30, 712)
layer_activations = {1: 'sigmoid', 2: 'sigmoid', 3: 'sigmoid'}
L = 3
ws shapes: [(1, (50, 30)), (2, (20, 50)), (3, (1, 20))]
As shapes: [(0, (30, 712)), (1, (50, 712)), (2, (20, 712)), (3, (1, 712))]
zs shapes: [(1, (50, 712)), (2, (20, 712)), (3, (1, 712))]
Accuracy on test: 0.8212290502793296
