# Deep Learning Specialisation

## Neural networks and deep learning 

### Logistic regression as neural network

- Inputs are images of cats a pixels of rgb and output in 0 or 1
- m training examples (x, y) where $x\in\mathbb{R}^n, y\in\{0,1\}$
- Often X represents the training data where the rows are samples however in this course the columns of X are the training examples and hence has shape (n, m).
- Y represents the vector of y values with shape (m, 1)
- $\hat{y}$ is the prediction. In the cat example $\hat{y} = P(y=1|x)$
- In logisitic regression $\hat{y}=\sigma(w^Tx+b)$ where $\sigma(z) = \frac{1}{1+e^{-z}}$
- Loss function $L(\hat{y}, y) = - (y \log(\hat{y}) + (1-y)\log(1-\hat{y}))$
- Cost function $J(w, b) = \frac{1}{m} \sum_{i=1}^mL(\hat{y}^i, y^i)$, J is convex
- Gradient descent updates w and b to minimise the cost
- We repeatedly update $w := w-\alpha\frac{\partial J}{\partial w} = w-\alpha dw$ where alpha is the learning rate and  $b := b-\alpha\frac{\partial J}{\partial b} = b -\alpha db$
- When implementing we need to vectorise the implementation
- Logistic regression diagram
```mermaid
flowchart LR
    x1(($$x_1$$)) --> z("$$z = w^Tx + b$$")
    x2(($$x_2$$)) --> z
    x3(($$x_3$$)) --> z
    z --> a("$$\hat{y} = a= \sigma(z) $$")
```
- Logistic regression equations are often summarised as
$$
\begin{align*}
z^{(i)} &= w^Tx^{(i)}+b \\
\hat{y}^{(i)} &= a^{(i)} = sigmoid(z^{(i)}) \\
L(a^{(i)}, y^{(i)}) &= -y^{(i)}\log(a^{(i)}) - (1 - y^{(i)})\log(1-a^{(i)}) \\
J &= \frac{1}{m}\sum_{i=1}^mL(a^{(i)} - y^{(i)})
\end{align*}
$$

In vector form

$$
\begin{align}
A &= \sigma(W^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)}) \\
\frac{\partial J}{\partial W} &= dW = \frac{1}{m}X(A-Y)^T \\
\frac{\partial J}{\partial b} &= db = \frac{1}{m}\sum_{i=1}^m (a^{(i)}-y^{(i)}) \\
J &= -\frac{1}{m}(Ylog(A)^T + (1-Y)log(1-A)^T)
\end{align}
$$
$$
X - dim: (n, m) \quad 
w - dim: (n, 1) \quad
b - dim: (1, 1) \quad
A - dim: (1, m) \quad
Y - dim: (1, m)
$$

### Shallow neural networks

- For a network we use super script [i] to descripe the ith layer where n the dimension of x in $n^{[0]}$
- network flow

```mermaid
flowchart LR
    x1(($$x_1$$)) --> z1("$$z^{[1]} = W^{[1]}x + b[1]$$")
    x2(($$x_2$$)) --> z1
    x3(($$x_3$$)) --> z1
    z1 --> a1("$$a^{[1]} = \sigma(z^{[1]}) $$")
    a1 --> z2("$$z^{[2]} = W^{[2]}a^{[1]} + b[2]$$")
    z2 --> a2("$$\hat{y} = a^{[2]} = \sigma(z^{[2]}) $$")
```
- The forward pass is calculated as
$$
\begin{align}
z^{[1](i)} &= W^{[1]}x^{(i)}+b^{[1]}, \quad\text{i is the ith sample} \\
z^{[1]} &= W^{[1]}X+b^{[1]} \\
A^{[1]} &= \sigma(z^{[1]}) \\ 
z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \\
Y = A^{[2]} &= \sigma(z^{[2]}) \\
\end{align}
$$
- Activation functions could be tanh, relu, leaky relu or the signmoid
- Taking the example of a two layer network (note we don't include the input layer in the count so we have 1 hidden layer) we have params

$$\begin{align}
x^{(i)} &- dim: (n, 1) \\
X &- dim: (n, m) \\
W^{[1]} &- dim: (n^{[1]}, n^{[0]}) \\
b^{[1]} &- dim: (n^{[1]}, 1) \\
Z^{[1]} &- dim: (n^{[1]}, m) \\
A^{[1]} &- dim: (n^{[1]}, m) \\
W^{[2]} &- dim: (n^{[2]}, n^{[1]}) \\
b^{[2]} &- dim: (n^{[2]}, 1) \\
Z^{[2]} &- dim: (n^{[2]}, m) \\
A^{[2]} &- dim: (n^{[2]}, m) \\
&n^{[2]} = 1 \\
Y &- dim: (n^{[2]}, m) \\
\end{align}
$$ 
- Back prop equations
$$\begin{align}
dZ^{[2]} &= A^{[2]} - Y \\
dW^{[2]} &= \frac{1}{m}dZ^{[2]}A^{[1]T} \\
db^{[2]} &= \frac{1}{m}np.sum(dZ^{[2]}, axis=1, keepdims=True) \\
dZ^{[1]} &= W^{[2]T}dZ^{[2]} * \sigma'(Z^{[1]}) \\
dW^{[1]} &= \frac{1}{m}dZ^{[1]}X^T \\
db^{[1]} &= \frac{1}{m}np.sum(dZ^{[1]}, axis=1, keepdims=True) \\
\end{align}
$$ 
- initialise weights randomly

### deep neural networks

 - The equations for the forward pass can we summarised as below where there are $L$ layers
 $$
\begin{align}
z^{[l](i)} &= W^{[l]}a^{[l-1](i))}+b^{[l]}, \quad\text{ith sample, lth layer} \\
a^{[l](i)} &= g(z^{[l](i)}) , \quad\text{ith sample, lth layer, g - activation function} \\
z^{[0](i)} &= x^{(i)} \\
\hat{y^{(i)}} &= a^{[L](i)} \\
L(a^{(i)}, y^{(i)}) &= -y^{(i)}\log(a^{(i)}) - (1 - y^{(i)})\log(1-a^{(i)}) \\
\end{align}
$$
 - As vectors
$$
\begin{align}
Z^{[l]} &= W^{[l]}A^{[l-1]}+b^{[l]} \\
A^{[l]} &= g(Z^{[l]}), \quad\text{lth layer, g - activation function} \\ 
Z^{[0]} &= X \\
\hat{Y} &= A^{[L]} \\
J &= -\frac{1}{m}(Ylog(A)^T + (1-Y)log(1-A)^T)
\end{align}
$$
- The equations for the backward pass
$$
\begin{align}
dZ^{[L]} &= A^{[L]} - Y \\
dZ^{[l]} &= W^{[l+1]^T}dZ^{[l+1]} * g'(Z^{[l]}) \quad l\in\{1...L-1\} \\
dW^{[l]} &= \frac{1}{m}dZ^{[l]}A^{[l-1]^T} \quad l\in\{1...L-1\} \\
db^{[l]} &= \frac{1}{m}np.sum(dZ^{[l]}, axis=1, keepdims=True) \quad l\in\{1...L-1\} \\
\end{align}
$$


In [13]:
from typing import Callable

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from scipy.special import expit


def relu(x):
    return x * (x > 0)

def relu_derivative(x):
    return 1. * (x > 0)

sigmoid = expit

def sigmoid_derivative(z):
    return sigmoid(z) * (1.0 - sigmoid(z))

def leaky_relu(x, leaky_constant:float = 0.01):
    return np.where(x > 0.0, x,  x * leaky_constantÏ)

def leaky_relu_derivative(x, leaky_constant=0.01):
    return np.where(x > 0, 1, leaky_constant)

def log_loss(A, Y):
    m = A.shape[1] # m is number samples
    return - (1/m) * (Y @ np.log(A).T + (1 - Y) @ np.log(1 - A).T)

def square_loss(A, Y):
    m = A.shape[1] # m is number samples
    return - (1/m) * (A - Y) @ (A - Y).T

ACTIVATION_FUNCTIONS : dict[str, Callable] = {
    "relu": relu,
    "leaky_relu": leaky_relu,
    "sigmoid": expit,
}

ACTIVATION_FUNCTION_DERIVATIVES: dict[str, Callable] = {
    "relu": relu_derivative,
    "leaky_relu": leaky_relu_derivative,
    "sigmoid": sigmoid_derivative,
}

LOSS_FUNCTIONS: dict[str, Callable] = {
    "log_loss": log_loss,
    "square_loss": square_loss
}

In [14]:
import logging

logger = logging.getLogger()

class NeuralNetwork:

    def __init__(
        self,
        layer_sizes: list[int],
        learning_rate: float = 0.5,
        layer_activations: dict[int, str] | None = None,
        cost_function: str = "log_loss"
    ):
        self.learning_rate = learning_rate
        self.layer_sizes = layer_sizes
        self.L = len(layer_sizes) - 1
        self.m = layer_sizes[0]
        self.cost_function = cost_function
        self.learning_rate
        self.layer_activations = layer_activations
        if not layer_activations:
            self.layer_activations = {l:"sigmoid" for l in range(1, self.L)} | {self.L:"sigmoid"}

    def initialise_weights(self) -> None:
        # Note layer indexes are off by one because python indexes by 0
        # so Ws[0] is really W^{[1]}

        # This is using He initialisation. Try changing to * 0.01 and see the change in cost plot.
        self.Ws = {
            l:np.random.normal(size=(n_l, n_l_minus_1)) * np.sqrt(2 / n_l_minus_1)
            for (l, (n_l, n_l_minus_1)) in enumerate(zip(self.layer_sizes[1:], self.layer_sizes), start=1)
        }
        self.bs = {l:np.zeros((n_l, 1)) for l, n_l in enumerate(self.layer_sizes[1:], start=1)}
        logger.info("Weights initialised")
        logger.debug(f"{self.Ws=}")

    def forward(self, X, cache=False) -> None:
        Zs, As = {}, {0:X}
        for l in range(1, self.L + 1):
            Zs[l] = self.Ws[l] @ As[l-1] + self.bs[l]
            g = ACTIVATION_FUNCTIONS[self.layer_activations[l]]
            logger.debug(f"Applying {self.layer_activations[l]} in layer{l}")
            As[l] = g(Zs[l])
        if cache:
            self.Zs, self.As = Zs, As
        return As[self.L]

    def backward(self, Y) -> None:
        dZs = {self.L: self.As[self.L] - Y}
        m = self.As[0].shape[1]
        dWs, dbs = {}, {}
        # [w1, w2, w3]
        for l in range(self.L, 0, -1):
            logger.debug(f"calculating dZ for layer_id {l}")
            if l != self.L:
                dZs[l] = self.Ws[l+1].T @ dZs[l+1] * \
                    ACTIVATION_FUNCTION_DERIVATIVES[self.layer_activations[l]](self.Zs[l])
                    # For sigmoid we could just use self.As[l] * (1 - self.As[l])
            dWs[l] = (1 / m) * dZs[l] @ self.As[l-1].T
            dbs[l] = (1 / m) * np.sum(dZs[l], axis=1, keepdims=True)
        self.dWs, self.dbs = dWs, dbs

    def update_weights(self):
        for l in range(1, self.L + 1):
            self.Ws[l] -= self.learning_rate * self.dWs[l]
            self.bs[l] -= self.learning_rate * self.dbs[l]

    def train(self, X, Y, n_epochs=10, log_every=100, plot_cost=False):
        import plotly.graph_objs as go
        from IPython.display import display, clear_output

        costs = []
        epochs = []

        if plot_cost:
            fig = go.FigureWidget()
            fig.add_scatter(x=[], y=[], mode='lines+markers', name='Cost')
            fig.update_layout(title='Training Cost over Epochs', xaxis_title='Epoch', yaxis_title='Cost')
            display(fig)

        for epoch in range(n_epochs):
            A = self.forward(X, cache=True)
            cost = self.cost(A, Y)
            if epoch % log_every == 0:
                logger.info(f"Cost after epoch {epoch} = {cost}")
            if plot_cost:
                costs.append(float(cost))
                epochs.append(epoch)
                if epoch % 10 == 0:
                    with fig.batch_update():
                        fig.data[0].x = epochs
                        fig.data[0].y = costs
            self.backward(Y)
            self.update_weights()

    def cost(self, A, Y):
        if self.cost_function == "log_loss":
            return log_loss(A, Y)
        elif self.cost_function == "square_loss":
            return square_loss(A, Y)
        else:
            raise Exception(f"Incorrect value for self.cost_function:= {self.cost_function}")

    def predict(self, X, return_probability=False):
        Y_hat = self.forward(X)
        if return_probability:
            return Y_hat
        return np.where(Y_hat>0.5, 1, 0)


In [15]:
def example():
    X_train = pd.read_feather('../titanic/processed/X_train.feather').to_numpy().T
    y_train = pd.read_feather('../titanic/processed/y_train.feather').to_numpy().T
    X_test = pd.read_feather('../titanic/processed/X_test.feather').to_numpy().T
    y_test = pd.read_feather('../titanic/processed/y_test.feather').to_numpy().T

    logging.basicConfig(level=logging.INFO, force=True)

    layer_sizes = [30, 50, 20, 1] # L = 3, A[3] = Yhat
    neural_network = NeuralNetwork(layer_sizes=layer_sizes)

    # uncomment to see relu converges better !
    # layer_sizes = [30, 50, 1] # L = 3, A[3] = Yhat
    # neural_network = NeuralNetwork(
    #     layer_sizes=layer_sizes,
    #     layer_activations={1:"relu", 2:"relu", 3:"sigmoid"},
    # )

    neural_network.initialise_weights()
    neural_network.train(X_train, y_train, n_epochs=2000, plot_cost=True)

    y_test_pred = neural_network.predict(X_test)
    accuracy = (y_test_pred == y_test).sum() / y_test.shape[1]

    print(f"X_train.shape: {X_train.shape}")
    print(f"layer_activations = {neural_network.layer_activations}")
    print(f"L = {neural_network.L}")
    print("ws shapes:",  [(i, w.shape) for i, w in neural_network.Ws.items()])
    print("As shapes:",  [(i, a.shape) for i, a in neural_network.As.items()])
    print("zs shapes:",  [(i, z.shape) for i, z in neural_network.Zs.items()])
    print(f"Accuracy on test: {accuracy}")

In [16]:
example()

INFO:root:Weights initialised


FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Cost',
              'type': 'scatter',
              'uid': '5083a0c6-8206-4719-a84e-679b0abface0',
              'x': [],
              'y': []}],
    'layout': {'template': '...',
               'title': {'text': 'Training Cost over Epochs'},
               'xaxis': {'title': {'text': 'Epoch'}},
               'yaxis': {'title': {'text': 'Cost'}}}
})

INFO:root:Cost after epoch 0 = [[0.67791741]]

Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)

INFO:root:Cost after epoch 100 = [[0.46155934]]
INFO:root:Cost after epoch 200 = [[0.43274977]]
INFO:root:Cost after epoch 300 = [[0.42003758]]
INFO:root:Cost after epoch 400 = [[0.41334262]]
INFO:root:Cost after epoch 500 = [[0.40872809]]
INFO:root:Cost after epoch 600 = [[0.40503333]]
INFO:root:Cost after epoch 700 = [[0.40186904]]
INFO:root:Cost after epoch 800 = [[0.39904907]]
INFO:root:Cost after epoch 900 = [[0.39646523]]
INFO:root:Cost after epoch 1000 = [[0.39405118]]
INFO:root:Cost after epoch 1100 = [[0.39176588]]
INFO:root:Cost after epoch 1200 = [[0.38958391]]
INFO:root:Cost after epoch 1300 = [[0.3874885]]
INFO:root:Cost after epoch 1400 = [[0.38546622]]
INFO:root:Cost after epoch 1500 = [[0.38350336]]
INFO:root:Cost after epoch

X_train.shape: (30, 712)
layer_activations = {1: 'sigmoid', 2: 'sigmoid', 3: 'sigmoid'}
L = 3
ws shapes: [(1, (50, 30)), (2, (20, 50)), (3, (1, 20))]
As shapes: [(0, (30, 712)), (1, (50, 712)), (2, (20, 712)), (3, (1, 712))]
zs shapes: [(1, (50, 712)), (2, (20, 712)), (3, (1, 712))]
Accuracy on test: 0.8212290502793296


## Hyperparameter tuning, regulatrisation and optimisation

### L2 regularisation, dropout, vanishing and exploding gradients, stadard scaling

- Decrease bias by making the model more complex, e.g. bigger network
- Decrease variance with more data, regularisation, dropout, new architecture
- The most common form of regularisation is L2 regularisation, so the cost function is
$$
\begin{align}
J &= -\frac{1}{m} \sum_{i=1}^my \log(\hat{y}^{i}) + (1-y^{i})\log(1-\hat{y}^{i}) + \frac{\lambda}{2m} \sum_{l=1}^{L}\|W^l\|^2 \\
J &= -\frac{1}{m} \sum_{i=1}^my \log(\hat{y}^{i}) + (1-y^{i})\log(1-\hat{y}^{i}) + \frac{\lambda}{2m} 
 \sum_{l=1}^{L}\sum_{i=1}^{n_{l}}\sum_{j=1}^{n_{l-1}}(W_{ij}^{[l]})^2 \\
J &= -\frac{1}{m}(Ylog(A)^T + (1-Y)log(1-A)^T) + \frac{\lambda}{2m}\text{trace}(W^TW)
\end{align}
$$
- In dropout each node is dropped out based on parameter `keep_prob`. When implementing, adfter dropout, each activation vector $a_l$ is divided by the keep_prob to keep the scale correct whilst dropping out node - this is called inverted dropout
- Normalise inputs using strandard scaling to increase the speed of training a nueral network
- Vanishing and exploding gradients happend due to many multiplications of terms in back prop. To avoid this use good initialisation with $W^{[l]} = np.random.rand(shape) * \sqrt(\frac{2}{n^{[l-1]}})$



### L2 regularisation

In [15]:
from typing import Callable

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from scipy.special import expit

def relu(x):
    return x * (x > 0)

def relu_derivative(x):
    return 1. * (x > 0)

sigmoid = expit

def sigmoid_derivative(z):
    return sigmoid(z) * (1.0 - sigmoid(z))

def leaky_relu(x, leaky_constant:float = 0.01):
    return np.where(x > 0.0, x,  x * leaky_constantÏ)

def leaky_relu_derivative(x, leaky_constant=0.01):
    return np.where(x > 0, 1, leaky_constant)


def log_loss(A, Y, Ws, regularisation_lambda:float = 0):
    m = A.shape[1] # m is number samples
    cost = - (1/m) * ((Y @ np.log(A).T + (1 - Y) @ np.log(1 - A).T))
    if not regularisation_lambda:
        return cost
    return cost + (regularisation_lambda / (2 * m)) * sum(np.square(W).sum() for W in Ws.values())

def square_loss(A, Y, Ws, regularisation_lambda:float = 0):
    m = A.shape[1] # m is number samples
    cost = - (1/m) * ((A - Y) @ (A - Y).T)
    if not regularisation_lambda:
        return cost
    return cost + (regularisation_lambda / (2 * m)) * sum(np.square(W).sum() for W in Ws.values())

ACTIVATION_FUNCTIONS : dict[str, Callable] = {
    "relu": relu,
    "leaky_relu": leaky_relu,
    "sigmoid": expit,
}

ACTIVATION_FUNCTION_DERIVATIVES: dict[str, Callable] = {
    "relu": relu_derivative,
    "leaky_relu": leaky_relu_derivative,
    "sigmoid": sigmoid_derivative,
}

LOSS_FUNCTIONS: dict[str, Callable] = {
    "log_loss": log_loss,
    "square_loss": square_loss
}

In [7]:
import logging

logger = logging.getLogger()

class NeuralNetwork:
    layer_activations: dict[int, str]
    def __init__(
        self,
        layer_sizes: list[int],
        learning_rate: float = 0.5,
        layer_activations: dict[int, str] | None = None,
        regularisation_lambda: float = 0,
        cost_function: str = "log_loss"
    ):
        self.learning_rate = learning_rate
        self.layer_sizes = layer_sizes
        self.regularisation_lambda = regularisation_lambda
        self.L = len(layer_sizes) - 1
        self.m = layer_sizes[0]
        self.cost_function = cost_function
        if layer_activations:
            self.layer_activations = layer_activations
        else:
            # This sets all hidden layers and the output layer to "sigmoid" by default.
            self.layer_activations = {l:"sigmoid" for l in range(1, self.L)} | {self.L:"sigmoid"}

    def initialise_weights(self) -> None:
        # This is using He initialisation. Try changing to * 0.01 and see the change in cost plot.
        self.Ws = {
            l:np.random.normal(size=(n_l, n_l_minus_1)) * np.sqrt(2 / n_l_minus_1)
            for (l, (n_l, n_l_minus_1)) in enumerate(zip(self.layer_sizes[1:], self.layer_sizes), start=1)
        }
        self.bs = {l:np.zeros((n_l, 1)) for l, n_l in enumerate(self.layer_sizes[1:], start=1)}
        logger.info("Weights initialised")
        logger.debug(f"{self.Ws=}")

    def forward(self, X, cache=False) -> None:
        Zs, As = {}, {0:X}
        for l in range(1, self.L + 1):
            Zs[l] = self.Ws[l] @ As[l-1] + self.bs[l]
            g = ACTIVATION_FUNCTIONS[self.layer_activations[l]]
            logger.debug(f"Applying {self.layer_activations[l]} in layer{l}")
            As[l] = g(Zs[l])
        if cache:
            self.Zs, self.As = Zs, As
        return As[self.L]

    def backward(self, Y) -> None:
        dZs = {self.L: self.As[self.L] - Y}
        m = self.As[0].shape[1]
        dWs, dbs = {}, {}
        for l in range(self.L, 0, -1):
            logger.debug(f"calculating dZ for layer_id {l}")
            if l != self.L:
                dZs[l] = self.Ws[l+1].T @ dZs[l+1] * \
                    ACTIVATION_FUNCTION_DERIVATIVES[self.layer_activations[l]](self.Zs[l])
                    # For sigmoid we could just use self.As[l] * (1 - self.As[l])
            dWs[l] = (1.0 / m) * dZs[l] @ self.As[l-1].T
            if self.regularisation_lambda:
                dWs[l] += ((self.regularisation_lambda / m) * self.Ws[l])
            dbs[l] = (1.0 / m) * np.sum(dZs[l], axis=1, keepdims=True)
        self.dWs, self.dbs = dWs, dbs

    def update_weights(self):
        for l in range(1, self.L + 1):
            self.Ws[l] -= self.learning_rate * self.dWs[l]
            self.bs[l] -= self.learning_rate * self.dbs[l]

    def train(self, X, Y, n_epochs=10, log_every=100, plot_cost=False, fig=None):
        import plotly.graph_objs as go
        from IPython.display import display, clear_output

        costs = []
        epochs = []

        if plot_cost:
            if fig is None:
                fig = go.FigureWidget()
                display(fig)
            fig.add_scatter(x=[], y=[], mode='lines+markers', name=f'Cost {self.regularisation_lambda}')
            fig.update_layout(title='Training Cost over Epochs', xaxis_title='Epoch', yaxis_title='Cost')

        for epoch in range(n_epochs):
            A = self.forward(X, cache=True)
            self.backward(Y)
            self.update_weights()
            cost = self.cost(A, Y)
            if epoch % log_every == 0:
                logger.info(f"Cost after epoch {epoch} = {cost}")
            if plot_cost:
                costs.append(cost.item()) # extract from (1, 1) array
                epochs.append(epoch)
                if epoch % 10 == 0:
                    with fig.batch_update():
                        fig.data[-1].x = epochs
                        fig.data[-1].y = costs

    def cost(self, A, Y):
        if self.cost_function == "log_loss":
            return log_loss(A, Y, self.Ws, self.regularisation_lambda)
        elif self.cost_function == "square_loss":
            return square_loss(A, Y, self.Ws, self.regularisation_lambda)
        else:
            raise Exception(f"Incorrect value for self.cost_function:= {self.cost_function}")

    def predict(self, X, return_probability=False):
        Y_hat = self.forward(X)
        if return_probability:
            return Y_hat
        return np.where(Y_hat>0.5, 1, 0)


In [8]:
def regularisation_example():

    X_train = pd.read_feather('../titanic/processed/X_train.feather').to_numpy().T
    y_train = pd.read_feather('../titanic/processed/y_train.feather').to_numpy().T
    X_test = pd.read_feather('../titanic/processed/X_test.feather').to_numpy().T
    y_test = pd.read_feather('../titanic/processed/y_test.feather').to_numpy().T

    # Define a simple neural network architecture
    layers = [30, 50, 20, 1]

    fig = go.FigureWidget()
    display(fig)

    # Fit with lambda = 0 (no regularization)
    nn_no_reg = NeuralNetwork(layers, regularisation_lambda=0)
    nn_no_reg.initialise_weights()
    nn_no_reg.train(X_train, y_train, n_epochs=400, log_every=500, plot_cost=True, fig=fig)
    train_Y_pred_no_reg = nn_no_reg.predict(X_train)
    test_Y_pred_no_reg = nn_no_reg.predict(X_test)

    # Fit with lambda = 0.2 (with regularization)
    nn_reg = NeuralNetwork(layers, regularisation_lambda=1)
    nn_reg.initialise_weights()
    nn_reg.train(X_train, y_train, n_epochs=400, log_every=500, plot_cost=True, fig=fig)
    train_Y_pred_reg = nn_reg.predict(X_train)
    test_Y_pred_reg = nn_reg.predict(X_test)

    train_accuracy_no_reg = (train_Y_pred_no_reg == y_train).sum() / y_train.shape[1]
    test_accuracy_no_reg = (test_Y_pred_no_reg == y_test).sum() / y_test.shape[1]
    train_accuracy_reg = (train_Y_pred_reg == y_train).sum() / y_train.shape[1]
    test_accuracy_reg = (test_Y_pred_reg == y_test).sum() / y_test.shape[1]

    print(f"Accuracy on train with no regularisation: {train_accuracy_no_reg}")
    print(f"Accuracy on test with no regularisation: {test_accuracy_no_reg}")
    print(f"Accuracy on train with regularisation: {train_accuracy_reg}")
    print(f"Accuracy on test with regularisation: {test_accuracy_reg}")


regularisation_example()

FigureWidget({
    'data': [], 'layout': {'template': '...'}
})

Accuracy on train with no regularisation: 0.8328651685393258
Accuracy on test with no regularisation: 0.8100558659217877
Accuracy on train with regularisation: 0.8314606741573034
Accuracy on test with regularisation: 0.8044692737430168


### Dropout

In [32]:
import logging

logger = logging.getLogger()

class NeuralNetwork:
    layer_activations: dict[int, str]
    def __init__(
        self,
        layer_sizes: list[int],
        learning_rate: float = 0.5,
        layer_activations: dict[int, str] | None = None,
        regularisation_lambda: float = 0.0,
        keep_prob: float = 1.0,
        cost_function: str = "log_loss",
        model_id: str = "",
    ):
        self.learning_rate = learning_rate
        self.layer_sizes = layer_sizes
        self.regularisation_lambda = regularisation_lambda
        self.L = len(layer_sizes) - 1
        self.m = layer_sizes[0]
        self.cost_function = cost_function
        self.keep_prob = keep_prob
        self.model_id = model_id
        if layer_activations:
            self.layer_activations = layer_activations
        else:
            # This sets all hidden layers and the output layer to "sigmoid" by default.
            self.layer_activations = {l:"sigmoid" for l in range(1, self.L)} | {self.L:"sigmoid"}

    def initialise_weights(self) -> None:
        # This is using He initialisation. Try changing to * 0.01 and see the change in cost plot.
        self.Ws = {
            l:np.random.normal(size=(n_l, n_l_minus_1)) * np.sqrt(2 / n_l_minus_1)
            for (l, (n_l, n_l_minus_1)) in enumerate(zip(self.layer_sizes[1:], self.layer_sizes), start=1)
        }
        self.bs = {l:np.zeros((n_l, 1)) for l, n_l in enumerate(self.layer_sizes[1:], start=1)}
        logger.info("Weights initialised")
        logger.debug(f"{self.Ws=}")

    def forward(self, X, cache=False) -> None:
        Zs, As, Ds = {}, {0:X}, {}
        for l in range(1, self.L + 1):
            Zs[l] = self.Ws[l] @ As[l-1] + self.bs[l]
            g = ACTIVATION_FUNCTIONS[self.layer_activations[l]]
            logger.debug(f"Applying {self.layer_activations[l]} in layer{l}")
            As[l] = g(Zs[l])
            # apply drop out
            if cache and l != self.L:
                Ds[l] = (np.random.uniform(size=As[l].shape) < self.keep_prob).astype(int) / self.keep_prob
                As[l] *= Ds[l]
        if cache:
            self.Zs, self.As, self.Ds = Zs, As, Ds
        return As[self.L]

    def backward(self, Y) -> None:
        dZs = {self.L: self.As[self.L] - Y}
        m = self.As[0].shape[1]
        dWs, dbs = {}, {}
        for l in range(self.L, 0, -1):
            logger.debug(f"calculating dZ for layer_id {l}")
            if l != self.L:
                # dj/dz = dj/da * da/dz
                dZs[l] = self.Ws[l+1].T @ dZs[l+1] * self.Ds[l] * \
                    ACTIVATION_FUNCTION_DERIVATIVES[self.layer_activations[l]](self.Zs[l])
                    # For sigmoid we could just use self.As[l] * (1 - self.As[l])
            dWs[l] = (1.0 / m) * dZs[l] @ self.As[l-1].T
            if self.regularisation_lambda and l != self.L:
                dWs[l] += ((self.regularisation_lambda / m) * self.Ws[l])
            dbs[l] = (1.0 / m) * np.sum(dZs[l], axis=1, keepdims=True)
        self.dWs, self.dbs = dWs, dbs

    def update_weights(self):
        for l in range(1, self.L + 1):
            self.Ws[l] -= self.learning_rate * self.dWs[l]
            self.bs[l] -= self.learning_rate * self.dbs[l]

    def train(self, X, Y, n_epochs=10, log_every=100, plot_cost=False, fig=None):
        import plotly.graph_objs as go
        from IPython.display import display, clear_output

        costs = []
        epochs = []

        if plot_cost:
            if fig is None:
                fig = go.FigureWidget()
                display(fig)
            fig.add_scatter(x=[], y=[], mode='lines+markers', name=self.model_id)
            fig.update_layout(title='Training Cost over Epochs', xaxis_title='Epoch', yaxis_title='Cost')

        for epoch in range(n_epochs):
            A = self.forward(X, cache=True)
            self.backward(Y)
            self.update_weights()
            cost = self.cost(A, Y).item()
            if epoch % log_every == 0:
                logger.info(f"Cost after epoch {epoch} = {cost}")
            logger.info(f"Cost after epoch {epoch} = {cost}")
            if plot_cost:
                if np.isnan(cost):
                    cost = 10
                costs.append(cost)
                epochs.append(epoch)
                if epoch % 10 == 0:
                    with fig.batch_update():
                        fig.data[-1].x = epochs
                        fig.data[-1].y = costs

    def cost(self, A, Y):
        if self.cost_function == "log_loss":
            return log_loss(A, Y, self.Ws, self.regularisation_lambda)
        elif self.cost_function == "square_loss":
            return square_loss(A, Y, self.Ws, self.regularisation_lambda)
        else:
            raise Exception(f"Incorrect value for self.cost_function:= {self.cost_function}")

    def predict(self, X, return_probability=False):
        Y_hat = self.forward(X)
        if return_probability:
            return Y_hat
        return np.where(Y_hat>0.5, 1, 0)


In [33]:
def dropout_example():

    X_train = pd.read_feather('../titanic/processed/X_train.feather').to_numpy().T
    y_train = pd.read_feather('../titanic/processed/y_train.feather').to_numpy().T
    X_test = pd.read_feather('../titanic/processed/X_test.feather').to_numpy().T
    y_test = pd.read_feather('../titanic/processed/y_test.feather').to_numpy().T

    # Define a simple neural network architecture
    layers = [30, 50, 20, 1]

    fig = go.FigureWidget()
    display(fig)

    # Fit with lambda = 0 (no regularization)
    nn_no_reg = NeuralNetwork(layers, keep_prob=1, model_id="No dropout")
    nn_no_reg.initialise_weights()
    nn_no_reg.train(X_train, y_train, n_epochs=500, log_every=500, plot_cost=True, fig=fig)
    train_Y_pred_no_reg = nn_no_reg.predict(X_train)
    test_Y_pred_no_reg = nn_no_reg.predict(X_test)

    # Fit with lambda = 0.2 (with regularization)
    nn_reg = NeuralNetwork(layers, keep_prob=0.8, model_id="20% drop out")
    nn_reg.initialise_weights()
    nn_reg.train(X_train, y_train, n_epochs=500, log_every=50, plot_cost=True, fig=fig)
    train_Y_pred_reg = nn_reg.predict(X_train)
    test_Y_pred_reg = nn_reg.predict(X_test)

    train_accuracy_no_reg = (train_Y_pred_no_reg == y_train).sum() / y_train.shape[1]
    test_accuracy_no_reg = (test_Y_pred_no_reg == y_test).sum() / y_test.shape[1]
    train_accuracy_reg = (train_Y_pred_reg == y_train).sum() / y_train.shape[1]
    test_accuracy_reg = (test_Y_pred_reg == y_test).sum() / y_test.shape[1]

    print(f"Accuracy on train with no regularisation: {train_accuracy_no_reg}")
    print(f"Accuracy on test with no regularisation: {test_accuracy_no_reg}")
    print(f"Accuracy on train with regularisation: {train_accuracy_reg}")
    print(f"Accuracy on test with regularisation: {test_accuracy_reg}")


dropout_example()

FigureWidget({
    'data': [], 'layout': {'template': '...'}
})

Accuracy on train with no regularisation: 0.8370786516853933
Accuracy on test with no regularisation: 0.8044692737430168
Accuracy on train with regularisation: 0.827247191011236
Accuracy on test with regularisation: 0.8100558659217877


### Standard scaling


In [12]:
from typing import Optional
from numpy.typing import NDArray
import pandas as pd
import plotly.graph_objects as go

class StandardScaler:

    def __init__(self):
        self.means: Optional[NDArray] = None
        self.stds: Optional[NDArray] = None
        self.is_fitted: bool = False

    def fit_transform(self, X) -> NDArray:
        self.means = X.mean(axis=1, keepdims=True)
        self.stds = X.std(axis=1, keepdims=True)
        return (X - self.means) / self.stds

    def transform(self, X) -> NDArray:
        return (X - self.means) / self.stds

    def fit(self, X: NDArray) -> None:
        self.means = X.mean(axis=1, keepdims=True)
        self.stds = X.std(axis=1, keepdims=True)
        self.is_fitted = True

def standard_scaling_example():
    X_train = pd.read_feather('../titanic/processed/X_train.feather').to_numpy().T
    y_train = pd.read_feather('../titanic/processed/y_train.feather').to_numpy().T
    X_test = pd.read_feather('../titanic/processed/X_test.feather').to_numpy().T
    y_test = pd.read_feather('../titanic/processed/y_test.feather').to_numpy().T

    scaler = StandardScaler()
    scaler.fit_transform(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    layers = [30, 50, 20, 1]

    nn = NeuralNetwork(layers, model_id="unscaled data")
    nn_scaled = NeuralNetwork(layers, model_id="scaled data")
    nn.initialise_weights()
    nn_scaled.initialise_weights()

    fig = go.FigureWidget()
    display(fig)
    nn.train(X_train, y_train, n_epochs=400, log_every=500, plot_cost=True, fig=fig)
    nn_scaled.train(X_train_scaled, y_train, n_epochs=400, log_every=500, plot_cost=True, fig=fig)

    train_Y_pred = nn.predict(X_train)
    test_Y_pred = nn.predict(X_test)
    train_Y_pred_scaled = nn_scaled.predict(X_train_scaled)
    test_Y_pred_scaled = nn_scaled.predict(X_test_scaled)

    train_accuracy = (train_Y_pred == y_train).sum() / y_train.shape[1]
    test_accuracy = (test_Y_pred == y_test).sum() / y_test.shape[1]
    train_accuracy_scaled = (train_Y_pred_scaled == y_train).sum() / y_train.shape[1]
    test_accuracy_scaled = (test_Y_pred_scaled == y_test).sum() / y_test.shape[1]

    print(f"Accuracy on train: {train_accuracy}")
    print(f"Accuracy on test: {test_accuracy}")
    print(f"Accuracy on train with scaled data: {train_accuracy_scaled}")
    print(f"Accuracy on test with scaled data: {test_accuracy_scaled}")

standard_scaling_example()

FigureWidget({
    'data': [], 'layout': {'template': '...'}
})

Accuracy on train: 0.8356741573033708
Accuracy on test: 0.8100558659217877
Accuracy on train with scaled data: 0.8455056179775281
Accuracy on test with scaled data: 0.8156424581005587


### Mini-batch training

In [81]:
import plotly.graph_objs as go
from IPython.display import display
import logging

logger = logging.getLogger()

import logging

logger = logging.getLogger()


class NeuralNetwork:
    layer_activations: dict[int, str]

    def __init__(
        self,
        layer_sizes: list[int],
        layer_activations: dict[int, str] | None = None,
        regularisation_lambda: float = 0.0,
        keep_prob: float = 1.0,
        cost_function: str = "log_loss",
        model_id: str = "",
    ):
        self.layer_sizes = layer_sizes
        self.regularisation_lambda = regularisation_lambda
        self.L = len(layer_sizes) - 1
        self.m = layer_sizes[0]
        self.cost_function = cost_function
        self.keep_prob = keep_prob
        self.model_id = model_id
        if layer_activations:
            self.layer_activations = layer_activations
        else:
            # This sets all hidden layers and the output layer to "sigmoid" by default.
            self.layer_activations = {l: "sigmoid" for l in range(1, self.L)} | {
                self.L: "sigmoid"
            }

    def initialise_weights(self) -> None:
        # This is using He initialisation. Try changing to * 0.01 and see the change in cost plot.
        self.params = {}
        for l, (n_l, n_l_minus_1) in enumerate(
            zip(self.layer_sizes[1:], self.layer_sizes), start=1
        ):
            self.params[f"W{l}"] = np.random.normal(size=(n_l, n_l_minus_1)) * np.sqrt(
                2 / n_l_minus_1
            )
            self.params[f"b{l}"] = np.zeros((n_l, 1))
        logger.info("Weights initialised")
        logger.debug(f"{self.params=}")

    def forward(self, X, cache=False) -> None:
        Zs, As, Ds = {}, {0: X}, {}
        for l in range(1, self.L + 1):
            W = self.params[f"W{l}"]
            b = self.params[f"b{l}"]
            Zs[l] = W @ As[l - 1] + b
            g = ACTIVATION_FUNCTIONS[self.layer_activations[l]]
            logger.debug(f"Applying {self.layer_activations[l]} in layer{l}")
            As[l] = g(Zs[l])
            # apply drop out but not on the output layer
            if cache and l != self.L:
                Ds[l] = (np.random.uniform(size=As[l].shape) < self.keep_prob).astype(
                    int
                ) / self.keep_prob
                As[l] *= Ds[l]
        if cache:
            self.Zs, self.As, self.Ds = Zs, As, Ds
        return As[self.L]

    def backward(self, Y):
        dZs = {self.L: self.As[self.L] - Y}
        m = self.As[0].shape[1]
        grads = {}
        for l in range(self.L, 0, -1):
            logger.debug(f"calculating dZ for layer_id {l}")
            if l != self.L:
                W_l_plus_1 = self.params[f"W{l+1}"]
                dZs[l] = (
                    W_l_plus_1.T
                    @ dZs[l + 1]
                    * self.Ds[l]
                    * ACTIVATION_FUNCTION_DERIVATIVES[self.layer_activations[l]](
                        self.Zs[l]
                    )
                )
            grads[f"dW{l}"] = (1.0 / m) * dZs[l] @ self.As[l - 1].T
            if self.regularisation_lambda and l != self.L:
                grads[f"dW{l}"] += (self.regularisation_lambda / m) * self.params[
                    f"W{l}"
                ]
            grads[f"db{l}"] = (1.0 / m) * np.sum(dZs[l], axis=1, keepdims=True)
        return grads

    def cost(self, A, Y) -> float:
        # For cost, we need to pass the weights. We'll extract them from self.params.
        Ws = {l: self.params[f"W{l}"] for l in range(1, self.L + 1)}
        if self.cost_function == "log_loss":
            return log_loss(A, Y, Ws, self.regularisation_lambda).item()
        elif self.cost_function == "square_loss":
            return square_loss(A, Y, Ws, self.regularisation_lambda).item()
        else:
            raise Exception(
                f"Incorrect value for self.cost_function:= {self.cost_function}"
            )

    def predict(self, X, return_probability=False):
        Y_hat = self.forward(X)
        if return_probability:
            return Y_hat
        return np.where(Y_hat > 0.5, 1, 0)


class Optimizer:
    """This implements gradient descent. With optional mini-batching."""

    def __init__(self, learning_rate: float = 0.1, batch_size: int | None = None):
        self.learning_rate = learning_rate
        self.batch_size = batch_size

    def update_model_params(self, model, grads, training_iteration):
        """Update model parameters using grads returned from backward."""
        for param_key in model.params:
            model.params[param_key] -= self.learning_rate * grads[f"d{param_key}"]

    def train(
        self,
        model,
        X,
        Y,
        n_epochs=1000,
        log_every: int | None = None,
        plot_cost=False,
        fig=None,
        plot_every=10,
    ):
        costs, epochs = [], []

        if plot_cost:
            if fig is None:
                fig = go.FigureWidget()
                display(fig)
            fig.add_scatter(x=[], y=[], mode="lines+markers", name=model.model_id)
            fig.update_layout(
                title="Training Cost over Epochs",
                xaxis_title="Epoch",
                yaxis_title="Cost",
            )

        m = X.shape[1]
        batch_size = self.batch_size if self.batch_size is not None else m

        training_iteration = 1
        for epoch in range(n_epochs):
            for i in range(0, m, batch_size):
                X_batch = X[:, i : i + batch_size]
                Y_batch = Y[:, i : i + batch_size]

                # Forward pass
                A = model.forward(X_batch, cache=True)
                grads = model.backward(Y_batch)
                self.update_model_params(model, grads, training_iteration)
                training_iteration += 1

            # Compute cost on the whole dataset after epoch
            A_full = model.forward(X, cache=False)
            cost = model.cost(A_full, Y)
            if log_every and epoch % log_every == 0:
                logger.info(f"Cost after epoch {epoch} = {cost}")
            if plot_cost and fig is not None:
                costs.append(cost)
                epochs.append(epoch + 1)
                if epoch % plot_every == 0:
                    with fig.batch_update():
                        fig.data[-1].x = epochs  # type: ignore
                        fig.data[-1].y = costs  # type: ignore


class MomentumOptimizer(Optimizer):
    """Implements momentum optimizer.

    The update rule is:
    v_t = beta * v_{t-1} + grad_t
    param_t = param_{t-1} - learning_rate * v_t

    Where:
    - S_t is the second moment of the gradient
    - param_t is the parameter
    - grad_t is the gradient of the cost function with respect to the parameter
    """

    def __init__(
        self,
        learning_rate: float = 0.1,
        batch_size: int | None = None,
        beta: float = 0.9,
    ):
        super().__init__(learning_rate, batch_size)
        self.beta = beta
        self.cache = {}

    def update_model_params(self, model, grads, training_iteration):
        for param_key in model.params:
            if param_key not in self.cache:
                self.cache[param_key] = np.zeros_like(model.params[param_key])
            self.cache[param_key] = (
                self.beta * self.cache[param_key]
                + (1 - self.beta) * grads[f"d{param_key}"]
            )
            model.params[param_key] -= self.learning_rate * self.cache[param_key]


class RMSPropOptimizer(Optimizer):
    """Implements RMSProp optimizer.

    Here we track the exponentially weighted average of the squared gradients (second moment).

    The update rule is:
    S_t = beta * S_{t-1} + (1 - beta) * (grad_t)^2
    param_t = param_{t-1} - learning_rate * grad_t / (sqrt(S_t) + epsilon)
    """

    def __init__(
        self,
        learning_rate: float = 0.1,
        batch_size: int | None = None,
        beta: float = 0.9,
        epsilon: float = 1e-8,
    ):
        super().__init__(learning_rate, batch_size)
        self.beta = beta
        self.epsilon = epsilon  # to avoid division by zero
        self.s_cache = {}

    def update_model_params(self, model, grads, training_iteration):
        # Standard RMSProp does NOT use bias correction (unlike Adam)
        for param_key in model.params:
            if param_key not in self.s_cache:
                self.s_cache[param_key] = np.zeros_like(model.params[param_key])
            self.s_cache[param_key] = (
                self.beta * self.s_cache[param_key]
                + (1 - self.beta) * (grads[f"d{param_key}"] ** 2)
            )
            model.params[param_key] -= (
                self.learning_rate
                * grads[f"d{param_key}"]
                / (np.sqrt(self.s_cache[param_key]) + self.epsilon)
            )


class ADAMOptimizer(Optimizer):
    """Implements ADAM optimizer with learning rate decay.

    The update rule is:
    v_t = beta_1 * v_{t-1} + (1 - beta_1) * grad
    s_t = beta_2 * s_{t-1} + (1 - beta_2) * (grad_t)^2
    v_t_corrected = v_t / (1 - beta_1^t)
    s_t_corrected = s_t / (1 - beta_2^t)
    param_t = param_{t-1} - learning_rate * v_t_corrected / (sqrt(s_t_corrected) + epsilon)

    """

    def __init__(
        self,
        learning_rate: float = 0.1,
        batch_size: int | None = None,
        beta_1: float = 0.9,
        beta_2: float = 0.999,
        epsilon: float = 1e-8,
        learning_rate_decay: float = 0,
    ):
        super().__init__(learning_rate, batch_size)
        self.beta_1 = beta_1  # for the momentum
        self.beta_2 = beta_2  # for the second moment (RMSProp)
        self.epsilon = epsilon  # to avoid division by zero
        self.learning_rate_decay = learning_rate_decay
        self.v_cache = {}
        self.s_cache = {}

    def update_model_params(self, model, grads, training_iteration):
        learning_rate = self.learning_rate * (
            1 / (1 + self.learning_rate_decay * training_iteration)
        )
        for param_key in model.params:
            if param_key not in self.v_cache:
                self.v_cache[param_key] = np.zeros_like(model.params[param_key])
                self.s_cache[param_key] = np.zeros_like(model.params[param_key])
            self.v_cache[param_key] = (
                self.beta_1 * self.v_cache[param_key]
                + (1 - self.beta_1) * grads[f"d{param_key}"]
            )
            self.s_cache[param_key] = (
                self.beta_2 * self.s_cache[param_key]
                + (1 - self.beta_2) * grads[f"d{param_key}"] ** 2
            )
            v_t_corrected = self.v_cache[param_key] / (1 - self.beta_1 ** training_iteration)
            s_t_corrected = self.s_cache[param_key] / (1 - self.beta_2 ** training_iteration)

            model.params[param_key] -= (
                learning_rate * v_t_corrected / (np.sqrt(s_t_corrected) + self.epsilon)
            )


def optimizer_example():
    X_train = pd.read_feather("../titanic/processed/X_train.feather").to_numpy().T
    y_train = pd.read_feather("../titanic/processed/y_train.feather").to_numpy().T
    X_test = pd.read_feather("../titanic/processed/X_test.feather").to_numpy().T
    y_test = pd.read_feather("../titanic/processed/y_test.feather").to_numpy().T

    layers = [30, 50, 20, 1]
    nn = NeuralNetwork(layers, model_id="Using an optimizer")
    nn_mini_batch = NeuralNetwork(layers, model_id="Using a mini-batch optimizer")
    nn_momentum = NeuralNetwork(layers, model_id="Using a momentum optimizer")
    nn_rmsprop = NeuralNetwork(layers, model_id="Using a RMSProp optimizer")
    nn_adam = NeuralNetwork(layers, model_id="Using a ADAM optimizer")
    nn_adam_decay = NeuralNetwork(layers, model_id="Using a ADAM optimizer with learning rate decay")

    nn.initialise_weights()
    nn_mini_batch.initialise_weights()
    nn_momentum.initialise_weights()
    nn_rmsprop.initialise_weights()
    nn_adam.initialise_weights()
    nn_adam_decay.initialise_weights()

    optimizer = Optimizer(learning_rate=0.5)
    mini_batch_optimizer = Optimizer(batch_size=128)
    momentum_optimizer = MomentumOptimizer(batch_size=128)
    rmsprop_optimizer = RMSPropOptimizer(batch_size=128)
    adam_optimizer = ADAMOptimizer(batch_size=128 )
    adam_decay_optimizer = ADAMOptimizer(batch_size=128, learning_rate_decay=0.5)

    models = [nn, nn_mini_batch, nn_momentum, nn_rmsprop, nn_adam, nn_adam_decay]
    optimizers = [optimizer, mini_batch_optimizer, momentum_optimizer, rmsprop_optimizer, adam_optimizer, adam_decay_optimizer]
    names = ["Gradient Descent", "Mini-batch", "Momentum", "RMSProp", "ADAM", "ADAM with learning rate decay"]

    fig = go.FigureWidget()
    display(fig)

    from time import perf_counter

    for model, optimizer, name in zip(models, optimizers, names):
        start_time = perf_counter()
        optimizer.train(model, X_train, y_train, n_epochs=500, plot_cost=True, fig=fig)
        end_time = perf_counter()
        time = end_time - start_time
        print(f"\n Optimiser: {name}")
        print(f"{name} training time: {time:.4f} seconds")
        train_Y_pred = model.predict(X_train)
        test_Y_pred = model.predict(X_test)
        train_accuracy = (train_Y_pred == y_train).sum() / y_train.shape[1]
        test_accuracy = (test_Y_pred == y_test).sum() / y_test.shape[1]
        print(f"Accuracy on train: {train_accuracy}")
        print(f"Accuracy on test: {test_accuracy}")



optimizer_example()

FigureWidget({
    'data': [], 'layout': {'template': '...'}
})


 Optimiser: Gradient Descent
Gradient Descent training time: 9.0928 seconds
Accuracy on train: 0.8328651685393258
Accuracy on test: 0.8100558659217877

 Optimiser: Mini-batch
Mini-batch training time: 13.5394 seconds
Accuracy on train: 0.8384831460674157
Accuracy on test: 0.8100558659217877

 Optimiser: Momentum
Momentum training time: 10.1611 seconds
Accuracy on train: 0.8370786516853933
Accuracy on test: 0.8100558659217877



divide by zero encountered in log


invalid value encountered in matmul


Message serialization failed with:
Out of range float values are not JSON compliant
Supporting this message is deprecated in jupyter-client 7, please make sure your message is JSON-compliant




 Optimiser: RMSProp
RMSProp training time: 15.1471 seconds
Accuracy on train: 0.9213483146067416
Accuracy on test: 0.8212290502793296

 Optimiser: ADAM
ADAM training time: 13.0351 seconds
Accuracy on train: 0.9283707865168539
Accuracy on test: 0.8044692737430168

 Optimiser: ADAM with learning rate decay
ADAM with learning rate decay training time: 12.5911 seconds
Accuracy on train: 0.8469101123595506
Accuracy on test: 0.8156424581005587
