# Neural Network

## Concept

If network has $s_j$ units in layer $j$ and $s_{j + 1}$ units inlayer $j + 1$, then $\Theta^{(j)}$ will be of dimension $s_{j + 1} \times (s_{j} + 1)$. $+ 1$ because $s_{j + 1}$ has additional **bias unit**. $\Theta^{(j)}$ is a matrix of **weights** controlling function mapping from layer $j$ to layer $j + 1$.

When a neural network has **no hidden layers** and has only **one unit in output layer**,

- If output layer is **linear activation**, it's **linear regression** because $y = I (\Theta x) = \Theta x$.
- If output layer is **sigmoid activation**, it's **logistic regression** because $y = \sigma(\Theta x)$ where $\sigma = \frac{1}{1 + e^{(-\Theta x)}}$.

## Cost Function

In **multi-class classification** where $n$ is the number of data, $L$ is the number of layers in neural network including input and output layers, $s_{l}$ is the number of units (not including bias unit) in layer $l$, $K$ is the number of classes, $\Theta$ is the weight matrices, $h_{\Theta}(x)$ is the output of neural network and $\in \mathbb{R}^K$, $(h_{\Theta}(x))_i$ is $i^{th}$ output, and $J(\Theta)$ is the cost.

$$
J(\Theta) = - \frac{1}{n} \left[ \sum_{i = 1}^{n} \sum_{k = 1}^{K} y_{k}^{(i)} \log (h_{\Theta}(x^{(i)}))_{k} + (1 - y_{k}^{(i)}) \log (1 - (h_{\Theta}(x^{(i)}))_{k}) \right] + \frac{\lambda}{2n} \sum_{l = 1}^{L} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (\Theta_{ji}^{(l)})^2
$$

This math takes the form of,

$$
\text{Regularized cost} = \text{Cost} + \lambda \times \text{Regularization}
$$

The first $\sum_{i = 1}^{n} \sum_{k = 1}^{K} y_{k}^{(i)}$ part says that we get the **log-likelihood** by each class and sum up all the $n$ items and divide it by $n$ to get the average cost.

The second $\sum_{l = 1}^{L} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}}$ says that we get all the weight parameters in the neural network to regularize them.

## Backpropagation

**Backpropagation** is neural network terminology for minimizing the cost function. The goal is to compute,

$$
\underset{\Theta}{\min} J(\Theta)
$$

It means that we want to minimize the cost function $J$ using an optimal set of parameters $\Theta$.


## Resource

- [Machine Learning by Stanford University | Coursera](https://www.coursera.org/learn/machine-learning)
- [Deep Learning Specialization | Coursera](https://www.coursera.org/specializations/deep-learning)

## Note

$X = (p \times n)$, $Y = (1 \times n)$

Logistic regression $\hat{y} = \sigma(w^T x + b)$

Loss function is for single data error, $l(\hat{y^{(i)}}, y^{(i)})$

Cost function is for sum of loss functions for the entire dataset, $J(w, b)$

https://www.coursera.org/learn/neural-networks-deep-learning/lecture/0ULGt/derivatives