## Basic definitions

### Neurons

In a truest sense of the word, saying the following equation is the definition of a neuron is oversimplification in term of biology, however, let's just accept it and move forward.

A neuron is a function that given input x, will calculate the output y with activation function f:
$ y = f(w*x + b) $, in which:

- y is the output
- x is the input
- w is the weight of the neuron
- b is the bias
- f is the activation function

If $f$ is an identity function, then this becomes a linear function. Hence, you can see that activation function plays a crucial role in making our neural network a non-linear model, which mathematically speaking, can represent a lot more thing compare to a simple linear model.

The trainable part of a neuron is normally the weight and bias. Sometimes, there can be trainable activation functions.

![a neuron](https://nickmccullum.com/images/python-deep-learning/understanding-neurons-deep-learning/neuron-functionality.png)

### Neural network

A neural network is just a network of neurons (duh).
The most basic of neural network is a dense network, with every neurons in the previous layer connected to every neurons in the next layer.

![a neural network](https://miro.medium.com/max/578/1*ToPT8jnb5mtnikmiB42hpQ.png)

However, if you do the math, this type of dense connection can make a neural network very huge with just a few layers. There are other types of neuron configuration that can be much more efficient, one of the most used for image and video is convolution. But this will be explored in another session.

### Common activation functions

We see above that activation functions help us create non-linear models. So, what are the common activation functions?

#### ReLU and it's family

ReLU is one of the simplest but also most widely used family of activation function. It stands for rectified linear unit.

The original ReLU function: 
$$
f(x)= 
\begin{cases}
x & \text{if} & x>0\\
0 & \text{otherwise}\\
\end{cases}
$$

The original ReLU has a dying ReLU problem where there is no gradient backflow due to negative values are all being squashed into a big fat 0.

There are a lot of other types of ReLU, such as leaky ReLU, ELU, etc.

#### Sigmoid
$$
f(x) = {1\over{1+e^{-x}}}
$$
Returned value of sigmoid has a range of 0 to 1.

#### Tanh
$$
f(x) = {{1-e^{-2x}}\over{1+e^{-2x}}}
$$
Returned value of tanh has a range of -1 to 1.

#### Softmax
Softmax is a special function, it receives a list of inputs and normalizes them into a probability distribution (sum of all outputs is 1), hence it is mostly used as the last activation function for classification model.
$$
f(\vec{x})_{i} = {{e^{x_i}}\over{\sum_{j=1}^{K}{e^{x_j}}}}
$$
$K$ is the size of the input vector $\vec{x}$.

### Now, let's "handcraft" a neural network model!

Consider a 1 dimensional world with 2 countries A and B, and there border is at point x = 3. It means that all of the land with value < 3 belongs to country  A and all of the land with value >= 3 belongs to country B. Now create a model that represent this!

First, we can convert the 2 classes into a number representative:

A: 0

B: 1

In [19]:
def relu(x: float) -> float:
    if x <= 0:
        return 0
    if x > 0:
        return x
    
weight = 1
bias = 1
dumb_net = lambda x: relu(weight * x + bias)

In [20]:
# Test, expect everything to be True
print([
    dumb_net(2.9) == 0,
    dumb_net(-1) == 0,
    dumb_net(3) > 0,
    dumb_net(100000) > 0
])

[False, True, True, True]


Let's us "train" ourselves and optimize the model by changing the value of the weight and bias.

## To train a neural network

We have managed to created a very simple model, but currently we have no systematic way to iteratively optimize it.

### Loss function

First of all, to optimize anything, we need to have a way of calculate how far we are from the target.

For the following equation, we will assume:
- $y$ stands for the desired (truth) value
- $\hat{y}$ stands for the predicted value
- $L$ stands for the loss function

Some of the common loss functions:

#### Mean squared error loss (MSE) - for regression problem
One of the standard and most popular loss function for real value prediction (continuous value prediction).

$$
L(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n}({y_i - \hat{y}_i})^2
$$

There are other flavors of MSE, such as mean squared logarithmic error loss or mean absolute error loss.

#### Cross-entropy loss - for classification problem

If there are just 2 classes, we can use binary cross-entropy loss, but noted that the final layer activation should be a sigmoid activation.
$$
L(y, \hat{y}) = - \frac{1}{N}\sum_{i=1}^{N}(y_i*log(\hat{y}_i) + (1-y_i)*log(1-\hat{y}_i))
$$

Multi-class cross entropy loss is used for multi-class classification. The truth matrix should be one hot encoded, meaning each class is totally independent from other classes. One-hot encoded matrix has the shape of (number of output, number of classes). If the output belong to a class, then only the corresponding class has value of 1, the rest is 0.
$$
L(y, \hat{y}) = -\sum_{k}^{K}y_k*\hat{y}_k
$$


### Chain rule

Chain rule is fairly simple:
$$
\frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx}
$$
$\frac{dy}{dx}$: derivative of y with respect to x

$\frac{dy}{du}$: derivative of y with respect to u

$\frac{dy}{dx}$: derivative of u with respect to x

But this is the key to make a neural network trainable. 

As you may notice, a neural network is just a big equation, and with chain rule, we can break it up into smaller parts, that in turn we can calculate how changing a smaller part can affect the final prediction.

You might also notice that since differentiation is one of the key in training a neural network, hence, all of the components of the neural network should be differentiable, from the neurons, to the activation functions to the loss functions. (ReLU is not exactly differentiable, but it is differentiable at all point except 0).


### Backpropagation
Backpropagation is a method to update the weights of the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight is computed using the chain rule.

Let's denote:
- $\partial$: partial derivative
- $f$: the entire neural network model
- $L$: the loss function

Applying chain rule, we will have:
$$
\frac{\partial{L}}{\partial{x}} = \frac{\partial{L}}{\partial{f(x)}}*\frac{\partial{f(x)}}{\partial{x}}
$$
Meaning that changes to the loss function with respect to the input equal to the changes of loss function with respect to the neural network time the changes of the neural network with respect to the input.

Moreover, we can express the loss function in term of the input and the neural network $f$:
$$
L(y, \hat{y}) = L(y, f(x))
$$

To update (train) the network, we just have to go against the loss gradient (in order to minimize the loss).


### Real example

Consider our `dumb_net` above, the entire network can be represented by this equation:
$$
f(x) = reLU(w*x + b) = reLU(W(x))
$$
with $W(x)$ represents the neuron function.

Let's use MSE loss (even though this is actually a classification problem, but let's ignore it for now).
Then we will have the following equation representing the loss:
$$
L(y, \hat{y}) = L(y, f(x)) = \frac{1}{n}\sum_{i=1}^{n}({y_i - f(x_i)})^2
$$
$$
L(y, (fx)) = \frac{1}{n}\sum_{i=1}^{n}({y_i - reLU(w*x + b)})^2
$$

Now, in order to optimize/update/train this network, we need to minimize the above loss function.

The derivative of MSE loss function with respect to our network:
$$
\frac{\partial{L}}{\partial{f(x)}} = f(x) - y
$$
The derivative of reLU activation function with respect to x:
$$
\frac{\partial{reLU}}{\partial{W(x)}}= 
\begin{cases}
1 & \text{if} & W(x)>0\\
0 & \text{otherwise}\\
\end{cases}
$$
Working backwards, we have:
$$
\frac{\partial{W}}{\partial{w}}= x 
$$

$$
\frac{\partial{W}}{\partial{b}}= 1
$$
with $w$ is the weight and $b$ is the bias. We want to actually update the weight and bias, remember?

Chain them up, we will have:
$$
\frac{\partial{L}}{\partial{w}} = \frac{\partial{L}}{\partial{f}}\frac{\partial{f}}{\partial{W}}\frac{\partial{W}}{\partial{w}}
$$
Bias will have a similar equation like above.

### Optimizer

With chain rule, we got a nice equation to minimize, but how do we actually minimize it? What should be the strategy?
The most simple method is stochastic gradient descent:
$$
w = w - n * \frac{\partial{L}}{\partial{w}}
$$
where $n$ is the learning rate or the step size.

There are other better optimizing strategies, such as Adam and its family. I won't go into details for this one.

In [25]:
def loss_derivative_weight(w,x,b,y):    
    return (w*x + b - y)*(1)*(x)

def loss_derivative_bias(w,x,b,y):    
    return (w*x + b - y)*(1)*(1)

In [26]:
loss_derivative_bias(1, 2.9, 1, 1)

2.9

In [27]:
dataset = [
    (2.9, 0),
    (-1, 0),
    (3, 1),
    (1000, 1)
]

In [40]:
cur_weight = 1
cur_bias = 1
lr = 0.0001

for i in range(100):
    item = i % len(dataset)
    x, y = dataset[item]
    cur_weight = cur_weight - lr * loss_derivative_weight(cur_weight, x, cur_bias, y)
    cur_bias = cur_bias - lr * loss_derivative_bias(cur_weight, x, cur_bias, y)
    print(cur_weight, cur_bias)

0.998869 0.99961032799
0.998869074132799 0.9996102538646143
0.9979702088899202 0.9993109017765609
-98.79898177027975 10.87920914771436
-98.71904679726379 10.906749750370794
-98.70808421760903 10.895788266973996
-98.62221567829327 10.924385352850788
9762.60691361575 -965.3372984472595
9754.676509017949 -968.06962090503
9753.604234404957 -966.997453519499
9745.116389830047 -969.8241886910961
-964669.4401743058 95497.21691115835
-963885.8473680235 95767.19408520396
-963779.8820638781 95661.23937758905
-962941.1782418339 95940.55570712384
95321582.69037086 -9436227.307285534
95244153.74524736 -9462904.489140928
95233683.03942393 -9452434.830388071
95150808.45543756 -9480034.829341663
-9418982033.505386 932419116.52478
-9411331071.159 935055160.6237637
-9410296432.535822 934020625.4644477
-9402107371.93388 936747855.6135814
930714955035.9928 -92134841322.77115
929958942862.7911 -92395315932.06908
929856707436.9116 -92293090729.73218
929047524327.4376 -92562575677.95734
-91966448650848.42 91

In [41]:
dumb_net = lambda x: relu(cur_weight * x + cur_bias)

In [42]:
print([
    dumb_net(2.9) == 0,
    dumb_net(-1) == 0,
    dumb_net(3) > 0,
    dumb_net(100000) > 0
])

[True, False, False, False]
