<a href="https://colab.research.google.com/github/wenxuan0923/My-notes/blob/master/DL_Math_of_Layers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Math under the hood: Layers

Last time we build our first DL model using dummy data. This note will explain the following questions:
* **What is heppening in `keras.layers.Dense`?**
* **Why the number of params is 16?**

In [0]:
import numpy as np
data = np.random.random((50, 3))
target = 2*np.sum(data, axis=1) - 1

In [4]:
import keras
model = keras.models.Sequential()
model.add(keras.layers.Dense(units=3, input_dim=3))
model.add(keras.layers.Dense(units=1))
model.summary()

Using TensorFlow backend.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 3)                 12        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 4         
Total params: 16
Trainable params: 16
Non-trainable params: 0
_________________________________________________________________


What these Layers do is, they extract representations out of the data fed into them through **data transformation**.

> You can consider it as a **data distillation** process:
> * Some data goes in and comes out in a more informative form <br>
> * Some unuseful information will be dropped during this process

**So that’s what machine learning is, technically:**
- searching for useful representations (defined by parameters in layers) of some input data 
- within a predefined space of possibilities (defined by the structure of network) 
- using guidance from a feedback signal (defined by loss score and optimizer)


#### Math under the hood

**The data transformation for a $\color{red}{\text{single}}$ data point:** 

$$output = \sigma(W^T\boldsymbol{x}+b)$$

Where $\boldsymbol{x} = [x_1, x_2, x_3]^T$ and $\sigma$ is an activation function. 

To further unpack the formula:

$$\sigma(W^T\boldsymbol{x}+b) =\sigma \Bigl([w_{1}, w_{2}, w_{3}]\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + b\Bigr) = \sigma(w_1x_1 + w_2x_2 + w_3x_3 + b) = \text{some scalar}$$

**To process the data for the $\color{red}{\text{whole dataset}}$ with $m$ samples:**

We can simply stack the sample data vertically:

$$\sigma(W^TX+b) = \sigma\Bigl([w_{1}, w_{2}, w_{3}]\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ \ x^{(1)} & x^{(2)} & \dots &x^{(m)} \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix} + b\Bigr) = \begin{bmatrix} \sigma(W^Tx^{(1)}+b) \\ \vdots\ \\ \sigma(W^Tx^{(m)}+b) \end{bmatrix} = \text{some vector} $$

Where $x^{(i)}$ represents the i-th sample in the dataset.

This is called **vectorization**, which can greatly speed up the algorithm.
This whole process illustrated above is **a neural network with 1 hidden layer and 1 hidden unit**. 

<p align="center">
<img src = 'https://drive.google.com/uc?id=1JX74MDF6yMrxkH9HlVsigxVsmtK4U6a7'
width="350" height="240" style="vertical-align:middle"/>
</p>

It becomes the well known **Logistic Regression Model** when we choose softmax as activation function.

With more hidden units or layers, we cam get a deep neural netword, let the one we defined in the example above.

**The structure of a neural network with 1 hidden layer with 3 units:**

<p align="center">
<img src = 'https://drive.google.com/uc?id=16Wd_swdEhLn4bdT1JgJ4BskJp_G_-s-1'
width="380" height="340" style="vertical-align:middle"/>
</p>

Here we use square bracket `[]` to represent the number of layer, and subscript to represent the hidden unit. <br>
For example: $W^{[1]}_{1}$ represent the first hidden unit in the first layer.

#### The data transformation for this network is:

**Hidden Layer**:

$$h_1 = \sigma(W^{[1]}_{1} X + b^{[1]}_{1})$$
$$h_2 = \sigma(W^{[1]}_{2} X + b^{[1]}_{2})$$
$$h_3 = \sigma(W^{[1]}_{3} X + b^{[1]}_{3})$$

We can further vectorize this expression:

$$h^{[1]} = \begin{bmatrix} h_1 \\ h_2 \\ h_3 \end{bmatrix} = \begin{bmatrix} \sigma(W_{1}^{[1]}X + b_{1}^{[1]}) \\  \sigma(W_{2}^{[1]}X + b_{2}^{[1]}) \\ \sigma(W_{3}^{[1]}X + b_{3}^{[1]}) \end{bmatrix} = \sigma \Biggl( \begin{bmatrix} \dots & W_{1}^{[1]} & \dots \\ \dots & W_{2}^{[1]} & \dots \\ \dots & W_{3}^{[1]} & \dots \end{bmatrix} X+ \begin{bmatrix} b_{1}^{[1]} \\  b_{2}^{[1]} \\ b_{3}^{[1]} \end{bmatrix}\Biggr) = \sigma(W^{[1]}X + b^{[1]})$$

**Output Layer**:

$$y' = W^{[2]}h^{[1]} + b^{[2]}$$


**Number of parameters in this network**

* $W^{[1]}$ has three vecors $W^{[1]}_{1}$, $W^{[1]}_{2}$, $W^{[1]}_{3}$, each of them has 3 dimensions $[w_1, w_2, w_3]$.
* $b^{[1]}$ has three scalar $b^{[1]}_{1}$, $b^{[1]}_{2}$, $b^{[1]}_{3}$
* $W^{[2]}$ has one 3D vecor
* $b^{[2]}$ has one scalar

The total number of parameters = 3*3 + 3 + 3 + 1 = 16
