# Neural Networks and Deep Learning

This notebook is a self-recap notes for the 1st course "Neural Networks and Deep Learning" out of 5 courses within the "Deep Learning Specialization" on coursera.

## Week1

### Why is Deep Learning taking off?

![alt text](https://miro.medium.com/max/2000/1*yhUYvpYxAjmMp3V9jGnttg.jpeg)

To achieve high performance, two factors are often needed: 

* build large NNs 
* prepared large amounts of labeled data.

![alt text](https://cherrythecoder.files.wordpress.com/2018/07/s8-36.png?w=736)

The success of today's deep learning rely not only on the data amount, but also on computation power, and appropriate training algorithms. From a iterative prototyping perspective, faster "idea-code-experiment" iteration produces faster feedback and hence improves products faster.

## Week2
### Binary Classification
For each training sample $(x,y)$

* feature space dimension $n_x$
* number of training samples $m_{train}$, test samples $m_{test}$
* input matrix (each col is a sample in this course) `X.shape =` $(n_x, m)$, label matrix (each col is a label) `Y.shape =` $(1, m)$

### Logistic Regression
$$\hat{y} = \sigma(\mathbf{w}^\intercal\mathbf{x} + b)$$
where
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

The model coefficients $(\mathbf{w}, b)$ are treated seperately in this course, instead of grouping them into $\mathbf{\theta} = (b, \mathbf{w})$ with an additional feature row full of ones.

One more notation, the superscript in parentheses $x^{(i)}$ denotes the i-th sample in the training set.

#### Loss function
$L(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^2$ will give a non-convex optimization object function. So what we used in LR is

$$-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$$.

#### Cost function
A measure on whole dataset $$J(\mathbf{w}, b) = \frac{1}{m}\sum L(\hat{y}^{(i)}, y^{(i)})$$, which is to be minimized during training.

#### Gradient descent

$$\mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial J(\mathbf{w}, b)}{\partial\mathbf{w}}$$

$$b \leftarrow b - \alpha \frac{\partial J(\mathbf{w}, b)}{\partial b}$$


In [0]:
# forward prop

# a ---5-------- 
#               \
# b              \
#  \               v=a+u ---11--- j=3v ---33
#   3            /
#    \          /
#     u=bc --6--
#    /
#   2
#  /
# c
# ```

# back prop

# a ---1-------- 
#               \
# b              \
#  \               v=a+u ---3--- j=3v ---1
#   2            /
#    \          /
#     u=bc --1--
#    /
#   3
#  /
# c
# ```




## Week3
### Computing a Neural Network's Output
Notation

For one node, $z = \mathbf{w}^\intercal \mathbf{x} + b$ and activation $a = \sigma(z)$.

$a_i^{[l]}$ where $[l]$ is the layer index (input is indexed as 0), and $i$ is the node index in its layer.

Similarly $\mathbf{w}_i^{[l]\intercal}$ is the weight vector to calculate node [l], i. Going further, we can stack these column vector $\mathbf{w}$ horizontally to have a weight matrix with num_rows = num_outputs and num_cols = num_inputs.

![alt text](https://cdn-images-1.medium.com/max/1200/1*buxOnswsinejx2FVZDuF8w.png)

### Backpropagation intuition

Considering one training sample,

"z-step": $\mathbf{z} = \mathbf{Wx + b}$

"a-step": $\mathbf{y} = g(\mathbf{z})$

To infer gradients about the parameter, fist calculate $d\mathbf{z}$

$$d\mathbf{z} = d\mathbf{y} * g'(\mathbf{z})$$

Then 

$$d\mathbf{W} = (d\mathbf{z})(\mathbf{x^\intercal})$$
$$d\mathbf{b} = d\mathbf{z}$$

It is easier to chain rule on "dz"s as follows.

Calculate  $d\mathbf{z}^{[1]}$ from $d\mathbf{z}^{[2]}$

$$\frac{dL}{d\mathbf{z}^{[1]}} = \frac{dL}{d\mathbf{x}} * \frac{d\mathbf{x}}{d\mathbf{z}^{[1]}}$$

$$ d\mathbf{z}^{[1]} = W^{[2]\intercal}d\mathbf{z}^{[2]} * g^{[1]'}(\mathbf{z}^{[1]})$$

![alt text](https://sandipanweb.files.wordpress.com/2017/11/grad_summary.png?w=676)

### Random Initialization

Initialize weights randomly since symmetry nodes are effectly equivalent to one nodes since they always update with same amount. Usually a small perturbation is reasonable since large weights might push to the small-gradient region. 