# course2 week1 笔记
---
内容概要：
* 训练集、验证集、测试集（train、dev、test sets）的意义以及如何划分
* 偏差和方差（bias and variance）的区别，如何处理高偏差、高方差以及两者共存的问题
* 在神经网络中应用正则化方法降低过拟合风险，如：L2正则化、dropout
* 一些加速神经网络训练的方法，如：Normalizing inputs
* 梯度校验（Gradient Checking）— 一种网络不起作用时的debug方法



* 正则化（Regularization）
    
    How does regularization prevent overfitting?
    dropout
    data augmentation
    early stopping
    
* Gradient Checking
* Vanishing/exploding gradients
* Normalizing inputs


正则化 L1范数、L2范数
Logistic Regression中的正则化
神经网络中的正则化

为什么正则化可以降低过拟合的风险？
> L2正则能够降低部分神经元的权重，从而事实上简化了网络。



## 参数初始化
---
There are two types of parameters to initialize in a neural network:
- the weight matrices $(W^{[1]}, W^{[2]}, W^{[3]}, ..., W^{[L-1]}, W^{[L]})$
- the bias vectors $(b^{[1]}, b^{[2]}, b^{[3]}, ..., b^{[L-1]}, b^{[L]})$

- *Zeros initialization* --  setting `initialization = "zeros"` in the input argument.
- *Random initialization* -- setting `initialization = "random"` in the input argument. This initializes the weights to large random values.  
- *He initialization* -- setting `initialization = "he"` in the input argument. This initializes the weights to random values scaled according to a paper by He et al., 2015. 

"He Initialization"; this is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights $W^{[l]}$ of `sqrt(1./layers_dims[l-1])` where He initialization would use `sqrt(2./layers_dims[l-1])`.)




## Regularization
---
The standard way to avoid overfitting is called **L2 regularization**. It consists of appropriately modifying your cost function, from:
$$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small  y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} \tag{1}$$
To:
$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} \tag{2}$$

Let's modify your cost and observe the consequences.

**Exercise**: Implement `compute_cost_with_regularization()` which computes the cost given by formula (2). To calculate $\sum\limits_k\sum\limits_j W_{k,j}^{[l]2}$  , use :
```python
np.sum(np.square(Wl))

**Observations**:
- The value of $\lambda$ is a hyperparameter that you can tune using a dev set.
- L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to "oversmooth", resulting in a model with high bias.

**What is L2-regularization actually doing?**:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. 

<font color='blue'>
**What you should remember** -- the implications of L2-regularization on:
- The cost computation:
    - A regularization term is added to the cost
- The backpropagation function:
    - There are extra terms in the gradients with respect to weight matrices
- Weights end up smaller ("weight decay"): 
    - Weights are pushed to smaller values.



Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system. 
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.

**dropout** is a widely used regularization technique that is specific to deep learning. 
**It randomly shuts down some neurons in each iteration.**

## 编程作业中的知识点
---
第一份作业：
参数初始化方法对神经网络的影响

第二份作业：
正则化方法降低过拟合风险

第三份作业：
梯度校验 — 网络不起作用时的debug方法


### dropout（随机失活）
---
随机选择一些神经元，删掉，得到简化的神经网络。

不能依赖任何特征，因为任何特征都有可能被清除。

在计算机视觉中常用，防止过拟合。

Why does drop-out work?


### 其他正则化方法
---
数据增强（data augmentation）-- 图像旋转、缩减

early stopping
提前停止迭代，决策依据是训练集误差曲线和测试集误差曲线。

### Normalizing Training Data
---
subtract mean
normalize variou

意义：让训练更快，不容易错过最优值。


### Vanishing / Exploding Gradients 梯度消失，梯度爆炸
---


### Weight initialization for deep networks
---

### 梯度的数值逼近
---

### 梯度检验 Gradient Checking -- debug神经网络的有效方法
---
1、不要再训练过程中进行梯度检验，仅仅在debug过程中使用
2、


## 

[关于梯度消失，梯度爆炸的问题](http://blog.csdn.net/qq_29133371/article/details/51867856)
[梯度爆炸和梯度消失的本质原因](http://blog.csdn.net/lujiandong1/article/details/53320174)