## Regularization
----

### Norm
- Norm is a measure function of vector's size or distance.
- There are several ways to calculate Norm. p-norm, maximum-norm...
- The most representative is p-norm.
- `L1 Norm` : If p=1, it calls Taxicab Norm or Manhattan Norm. 
- `L2 Norm` : If p=2, it calls Euclidean Norm.

$$ ||x||_p := (\sum_{i=1}^{n} |x_i|^p)^{1/p} $$

In [5]:
import math
import numpy as np

def p_norm(vector, p):
    return math.pow(sum(np.power(vector, p)), 1/p)

arr1 = np.array([1,2,3,4,5])
print(p_norm(arr1, 1))
print(p_norm(arr1, 2))

15.0
7.416198487095663


### Loss
- With reference to Norm, we can understand two type of loss. (`L1 loss`, `L2 loss`)

$$ L1 = \sum_{i=1}^{n}|y_i - f(x_i)| $$ 

$$ L2 = \sum_{i=1}^{n}(y_i - f(x_i))^2 $$ 

- L1 loss calls `LAD`(Least Absolute Deviations).
- L2 loss calls `LSE`(Least Square Error). This is the same as MSE(Mean Square Error).
- Robustness : Robustness is how much loss is affected when we have an outlier. If we have many outliers, LSE more increasing than LAD. So, if we want to remove outlier's effect, choose L1 loss.

In [6]:
def l1_loss(y, y_pred):
    return sum(np.abs(y-y_pred))

def l2_loss(y, y_pred):
    return sum(np.square(y-y_pred))

arr1 = np.array([1,2,3,4,5])
arr2 = np.array([1.1,2.2,2.9,4.1,4.8])
print(l1_loss(arr1, arr2))
print(l2_loss(arr1, arr2))

0.7000000000000002
0.1100000000000001


### Regularization

- Regularization is a penalty for model's complexity. As a result, it prevent overfitting. And it helps model generalization.
- Generalization means balanace between model's feature. If one feature have super higher effectiveness than other, model's generalization is broken. 
- There are many ways regularization for prevent overfit and balancing the feature's effectiveness. (e.g. L1 regularization, L2 regularization, Dropout)
- `Summary : Adding loss so that it doesn't overfit the training data so perfectly`
- L1 Regularization (if regression use L1, it calls Rasso model)

$$ Cost(W,b) = 1/m\sum_{i=1}^{m}L(\hat{y_i}, y_i) + \lambda ||w||_1 $$
 
- L2 Regularization (if regression use L2, it calls Ridge model)

$$ Cost(W,b) = 1/m\sum_{i=1}^{m}L(\hat{y_i}, y_i) + \lambda ||w||_2 $$

- L1 has the effect of feature selection. So it works well on sparse dataset. On the other hand, L2 does feature generalization. The below chart explain these phenomenon.

![regularization](img/regularization.png)