# <font color="bordo">Regularization</font>

## Result of regularization

Reducing the magnitude of the parameters $θ$ gives us simpler hypothesis (smoother curve)
  - Effectively get rid of some of the terms
  - A simpler hypothesis is less prone to overfitting

## Cost function optimization with regularization

### Penalize and make some of the $θ$ parameters really small
e.g. modify our cost function to penalize large $θ_3$ and $θ_4$
<img src="images/regularization - 1.png">

### Here we end up with $θ_3$ and $θ_4$ being close to zero
so we're basically left with a quadratic function: <font size="3em">$θ_0 + θ_1x + θ_2x^2$</font>
<img src="images/regularization - 2.png">

### In general we penalized ALL our parameters
<img src="images/regularization - 3.png">
  - By convention you don't penalize $θ_0$ - minimization is from $θ_1$ onwards
  - *<font color="blue" size="3em">$λ$</font>* is the *<font color="blue">regularization parameter</font>*
    - Controls a trade off between our two goals:
      - Fit the training set well
      - Keep parameters small

### How to choose λ
- <font color="red">If λ is very large</font> we end up penalizing ALL the parameters (θ1, θ2 etc.) 
  <br>so all the parameters end up being close to zero
  <br>If this happens, it's like we got rid of all the terms in the hypothesis
  <br>This results here is then <font color="red">underfitting</font>
  <br>So this hypothesis is too <font color="red">biased</font> because of the absence of any parameters (effectively)
<p>
- So, λ should be chosen carefully - not too big
  <br>We look at some automatic ways to select λ later...

### <font color="bordo">Cost Function with Regularization for Linear Regression</font>
<img src="images/regularization - LR - 1.png">
<p>
- Previously, Gradient Descent would repeatedly update the parameters $θ_j$, where j = 0,1,2...n simultaneously
<img src="images/regularization - LR - 2.png" />
<p>
- Lets write the case for $θ_0$ separately
<img src="images/regularization - LR - 3.png" />
<p>
- To modify this algorithm to use the regularized objective:
  - Use the first term for $θ_0$ (we don't penalize $θ_0$)
  - Modify the second term as follows:
<img src="images/regularization - LR - 4.png" />
  - The pink-colored term is <font size="4em">$\frac{\partial}{\partial \theta_j}J(\theta)$</font> 
    for the regularized $J(\theta)$

### Note
If you group all the terms that depend on $\theta_j$ we get an equivalent to the second term above as follows:
  - Note that the second term is exactly the same as the original gradient descent update for $(j = 1,2,...,n)$
    <img src="images/regularization - LR - 5.png" />
  - The term <font size="4em">$\left(1 - \alpha\frac{\lambda}{m}\right)$</font> is going to be a number less than 1 
    <br>usually learning rate is small and m is large so this typically evaluates to (1 - a small number)
    often around 0.99 to 0.95

### <font color="bordo">Cost Function with Regularization for Logistic Regression</font>

We saw earlier that logistic regression can be prone to overfitting with lots of features
<p>
- Original Logistic Regression Cost Function:
  <img src="images/regularization - LR - 6.png" />
<p>
- To modify it we have to add an extra term
  <img src="images/regularization - LR - 7.png" />
<p>
- Like in Linear Regression, this has the effect of penalizing the parameters $θ_1$, $θ_2$ up to $θ_n$
<p>
- The $θ$ update rule is the same as in Linear Regression (except the hypothesis is different)