## Model Representation

To establish notation for future, we'll use $x^{(i)}$ to denote the **input** variables (*living area in this example*), also called input features, and $y^{(i)}$ to denote the **output** or target variable that we are trying to predict (*price*).

A pair $(x^{(i)}, y^{(i)})$ is called a **training example**, and the dataset that we'll be using to learn - a list of $m$ training examples $(x^{(i)}, y^{(i)}); i = 1, ..., m$ - is called a **training set**.

Note that the superscript "$^(i)$" in the notations is simply an index into the training set, and has nothing to do with the exponentiation.

We will also use $X$ to denote the space of input values, and $Y$ to denote the space of output values. In this example, $X = Y = \mathbb{R}$.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function $h: X \to Y $, so that $h(x)$ is a "*good*" predictor for the corresponding value of y.

For historical reasons, this function $h$ is called a **hypothesis**.

Seen pictorially, the process is therefore like this:



## Diagnosing Bias vd. Variance

In this section we examine the relationship between the degree of the polynomial $d$ and the underfitting or overfitting of our hypothesis.

- We need to distinguish whether **bias** or **variance** is the problem contributing to bad predictions.
- **High bias is underfitting** and **high variance is overfitting**. Ideally, we need to find a goldern mean between these two.

The training error will tend to **decrease** as we increase $d$ up to a point, and then it will **increase** as d is increased, forming a convex curve.

#### High Bias (Underfitting):

both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high. Also $J_{CV}(\Theta) \approx J_{train}(\Theta)$.

#### High Variance (Overfitting):

$J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$.

![image.png](attachment:image.png)

## Regularization and Bias/Variance

#### Linear regression with regularization

Model: 

$$ h_{\theta}(x) = \theta_0 +  \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4 $$

$$ J(\theta) = \frac{1}{2m} \sum_{i = 1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta{j}^{2}  $$

![image.png](attachment:image.png)

In the figure above, we see that as $\lambda$ increases, our fit become more rigid.

On the other hand, as $\lambda$ approaches $0$, we tend to over overfit the data. So how do we choose our parameter $\lambda$ to get it "just right"? In order to choose the model and the regularization term $\lambda$, we need to:

1. Create a list of lambdas (i.e. $\lambda \in {0, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24}$).

2. Create a set od models with different degrees or any other variants.

3. Iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn some $\Theta$.

4. Computer the cross calidation error using the learned $\Theta$ (computed with $\lambda$) on the $J_{CV}(\Theta)$ **without** regularization or $\lambda = 0$.

5. Select the best combo that produces the lowest error on the cross validatin set.

6. Using the best combo $\Theta$ and $\lambda$, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.