# Linear Regression

* A very simple approach for supervised learning.
* Specially useful for predicting quantitative response.
* Many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.

## Simple Linear Regression

* Predicting a quantitative response $Y$ on the basis of a single predictor variable $X$
$$Y = \beta_0 + \beta_1X + \epsilon$$

* $\beta_0$: intercept
* $\beta_1$: slope

* Given some estimares $\hat \beta_0$ and $\hat \beta_1 x,$
$$\hat y = \hat \beta_0 + \hat \beta_1 x$$
* where $\hat y$ indicates a prediction of $Y$ on the basis of $X = x$

### Estimation of the parameters by least squares

* Let $\hat y = \hat \beta_0 + \hat \beta_1 x$ be the prediction for $Y$ based on the ith value of $X$. Then $e_i = y_i - \hat y_i$ represents the ith residual
* We define the *residual sum of squares* (RSS) as

$$\text{RSS} = e_1^2 + e_2^2 + \dots + e_n^2$$

* or equivalently as
$$\text{RSS} = (y_1 - \hat\beta_0 - \hat \beta_1x_1)^2 + (y_12- \hat\beta_0 - \hat \beta_1x_2)^2 + \dots + (y_n - \hat\beta_0 - \hat \beta_1x_n)^2$$

* The least squares approach chooses $\hat \beta_0$ and $\hat \beta_1$ to minimize the RSS

* The minimizing values can be shown to be:
$$\hat \beta_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}
{\sum_{i=1}^n(x_i - \bar{x})^2}$$
$$\hat \beta = \bar{y} - \hat \beta \bar{x,}$$

* where $\bar{y} \equiv \frac{1}{n}\sum_{i=1}^n y_i$ and $\bar{x} \equiv \frac{1}{n}\sum_{i=1}^n x_i$ are the sample means

### Assessing the Accuracy of the Coefficient Estimates

* The standard error of an estimator reflects how it varies under repeated sampling.
* We have:

$$\text{SE}(\hat \beta_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}, ~~\text{SE}(\hat \beta_0)^2 = \sigma^2\bigg[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\bigg]$$

 where $\sigma^2 = \text{Var}(\epsilon)$
 
* These standard errors can be used to compute *confidence intervals*. 
* A 95\% confidence interval is defined as a range of values such that with 95\% probability, the range will contain the true unknown value of the parameter.
* It has the form
$$\hat\beta_1 \pm 2. \text{SE}(\hat\beta_1)$$

* That is, there is approximately a 95\% chance that the interval
$$\big[\hat\beta_1 - 2 . \text{SE}(\hat \beta_1), ~~\hat\beta_1 . \text{SE}(\hat \beta_1)\big]$$
* will contain the true value of $\beta_1$ (under a scenario where we got repeated samples)

### Hypothesis testing

* Standard errors can also be used to perform *hypothesis tests* on the coefficients. 
* The most common hypothesis test involves testing the *null hypothesis* of
    * $H_0$: There is no relationship between $X$ and $Y$ versus the *alternative hypothesis*
    * $H_A$: There is some relationship between $X$ and $Y$
    
* Mathematically, this corresponds to testing:
$$H_0: \beta_1 = 0$$ versus
$$H_A: \beta_1 \neq 0,$$
* Since if $\beta_1 = 0$ then the model reduces to $Y = \beta_0 + \epsilon$, and $X$ is not associated with $Y$

* To test the null hypothesis, we compute a *t-statistic*, given by
$$t = \frac{\hat \beta_1 - 0}{\text{SE}(\hat\beta_1)},$$
* This will have a t-distribution with $n - 2$ degrees of freedom, assuming $\beta_1 = 0$
* *p-value*: probability of observing any value $\geqslant |t|$ 