### $\underline{\text{Simple Linear Regression Summary:}}$

Approximate **linear relationship** between $Y$ and $X$ (regress $Y$ on $X$) is:

$$ Y \approx \beta_0 + \beta_1 X$$ 

where $\beta_0$ is the intercept and $\beta_1$ is the slope. 

Use the **training data** to estimate $\beta_0$ and $\beta_1$:

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1.$$

Estimate coefficients using observations pairs consisting of $(x_1, y_1), \cdots, (x_n, y_n)$

Goal is to obtain **coefficient estimates** 
$\hat{\beta}_0$ and $\hat{\beta}_1$ such that the linear model fits the data: 

$$\sum_{i=1}^n \hat{y}_i \approx \sum_{i=1}^n( \hat{\beta}_0 + \hat{\beta}_1 x_i)$$


**Residual** is difference between the observed response value and the $i$ th response value predicted by the linear model:
$$e_i = y_i - \hat{y}_i.$$ 

**Residual Sum of Squares ($RSS$)** is minimized using least squares to find the best  $\hat{\beta}_0$, $\hat{\beta}_1$ in the model. 

**Best Coefficient Estimate, $\hat{\beta_0}$, $\hat{\beta_1}$**:

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$

$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}.$$

**Sample means**:

$$\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$$

$$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$$

Note: $\sum_{i=1}^n y_i = n \bar{y}$, $\sum_{i=1}^n x_i = n \bar{x}$,


**True relationship**:

$$Y = \beta_0 + \beta_1 x + \varepsilon$$

where $\varepsilon$ is a non-zero random error or noise. 


**Population regression line**: is the best linear approx. to the true relationship between $X$ and $Y$, where coefficients $\beta_0$, $\beta_1$ define the populaion regression line. 

**Least squares line**: least squares coefficient estimates characterize this line. The more spread out the $x_i$'s the more precise the slope due the denominotor of $\hat{\beta_1}$, $\sum_{i=1}^n(x_i - \bar{x})^2$. If points are concentrated we can easily "turn" the slope, but if they are spread, we can more easily pin down a slope. For experimental design we prefer $x_i$'s more spread out. 

**Population mean**: on random variable $Y$ is $\mu$.

**Sample mean**: have access to $n$ observations from $Y$ then a reasonable estimate for $\mu$ is 

$$\hat{\mu}=\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$$

where $\hat{\mu}$ and $\mu$ are different but $\hat{\mu}$ is a good estimate of $\mu$. The sample mean is a unbiased estimate of the population mean . If we could average a large number of estimates of $\mu$ from a massive number of observations than averaging $\hat{\mu}$ is exactly $\mu$.


**Standard error**: is the average amount the estimate $\hat{\mu}$ difers from $\mu$. The estimate shrinks with $n$

$$Var(\hat\mu)=SE(\hat{\mu})^2=\frac{\sigma^2}{n}$$

where $\sigma^2=Var(\varepsilon)$ is the standard deviation of each realization $y_i$ of $Y$. 


$$SE(\hat{\beta}_0)^2 = \sigma^2 \left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right]$$

$$SE(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}.$$

For $SE(\hat{\beta}_0)^2$, $SE(\hat{\beta}_1)^2$ to be valid the error $\varepsilon_i$ for each observation have common variance $\sigma^2$ and uncorrelated. 

**Leverage**: $SE(\hat{\beta}_1)$ is smaller if $x_i$ is more spread out, then there is more leverage to estimate the slope. 

**Residual standard error ($RSE$)**: $\sigma^2$ is usually not known but can be estimated from data 

$$RSE = \sqrt{\frac{RSS}{n-p-1}}$$

where the $RSS = \sum_{i=1}^n(y_i - \hat{y_i})^2$. The $RSE$ esimates the standard deviation of $\varepsilon$ and is the average amount the response deviates from the true regression line.  For example if $RSE=3.26$ then the actual sales in each market deviate from the true regression line by $3,260$ units on average. If the mean sales over all markets is approximately $14,000$ units, then the percentage error is $\frac{3260}{14000}=23\%$. The RSE is a measure of lack of fit of the model to the data. When $\sigma^2$ is estimated then we write $\hat{SE}(\hat{\beta}_1)$ for standard error. 

**Confidence intervals** are a range of values s.t. with $95\%$ probability this range contains the true unknown value of the parameter: 

$$\hat{\beta}_1 \pm 2 \cdot SE(\hat{\beta}_1)$$

where each confidence interval is drawn from a new population sample.

The $95\%$ confidence interval property we take repeated samples and construct confidence intervals for each sample, then $95\%$ of the intervals will contain the true unknown  value of the parameter 

$$[\hat{\beta}_1 - 2 \cdot SE(\hat{\beta}_1), \hat{\beta}_1 - 2 \cdot SE(\hat{\beta}_1)].$$

**Hypothesis tests**: the *null hypothesis* is $H_0:$ there is no relationship between $X$ and $Y$ or $\beta_1=0$. The *alternative hypothesis* is $H_a:$ there is a relationship between $X$ and $Y$ or $\beta_1\neq0$.

**T-statistic**: is used to test the null hypothesis, or is $\hat{\beta_1}$ sufficient far from $0$ such that $\beta_1$ is non-zero where it's the number of standard deviations $\hat{\beta}$ is from $0$: 

$$\hat{t}=\frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}.$$

**P-value**: a small p-value infers there is a relationship between $X$ and $Y$ and so we can reject the null hypothesis. The typical cutoff is $0.05$ or $5\%$ to reject the null hypothesis. 

**$R^2$ stat**: it's not always clear what a good $RSE$ is and so instead use $R^2$ 

$$R^2=\frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$$

where $TSS=\sum(y_i - \bar{y})^2$ is called the *total sum of squares* and measures the total variance in response $Y$, is the variability in response before regresssion is performed. 

$R^2$ measures the proportion of variablility in $Y$ that can be explained using $X$, is a measure of the linear relationship between $X$ and $Y$. 

 - $R^2\approx 1$ means a large proportion of variability in response explained by regression. 
 - $R^2\approx 0$ regression doesn't explain variability in the response; because the linear model is wrong; or the error variance $\sigma^2$ is large; or both. 
 - A $R^2=0.612$ tells us that using the predictor, $p$, we reduced the variance in the response, $y$,  by $61\%$. Therefore, $p$, is a strong predictor of the response, $y$. 

**Correlation** measures the relation between $Y$ and $X$. 

$$Cor(X, Y) = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}}$$

where $r = Cor(X, Y)$ and for simple linear regression $R^2 = r^2$. 

---
### $\underline{\text{Multiple Regression Summary:}}$

If we have $p$ distinct predictors such that 

$$y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon$$

where $\beta_j$ is the average effect of $Y$ on one unit increase in $X$, holding other predictors fixed. 

We use the training data to estimate the regression coefficients

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p$$

and we can choose $\hat{\beta}_0$, $\hat{\beta}_1$, $\cdots$, $\hat{\beta}_p$ to minimize the residual sum of squares $RSS$

$$\begin{align*}
RSS &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\
&=\sum_{i=1}^n[y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p)]^2.
\end{align*}$$

In multiple regression, a predictor like newspaper ads is not directly related to sales but higher values for the newspaper ads associated is associated with higher sales and so newspaper ads serve as a **surrogate** for another predictor like radio ads so newspaper ads get credit for their association between radio and sales. 

**Hypothesis test**: for mulitiple regression the null hypothesis is $H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$. The alternative hypothesis is 
$H_a: \beta_j \neq 0$, or atleast one coeffiecient is non-zero.

**F-statistic**: for multiple regression this is the hypothesis test 

$$F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)}$$

where $TSS=\sum(y_i - \bar{y})^2$, $RSS=\sum(y_i - \hat{y}_i)^2$. 

- If $H_0$ is true then $F-stat \sim 1$.

- If $H_a$ is true then $\mathbb{E}[(TSS-RSS)/p]>\sigma^2\rightarrow F-stat>1$

- If $n$ is large and the $F-stat$ is a little bigger than 1 this can provide evidence against $H_0$ but if $n$ is small, $H_0$ is true the $\varepsilon_i$ errors have a normal distribution, $F-stat$ follows a $F-dist$.

- $F-stat$ is good if $p$ is small; if $p>n$ then can't apply the $F-stat$. 

**Partial effect of adding a variable**: 

Testing the null hypothesis on a subset of coefficients 

$$H_0: \beta_{p-q+1} = \beta_{p-q+2}= \cdots = \beta_p = 0$$

and fit a new model that uses all variables except the subset, then $RSS=RSS_0$

$$F = \frac{(RSS_0 - RSS)/q}{RSS/(n-p-1)}$$

where this is similar to the $F-test$ that omits single variable from the model while keeping the others to observe the partial effect of adding a variable.

**Variable selection**: determines what predictors associated with response to fit single model using only those predictors. Examples of automated methods to decide on best models are: forward selection, backward selection, and mixed selection (covered in later chapters). 

**$R^2 for multiple regression model** $R^2 \sim 1$ means that the model explain a large portion of variance in the response variable. The $R^2$ icnreases as more variables are added because adding a variable decreases the residual sum of squares on training data

$$RSE = \sqrt{\frac{RSS}{n - p -1}}$$

where models with more variable can have a higher $RSE$ if a decrease in $ RSS$ is small relative to an increase in $p$. 

**Interaction effect(synergy)**: is between predictors where combining predictors results in an overestimated response then suing a single predictors. 


**Prediction**: there are 3 types of prediction uncertainty

- *Reducible error*: inaccuracy in the coefficient estimates are due to this type of error. Use confidence intervals to see how close $\hat{y}=\hat{\beta_0} + \hat{\beta_1}x_1 + \cdots + \hat{\beta_p}x_p$ is to $f(X)=\beta_0 + \beta_1x_1 + \cdots + \beta_px_p$. 
- *Model bias*: is an additional source of reducible error. 
- *Irreducible error*: true values of $\beta_1,\cdots,\beta_p$ can't be predicted perfectly because of random error $\varepsilon$. Use prediction intervals to quantify how far $y$ is from $\hat{y}$.

**Confidence intervals vs. prediction intervals**: the $95\%$ confidence interval quantifies uncertainty around average sales over large number of cities (repeated samples drawn). $95\%$ of these intervals will contain the true value of $f(X)$. The $95\%$ prediction interval quantifies uncertainty in sales for a particular city. Both intervals are centered at the same value but the prediction interval is wider because there is more uncertainty about sales for a given city versus average sales over many locations. 

---
### $\underline{\text{Qualitative Predictors:}}$ 

Suppose that we wish to investigate differences in credit card balance between those who own a house and those who don’t, ignoring the other variables for the moment. If a qualitative predictor (also known as a *factor*) only has two levels, or possible values, then incorporating it into a regression model is very simple. 

We simply create an **indicator** or **dummy variable** that takes on two possible numerical values. For example, based on the variable `Own`, we can define a new variable:

$$
x_i = 
\begin{cases}
1 & \text{if the $i$th person owns a house} \\\\
0 & \text{if the $i$th person does not own a house}
\end{cases}
\tag{1}
$$

We then use this variable as a predictor in the regression model:

$$
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i =
\begin{cases}
\beta_0 + \beta_1 + \varepsilon_i & \text{if the $i$th person owns a house} \\\\
\beta_0 + \varepsilon_i & \text{if the $i$th person does not}
\end{cases}
\tag{2}
$$

Here, $\beta_0$ represents the **average credit card balance among non-owners**, $\beta_0 + \beta_1$ is the **average balance among owners**, and $\beta_1$ is the **average difference in balance between owners and non-owners**.

When a qualitative predictor has more than two levels, a single dummy
variable cannot represent all possible values. In this situation, we can create
additional dummy variables. For example, for the **region** variable we create
two dummy variables. The first could be

$$
x_{i1} = 
\begin{cases}
1 & \text{if the $i$th person is from the South} \\\\
0 & \text{if the $i$th person is not from the South,}
\end{cases}
\tag{3}
$$

and the second could be 

$$
x_{i2} = 
\begin{cases}
1 & \text{if the $i$th person is from the West} \\\\
0 & \text{if the $i$th person is not from the West,}
\end{cases}
\tag{4}
$$

Then both of these variables can be used in the regression equation to obtain the model:

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i =
\begin{cases}
\beta_0 + \beta_1 + \varepsilon_i & \text{if the $i$th person is from the South} \\\\
\beta_0 + \beta_2 + \varepsilon_i & \text{if the $i$th person is from the West} \\\\
\beta_0 + \varepsilon_i & \text{if the $i$th person is from the East}
\end{cases}
\tag{5}
$$

Here:

- $\beta_0$ represents the **average credit card balance** for individuals from the **East**.
- $\beta_1$ is the **difference** in average balance between people from the **South** and those from the East.
- $\beta_2$ is the **difference** between those from the **West** and the East.

There will always be **one fewer dummy variable than the number of levels**. The level with no dummy variable—**East** in this example—is called the **baseline**.


---
### $\underline{\text{Extensions of the Linear Model}}$

The linear model assumes the *additive assumption*, which roughly the association between predictor $X_j$ and response $Y$ does not depend on other predictors. It also depends on the *linearity assumption* such that the change in $Y$ associated with a one-unit change in $Y_j$ is constant regardless of the value of $X_j$. 

**Interaction effect**: 

One way of extending this model is to include a **third predictor**, called an **interaction term**, which is constructed by multiplying $X_1$ and $X_2$. This gives the model:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon
\tag{3.31}
$$

How does the inclusion of this interaction term **relax the additive assumption**?

Note that equation (3.31) can be rewritten as:

$$
Y = \beta_0 + (\beta_1 + \beta_3 X_2) X_1 + \beta_2 X_2 + \varepsilon
\tag{3.32}
$$

Letting:

$$
\tilde{\beta}_1 = \beta_1 + \beta_3 X_2,
$$

we can see that $\tilde{\beta}_1$ is now a **function of $X_2$**. This means that the **effect of $X_1$ on $Y$** depends on the value of $X_2$ — the association between $X_1$ and $Y$ is no longer constant.

Similarly, the effect of $X_2$ on $Y$ also depends on the value of $X_1$. Therefore, the model allows for **interaction between the predictors**, relaxing the assumption that their effects are strictly additive.

We now return to the **Advertising** example. A linear model that uses `radio`, `TV`, and an **interaction** between the two to predict `sales` takes the form:

$$
\text{sales} = \beta_0 + \beta_1 \cdot \text{TV} + \beta_2 \cdot \text{radio} + \beta_3 \cdot (\text{radio} \times \text{TV}) + \varepsilon
$$

This can be rewritten as:

$$
\text{sales} = \beta_0 + (\beta_1 + \beta_3 \cdot \text{radio}) \cdot \text{TV} + \beta_2 \cdot \text{radio} + \varepsilon
\tag{3.33}
$$

From this form, we can interpret $\beta_3$ as the **change in the effectiveness of TV advertising** associated with a **one-unit increase in radio advertising** — or vice versa.

That is, the **marginal effect** of TV on sales depends on the level of radio advertising, and vice versa, due to the interaction term. This captures a **non-additive** relationship between the two predictors.

**Nonlinear relationships**: 

Polynomial regression extends the linear model to non-linear relatioships. By plotting the data we can observe a non-linear relationship between the mpg and horsepower (response and predictor), the quadratic shape suggests the model is 

$$mpg = \beta_0 + \beta_1 \times horsepower + \beta_2 \times horsepower^2 + \varepsilon.$$

This involves prediction $mpg$ using nonlinear function of $horsepower$ but this is still a linear model since $X_1 = horsepower$, $X_2=horsepower^2$. 

### $\underline{\text{Problems with Linear Regression:}}$

1. **Non-linear data**: linear regression assumes a straight-line relationship between predictors and the response variable
    - use *residual plots*, $\hat{y}$ vs. $y_i - \hat{y}_i$ to identify non-linearity, this plot shows no non-linearity ideally. 
    - if the residual plot shows a non-linear pattern a simple solution is to use a non-linear transformation $\log x, \sqrt{x}, x^2$ in the model. 

2. **Correlation of error terms**: the standard errors for the estimates are based on assumption that noise $\varepsilon_1, \varepsilon_2, \cdots, \varepsilon_n$ are uncorrelated. 
    - if the $\varepsilon_i$'s are correlated then the estimated standard error will underestimate the true standard error, confidence interval will be narrower, p-values smaller, and result in the erroneous conclusion that the parameter is statistically significant. 

3. **Non-constant variance of error terms**: usually the error terms have constant variance $Var(\varepsilon_i)=\sigma^2$. 
    - this non-constant variance is seen as a funnel shape in the residual plot. 
    - one solution is to transform $Y$ using a concave function like $logY$ or $\sqrt{Y}$ causing a larger shrinkage of the larger responses. 
    - another solution is the $i$th response could be the average of $n_i$ raw obseravtions if each observation is uncorrelated with variance $\sigma^2$ then the average variance is $\sigma_i^2=\sigma^2/n_i$ and fit using a weighted least squares where the weights are proportional to the increase variances or $w_i = n_i$. 

4. **Outliers**: point for which $y_i$ is far from the value predicted by the model. 
    - can use the residuals plot to identify the outliers
    - can use plot the studentized residuals, which divides each residual $e_i$ by estimated standard error and the observations where studentized residuals > $|3|$ are possible outliers
    - outliers can be dut to error in data collection and can be remedied by removing, can also mean missing predictor

5. **High leverage points**: when an observation has an unusually high value for $x_i$ such that the predictor value for an observation is large relative to other observations.
    - removing high leverage points can impact fit, and identify these points using leverage statistic 
    
    $$h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i'=1}^n(x_{i'} - \bar{x})^2}$$ 

    which can be used for simple regression and as $h_i$ increases with distance of $x_i$ from $\bar{x}$ and statistic is always between \frac{1}{n} and 1. 
    - the average leverage for all observations $(p + 1)/n$ so if the observation has a leverage stat $>> (p+1)/n$ then suspect a high leverage point. 
    - high leverage and high studentized residual is a bad combination! 

6. **Collinearity**: when predictors are highly correlated with each other (closely related) can be an issue for regression since it's difficult to separate individual effects on the response. 
    - collinearity reduces the accuracy of estimats, causing the standard error for $\beta_j$ to grow. 
    - collinearity reduces the $t-stat$ so may fail to reject $H_0:\beta_j=0$. 
    - can detect paris of highly correlated values with *correlation matrix*. 
    - *multicollinearity* is collinearity between 3 or more variables and so we have to use **variance inflation factor(VIF)**. 

    $$VIF(\hat{\beta_j}) = \frac{1}{1 - R_{X_j|X_{_j}}^2}$$

    - $R_{X_j|X_{_j}}^2$ is $R^2$ from regressing $X_j$ onto all other predictors, VIF = 1 means no multicollinearity, VIF = 5-10 means collinearity. 
    - one solution is to drop the problematic variable from regression since it's redundant dut to collinearity. 
    - another solution is to combine collinear variables into single predictor. 

---
### $\underline{\text{Cubic vs. Linear Model:}}$


For the training model we use least squares to minimizes the sum of squares residuals over a larger parameter space (parameter vector $\hat{\beta}$ is 4 by 1 instead of 2 by 1). Minimizing over a larger set of possible solutions can't give a higher minimum RSS, and will at worst reproduce a linear fits by setting the extra coefficients to 0 and can often find a way to reduce the training RSS even further. 

### Linear Model Fit

$$
\hat{\beta}_{\text{lin}}
= 
\begin{pmatrix}
\beta_0\\
\beta_1
\end{pmatrix}
$$

minimize

$$
\mathrm{RSS}_{\text{lin}}
= \sum_{i=1}^n 
\Bigl[
    y_i \;-\; (\beta_0 + \beta_1\,x_i)
\Bigr]^2.
$$

In matrix form, let

$$
X_{\text{lin}}
=
\begin{pmatrix}
1 & x_1\\
1 & x_2\\
\vdots & \vdots\\
1 & x_n
\end{pmatrix},
\quad
y
=
\begin{pmatrix}
y_1\\
y_2\\
\vdots\\
y_n
\end{pmatrix}.
$$

Then

$$
\mathrm{RSS}_{\text{lin}}
= 
\min_{\beta_0,\;\beta_1}
\;\bigl\|\,y \;-\; X_{\text{lin}}\,\beta\bigr\|^2.
$$


### Cubic Model Fit

$$
\hat{\beta}_{\text{cub}}
=
\begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2\\
\beta_3
\end{pmatrix}
$$

minimize

$$
\mathrm{RSS}_{\text{cub}}
= \sum_{i=1}^n 
\Bigl[
    y_i 
    \;-\; 
    (\beta_0 + \beta_1\,x_i + \beta_2\,x_i^2 + \beta_3\,x_i^3)
\Bigr]^2.
$$

In matrix form, let

$$
X_{\text{cub}}
=
\begin{pmatrix}
1 & x_1 & x_1^2 & x_1^3\\
1 & x_2 & x_2^2 & x_2^3\\
\vdots & \vdots & \vdots & \vdots\\
1 & x_n & x_n^2 & x_n^3
\end{pmatrix}.
$$

Then

$$
\mathrm{RSS}_{\text{cub}}
= 
\min_{\beta_0,\;\beta_1,\;\beta_2,\;\beta_3}
\;\bigl\|\,y \;-\; X_{\text{cub}}\,\beta\bigr\|^2.
$$

### Minimizing Over a Larger Set

In least squares, 
$$
\hat{\beta}_{\text{cub}}
\text{ is chosen to minimize }
\|\,y - X_{\text{cub}}\,\beta\|^2
\text{ over }
\mathbb{R}^4.
$$

Because $\mathbb{R}^2$ (linear parameters) is embedded in $\mathbb{R}^4$ (cubic parameters) via
$$
(\beta_0,\,\beta_1)
\;\mapsto\;
(\beta_0,\,\beta_1,\,0,\,0),
$$
the minimum over the bigger space can only be smaller or equal to the minimum over the smaller space. Symbolically:

$$
\min_{(\beta_0,\beta_1,\beta_2,\beta_3)\in \mathbb{R}^4}
\;\bigl\|\,y - X_{\text{cub}}\,\beta\bigr\|^2
\;\;\le\;\;
\min_{(\beta_0,\beta_1)\in \mathbb{R}^2}
\;\bigl\|\,y - X_{\text{cub}}(\beta_0,\beta_1,0,0)\bigr\|^2.
$$

But
$$
X_{\text{cub}}(\beta_0,\,\beta_1,\,0,\,0)
= 
X_{\text{lin}}(\beta_0,\,\beta_1).
$$

For any $(\beta_0,\beta_1)$, the vector of fitted values in the cubic model collapses to the linear model’s fitted values if $\beta_2=0,\;\beta_3=0$. Hence:
$$
\bigl\|\,y - X_{\text{cub}}(\beta_0,\beta_1,0,0)\bigr\|^2
=
\bigl\|\,y - X_{\text{lin}}(\beta_0,\beta_1)\bigr\|^2.
$$

Therefore,
$$
\min_{\beta_0,\beta_1,\beta_2,\beta_3}
\bigl\|\,y - X_{\text{cub}}\,\beta\bigr\|^2
\;\;\le\;\;
\min_{\beta_0,\beta_1}
\bigl\|\,y - X_{\text{lin}}(\beta_0,\beta_1)\bigr\|^2.
$$

By definition, these minima are precisely the **training RSS** for each model:

$$
\mathrm{RSS}_{\text{cub}}
\;\le\;
\mathrm{RSS}_{\text{lin}}.
$$


Let’s use 4 observations:

$$
x = [\,1,\;2,\;3,\;4\,], 
\quad
y = [\,2,\;3,\;6,\;9\,].
$$

So we have \(n=4\) data points:
$$
(x_1,\,y_1)=(1,\,2), 
\quad
(x_2,\,y_2)=(2,\,3), 
\quad
(x_3,\,y_3)=(3,\,6), 
\quad
(x_4,\,y_4)=(4,\,9).
$$

### Design Matrices

The linear model is

$$
Y = \beta_0 \;+\; \beta_1\,X \;+\; \varepsilon.
$$

In matrix form, we write $X_{\text{lin}}$ as

$$
X_{\text{lin}}
=
\begin{pmatrix}
1 & 1\\
1 & 2\\
1 & 3\\
1 & 4
\end{pmatrix}_{4\times 2}.
$$

- The first column is all 1’s (for $\beta_0$).
- The second column is $x_i$.


The cubic model is

$$
Y = \beta_0 \;+\; \beta_1\,X \;+\; \beta_2\,X^2 \;+\; \beta_3\,X^3 \;+\; \varepsilon.
$$

Hence $X_{\text{cub}}$ is

$$
X_{\text{cub}}
=
\begin{pmatrix}
1 & 1 & 1^2 & 1^3\\
1 & 2 & 2^2 & 2^3\\
1 & 3 & 3^2 & 3^3\\
1 & 4 & 4^2 & 4^3
\end{pmatrix}
=
\begin{pmatrix}
1 & 1 & 1 & 1\\
1 & 2 & 4 & 8\\
1 & 3 & 9 & 27\\
1 & 4 & 16 & 64
\end{pmatrix}_{4\times 4}.
$$

---


---
### $\underline{\text{Overfitting:}}$

When you switch to test $RSS$ there is no gurantee that the cubic model will have equal of lower $RSS$ than the linear model. In fact, the test $RSS$ can get worse if you add parameters due to $\bold{overfitting}$.


When the cubic model uses extra coefficients $\beta_2$ and $\beta_3$ it can chase some random noise in the training set that doesn't generalize. This might reduce the training $RSS$ but might not improve predictions on new data. Sometimes the cubic model might capture the true relationship (especially if it's nonlinear!), giving it a lower $RSS$ than the linear model. But these extra parameters can cause overfitting, causing the test $RSS$ to go up for the cubic model. Therefore, unlike training $RSS$, the test $RSS$ can increase, decrease, or be approximately equal when adding parameters. 

---
### $\underline{\text{Bias-Variance Trade-Off:}}$ 
The expected test MSE for $x_0$ is 

$$Test\:MSE(\hat{f}) =\mathbb{E}(y_0 - \hat{f}(x_0))^2=\underbrace{Bias^2[\hat{f}(x_0)]}_\text{systematic deviation} + \underbrace{Var[\hat{f}(x_0)]}_\text{variance} + \text{irreducible error}$$

where $\text{irreducible error}=Var(\varepsilon).$ Adding parameters like $\beta_2, \beta_3$ usually reduces bias but increases variance. Therefore, if the variance is greater than the bias reduction, then your test error can go up. On the training set, the variance does not cause you to do worse, since more parameters allow you to fit atleast as well, so training RSS never rises. 



---
### $\underline{\text{Lecture Questions:}}$

### Question 1. 
Run linear regression with a slope estimate of 0.5 and with an estimate standard error of 0.2. What's the largest value of $\beta$ for which we would NOT reject the null hypothesis (assume a normal approximation to the distribution using $5\%$ significance level for a 2-sided test; need the critical value for standard normal approximation assumming a large sample assumption)

### Answer 1. 

This is a hypothesis test where we are interested in whethere the parameter is significantly higher or lower than a hypothesized value. We split the $5\%$ significance level evenly across both tails of the distribution - $2.5\%$ in the lower tail and $2.5\%$ in the upper tail.Using a standard normal approximation (Z-test) and assuming a large sample (so the Central Limit Theorem applies), we want to find the Z-values that correspond to cutoffs at $2.5\%$ in each tail which are quantiles at 0.025 and 0.975 where the values are 

$z_{0.025}=-1.96$ (lower critical value)

$z_{0.975}=+1.96$ (upper critical value)

and so the critical region is 

$$|z| > 1.96.$$

We need to use the *z-score*, which tells us how many standard deviations a data point (or estimate) is from the mean (hypothesized value). The equation for the z-score is 

$$z = \frac{\hat{\beta} - \beta_0}{SE}$$ 

which is the same as the t-stat, except the z-score assumes the sample size is large and you use the standard normal distribution. Also, the z-score is used if the population standard deviation is known (rare in practice). The t-stat assumes a small sample size, estimating the standard deviation from data, errors are assumed to follow a normal distribution, and we use a student's t-distribution, which has heavier tails to account for additional uncertainty. Back to our question

$$z = \frac{\hat{\beta} - \beta_0}{SE}\in[-1.96, 1.96]$$

and solve 

$$-1.96 \leq \frac{0.5 - \beta}{0.2} \leq 1.96$$

and solving for $\beta$

$$0.108 \leq \beta \leq 0.892$$

and so the largest value of $\beta$ for which we would **not** reject the null hypothesis is 0.892. 




### Question 2. 
T or F? Estimate $\hat{\beta_1}$ in linear regression that controls for many variables (many predictors in addition to $x_1$) is a more reliable measure of the casual relationship then $\hat{\beta_1}$ from a univariate regression? 

### Answer 2. 

Consider two regression models for estimating the effect of $x_1$ on $Y$:

1. **Simple regression**:

   $$
   Y = \beta_0 + \beta_1\,x_1 + \varepsilon,\qquad \hat\beta_1^{(S)}
   $$

   This slope $\hat\beta_1^{(S)}$ ignores other predictors and thus can suffer from **omitted-variable bias** whenever a true predictor—say $x_2$—is correlated both with $x_1$ and with $Y$.  In that case, $\hat\beta_1^{(S)}$ “soaks up” part of $x_2$’s effect, so its expectation no longer equals the true $\beta_1$. Therefore, $\mathbb{E}[\hat{\beta}_1] \neq \beta_1$.

2. **Multiple regression**:

   $$
   Y = \beta_0 + \beta_1\,x_1 + \beta_2\,x_2 + \cdots + \beta_p\,x_p + \varepsilon,\qquad \hat\beta_1^{(M)}
   $$

   Here $\hat\beta_1^{(M)}$ is the **partial slope** of $x_1$ “holding all other $x_j$ fixed.”  If each added predictor truly belongs—i.e., is associated with both $x_1$ and $Y$—then including it removes omitted-variable bias, so

   $$
   E\bigl[\hat\beta_1^{(M)}\bigr] \;=\; \beta_1.
   $$

However, each extra predictor also increases the **variance** of $\hat\beta_1^{(M)}$.  In Chapter 3 terms, adding a variable that is highly correlated with $x_1$ inflates $\text{Var}(\hat\beta_1^{(M)})$, which in turn raises its **standard error**.  A larger standard error makes hypothesis tests and confidence intervals for $\beta_1$ less precise.

**Bias–Variance Trade‐Off**

* If those additional predictors genuinely explain part of $Y$ that would otherwise be absorbed into $x_1$’s slope, then $\hat\beta_1^{(M)}$ corrects **omitted‐variable bias** and is more reliable despite the larger SE.   

* If the extra variables are only weakly related to $Y$ or nearly collinear with $x_1$, the **variance inflation** can outweigh any bias reduction, making $\hat\beta_1^{(M)}$ less precise than $\hat\beta_1^{(S)}$.

**Conclusion**

Whether $\hat\beta_1^{(M)}$ is “more reliable” than $\hat\beta_1^{(S)}$ depends on the bias–variance trade‐off.  Include only those predictors that meaningfully reduce omitted‐variable bias; otherwise, a simpler model may yield a more precise (though biased) estimate.


### Question 3. 
For our model for sales vs. TV interacted with radio, then what's the effect of an additional $1 of radio ad if TV = $50? 


### Answer 3. 

The model with the interaction term is

$$sales = \beta_0 + \beta_1 \times TV + \beta_2 \times radio + \beta_3 \times (radio \times TV) + \varepsilon.$$

To find the effect that radio have on sales we take the partial 

$$
\begin{align*}
\frac{\partial sales}{\partial radio} &= \beta_2 + \beta_3 \times TV\\
&= 0.0289 + 0.0011 \times 50 \\
&= 0.0839
\end{align*}
$$

The $R^2$ for including the interaction term is $96.8\%$ compared to $89.7\%$ without. Therefore, $\frac{(96.8 - 89.8)}{(100 - 89.7)} = 69\%$ of the variability in sales that remains after fitting the model has been explained by the interaction term. 

---
### $\underline{\text{Facts, Derivations, etc:}}$

### Expected Value (mean, average, or most likely value):

For discrete random values 

$$\mathbb{E}[x]  = \sum_{x\in\mathcal{X}} xp(x)= \mu$$


### Sum Rule for Marginals: 

For discrete random variables 

$$p(x) = \sum_{y}p(x, y)$$

where $p(x)$ is the marginal distribution of random variable $x$ and $p(x, y)$ is the joint distirbution.

### Variance: 

$$
\begin{align*}
Var(x) &= \mathbb{V}[x]\\
&=\mathbb{E}[(x - \mu)^2]\\
&= \mathbb{E}[(x^2 - 2x\mu + \mu^2)] \\
&= \mathbb{E}[x^2] - 2\mu\mathbb{E}[x] + \mathbb{E}[\mu^2] \\
&= \mathbb{E}[x^2] - 2\mu^2 + \mu^2 \\
&= \mathbb{E}[x^2] - \mu^2 \\
&= \sigma^2
\end{align*}
$$

note: sometimes $\mu^2$ is written as $\mathbb{E}[x]^2.$ 

### Linearity of Expectation: 

And you can split sums and  pull constants 

$$\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y].$$

$$\mathbb{E}[cX] = c\mathbb{E}[X].$$

### Linear Transformation of a Random Variable: 

$x$ is a random variable and $y$ is a linear transformation of $x$ where 

$$y = ax + b.$$

The expectation is
$$
\begin{align*}
\mathbb{E}[y] &= \mathbb{E}[ax + b]\\
&= \mathbb{E}[ax] + \mathbb{E}[b] \\
&= a\mathbb{E}[x] + b \\
&= a\mu + b. \\
\end{align*}
$$

And the variance is 

$$
\begin{align*}
\mathbb{V}[y] &= \mathbb{E}[y^2] - \mathbb{E}[y]^2\\
&= \mathbb{E}[(ax + b)^2] - \mathbb{E}[ax + b]\cdot\mathbb{E}[ax + b] \\
&=  \mathbb{E}[(ax)^2 + 2abx + b^2] - (a\mu + b)^2 \\
&= a^2\mathbb{E}[x^2] + 2ab\mu + b^2 -[(a\mu)^2 + 2ab\mu + b^2] \\
&= a^2\mathbb{E}[x^2] - a^2\mu^2 \\
&= a^2(\mathbb{E}[x^2] - \mu^2) \\
&= a^2\mathbb{V}[x]
\end{align*}
$$

### Covariance between $x \in \mathbb{R}$ and $y \in \mathbb{R}$:

$$
\begin{align*}
Cov[x, y] &= \mathbb{E}[(x - \mathbb{E}[x])(y - \mathbb{E}[y])]\\
&= \mathbb{E}[xy - y\mathbb{E}[x] -x\mathbb{E}[y] + \mathbb{E}[x]\mathbb{E}[y]] \\
&= \mathbb{E}[xy - y\mu_x - x\mu_y + \mu_x\mu_y] \\
&= \mathbb{E}[xy] - 2\mu_x\mu_y + \mu_x\mu_y \\
&= \mathbb{E}[xy] - \mu_x\mu_y \\
\end{align*}
$$

### Expected value of a random vector, $\bf{x}\in\mathbb{R}^n$:

We can calculate this expected value elementwise representing the averages of each component

$$\mathbb{E}[\bf{x}] = \begin{bmatrix}
\mathbb{E}[x_1] \\
\vdots \\ 
\mathbb{E}[x_n] \\
\end{bmatrix} = \mathbf{\mu} \in \mathbb{R}^n.
$$

### Covariance of two random vectors, $\bf{x}\in\mathbb{R}^n$ and $\bf{y}\in\mathbb{R}^m$:
$$
\begin{align*}
Cov[\bf{x}, \bf{y}] &= \mathbb{E}[(\bf{x} - \mathbb{E}[\bf{x}])(\bf{y} - \mathbb{E}[\bf{y}])^T]\\
&= \mathbb{E}[\bf{x}\bf{y}^T - \mathbb{E}[\bf{x}]\bf{y}^T - \bf{x}\mathbb{E}[\bf{y}]^T + \mathbb{E}[\bf{x}]\mathbb{E}[\bf{y}]^T] \\
&= \mathbb{E}[\bf{x}\bf{y}^T] - \mu_x\mu_y^T - \mu_x\mu_y^T + \mu_x\mu_y^T\\
&= \mathbb{E}[\bf{x}\bf{y}^T] - \mu_x\mu_y^T \\
\end{align*}
$$

### Uncorrelated random vectors 
Two random variables being **uncorrelated** (i.e., having zero covariance) does **not** imply they are **independent**.

Consider the example:

$$
X \sim \text{Uniform}(-1, 1), \qquad Y = X^2.
$$

Then:

$$
\operatorname{Cov}(X, Y) = \operatorname{Cov}(X, X^2) = 0,
$$

but $X$ and $Y$ are clearly **not independent**. Knowing $Y$ restricts the possible values of $X$. So uncorrelatedness is a weaker condition than independence.

### Two Random Vectors, $\bf{x}$ and $\bf{y}$:

$$\mathbb{E}[\bf{x} + \bf{y}] = \mathbb{E}[\bf{x}] + \mathbb{E}[\bf{y}]$$

$$\mathbb{E}[\bf{x} - \bf{y}] = \mathbb{E}[\bf{x}] - \mathbb{E}[\bf{y}]$$

$$\mathbb{V}[\bf{x} + \bf{y}] = \mathbb{V}[\bf{x}] + \mathbb{V}[\bf{y}] + Cov[\bf{x}, \bf{y}] + Cov[\bf{y}, \bf{x}]$$

$$\mathbb{V}[\bf{x} - \bf{y}] = \mathbb{V}[\bf{x}] + \mathbb{V}[\bf{y}] - Cov[\bf{x}, \bf{y}] - Cov[\bf{y}, \bf{x}]$$
### Affine transformation of a random vector: 

The affine transfomation of random vector $\bf{x}\in\mathbb{R}^n$ is $\bf{y} = A\bf{x} + \bf{b}$ and has the expectation

$$\begin{align*}
\mathbb{E}[\bf{y}] &= \mathbb{E}[A\bf{x} + \bf{b}] \\ 
&= A\mathbb{E}[\bf{x}] +  \mathbb{E}[\bf{b}]\\
&= A\bf{\mu} + \bf{b}.
\end{align*}
$$

The covariance is 

$$
\begin{align*}
Cov[\bf{y}] &= \mathbb{E}[(\bf{y} - \mathbb{E}[\bf{y}])(\bf{y} - \mathbb{E}[\bf{y}])^T]\\
&= \mathbb{E}[(A\bf{x} - A\mu)(A\bf{x} - A\mu)^T]
\end{align*}
$$

and using **linearity of expectation**

$$
\begin{align*}
&= \mathbb{E}[A(\bf{x} - \mu)(\bf{x} - \mu)^T A^T] \\
&= A\mathbb{E}[(\bf{x} - \mu)(\bf{x} - \mu)^T]A^T \\
&= A\Sigma_x A^T
\end{align*}
$$


### Expected Value (continuos):

$$\mathbb{E}[x]  = \int_{-\infty}^{\infty} xf(x)= \mu.$$

### Marginal distribution (continuous): 

$$f(x) = \int_{\mathbb{R}}f(x, y)dy$$

where f(x, y) is the joint distribution and f(x) is the marginal distribution.

### TODO: more on cont. dist.  

### Ordinary Least Squares Derivation:

Show for 

$$y_i = \beta_0 + \beta_1 x_i$$

the $\hat{\beta}_0$, $\hat{\beta}_1$ that minimize $RSS$ using least squares is 

$$\hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})}$$  

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$$

where $\bar{y} = \frac{1}{n}\sum_{i=1}^ny_i$, $\bar{x} = \frac{1}{n}\sum_{i=1}^nx_i$.

The residual or error for the $ith$ observation or sample between the actual and estimate is: 

$e_i = y_i - \hat{y_i}$

and for $n$ observations the $RSS$ or residual sum of squares is 

$$\begin{align*}
RSS &= \sum_{i=1}^n e_i^2 \\
    &= e_1^2 + \dots + e_n^2 \\
    &= (y_1 - \hat{y}_1)^2 + \dots + (y_n - \hat{y}_n)^2
\end{align*}
$$

We now must minimize the $RSS$ with respect to $\hat{\beta}_0$, $\hat{\beta}_1$:

$$\begin{align*}
RSS(\hat{\beta}_0,\hat{\beta}_1) &=(y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + \dots + (y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2
\end{align*}
$$

such that 

$$\begin{align*}
\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_0} &= 2(y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)\cdot(-1) + \dots + 2(y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)\cdot(-1) \\
&=2\sum_{i=1}^n (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i) = 0  
\end{align*}
$$

$$\begin{align*}
\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_1} &= 2(y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)\cdot(-x_1) + \dots + 2(y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)\cdot(-x_n) \\
&=2\sum_{i=1}^n x_i (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i) = 0  
\end{align*}
$$ 

and we now have the following 2 equations: 

$\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_0}=\sum_{i=1}^n (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i)=0$

$\frac{\partial RSS(\hat{\beta}_0,\hat{\beta}_1)}{\partial \hat{\beta}_1}=\sum_{i=1}^nx_i (\hat{\beta}_0 +\hat{\beta}_1 x_i - y_i) = 0 $.

The first equation we can use the fact that $n\hat{\beta}_0=\sum_{i=1}^n\hat{\beta}_0$ and $\bar{x} = \frac{1}{n}\sum_{i=1}^nx_i$ then we have and solve for $\hat{\beta}_0$: 

$$\begin{align*}
\hat{\beta}_0 &= \frac{1}{n}\sum_{i=1}^ny_i - \frac{\hat{\beta}_1}{n}\sum_{i=1}^nx_i \\
             &= \bar{y} - \hat{\beta}_1\bar{x}
\end{align*}
$$.

For the second equation we can identify the following two factorizations: 

$$\begin{align*}
\sum_{i=1}^n(x_i - \bar{x})(x_i - \bar{x}) &= \sum x_i^2 - \bar{x}\sum x_i - \bar{x}\sum x_i + \sum \bar{x}^2 \\
&= \sum x_i^2 - n\bar{x}^2 - n\bar{x}^2 + n\bar{x}^2 \\ 
&= \sum x_i^2 - n\bar{x}^2
\end{align*}
$$.

$$\begin{align*}
\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y}) &= \sum x_iy_i - \bar{x}\sum y_i - \bar{y}\sum x_i + \sum \bar{x}\bar{y} \\
&= \sum x_iy_i  - n\bar{x}\bar{y} - n\bar{y}\bar{x} + n\bar{x}\bar{y}\\ 
&= \sum x_iy_i - n\bar{x}\bar{y}
\end{align*}
$$.

Then from the second equation we have: 

$$\begin{align*}
\sum_{i=1}^n x_i y_i &= \hat{\beta_0}\sum_{i=1}^nx_i + \hat{\beta_1}\sum_{i=1}^n x_i^2 \\
 &= (\bar{y} - \hat{\beta}_1 \bar{x})\sum_{i=1}^nx_i + \hat{\beta_1}\sum_{i=1}^n x_i^2 \\
 &=(\bar{y} - \hat{\beta}_1 \bar{x})(n\bar{x}) + \hat{\beta_1}\sum_{i=1}^n x_i^2 
\end{align*}
$$

Solving for $\hat{\beta_1}$:

$$\begin{align*}
\hat{\beta_1}(\sum_{i=1}^n x_i^2 - n\bar{x}^2) = \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} 
\end{align*}
$$

and using the previous two factorizations we have: 

$$\hat{\beta_1}\sum_{i=1}^n (x_i - \bar{x})^2 = \sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})$$ 

and solving for $\hat{\beta_1}$:

$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}.$$

### Ordinary Least Squares Derivation (matrix):

Let's use the matrix gradient the derivation of OLS using the squared norm (same as RSS). First, let's elaborate the following facts for linear quadratic functions of $\beta$: 

For any $\mathbf{c} \in \mathbb{R}^n$, we have $\nabla_{\beta}(\mathbf{c}^T\mathbf{\beta})=\mathbf{c}$.

For any $X \in \mathbb{R}^{n\times n}$, we have $\nabla_{\beta}(\mathbf{\beta}^TX\mathbf{\beta})=(X + X^T)\beta$.

We are trying to minimize the squared norm with respect to $\beta$: 
$$\begin{align*}
0 &= \nabla||\bold{y} - X\bold{\beta}||^2 \\
&=\nabla (\bold{y} - X\beta)^T(\bold{y} - X\beta)\\
&= \nabla (\bold{y}^T - \beta^TX^T)(\bold{y} - X\beta)\\
&= \nabla(\bold{y}^T\bold{y} - \beta^TX^T\bold{y} - \bold{y}^TX\beta +\beta^TX^TX\beta)
\end{align*}
$$

Since
$\beta \in \mathbb{R}^p$, $\mathbf{y} \in \mathbb{R}^p$, and $X \in \mathbb{R}^{n\times p}$, then $\beta^TX^T\bold{y}=\bold{y}^TX\beta$. We can readily take the transpose since they both result in the same scalar quantity. Continuing on we get 

$$\begin{align*}
0 &= \nabla(\bold{y}^T\bold{y} - \beta^TX^T\bold{y} - \bold{y}^TX\beta +\beta^TX^TX\beta)\\
&= \nabla(\bold{y}^T\bold{y} - 2\beta^TX^T\bold{y} + \beta^TX^TX\beta) \\
&= - 2X^T\bold{y} + (X^TX + X^TX)\beta \\
&= - 2X^T\bold{y}  + 2X^TX\beta
\end{align*}
$$

Solving for $\beta$ we arrive at 

$$\beta = (X^TX)^{-1}X^T\bold{y}.$$

### $\underline{\text{Probability notation:}}$

$X$ is a **random variable** a quantity whose value is known until we observe or "sample" it. 

### Individual sample values 

Are when we actually draw a sample of size $n$ we observe numeric realizations 

$$x_1, x_2, \dots, x_n$$

of the random variables 

$$X_1, X_2, \dots, X_n$$ 

In the probabilistic model we assume there is a single underlying random variable $X$ that captures the population distribution—call its density $f_X$ or pmf.  A *sample* of size $n$ is modelled as $n$ independent replicates of that same variable:

$$
X_1,\;X_2,\;\ldots,\;X_n \;\stackrel{\text{iid}}{\sim}\; X .
$$

* **Independent** means the outcome of one draw carries no information about the others.
* **Identically distributed** means every draw has the same mean $\mu$, variance $\sigma^{2}$, and full distribution $f_X$.

So each $X_i$ is a conceptually separate random variable, but all share the same population law.  They are placeholders for the numbers you *might* observe.  Once you actually collect the data, you plug in the realised values and switch to lowercase:

$$
(x_1,\dots,x_n)=\text{the specific numbers your experiment produced}.
$$

That separation—random $X_i$ in theory, observed $x_i$ in practice—is what lets us derive properties (like expectation and variance) before seeing any data, and then apply them to the concrete sample you end up with.


### Sample-mean random variable

Is when each $X_i$ is random, their average is 

$$\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$$

is also a random variable. It has an expectation 

$$\mathbb{E}[\bar{X}] = \mu$$

which says the sample mean is an unbiased estimator of the population mean. 

### Observed (numeric) sample mean

After we collect the data we compute 

$$\bar{x} = \frac{1}{n}\sum_{i=1}^nx_i$$

which is a number. Lowercase is used because $\bar{x}$ is no longer rrandom but is the variable $\bar{X}$ takes for this particular sample. 

### Variances

$$\sigma^2 = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

is the *population* variance of a single draw $X$ from the distribution. We can think of each $X_i$ from the sample $$X_1,\;X_2,\;\ldots,\;X_n \;\stackrel{\text{iid}}{\sim}\; X$$ as a separate "dart throw" at the same target, where each throw has the same spread around the bullseye

$$Var(X_i) = \sigma^2, \hspace{0.1cm} i=1, \dots, n.$$ 

Because the throws are *independent*, the law of variances for a sum is 

$$Var(\sum_{i=1}^nX_i)=\sum_{i=1}^nVar(X_i).$$

Every term of the right-hard side equals the same constant, $\sigma^2$, so adding the constant to itself $n$ times gives 

$$\sum_{i=1}^n Var(X_i) = \sigma^2 + \sigma^2 + \dots + \sigma^2 = n\sigma^2.$$

As we have discussed, for a random variable $X$, the sample-mean is

$$\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$$

has the expectation $\mathbb{E}[\bar{X}]=\mu$ and by independence we have

$$Var(\bar{X}) = Var\left(\frac{1}{n}\sum_{i=1}^n X_i\right)=\frac{1}{n^2}\sum_{i=1}^n Var(X_i) = \frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n}$$

and this is called the *variance of the sample-mean estimator*. As $n$ grows, $\bar{X}$ concentrates around $\mu$ at a rate of $\frac{1}{\sqrt{n}}$. 

In other words, the standard error(or standard deviation of your estimator), 

$$SE(\bar{X})=\sqrt{Var(\bar{X})}=\sigma/\sqrt{n}$$ 

captures two things 
- the numerator $\sigma$ captures the variabililty in a single draw from the population
- the denomintor $\sqrt{n}$ captures the mathmatical effect of averaging independent draws. 

Therefore, if we quadrupled the sample size the *spread* (standard deviation, root-mean-square error) of the sampling distribution - hences your estimator error - falls by one-half. Therefore, any accuracy measure derived from the $SE$ like the half-width confidence interval scales like $1/\sqrt{n}$. 

In practice, $\sigma$ is unknown, and we can plug in the sample standard deviation $S$ to get the full data-based estimate

$$\hat{SE}(\bar{X}) = \frac{S}{\sqrt{n}}.$$

The resulting standard error is what appears in the t-statistic, t-based confidence interval, and subsequent inference step, we're carrying forward the same $\sqrt{n}$ shrinkage. 

### Summary of probability notation:

$X$ is one draw from the population (random)

$X_i$ is the $i$-th draw in a sample (random)

$\bar{X}$ is the *sample-mean random variable* (random)

$\mathbb{E}[X]$ or $\mathbb{E}[\bar{X}]$ is the population mean $\mu$ (deterministic constant)

$x_i$ is the observed value of $X_i$ (not random)

$\bar{x}$ is the numeric mean of the observed sample (not random)


---
### $\underline{\text{More Simple Regression Analysis}}$:
Since $\hat{\beta}_0$ and $\hat{\beta}_1$ are estimators for $\beta_0$ and $\beta_1$ we want to determine how good they are. To construct confidence intervals for the true regression coefficients we need to find the distribution functions for $\hat{\beta}_0$ and $\hat{\beta}_1$. We can assume that the noise, $\varepsilon_i$, are independent and normally distributed random variables with zero mean and variance $\sigma^2$. For now, we assume that $\sigma^2$ is known. 

### Expected value of coefficient estimators

The true relationship between $Y$ and $X$ is given by 

$$Y = f(X) + \varepsilon$$

where $f(X) = \beta_0 + \beta_1X$.

How accurate is the sample mean, $\hat{\mu}$, to the population mean, $\mu$? We can estimate $Y$ for a given $X$ by find the best estimate of the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ using least squares

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$$

$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i -\bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}.$$

Since 

$$\begin{align*}
\sum_{i=1}^n(x_i - \bar{x})(y_i -\bar{y}) &= \sum_{i=1}^nx_iy_i - \bar{y}\sum_{i=1}^nx_i - \bar{x}\sum_{i=1}^ny_i + \sum_{i=1}^n\bar{x}\bar{y} \\
&= \sum_{i=1}^nx_iy_i - n\bar{x}\bar{y} - n\bar{x}\bar{y} + n\bar{x}\bar{y} \\
&= \sum_{i=1}^nx_iy_i - n\bar{x}\bar{y} \\
&= \sum_{i=1}^nx_iy_i - \bar{x}\sum_{i=1}^n{y_i} \\
&= \sum_{i=1}^n(x_i - \bar{x})y_i
\end{align*}
$$

and 

$$\begin{align*}
\sum_{i=1}^n (x_i - \bar{x})^2 &= \sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^nx_i + \sum_{i=1}^n\bar{x}^2 \\
&=\sum_{i=1}^n x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 \\
&=\sum_{i=1}^n x_i^2 - 2n\bar{x}^2 + n\bar{x}^2 \\
&=\sum_{i=1}^n x_i^2 - n\bar{x}^2.
\end{align*}
$$

then we can re-write $\hat{\beta_1}$ as 

$$\hat{\beta_1} = \frac{\sum_{i=1}^n(x_i - \bar{x})y_i}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}.$$

We have now factored out $y_i$, which is our true response. This is a random variable since 

$$y_i = \beta_0 + \beta_1x_i + \varepsilon.$$

where $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ and the other terms like $x_i$ are observations (numeric realizations that are not random) of random variable $X_i$. Let's subsitute $y_i$ into $\hat{\beta_1}$

$$\begin{align*}
\hat{\beta_1} &= \frac{\sum_{i=1}^n(x_i - \bar{x})(\beta_0 + \beta_1x_i + \varepsilon)}{\sum_{i=1}^n x_i^2 - n\bar{x}^2} \\
&=\frac{\sum_{i=1}^n(x_i - \bar{x})\beta_0 + \sum_{i=1}^n(x_i - \bar{x})\beta_1x_i + \sum_{i=1}^n(x_i - \bar{x})\varepsilon)}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}.
\end{align*}
$$

Taking the expectation of $\hat{\beta_1}$ we have 

$$\begin{align*}
\mathbb{E}[\hat{\beta_1}] &=\frac{\sum_{i=1}^n(x_i - \bar{x})\beta_0 + \sum_{i=1}^n(x_i - \bar{x})\beta_1x_i + \sum_{i=1}^n(x_i - \bar{x})\mathbb{E}[\varepsilon])}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}
\end{align*}
$$

and since $\mathbb{E}[\varepsilon]=0$ and $\sum_{i=1}^n(x_i - \bar{x}) = n\bar{x} - n\bar{x}=0$ we have 

$$\begin{align*}
\mathbb{E}[\hat{\beta}_1] &=\frac{\sum_{i=1}^n(x_i - \bar{x})x_i\beta_1}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}
\end{align*}
$$

and since $\sum_{i=1}^n x_i^2 - n\bar{x}^2 = \sum_{i=1}^n(x_i - \bar{x})x_i$ 

$$\begin{align*}
\mathbb{E}[\hat{\beta}_1] &=\beta_1\frac{\sum_{i=1}^n(x_i - \bar{x})x_i}{\sum_{i=1}^n(x_i - \bar{x})x_i}\\
&=\beta_1.
\end{align*}
$$

Therefore, $\hat{\beta_1}$ is an unbiased estimator of $\beta_1$. We can do the same for $\hat{\beta_0}$

$$\begin{align*}
\hat{\beta}_0 &= \bar{y} - \hat{\beta}_1\bar{x}\\
&=\frac{1}{n}\sum_{i=1}^n y_i - \hat{\beta}_1\bar{x}\\
&=\frac{1}{n}\sum_{i=1}^n (\beta_0 + \beta_1x_i + \varepsilon_i) - \hat{\beta}_1\bar{x}\\
&=\frac{1}{n}(\sum_{i=1}^n \beta_0 + \sum_{i=1}^n \beta_1 x_i) + \frac{1}{n}\sum_{i=1}^n\varepsilon_i - \hat{\beta}_1\bar{x}\\
&=\frac{1}{n}(n\beta_0 + n\beta_1\bar{x}) + \bar{\varepsilon} - \hat{\beta}_1\bar{x}\\
&=\beta_0 + (\beta_1 - \hat{\beta}_1)\bar{x} + \bar{\varepsilon} \\
\end{align*}
$$

and taking the expectation of both sides 


$$\mathbb{E}[\hat{\beta}_0] =\mathbb{E}[(\beta_0 + (\beta_1 - \hat{\beta}_1)\bar{x} + \bar{\varepsilon})]$$ 

and since $\mathbb{E}[\hat{\beta}_1] = \beta_1$, $\mathbb{E}[\bar{\varepsilon}]=0$, then

$$
\begin{align*}
\mathbb{E}[\hat{\beta}_0]&=\beta_0 + (\beta_1 - \beta_1)\bar{x} \\
&=\beta_0.
\end{align*}
$$ 

Therefore, $\hat{\beta}_0$ is a unbiased estimator of $\beta_0$.

### Variance of coefficient estimators

Since we know that the variance of the true response is 

$$
\begin{align*}
Var(Y_i) &= Var(\beta_0 + \beta_1x_i + \varepsilon) \\
&=Var(\varepsilon) \\
&=\sigma^2.
\end{align*}
$$

where the the design points $x_i$ (or design matrix, $X_i$) are fixed. 

### Note
A more mathematically correct notation is $Var(y_i|x_i)$, which explicitly says that the variance is conditional on those fixed design values, $x_i$. This is to distinguish it from the unconditional variance $Var(Y)=\beta_1^2Var(X)+\sigma^2$ when $X$ is random. Also, this notational difference is made when you are contrasting regression with other models where the predictors themselves are stochastic. In ISLP, the authors will just use $Var(y_i)=\sigma^2$ or $Var(Y_i)=\sigma^2$. This is fine as long as you specify that the variance comes from the noise term and that you’re treating the design points as fixed.

### Variance of coefficient estimators cont. 
Back to the coefficient estimates, we have from before    

$$\hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i-\bar{x})y_i}{\sum_{i=1}^n(x_i-\bar{x})^2}$$

Using the fact that $Var(y_i)=\sigma^2$ and $Var(aX) = a^2Var(X)$ we can take the variance of both sides to 

$$
\begin{align*}
Var(\hat{\beta}_1) &=Var\left(\frac{\sum_{i=1}^n(x_i-\bar{x})y_i}{\sum_{i=1}^n(x_i-\bar{x})^2}\right) \\
&=\frac{\sum_{i=1}^n(x_i-\bar{x})^2 Var(y_i)}{\sum_{i=1}^n(x_i-\bar{x})^4}\\
&=\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2}.
\end{align*}
$$

Likewise, using the fact 

$$Var(X - Y) = Var(X) + Var(Y) - 2Cov(X, Y)$$

and taking the variance of both sides 

$$
\begin{align*}
Var(\hat{\beta}_0) &= Var(\bar{y} - \hat{\beta}_1\bar{x}) \\
&= Var(\bar{y}) + \bar{x}^2Var(\hat{\beta}_1) - 2\bar{x}Cov(\bar{y},\hat{\beta}_1)\\
\end{align*}
$$

and since $\bar{y}$ and $\hat{\beta}_1$ are uncorrelated then $Cov(\bar{y},\hat{\beta}_1)=0$

$$
\begin{align*}
Var(\hat{\beta}_0) &= Var(\bar{y}) + \bar{x}^2Var(\hat{\beta}_1) \\
&= Var\left(\frac{1}{n}\sum_{i=1}^ny_i\right) + \bar{x}^2Var(\hat{\beta}_1) \\
&= \frac{1}{n^2}Var(\sum_{i=1}^ny_i) + \bar{x}^2Var(\hat{\beta}_1) \\
&= \frac{1}{n^2}(n\sigma^2) + \bar{x}^2\left(\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\right)\\
&=\frac{\sigma}{n} + \frac{\sigma^2\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}.
\end{align*}
$$

### Proof that covariance is zero
We start with the formulas for the estimators:

$$
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x}) Y_i}{\sum_{i=1}^n (x_i - \bar{x})^2},
\qquad
\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i.
$$

We want to compute the covariance:

$$
\operatorname{Cov}(\bar{Y}, \hat{\beta}_1)
= \operatorname{Cov}\left( \frac{1}{n} \sum_{i=1}^n Y_i,
\frac{\sum_{j=1}^n (x_j - \bar{x}) Y_j}{\sum_{j=1}^n (x_j - \bar{x})^2} \right).
$$

Since the denominator in $\hat{\beta}_1$ is constant (a function of the fixed design points $x_j$), we can factor it out:

$$
= \frac{1}{n \sum_{j=1}^n (x_j - \bar{x})^2}
\operatorname{Cov} \left( \sum_{i=1}^n Y_i, \sum_{j=1}^n (x_j - \bar{x}) Y_j \right).
$$

Expanding the covariance:

$$
= \frac{1}{n \sum_{j=1}^n (x_j - \bar{x})^2}
\sum_{i=1}^n \sum_{j=1}^n (x_j - \bar{x}) \operatorname{Cov}(Y_i, Y_j).
$$

Since $\operatorname{Cov}(Y_i, Y_j) = 0$ for $i \ne j$ and $\sigma^2$ for $i = j$ (due to independent errors):

$$
= \frac{1}{n \sum_{j=1}^n (x_j - \bar{x})^2}
\sum_{i=1}^n (x_i - \bar{x}) \sigma^2
= \frac{\sigma^2}{n \sum_{j=1}^n (x_j - \bar{x})^2}
\sum_{i=1}^n (x_i - \bar{x}).
$$

But:

$$
\sum_{i=1}^n (x_i - \bar{x}) = 0,
$$

so we conclude:

$$
\operatorname{Cov}(\bar{Y}, \hat{\beta}_1) = 0.
$$

Thus, $\bar{Y}$ and $\hat{\beta}_1$ are uncorrelated.


### Sample variance

Once we know the distributions for the two estimators $\hat{\beta}_0$ and $\hat{\beta}_1$ such 
that $\sigma^2$ is known then we can construct confidence intervals for the regression 
coefficients. In the majority of cases $\sigma^2$ is not known, and so we try to come 
up with an estimator for $\sigma^2$, where $\sigma^2$ is a measure of the variation of 
data around the regression line.

We know the $RSS$ is given by 

$$
\begin{align*}
RSS &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\
&=\sum_{i=1}^n(y_i - \hat{\beta}_0 - \hat{\beta}_1x_i)^2. \\
\end{align*}
$$

Let's define the following quantities for convenience 
$$S_{xx} = \sum_{i=1}^n(x_i - \bar{x})^2$$

$$S_{yy} = \sum_{i=1}^n(y_i - \bar{y})^2$$

$$S_{xy} = \sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y}).$$

Since $\hat{\beta}_0 = \bar{y} -  \hat{\beta}_1\bar{x}$ we have 

$$
\begin{align*}
RSS &=\sum_{i=1}^n[y_i - \bar{y} -  \hat{\beta}_1(x_i - \bar{x})]^2 \\
&= \sum_{i=1}^n(y_i - \bar{y})^2 - 2\hat{\beta}_1\sum_{i=1}^n(y_i - \bar{y})(x_i - \bar{x}) + \hat{\beta}_1^2\sum_{i=1}^n(x_i - \bar{x})^2 \\
&= S_{yy} - 2\hat{\beta}_1S_{xy} + \hat{\beta}_1^2S_{xx}.\\
\end{align*}
$$

We know that $\hat{\beta_1}$ is 

$$
\begin{align*}
\hat{\beta_1} &= \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2} \\
&=\frac{S_{xy}}{S_{xx}}
\end{align*}
$$

and plugging back into $RSS$

$$
\begin{align*}
RSS &= S_{yy} - 2\frac{S_{xy}}{S_{xx}}S_{xy} + \left(\frac{S_{xy}}{S_{xx}}\right)^2 S_{xx}\\
&=S_{yy} - \frac{S_{xy}^2}{S_{xx}}\\
&=S_{yy} - \hat{\beta_1}S_{xy}
\end{align*}
$$

and taking the expectation of both sides 
$$
\begin{align*}
\mathbb{E}[RSS] &= \mathbb{E}[S_{yy} - \hat{\beta_1}S_{xy}] \\
&=\mathbb{E}[\sum_{i=1}^ny_i^2 - n\bar{y}^2 - \hat{\beta}_1^2S_{xx}] \\
&=\sum_{i=1}^n(\mathbb{E}[y_i^2]) - n\mathbb{E}[\bar{y}^2] - S_{xx}\mathbb{E}[\hat{\beta}_1^2] \\
\end{align*}
$$

and because $\mathbb{V}[y_i] = \mathbb{E}[y_i^2] - (\mathbb{E}[y_i])^2$

$$
\begin{align*}
\mathbb{E}[RSS]
&=\sum_{i=1}^n(\mathbb{V}[y_i] + \mathbb{E}[y_i]^2) - n(\mathbb{V}[\bar{y}] + \mathbb{E}[\bar{y}]^2) - S_{xx}(\mathbb{V}[\hat{\beta}_1] + \mathbb{E}[\hat{\beta}_1]^2)\\

&=\sum_{i=1}^n\sigma^2 + \sum_{i=1}^n(\beta_0 + \beta_1x_i)^2 - n\left(\frac{\sigma^2}{n} + (\beta_0 + \beta_1\bar{x})^2\right) -S_{xx}\left(\frac{\sigma^2}{S_{xx}} + \beta_1^2\right) \\ 

&=n\sigma^2 + \sum_{i=1}^n\left(\beta_0^2 + 2\beta_0\beta_1x_i + (\beta_1x_i)^2\right) - \sigma^2 - \left(n\beta_0^2 + 2\beta_0\beta_1(n\bar{x}) + n\beta_1^2\bar{x}^2\right) - \sigma^2 - \beta_1^2\sum_{i=1}^n(x_i - \bar{x})^2 \\

&=(n-2)\sigma^2 + \sum_{i=1}^n\left(\beta_0^2 + 2\beta_0\beta_1x_i + (\beta_1x_i)^2\right) - n\beta_0^2  - 2\beta_0\beta_1(n\bar{x}) - n\beta_1^2\bar{x}^2 - \beta_1^2\sum_{i=1}^nx_i^2 + n\beta_1^2\bar{x}^2\\

&=(n-2)\sigma^2 + \sum_{i=1}^n\left(\beta_0^2 + 2\beta_0\beta_1x_i + (\beta_1x_i)^2\right) - \sum_{i=1}^n\beta_0^2 - 2\beta_0\beta_1\sum_{i=1}^nx_i - \beta_1^2\sum_{i=1}^nx_i^2\\

&=(n-2)\sigma^2 + \sum_{i=1}^n\left(\beta_0^2 + 2\beta_0\beta_1x_i + (\beta_1x_i)^2\right) - \sum_{i=1}^n\left(\beta_0^2 + 2\beta_0\beta_1x_i + (\beta_1x_i)^2\right) \\

&=(n-2)\sigma^2.

\end{align*}
$$

Then dividing both sides by $n-2$ 

$$\mathbb{E}\left[\frac{RSS}{n-2}\right] = \sigma^2$$

in other words, $\frac{RSS}{n-2}$ is an unbiased estimator of the true error variance. By 
definition this quantity is called the squared residual standard error 

$$RSE^2 = \frac{RSS}{n-2}$$

which can be expressed using previous quantities 

$$RSE^2= \frac{S_{yy} - \hat{\beta}_1S_{xy}}{n-2}.$$

Taking the square root we get the **residual standard error**

$$RSE= \sqrt{\frac{RSS}{n-2}}.$$

So why is this unbiased? Because by construction we have

$$
\mathbb{E}[\mathrm{RSS}] = (n-2)\,\sigma^2
$$
it follows immediately that
$$
\mathbb{E}\!\Bigl[\frac{\mathrm{RSS}}{n-2}\Bigr]
= \frac{\mathbb{E}[\mathrm{RSS}]}{n-2}
= \frac{(n-2)\,\sigma^2}{n-2}
= \sigma^2.
$$
That is exactly the definition of an unbiased estimator of $\sigma^2$: its expectation equals the true parameter. Since $RSE^2$ is unbiased then the distribution function for $RSE^2$ is 

$$
\begin{align*}
\frac{RSE^2(n-2)}{\sigma^2} &= \frac{S_{yy}}{\sigma^2} - \frac{\hat{\beta}_1 S_{xx}}{\sigma^2} \\
&=\sum_{i=1}^n\left(\frac{y_i - \bar{y}}{\sigma^2}\right) - \frac{\hat{\beta}_1^2 S_{xx}}{\sigma^2} \\
&=\sum_{i=1}^n\underbrace{\left(\frac{y_i - \bar{y}}{\sigma^2}\right)}_{\chi_{n-1}^2}  - \underbrace{\left(\frac{\hat{\beta}_1 - \beta_1}{\sigma/\sqrt{S_{xx}}}\right)^2}_{\chi_1^2}
\end{align*}
$$

which suggests that $\frac{RSE^2(n-2)}{\sigma^2}$ has a $\chi^2$ distirbution with $n-2$ degress of freedom. 

Recall, $\hat{\beta}_1$ is normally distributed with parameters $\beta_1$ and $\sigma^2/\sqrt{S_{xx}}$, and therefore $\frac{\hat{\beta}_1 - \beta_1}{RSE/\sqrt{S_{xx}}} \sim \mathcal{N}(0,1)$. If we replace $\sigma$ by its estimate $RSE$ we find

$$
\begin{align*}
\frac{\hat{\beta}_1 - \beta_1}{RSE/\sqrt{S_{xx}}} &= \frac{\frac{\hat{\beta}_1 - \beta_1}{\sigma/\sqrt{S_{xx}}}}{\sqrt{\frac{RSE^2}{\sigma^2}\cdot\frac{n-2}{n-2}}} \\

&=\frac{\frac{\hat{\beta}_1 - \beta_1}{\sigma/\sqrt{S_{xx}}}\sqrt{n-2}}{\sqrt{\frac{RSE^2}{\sigma^2}(n-2)}} \\

&\sim \frac{\mathcal{N}(0, \sqrt{n-2})}{\sqrt{\chi^2(n-2)}} \\

&\sim t(n-2)
\end{align*}
$$

which suggests that $\frac{\hat{\beta}_1 - \beta_1}{RSE/\sqrt{S_{xx}}}$ has a t-distribution with $n-2$ degress of freedom. Next we can consider the distribution for the estimate of the intercept, $\hat{\beta}_0$. Recall, that $\hat{\beta}_0 \sim \mathcal{N}(\beta_0, \sqrt{\frac{\sigma^2}{n} + \frac{\sigma^2 \bar{x}^2}{S_{xx}}})$ and therefore $$\frac{\hat{\beta}_0 - \beta_0}{\sqrt{\frac{\sigma^2}{n} + \frac{\sigma^2 \bar{x}^2}{S_{xx}}}} \sim \mathcal{N}(0, 1)$$ and if we replace $\sigma$ with $RSE$ then 


$$
\begin{align*}

\frac{\hat{\beta}_0 - \beta_0}{\sqrt{\frac{\sigma^2}{n} + \frac{\sigma^2 \bar{x}^2}{S_{xx}}}} &= \frac{\frac{\hat{\beta}_0 - \beta_0}{\sigma\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}}}{\sqrt{\frac{RSE^2}{\sigma^2}\cdot \frac{n-2}{n-2}}}\\

&=\frac{\frac{\hat{\beta}_0 - \beta_0}{\sigma\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}}}\sqrt{n-2}}
     {\sqrt{\frac{\mathrm{RSE}^2}{\sigma^2}\,(n-2)}}\\

&\sim \frac{\mathcal{N}(0, \sqrt{n-2})}{\sqrt{\chi^2(n-2)}} \\

&\sim t(n-2)

\end{align*}
$$

which suggests that $\frac{\hat{\beta}_0 - \beta_0}{RSE\sqrt{\frac{1}{n} + \frac{\sigma^2 \bar{x}^2}{S_{xx}}}}$ has a t-distirbution with $n-2$ degrees of freedom. 

### Bias - Variance relationship to estimators

For any estimator $\hat{\theta}$ of a scalar target $\theta$

$$\operatorname{MSE}(\hat{\theta})
  \;=\;
  \mathbb{E}\!\bigl[(\hat{\theta}-\theta)^2\bigr]
  \;=\;
  \underbrace{\bigl(\operatorname{Bias}(\hat{\theta})\bigr)^2}_{\text{systematic error}}
  \;+\;
  \underbrace{\operatorname{Var}(\hat{\theta})}_{\text{sampling noise}}.
$$

If an estimator is unbiased, like ordinary least squares gives for $\hat{\beta}_0$ and $\hat{\beta}_1$, then the first term of the bias-variance decomposition vanishes. The mean-squared error reduces to variance alone

$$
\operatorname{MSE}(\hat{\beta}_0) \;=\; \operatorname{Var}(\hat{\beta}_0),
\qquad
\operatorname{MSE}(\hat{\beta}_1) \;=\; \operatorname{Var}(\hat{\beta}_1).
$$

Therefore, OLS is the zero-bias corner of the bias-variance plane, and it's total risk is entirely variance driven. Unbiasedness is nice, but it does not guarantee low MSE. If the design points $x_i$ are tightly clustered or the noise level $\sigma^2$ is high then the variance for $\hat{\beta}_0$, $\hat{\beta}_1$ can be large, giving wider confidence intervals and unstable predictions. Shrinkage methods like ridge, lasso, elastic net, Bayesian priors deliberately introduce bias to cut variance, often yielding a smaller MSE overall. This "trade-off" is deciding where on the curve you want to land. OLS chooses the far left edge and ridge and lasso slide rightward, accepting some bias to reduce variance. 

---
### $\underline{\text{More Multiple Regression Analysis}}$:

Let's re-write the OLS equations in a different form, that will be amenable to deriving sample variance, f-statistics, and other stats used for multiple regression. We know that the **residual vector** is  

$$\bf{e} = Y - \hat{Y}$$

and the OLS solution is 

$$\hat{\beta} = (X^TX)^{-1}X^TY.$$

The fitted values are 

$$\hat{Y} = X\hat{\beta} = X(X^TX)^{-1}X^TY = PY$$

where

$$P = X(X^TX)^{-1}X^T.$$

Using the residual definition we know

$$\bf{e}=Y - \hat{Y} = Y - PY = (I - P)Y.$$

Plugging in the true model $Y=X\beta + \varepsilon$ 

$$\bf{e}= (I - P)(X\beta + \varepsilon) = (I - P)X\beta + (I-P)\varepsilon.$$

Since $PX = X(X^TX)^{-1}X^TX = XI = X$ then 

$$(I-P)X\beta=X\beta - PX\beta = X\beta - X\beta = 0$$

and all that remains is 

$$\bf{e} = (I - P)\varepsilon$$

which can also be written as 

$$\bf{e} = (I - X(X^TX)^{-1}X^T)\varepsilon.$$

We will also use the following identities

$$\mathbb{E}[Tr(X)] = Tr(E[X])$$


$$Tr(AB) = Tr(BA)$$

$$Tr(ABC) = Tr(BCA) = Tr(CAB)$$ 

which is the circular property. The inner product of of the residual $\bf{e}\in\mathbb{R}^n$ can be expressed as 

$$
\begin{align*}
\sum_{i=1}^n e_i^2 &= \bf{e}^T\bf{e} \\
&= (\bf{Y}-\hat{\bf{Y}})^T (\bf{Y}-\hat{\bf{Y}})\\
&=\it{Tr}\left((\bf{Y}-\hat{\bf{Y}})(\bf{Y}-\hat{\bf{Y}})^T\right).
\end{align*}
$$

Also, we have $\mathbf{\varepsilon} = (\varepsilon_1, \dots, \varepsilon_n)^T$

$$
\mathbf{\varepsilon\,\varepsilon}^T
=
\begin{pmatrix}
  \varepsilon_1\varepsilon_1 & \cdots & \varepsilon_1\varepsilon_n \\
  \vdots & \ddots & \vdots \\
  \varepsilon_n\varepsilon_1 & \cdots & \varepsilon_n\varepsilon_n
\end{pmatrix}.
$$

and therefore 

$$
\begin{align*}
Tr(\varepsilon\,\varepsilon^T) &=\varepsilon_1^2 + \cdots + \varepsilon_n^2 \\
&=\sum_{i=1}^n \varepsilon_i^2 \\
\end{align*}
$$

and if we take the expectation of both sides we get 

$$\mathbb{E}[Tr(\varepsilon\,\varepsilon^T)] = \sum_{i=1}^n \mathbb{E}[\varepsilon_i^2]$$


where 
$$
\begin{align*}
\sum_{i=1}^n \mathbb{E}[\varepsilon_i^2] &= \sum_{i=1}^n[Var(\varepsilon_i) + (\mathbb{E}[\varepsilon_i])^2]\\
&= \sum_{i=1}^nVar(\varepsilon_i) \\
&= n\sigma^2. \\
\end{align*}
$$

### F-statistic

Recall, in multiple regression we have $p$ predictors, and to determine whether there is a relationship between the response and the predictors, we need to determine whether the regression coefficients are zero. The null hypothesis is 

$$H_0:\beta_1 = \beta_2 =\cdots=\beta_p=0$$

versus the alternative 

$$H_a: \text{atleast one} \hspace{0.1cm} \beta_j\hspace{0.1cm} \text{is non-zero.}$$

The hypothesis test is perfomed by computing the F-statistic

$$F=\frac{(TSS - RSS)/p}{RSS/(n-p-1)}$$

where $TSS=\sum_{i=1}^n(y_i - \bar{y})^2$ and $RSS=\sum_{i=1}^n(y_i - \hat{y})^2$. If the 
linear model assumptions are correct then 

$$\mathbb{E}[RSS/(n-p-1)] = \sigma^2$$

and if $H_0$ is true then 

$$\mathbb{E}[(TSS - RSS)/p] = \sigma^2.$$

The $RSS$ is 

$$
\begin{align*}
RSS &= Tr\left( (Y-\hat{Y})^T(Y - \hat{Y})\right)\\ 
&= Tr\left( (Y-\hat{Y})(Y - \hat{Y})^T\right)\\
&= Tr\left((I-P)YY^T(I-P)^T\right)\\
&= Tr\left((I-P)(X\beta + \varepsilon)(X\beta + \varepsilon)^T(I-P)^T\right)\\
&= Tr\left((X\beta + \varepsilon - PX\beta -P\varepsilon)(\beta^TX^T + \varepsilon^T-\beta^TX^TP^T - \varepsilon^T P^T)\right) \\
& = Tr\left(\varepsilon - P\varepsilon)(\varepsilon^T - \varepsilon^T P^T\right)\\
& = Tr\left(\varepsilon\varepsilon^T -P\varepsilon\varepsilon^T - \varepsilon\varepsilon^TP^T + P\varepsilon\varepsilon^TP^T\right).
\end{align*}
$$

Let's take the expectation of both sides and $P = P^T$ so we have 


$$\mathbb{E}[RSS] = \mathbb{E}[Tr(\varepsilon\varepsilon^T) - Tr(P\varepsilon\varepsilon^T) - Tr(\varepsilon\varepsilon^TP) + Tr(P\varepsilon\varepsilon^TP)]$$

and since $Tr(\varepsilon\varepsilon^TP)=Tr(P\varepsilon\varepsilon^T)$  

$$
\begin{align*}
\mathbb{E}[RSS] &=\mathbb{E}[Tr(\varepsilon\varepsilon^T)] - 2\mathbb{E}[Tr(P\varepsilon\varepsilon^T)] + \mathbb{E}[Tr(P\varepsilon\varepsilon^TP)]\\ 
\end{align*}
$$

and using the linearity of expectation 

$$
\begin{align*}
\mathbb{E}[RSS]&=Tr(\mathbb{E}[\varepsilon\varepsilon^T]) - 2Tr(P\mathbb{E}[\varepsilon\varepsilon^T]) + Tr(P\mathbb{E}[\varepsilon\varepsilon^T]P)\\ 
&=Tr(\sigma^2I) - 2Tr(P\sigma^2I) + Tr(P\sigma^2IP) \\
&=\sigma^2Tr(I) - 2\sigma^2Tr(P) + \sigma^2Tr(P^2) \\
\end{align*}
$$

and $P$ is idempotent since $P^2 =X(X^TX)^{-1}X^T \left(X(X^TX)^{-1}X^T\right)=X(X^TX)^{-1}X^T=P$ so

$$\mathbb{E}[RSS]=\sigma^2Tr(I_{n\times n}) - \sigma^2Tr(P)$$

and since $I \in \mathbb{R}^{n\times n}$ and $X$ has $p$ predictors and an intercept then

$$
\begin{align*}
\mathbb{E}[RSS]&=\sigma^2n - \sigma^2(p + 1) \\
&=(n - p - 1)\sigma^2
\end{align*}
$$

and finally

$$\frac{\mathbb{E}[RSS]}{(n - p - 1)} = \sigma^2.$$

$$\boxed{\mathbb{E}[RSS] = (n - p - 1)\sigma^2}$$
A shorter derivation is 

$$RSS=\bf{e}^T\bf{e}=((I−P)Y)^T(I−P)Y=Y^T(I−P)^T(I−P)Y$$ 

and $P$ is symmetric such that $(I−P)^T=I−P$ and idempotent such that $(I-P)^2 =I-P$ therefore 

$$RSS = Y^T(I−P)Y.$$

Since $Y=X\beta + \varepsilon$ and $(I-P)X\beta=(X\beta - X\beta)=0$ you get 

$$RSS = \varepsilon^T(I−P)\varepsilon$$

and then proceed as before by taking the expectation of both sides. 

The total sum of squares around the sample mean, $TSS$, you’re effectively fitting the intercept-only model
$$
\quad
Y_i = \beta_0 + \varepsilon_i,
\quad \hat\beta_0 = \bar Y.
$$

and $\hat{\beta}_0=\bar{Y}$. In population terms we write $\beta_0=\mu$

$$
\quad
\beta_0 = \mu,
\quad Y_i = \mu + \varepsilon_i,
\quad \varepsilon_i \sim N(0,\sigma^2).
$$

and the residuals are 
$$
\quad
e_i = Y_i - \bar Y,
\quad
\mathrm{TSS} = \sum_{i=1}^n e_i^2
= \sum_{i=1}^n (Y_i - \bar Y)^2.
$$

Hence we start the derivation from $Y_i = \mu + \varepsilon_i$

$$
\begin{aligned}
\mathrm{TSS}
&= \sum_{i=1}^n (Y_i - \bar Y)^2
= \sum_{i=1}^n \bigl(Y_i^2 - 2\,Y_i\,\bar Y + \bar Y^2\bigr)\\[4pt]
&= \sum_{i=1}^n Y_i^2 \;-\; 2\,\bar Y\sum_{i=1}^n Y_i \;+\; n\,\bar Y^2
  = \sum_{i=1}^n Y_i^2 \;-\; n\,\bar Y^2,\\[6pt]
\mathbb{E}[\mathrm{TSS}]
&= \mathbb{E}\Bigl[\sum_{i=1}^n Y_i^2\Bigr]
  - n\,\mathbb{E}[\bar Y^2].
\end{aligned}
$$

and so we get 

$$
\begin{aligned}
\mathbb{E}[Y_i^2]
&= Var(Y_i) + (\mathbb{E}[Y_i])^2
= \sigma^2 + \mu^2
\;\Longrightarrow\;
\mathbb{E}\Bigl[\sum_i Y_i^2\Bigr] = n(\sigma^2+\mu^2),\\[6pt]
\bar Y
&= \mu + \tfrac{1}{n}\sum_i \varepsilon_i,
\quad Var(\bar Y)=\tfrac{\sigma^2}{n}
\;\Longrightarrow\;
\mathbb{E}[\bar Y^2] = Var(\bar Y) + \mathbb{E}[\bar{Y}]^2=\tfrac{\sigma^2}{n} + \mu^2,\\[6pt]
\mathbb{E}[\mathrm{TSS}]
&= n(\sigma^2+\mu^2) - n\Bigl(\tfrac{\sigma^2}{n} + \mu^2\Bigr)
= (n-1)\,\sigma^2.
\end{aligned}
$$

$$
\boxed{\mathbb{E}[\mathrm{TSS}] = (n-1)\,\sigma^2.}
$$

Also, we can derive the same result using
$$
\begin{aligned}
\mathrm{TSS}
&= (Y - \bar Y\,\mathbf{1})^T (Y - \bar Y\,\mathbf{1})
  = \operatorname{tr}\bigl((I - H)\,Y\,Y^T\bigr), 
  \quad H = \tfrac{1}{n}\,\mathbf{1}\,\mathbf{1}^T, \\[6pt]
Y\,Y^T
&= (\mu\,\mathbf{1} + \varepsilon)\,(\mu\,\mathbf{1} + \varepsilon)^T
  = \mu^2\,\mathbf{1}\,\mathbf{1}^T
    + \mu\,\mathbf{1}\,\varepsilon^T
    + \mu\,\varepsilon\,\mathbf{1}^T
    + \varepsilon\,\varepsilon^T, \\[6pt]
(I - H)\,\mathbf{1} &= 0
  \quad\Longrightarrow\quad
  (I - H)\,Y\,Y^T = (I - H)(\mu^2\,\mathbf{1}\,\mathbf{1}^T
    + \mu\,\mathbf{1}\,\varepsilon^T
    + \mu\,\varepsilon\,\mathbf{1}^T
    + \varepsilon\,\varepsilon^T) = (I - H)\,\varepsilon\,\varepsilon^T, \\[6pt]
\end{aligned}
$$

because 

$$
H = \frac{1}{n}\,\mathbf{1}\,\mathbf{1}^T
\quad\Longrightarrow\quad
(I - H)\,\mathbf{1}
= \mathbf{1} - \frac{1}{n}\,\mathbf{1}(\mathbf{1}^T\mathbf{1})
= \mathbf{1} - \mathbf{1}
= 0
$$

and we are left wth 
$$
\begin{aligned}
\mathrm{TSS}
&= \operatorname{tr}\bigl((I - H)\,\varepsilon\,\varepsilon^T\bigr), \\[6pt]
\mathbb{E}[\mathrm{TSS}]
&= \operatorname{tr}\!\bigl((I - H)\,\mathbb{E}[\varepsilon\,\varepsilon^T]\bigr)
  = \operatorname{tr}\!\bigl((I - H)\,\sigma^2 I\bigr)
  = \sigma^2\,\operatorname{tr}(I - H) \\[4pt]
&= \sigma^2\bigl(\operatorname{tr}(I)-\operatorname{tr}(H)\bigr)
  = \sigma^2\,(n - 1).
\end{aligned}
$$

If the linear assumption is true then

$$\mathbb{E}[RSS/(n-p-1)]=\sigma^2.$$

If $H_0$ is true then 

$$
\begin{align*}
\mathbb{E}[(TSS - RSS)/p] &= \frac{1}{p}(\mathbb{E}[TSS] - \mathbb{E}[RSS]) \\
&=\frac{1}{p}(\sigma^2(n - 1) - \sigma^2(n - 1 - p))=\frac{1}{p}p\sigma^2=\sigma^2.
\end{align*}
$$

### $\underline{\text{References}}$:

1. James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J.  
   *An Introduction to Statistical Learning: With Applications in Python*.  
   Springer, 2023. ISBN: 978-3031387463.

2. Murphy, K. P.  
   *Probabilistic Machine Learning: An Introduction*.  
   MIT Press, 2022. ISBN: 978-0262046824.

3. Deisenroth, M. P., Faisal, A. A., & Ong, C. S.  
   *Mathematics for Machine Learning*.  
   Cambridge University Press, 2020. ISBN: 978-1108455145.


