# Regression

## ANOVA vs. regression

- Use ANOVA when all independent variables are **discrete** (usually categorical)
- Use a regression when at least some independent variables are **continuous**

## Linear model

**Linear** means that **scalar-multiplication** and **addition** of regressors. ANOVA, regression, correlation are all linear models. They are called **General Linear Model (GLM)** By contrast, **non-linear operations** are log, square root, powers, trigonometry etc. In GLM, the data themselve can have **nonlinearities**, but only the GLM parameters must be linear in the model. But nonlinear data are often linearized to facilitate interpretation.

## Evaluation

$$
\hat{y} = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
$$
$$
\epsilon = y - \hat{y}
$$
$$
\text{SS}_{\epsilon} = \sum (y_i - \hat{y}_i)^2
$$
$$
\text{SS}_{\text{Total}} = \sum (y_i - \bar{y})^2
$$
$$
\text{R}^2 = 1 - \frac{\text{SS}_{\epsilon}}{\text{SS}_{\text{Total}}}
$$

When a model fits the data well, $\text{SS}_{\epsilon}$ gets small. The numerater gets small. So the ratio gets small. We have little thing to subtract from 1, so $\text{R}^2$ gets large.

$\text{R}^2$ is often used in model comparisons.

## F-test evaluation

**F-test** evaluation of model statistical significance has the following null hypothesis.

$$
H_0: \beta_{1-k} = 0
$$

The beta coefficients 1 throuhg k are equal to 0. It starts from 1 because the intercept term is 0, $\beta_0$.

$$
H_A: \text{At least one }\beta \ne 0
$$

At least one beta coefficient is not 0, but it doesn't tell us which one and how many coefficients.

To test the above, we need to compute F test statistic.

$$
\text{SS}_{\epsilon} = \sum (y_i - \hat{y}_i)^2
$$
$$
\text{SS}_{\text{Model}} = \sum (\hat{y_i} - \bar{y})^2
$$
$$
\text{F}_{(k - 1, N - k)} = \frac{\text{SS}_{\text{Model}} / (k - 1)}{\text{SS}_{\epsilon} / (N - k)}
$$

This only tells us the entire model, and it doesn't say whether any individual coefficients are significant. If this F statistic is statistically significant, we can evaluate each individual $\beta$ coefficient by **t-test**.

F statistic could be more significant if we have more data, so the denominator gets smaller, and F statistic gets larger, and the model gets more significant.

Here $k$ is the total number of parameters including the intercept.

$$
t_{N - k} = \frac{\beta_i}{s_{\beta_i}}
$$

## Multiple regression

Interpretation is $\beta_i$ reflects the effect of a unit change in $x_i$ on $y$ **when all other variables are held constant**.

## Standardized regression coefficients


Raw $\beta$ coefficients from **left-inverse** formula $\beta = (X^T X)^{-1} X^T Y$ is **unstandardized $\beta$ coefficients** which change depending on the scale of the independent variables. Unstandardized $\beta$ coefficients are difficult to compare across variables because each variable has different scales or units.

Standardized $\beta$ coefficients are in **standard deviation units**, unrelated to the scales of the data.

$b_k$ is the standardized $\beta_k$ coefficient. $s_{x_k}$ is the standard deviation of the independent variable $x_k$. $s_y$ is the standard deviation of the dependent variable $y$.

$$
b_k = \beta_k \frac{s_{x_k}}{s_y}
$$

It means that scale the unstandardized $\beta$ by6 the standard deviations of the independent variable and the dependent variable.

Interpretation of the standardized $\beta$ coefficients is $b_k$ reflects the effect of a **one-standard deviation change** in $x_k$ on **standard deviation changes** in $y$ **when all other variables are held constant**.

Standardized $\beta$ coefficients can also be computed by first **z-normalize, meaning subtract mean and divided by standard deviation**, both independent variables and dependent variables, and then compute regression coefficients. 

## Polynomial regression model

The model will have the nonlinearities, but the coefficients are all linear, so we can fit the standard linear model.

By increasing the order of the polynomials, the model tends to be overfitting. The appropriate order can be found by **Bayes information criteria (BIC)**. $k$ is the order of polynomial.

$$
\text{BIC}_k = n \ln{(\text{SS}_{\epsilon})} + k \ln{(n)}
$$

We pick the model which has the minimum BIC. The formula tells us that, as the number of the order increase, the BIC tends to increase.

## Logistic regression

$$
\ln{\frac{p}{1 - p}} = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
$$

Because $\ln$ and $e$ cancel each other out,

$$
\frac{p}{1 - p} = e^{\beta_0 + \beta_1 x_1 + ... + \beta_k x_k}
$$

$$
p = \frac{1}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_k x_k}}
$$

The reason why taking **log** is because the log of small values has a larger dynamic range and is easier to work with in optimization problems.

$\frac{1}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_k x_k}}$ has the nonlinearities in the coefficients, so we cannot use left-inverse to compute regression coefficients. Instead, use iterative methods such as **gradient descent** to find the set of parameters. Just like linear regression, raw $\beta$ coefficients of logistic regression are unstandardized, so we cannot compare among independent variables without standard $\beta$ coefficients.

Name comes from a binary dependent variable "logical". When it gets beyond binary into any number of categorical outcomes, it's **multinomial logistic regression**.

## Control overfitting and underfitting

When you test multiple models, meaning doing multiple tests, the actual p-value threshold $\alpha$ is between $0.05 < \alpha < 0.05 * k$, where $k$ is the number of multiple tests. It means the tests become easier to pass the significance, because significance threshold becomes larger. **Researcher degrees of freedom**.

- Use training and testing datasets of cross validation.
- Visualize the data and make an informed decision.
- If the model is polynomial regression, use Bayes information criteria.







