# Regression Analysis


> "_Nature has established patterns originating in the return of events, but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary._"  - **Gottfried Wilhelm Leibniz**


This notebook is aimed at showing the different types of regression and how they can be used to solve various problems. There are a couple caveats associated with regression and some common biases, we will only explore a handful of these in our discussion.

Before we begin with regresssion, we will take a dive into estimation approaches:
 
 - **Ordinary Least Squares (OLS)**
 - **Maximum Likelihood Estimation (MLE)**
 - **Bayesian (Univariate & Multivariate)**
 
and we'll make efforts to describe them in greater detail.

We will explore several types of regression namely:

 1. **Linear Regression**
 2. **Ridge Regression**
 3. **Lasso Regression**
 4. **Bayesian Linear Regression**
 5. **Logistic Regression**
 
**NB:** This series is a summary of **Part II: Early Computer-Age Methods** found in **CASI: Computer Age Statistical Inference**.


***


## [Ordinary Least Squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares)

Ordinary Least Squares (OLS) is a statistical method for approximating the ___unknown parameters___ in a linear regression model by selecting ___parameters___ of linear function from a set of ___explanatory variables___ by the principle of ___least squares___:

> *... minimizing the sum of the squares in the differences between the observed ___dependent variable___ (values of the variable being predicted) in the given dataset and those predicted by the linear function ...*

OLS is consistent if ___regressors___ are exogenous (i.e., independent of the error term) in the linear model, errors are ___homoscedastic___ (have the same finite variance a.k.a., homogeneity of the variance) and are not correlated. This provides us with the ___minimum-variance mean unbiased___ estimation when the errors have finite variances.

If we add the assumption that the errors are normally distributed (i.e., follow a [Gaussian](https://en.wikipedia.org/wiki/Normal_distribution) distribution), OLS is the ___maximum likelihood estimator___.

### The linear formulation:

Suppose our data has $n$ observations $\{y_i, x_i\}^n_{i=1}$, where each observation $i$ includes a scalar response $y_i$ and a column vector $x_j$ of values of $p$ predictors (regressors) $x_{ij}$ for $j = 1, ... , p$. In a linear regression model, the response variable, $y_i$, is a linear function of the regressors:

> $y_i = \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_px_{ip} + \epsilon_i,$

> $y_i = \bf{X}\beta + \epsilon$

A measure of the overall model fit is given by the ___Residual Sum of Squares (RSS)___:

> $S(b) = \sum^{n}_{i=1}(y_i - x_i^Tb)^2 = (y - Xb)^T(y - Xb),$ where T is the traspose.

After estimating $\beta$, the ___fitted values___ (or ___predicted values___) from the regression will be

> $\hat(y) = X\hat{\beta} = Py,$ where $P = X(X^TX)^{-1}X^T$.

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto $X$. The ___coefficient of determination___ $\bf{R^2}$ is defined as a ratio of "explained" variance to the "total" variance of the dependent variable $y$:

> $R^2 = \frac{\sum(\hat{y_i} - \bar{y})^2}{\sum(y_i - \bar{y})^2} = \frac{y^TP^TLPy}{y^TLy} = 1 - \frac{y^TMy}{y^TLy} = 1 - \frac{RSS}{TSS},$ where TSS is the ___total sum of squares___ for the dependent variable, $L = I_n - \frac{11^T}{n}$.


***


## [Maximum Likelihood Estimation (MLE)](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)

Maximum likelihood estimation (MLE) is arguably the twentieth century's most influential piece of applied mathematics, and continues to be the method of choice in the statistician's toolkit. Generally, MLE provides nearly unbiased estimates of nearly minimum variance in an automatic way. 

**CASI**, _page 91_, discusses the inadequacies and dangers of MLE with reference to the cost of unbiadness when there are hundreds or thousands of parameters to estimate at the same time. The [___James-Stein Estimator___](https://en.wikipedia.org/wiki/James_stein_estimator) illustrated this point dramatically in 1961 using a few unknown parameters, not hundreds or thousands. Discussing the issues with MLE leads to the story of _shrinkage estimation_ - deliberate introduction of biases to improve overall performance, at a possible danger to individual estimates.


***


## [Bayesian Estimation](https://en.wikipedia.org/wiki/Bayesian_estimation)



### 1. Linear Regression

#### 1.1. Simple Linear Regression Model

When the data matrix contains only two variables, a constant and a scalar regressor $x_j$, then we have a ___simple regression model___ with parameters $(\alpha, \beta)$:

> $y_i = \alpha + \beta x_i + \epsilon_i.$

The least squares estimates in this case are given by:

> $\hat{\beta} = \frac{\sum x_iy_i - \frac{1}{n}\sum x_i \sum y_i}{\sum x_{i}^{2} - \frac{1}{n}(\sum x_i)^2} = \frac{Cov[x, y]}{Var[x]}$

> $\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x},$

where $Var(.)$ & $Cov(.)$ are sample parameters.

### 2. Ridge Regression

Linear regression is based on a version of $\hat{\mu}^{MLE}$. Just as in linear regression, we observe _n_-dimensional vectors $y = (y_1, y_2, ..., y_n)^'$ from the linear model:

> $y = \bf{X}\beta + \epsilon,$

where $\epsilon$ has uncorrelated components and constant variance $\sigma^2$. The _least squares estimate_ $\hat{\beta}$ is the minimizer of the total sum of squared errors,

> $\hat{\beta} = $ arg min $\{|y - \bf{X}\beta|\},$ 

> given by $\hat{\beta} = \bf{S}^{-1} \bf{X}^{'} y,$ where $\bf{S}$ is the $p x p$ inner product matrix.

#### A digression on the assumptions of a regression model

An assumption of the fitted model is that the ___standard deviations___ of the error terms are constant and do not depend on the x-value (the predictor) - this is homoscedasticity. It is not required for the estimates to be unbiased, consistent and asymptotically normal.