# Linear Regression

Linear regression allows us model and predict the behavior of continuous variables. In this module, we will study and look at examples of the ordinary least squares method for estimating the parameters in a linear regression model. 

## Ordinary Least Squares Simple Linear Regression

The simple linear regression model is given by
![alt](extras/lm4.png)
where y is the dependent variable, x is the independent variable, e is the random error term, and B1 and B2 are the regression parameters. 

Assumptions of a simple linear regression model: 
1. The mean value of y, for each value of x, is given by the linear regression function:
![alt](extras/lm1.png)
2. For each value of x, the values of y are distributed about theim mean value, following probability distributions that all have the same variance, 
![alt](extras/lm2.png)
3. The sample value values of y are all uncorrelated and have zero covariance, implying that there is no linear association among them, 
![alt](extras/lm3.png)
This assumption can be made stronger by assuming that the values of y are all statistically independent. 

4. The variable x is not random and must take at least two different values. 

5. The values of y are normally disstributed about their mean for ach value of x (optional).
6. The value of y, for each value of x, is
![alt](extras/lm4.png)
7. The expected value of the random error e is 
![alt](extras/lm5.png)
which is equivalent to assuming that 
![alt](extras/lm6.png)
8. The variance of the random error e is
![alt](extras/lm7.png)
The random variables y and e have the same variance because they differ only by a constant. 
9. The covariance between any pair of random errors ei and ej is
![alt](extras/lm8.png)
The stronger version of this assumption is that the random errors e are statistically independent, in which case the values of the dependen variable y are also statistically indendent. 
10. The variable x is not random and must take on at least 2 different values. 
11. The values of e are normally distributed about their mean if the values of y are normally distributed, and vice versa. 
![alt](extras/lm9.png)

The least squares model attemps to find the line through our data that minimizes the squared residuals, defined as the difference between the observed and predicted values of our outcome variable, where the residual is
![alt](extras/lm11.png)
and the sum squared residuals is
![alt](extras/lm12.png)

The least squares estimators b1 and b2 are 
![alt](extras/lm13.png)
and under assumptions 6-10 above, the Gauss Markov Theorem states that b1 and b2 have the smallest variance of all linear and unbiased estimators of B1 and B2. 

#### The Normality Assumption
Our OLS hypothesis tests and interval estimates for the coefficients rely on the assumption that the errors, and hence the dependent variable y, are normally distributed. Although the tests and interval estimates are valid in large samples regardless of the normal distribution of our errors, we might still find ourselves in a position where we want to find an alternative functional form or transform the dependent variable in order to improve our model. We can test for normality using the Jarque-Bera test, which evaluates the skewness (symmetry) and kurtosis ('peakedness' of the distribution) of the residuals. 

## Multiple Linear Regression Model

The multiple linear regression model can be generalized by:
![alt](extras/lm17.png)
The main assumptions of the multiple regression model are:
![alt](extras/lm18.png)

## Model Specification and Evaluation
Choosing the correct model is a process of combining intuition with empirical observations about the behavior of the model. This section covers the essential tests that can be performed in oder to make sure that we have chosen the correct variables and have the optimal model. 

### t-test
The t-statistic and p-value columns are testing whether any of the coefficients might be equal to zero. The t-statistic is calculated simply as 
![alt](https://wikimedia.org/api/rest_v1/media/math/render/svg/706d1c514396be8e7301a23ab369cdcf5b1c5096)
If the errors ε follow a normal distribution, t follows a Student-t distribution. Under weaker conditions, t is asymptotically normal. Large values of t indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, p-value, expresses the results of the hypothesis test as a significance level. Conventionally, p-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.

### F-Statistic
F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has F(p–1,n–p) distribution under the null hypothesis and normality assumption, and its p-value indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as for example Wald test or LR test should be used.

### Measuring goodness of fit  
In order understand how much of the variation in the observed value y is explained by the predicted value of yhat, we need to define the total sum of squares (SST) as the sum of squares due to the regression (SSR) plus the sum of squares due to error (SSE):
![alt](extras/lm14.png)
which becomes
![alt](extras/lm15.png)
We can now define a measure called the coefficient of determination, or R^2, which is the proportion of variation in y explained by x within the regression model:
![alt](extras/lm16.png)

### Adjusted R-Squared

Adjusted R-squared is a slightly modified version of R^2, designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than R^2, can decrease as new regressors are added, and even be negative for poorly fitting models:
![alt](https://wikimedia.org/api/rest_v1/media/math/render/svg/7ec4559807623b855036fce5201f9e8b6c7aca4b)

### Omitted Variables
Omitting an important explanatory variable from our equation results in biased coefficient estimates, but a reduced variance. However, this is only true if the sample covariance between the omitted and the remaining variable is not zero. If the covariance is zero, then the least squares estimator in the misspecified model is still unbiased. 

### Irrelevant Variables
The inclusion of irrelevant variables (often identified by large p-values) can result in increased standard errors of the coefficients estimated for all the variables in the model (thus increasing individual p-values). It also results in a reduced precision of the estimated coefficients for the relevant variables in the equation.

### Collinearity
When two variables move together in systematic ways, they are said to be collinear, and the problem is labeled as collinearity. When two variables are perfectly correlated, then we have exact or extreme collinearity. Collinearity results in large variance of the estimator, which means a large standard error, which in turn means the estimate may not be significantly different from zero and the interval estimate will be wide. However, it is important to know that non-exact collinearity is not an assumption of the least squares model. Additionally, even though collinearity can make it difficult to isolate the effects of the individual variables, accurate forecasts may still be possible if the nature of the collinear relationship remains the same within the out-of-sample observations.  

To test for collinearity, one can look at the correlation coefficients between pairs of explanatory variables. In the cases where collinear relationships involve more than two explanatory variables, we can estimate the 'auxillary regressions,' where the left-hand side is one of the explanatory variables and the right-hand side variables are the remaining explanatory variables: 
![alt](extras/lm19.png)
If the R^2 form this artificial model is high - above .80 say - the implication is that a large portion of the variation in x2 is explained by variation in the other explanatory variables, meaning the precision of b2 is likely negatively affected by this collinearity. 

One way to reduce the negative effects of collinearity is to collect more, and better sample data. This will give the model more 'information' and allow it to estimate the parameters more precisely. The other method is to use non-sample data in the form of linear constraints on the parameter (such as a constraint that all the parameters have to add up to 1). However, although this method reduces estimator sampling variability, it also increases the estimator bias, unless the constraints are exactly true. 

### Heteroskedasticity
One assumption of the fitted model (to ensure that the least-squares estimators are each a best linear unbiased estimator of the respective population parameters, by the Gauss–Markov theorem) is that the standard deviations of the error terms are constant and do not depend on the x-value. Consequently, each probability distribution for y has the same standard deviation regardless of the x-value (predictor). In short, this assumption is homoscedasticity. Homoscedasticity is not required for the estimates to be unbiased, consistent, and asymptotically normal. While the ordinary least squares estimator is still unbiased in the presence of heteroscedasticity, it is inefficient because the true variance and covariance are underestimated. Biased standard errors lead to biased inference, so results of hypothesis tests are possibly wrong.

One can visually identify heteroskedasticity by plotting the residuals against the independent variable (in the case of simple linear regression) or against the fitted values (yhat). The most popular statistical methods for detecting heteroskedasticity are the White Test or the Breusch-Pagan (Lagrange multiplier) test. 

There are several common corrections for heteroscedasticity. They are:
- View logarithmized data. Non-logarithmized series that are growing exponentially often appear to have increasing variability, random volatility, or volatility clusters as the series rises over time. The variability in percentage terms may, however, be rather stable. The reason for this is that the likelihood function for exponentially growing data lacks a variance. Using regression, the maximum likelihood estimator is the least squares estimator, a form of the sample mean, but the sampling distribution of the estimator is the Cauchy distribution. The Cauchy distribution has no variance and so there is no fixed point for the sample variance to converge to causing it to behave as a random number. Taking the logarithm of the data converts the likelihood function to the hyperbolic secant distribution, which has a defined variance.
- Use a different specification for the model (different X variables, or perhaps non-linear transformations of the X variables).
- Apply a weighted least squares estimation method, in which OLS is applied to transformed or weighted values of X and Y. The weights vary over observations, usually depending on the changing error variances. In one variation the weights are directly related to the magnitude of the dependent variable, and this corresponds to least squares percentage regression.

### Testing Nested Models using the F-Test
- Two models are nested if both contain the same terms and one has at
least one additional term.
- Example:  
y = β0 + β1x1 + β2x2 + β3x1x2 + e (1)  
y = β0 + β1x1 + β2x2 + β3x1x2 + β4x + β5x + e (2)  
- Model (1) is nested within model (2).
- Model (1) is the reduced model and model (2) is the full model.
- How do we decide whether the more complex (full) model contributes additional information about the association between y and the predictors?
- In example above, this is equivalent to testing H0 : β4 = β5 = 0 versus Ha : at least one β 6= 0.
- Test consists in comparing the SSE for the reduced model (SSER) and the SSE for the complete model (SSEC).
- SSER > SSEC always so question is whether the drop in SSE from fitting the complete model is ‘large enough’.
- We use an F−test to compare nested models, one with k parameters (reduced) and another one with k + p parameters (complete or full).
- Hypotheses: H0 : βk+1 = βk+2 = ... = βk+p = 0 versus Ha : At least one β 6= 0.
- Test statistic: F = ((SSER−SSEC)/ # of additional βs)/(SSEC/[n−(k+p+1)])
- At level α, we compare the F−statistic to an Fν1,ν2 from table, where ν1 = p and ν2 = n − (k + p + 1).
- If F ≥ Fα,ν1,ν2, reject H0.

### Testing Non-Nested Models
To compare non-nested models, we can use the Cox test, the Davidson-MacKinnon J test, or theencompassing test of Davidson & MacKinnon.