Multiple Linear Regression
===========================

Multiple linear regression is among the most used statistical data analysis techniques. These models obtain an additive function relating a target variable to a set of predictor variables. This additive function is a sum of terms of the form $\beta_i \times X_i$, where $X_i$ is a predictor variable and $\beta_i$ is a number.

There is no predefined way of handling missing values for this type of modeling technique.

How to obtain a linear regression model for predicting the frequency of one of the algae.

```
> lm.a1 <- lm(a1 ~., data = clean.algae[,1:12])
```

The application of the function *summary()* to a linear model gives some diagnostic information concerning the obtained model. 

- Information concerning the residuals (i.e., the errors) of the fit of the linear model to the used data. The residuals should have a mean zero and should have a normal distribution (and obviously be as small as possible!)
- For earch coefficient (variable) of the multiple regression equation, R will show its value and also its standard error (an estimate of the variability of these coefficients).
- To test this hypothesis that each of them is null, that is, $H0: \beta_i = 0$. To test this hypothesis, the [_t-test_](http://en.wikipedia.org/wiki/Student%27s_t-test) is used. R will show column $Pr(>|t|)$ associated with each coefficient with the level at which the hypothesis that the coefficient is null is rejected.
- $R^2$ coefficients (multiple and adjusted) indicate the degree of fit of the model to the data, that is, the proportion of variance in the data that is explained by the model. Values near 1 are better.
- We can test the null hypothesis that there is no dependence of the target variable on any of the explanatory variables, that is, $H0 : \beta_1 = \beta_2 = \dots = \beta_m = 0 $. The F-statistic can be used for this purpose by comparing it to a critical value. R provide the confidence level at which we are sure to reject this null hypothesis. A p-level of 0.0001 means that we are 99.99% confident that the null hypothesis is not true. If the model fails this test ($p > 0.1$), it makes no sense to look at the _t-tests_ on the individual coefficients.

How to calculate $R^2$ (Coefficient of determination)
------------------------------------------------------

1. [Wikipedia pages of $R^2$](http://en.wikipedia.org/wiki/Coefficient_of_determination)

2. Calculation

A data set has $n$ values marked $y_1,\dots,y_n$, each associated with a predicted value $f_1,\dots,f_n$.

If $\bar y$ is the mean of the observed data:

$$\bar y = \frac{1}{n}\sum_{i=1}^{n} y_i$$

then the variability of the data set can be measured using three [sum of squares](http://mathworld.wolfram.com/SumofSquaresFunction.html) formulas:

* The total sum of squares (proportional to the variance of the data)

$$ SS_{tot} = \sum_i (y_i - \bar y)^2 $$

* The regression sum of squares, also called the explained sum of squares

$$ SS_{reg} = \sum_i (f_i - \bar y)^2 $$

* The sum of squares of residuals, also called the residual sum of squares

$$ SS_{res} = \sum_i (y_i - f_i)^2 $$

The most general definition of the coefficient of determination is

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$


P values
==========

#### 1. Alternative hypothesis vs null hypothesis

- A statistical hypothesis refers to a probability distribution that is assumed to govern the observed data. If $X$ is a *random variable* representing the observed data and $H$ is the statistical hypothesis under consideration, then the notion of statistical significance can be naively quantified by the conditional probability $Pr(X|H)$
- *null hypothesis* refers to a general statement or default position that there is no relationship between two measued phenomena. Rejecting or disproving the null hypothesis -- and thus concluding that there are grounds for believing that there is a relationship between 2 phenomena
- *alternative hypothesis* (*research hypothesis*)

#### 2. p-value

*p-value* is a function of the observed sample results that is used for testing a statistical hypothesis. It is used in the context of null-hypothesis testing in order to quantify the idea of *statistical significance* of evidence.

Null hypothesis testing is a [reductio ad absurdum](http://en.wikipedia.org/wiki/Reductio_ad_absurdum) argument adapted to statistics.

Before performing the test
- a threshold value is chosen, called the significance level (5% or 1%) and denoted as $\alpha$.
- if the _p-value_ is equal to or smaller than the significance level, it suggests that the observed data are inconsistent with the assumption that the null hypothesis is tru, and thus that hypothesis must be rejected and the alternative hypothesis is accepted as true.

The p-value is defined as the probability under the assumption of hypothesis $H$, of obtaining a result equal to or more extreme than what was actually observed.

"More extreme than what was actually observed" can either mean $\{X \geq x\}$ (right tail event) or $\{X \leq x\}$ (left tail event) or the smaller of $\{X \geq x\}$ and $\{X \leq x\}$ (double tailed event). Thus the p-value is given by

- $Pr(X \geq x|H)$ for right tail event
- $Pr(X \leq x|H)$ for left tail event
- 2 min $\{Pr(X \leq x|H), Pr(X \geq x|H)\}$ for double tail event

Backward elimination
====================

    > final.lm <- step(lm.a1)

Reference

- http://nbviewer.ipython.org/github/mwaskom/Psych216/blob/master/week1_tutorial.ipynb
- https://scipy-lectures.github.io/intro/matplotlib/matplotlib.html#plotting-with-default-settings