# Chapter 6 - Forecasting Numeric Data. Regression Methods

Lantz, B. (2019) - Machine Learning with R. Expert techniques for predicitive modeling. [p. 167 - 216]

Using real-world data and prediction tasks, this chapter gives an introduction to techniques for estimating relationships among numeric data. This includes:

*   The basics of regression analysis, a set of statistical processes for modelling the size and strength of numeric relationships.
*   How to prepare data and interpret the regression models.
*   An overview of different techniques on how to adapt decision tree classifiers for numeric prediction tasks.

**Vorwissen über decision tree classifiers ist von vorteil (chapter 5).**

---

## Part A - Basics of regression analysis

Regression analysis is amongst the most widely used methods in science and especially machine learning. In particular, it can helpt to gain insight about a set of data which can helpt to explain the past and extrapolate into the future. This method can be applied to scientific studies, such as in the fields of economics, physics or psychology, for example to quantify the relationship between a dependent variable (the value to be predicted) and one or more independent variables (the predictors), or to identify patterns that can be used to forecast future behaviour. In statistical modelling this process is known as regression analysis, which will be introduced in the following sections. 

### Understanding regression

Recall that a line in the classical cartesian xy-plane can be defined in the form
$$
y=a+bx
$$
where $y$ indicates the dependent variable ($y$ is to be predicted) and $x$ the indipendent variable (the predictor). The slope of that line is specified by the term $b$ and indicates an increasing line if positive, and a decreasingand if negative. The intercept with the y-axis is given by the term $a$ (when $x=0$).

In machine learning a similar format of that equation is used, where the purpose of the machine is to identify values of $a$ and $b$ such that the specified line is able to describe the relationship between the supplied values $x$ to the values of $y$ in the best possible way. In practice it is rarely the case that the function perfectly relates those values, so the machine quantifies this error with an additional term. Hence, the smaller that error term, the better is the function to explain the relationship between the $y$ and $x$. 

This chapter focuses on the most basic regression models, the so-called **linear** regression models, since they use straight lines to explain the relationships. In the case of only one independent variable ($x_1$) it is called a **simple** linear regression, in the case of two or more independent variables ($x_1, x_2, ..., x_n$) it is known as **multiple** linear regression. 

## Simple linear regression

A simple linear regression model uses a line defined by an equation in the form

$$
y = \alpha + \beta x
$$

to explain the relationship between a dependent variable and a single independent variable. This equation is identical to the equation described previously, beside from the Greek characters which indicate variables that are parameters of statistical functions. As stated above, those parameters (or its estimates) however can be evaluated by performing a regression analysis. Using data from the space shuttle "Challenger" launch in 1986 (which went terribly wrong) will give a glimpse how such an analysis can help to gain insight about the data and to test hypotheses. 

Hence, a regression model that demonstrates the connection between O-ring failures (the dependent variable $y$) and the outside temperature during launch (the independent variable $x$, i.e. the predictor) could predict the possibility of failure given the expected temperature at launch. Since the parameters $\alpha$ and $\beta$ are necessary *ingredients* to form a line through the data set, finding their values is inevitable. As usual in real-world problems, finding an exact value, meaning the line passes through every measured data point exactly, is very unlikely. Instead, the line will rather somewhat evenly "cut" through the data. Therefore, the resulting values for the parameters (if error > 0) are considered *parameter estimates*. The best estimates are online the ones which generate the smallest error possible. To identify the optimal paramters such that the line is closest to the data points an estimation method known as **ordinary least squares (OLS)** is applied. 

Note that the term *line* is meant as the solucion space of the equation $y = \alpha + \beta$. Values on that line, i.e. the line itself, are predictions made by the regression model. If the observed data point is below or above the predicted solution, the error is greater than 0, since the vertical distance from the prediction to the true value is > 0.

## Ordinary least squares estimation

Estimating the optimal values for the parameters $\alpha$ and $\beta$ means finding the optimal value for the intercept $\alpha$ and the slope $\beta$ such that the deviation of the predicted values ($\hat{y}$) from the actual value ($y$) is as small as possible. Statistically speaking, *errors* are referred to as **residuals**. In OLS regression, the estimated values for the parameters are chosen such that the **sum of the squared errors (SSE)** is minimal. Mathematically speaking, the goal of OLS regression is to minimise following equation: 

$$
\sum{(y_i - \hat{y_i})^2} = \sum{e^2_i},
$$

where $y_i$ is the actual value and $\hat{y_i}$ the value the regression model has predicted. The difference between those values is the residual, i.e. the error, denoted as $e$. Since the errors can be positive valued (over-estimation) or negative valued (under-estimation) they being squared to eliminate the negative values and summed across all points in the data. 

### Evaluating solutions for for $\alpha$ and $\beta$

The solution for $\alpha$ depends on $\beta$. Thus, the value is obtained by applying simple algebra and solving following equation:

$$
\alpha = \bar{y} - \beta\bar{x},
$$

where $\bar{y}$ and $\bar{x}$ denotes the mean value of y and x, respectively. Calculating the slope of the regression model requires a bit more calculus and is calculated by solving 

\begin{equation}
\beta = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}.
\end{equation}

By breaking up the equation into its components, it becomes evident that the slope $\beta$ can be calculated by dividing the covariance by the variance of the independent variable:

$$
\beta = \frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)}
$$

For the sake of convenience no proof will be given for the equation of $\beta$, the interested reader however can consult standard statistical books if interested. Having those statements as the basis, it is no hurdle to calculate the slope and the intercept of the regression model using built-in R functions, i.e. finding the values for $\alpha$ and $\beta$. This will be demonstrated on the dataset `challenger.csv` from the Packt Publishing website.

First, the data has to be stored in a dataframe. Note that the independent variable $x$ is named `temperature` and the dependent variable $y$ is named `distress_ct`. 

In [0]:
# load data from github-repository
url <- "https://raw.githubusercontent.com/tanasrad/Machine_Learning_with_R/master/Ch6/challenger.csv"
launch <- read.csv(url)

Using built-in R functions for the calculation of covariance and variance is straightforward. Keep in mind that the independent variable $x$ is `temperature` and the value to be predicted, $y$ is the independent variable `distress_ct`. Thus, $\beta$ is manually calculated using Cov(x,y) and Var(x):

In [5]:
# estimate beta manually
b <- cov(launch$temperature, launch$distress_ct) / var(launch$temperature)
b

The rounded result is -0.0475. The negative slope indicates already that for increasing values in $x$, the value for $y$ will decrease. Meaning, the `distress_ct` will deacrease by a factor of 0.0475 with increasing `temperature`. 

Using the computed slope, the value for the intercept $\alpha$ can be computed, also manually using built-in R functions:

In [6]:
# estimate alpha manually
a <- mean(launch$distress_ct) - b * mean(launch$temperature)
a

Even though calculating the values manually for $\alpha$ and $\beta$ is not ideal, to further understand the regression model's fit it is usefull to first learn a method for measuring the strength of a linear relationship. Afterwards, the more sophisticated way of carrying out linear regressions with the lm-function will be introduced, as well as how to apply multiple linear regressions to problems with mulitple independent varibles. 

### Correlations

Correlation coefficients are very useful to express the linear relationship between two variables. Hence, the coefficient indicates how closely their relationship follows a straight line. The best-known correlation coefficient is the **Pearson correlation coefficient**, which ranges between +1 and -1. A correlation of zero indicates no linear relationship, the maximum and minimum values indicate that the variables are perfectly correlated. Furthermore, a correlation coefficient of +1 indicates a perfect positive correlation, where the two variables behave in the same way, and a correlation coefficient of -1 in contrast is a perfect correlation, but the variables are behaving in the exact oposite way. 

The Pearson's correlation is defined as:

$$
\rho_{x,y} = \mathrm{Corr}(x,y) = \frac{\mathrm{Cov}(x,y)}{\sigma_x \sigma_y},
$$

where $\sigma$ denotes the standard deviation of x and y, repsecitvely. Applying this formula, the correlation between the launch `temperature` and the number of O-ring `distress_ct` events can also be computed manually using built-in R functions:

Clearly, this isn't the most sophisticated way to calculate the values for $\alpha$ and $\beta$, it is nonetheless important to 

In [7]:
# calculate the correlation of launch data manually
r <- cov(launch$temperature, launch$distress_ct) /
       (sd(launch$temperature) * sd(launch$distress_ct))
r

Alternatively, using the R-function cor() leads, not surprisingly, to the same result:

In [8]:
# check with the built-in function
cor(launch$temperature, launch$distress_ct)

As stated at the beginning of this section, negative correlations imply a reverse relationship, i.e. an increase in the dependent variable `temperature` ($x$) is related to a decrease in `distres_ct` $(y)$. Since the value -0.5111 is halfway ot the maximum -1, this implies that there is a non-neglectiable negative linear association. One **rule of thumb** interprets correlation strength as "weak" if the values are between 0.1 and 0.3, "moderate" in the range between 0.3 and 0.5, and "strong" for values above 0.5. However, this is only a rule of thumb and correlation should always be interpreted in context! Nevertheless, investigating linear relationships among independent variables and the dependent variables will be important for understanding regression models on with larger numbers of predictors. 

However, the slope of the regression model can also be calculated using the correlation coefficient:

$$
\beta = \frac{\mathrm{Cov}(x,y)}{\sigma_x \sigma_y} \cdot \frac{\sigma_y}{\sigma_x} = \rho_{x,y} \cdot \frac{\sigma_y}{\sigma_x}
$$

In [9]:
# computing the slope (beta) manually using correlation
r * (sd(launch$distress_ct) / sd(launch$temperature))

The slope of the regression model is -0.04754, were not surprisingly both ways of manually calculating the value for $\beta$ yields the same result. 
Such simple linear regressions are commonly carried out using the built-in lm function, which was written to fit linear models. It can be used to carry out regressions, single stratum analysis of variance and analysis of covariance. Everything that was described beforehand could have been done with this lm function. Applying the function is simple, the first input `formula` asks for the specifications, which dependent variable should be modelled by which predictor. The `~` operator can be read as "is being modelled by". An expression of the form `y ~ model` is interpreted as `y` is modelled by a linear predictor specified symbolically by `model`. In the code below this means `distress_ct` is being modelled by `temperature`. The secont argument `data` asks to specify, from which data set the value for `y` and `x` shoult be taken from. If nothing else is specified, this function will use the default values for the remaining parameters and performing an OLS-regression . The results will be nicely structured, which is very helpful if one is calculating a regression on multiple independent variables. 

In [10]:
# confirming the regression line using the lm function
model <- lm(distress_ct ~ temperature, data = launch)
model


Call:
lm(formula = distress_ct ~ temperature, data = launch)

Coefficients:
(Intercept)  temperature  
    3.69841     -0.04754  


The output shows the values 3.69841 for the intercept $\alpha$ and -0.04754 for the slope $\beta$, matching the manual calculations. If a thorough statistical analysis of the regression analysis is required, the function `summary` provides a useful summary of the entire model, including an analysis on the residuals, statistical significance of the results, the R-squared, the t and p-values. 

In [11]:
summary(model)


Call:
lm(formula = distress_ct ~ temperature, data = launch)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5608 -0.3944 -0.0854  0.1056  1.8671 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  3.69841    1.21951   3.033  0.00633 **
temperature -0.04754    0.01744  -2.725  0.01268 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5774 on 21 degrees of freedom
Multiple R-squared:  0.2613,	Adjusted R-squared:  0.2261 
F-statistic: 7.426 on 1 and 21 DF,  p-value: 0.01268


Interpreting the summary of the regression `model` allows to conclude that the goodness-of-fit (**R-squared**) of the regression model is 0.2613, meaning that the model is able to explain 26.13% of the variance within the data. Basically, R-squared is the ratio of the variance explained by the model and the total variance of the data. Conceptually, the higher R-squared, the smaller the errors between predicted and acutal value, i.e. the closer are the actual data points to the line of the regression model. Logically, the R-squared would increase as more variables are included in the model. Therefore it's advicable to use the **adjusted R-squared** for multiple linear regressions. Another indicator for a good/poor fit is the **P-value**, which gives the probability of observing any value equal or larger than the **t-value**, which is the measure of how many standard deviations the coefficient is far away from 0. Hence, the greater the t-value, the better the model, i.e. higher indication of a relationship between the variables. The P-value for the coefficient is 0.6% which is considered as a good fit (also indicated by the **significance codes**). Hence, the lower the P-value the higher the significance of the estimate. A small value allows for the conclusion for the presence of relationships.

As a concluding remark for this section, note that correlation does not imply causation, it rather describes the relationship between a pair of variables, yet there could be other unmeasured explanations. For the interested readers, what such false conclusions can lead to are shown on this website. Real-world cases that are highly correlated but have no causality whatsoever, known as spurious correlations. See: http://tylervigen.com/spurious-correlations


## Multiple Linear Regressions

The multiple linear regression is understood as an extension of simple linear regressions. Both models have the same goal, namely to estimate the values of the slope coefficients which minimize the prediction error of a linear equation. As the name suggests, mulitple linear regressions allow for additional terms for the additional independent variables (predictors). 

Like any model in statistics and mathematics, multiple linear regressions also come with it's strengths and weaknesses. Multiple linear regressions ar by far the most common approach for modelling numeric data, and since most real-world analyses have more than one independent variable, it is likely this model is used for most numeric prediction tasks. It is so popular, because they can be adapted to model almost any task. However, it also makes strong assumptions about the data and are only suitable for numeric data. Categorical data for example would require additional preparation. Hence, a multiple regression models are in the form:

$$
y = \alpha + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_i x_i + \epsilon ,
$$

where $\epsilon$ has been added to describe the error of the prediction. 

In [0]:
# creating a simple multiple regression function
reg <- function(y, x) {
  x <- as.matrix(x)
  x <- cbind(Intercept = 1, x)
  b <- solve(t(x) %*% x) %*% t(x) %*% y
  colnames(b) <- "estimate"
  print(b)
}

# examine the launch data
str(launch)

# test regression model with simple linear regression
reg(y = launch$distress_ct, x = launch[2])

# use regression model with multiple regression
reg(y = launch$distress_ct, x = launch[2:4])

# confirming the multiple regression result using the lm function (not in text)
model <- lm(distress_ct ~ temperature + field_check_pressure + flight_num, data = launch)
model