# Chapter 6 - Forecasting Numeric Data. Regression Methods

Lantz, B. (2019) - Machine Learning with R. Expert techniques for predicitive modeling. [p. 167 - 216]

Using real-world data and prediction tasks, this chapter gives an introduction to techniques for estimating relationships among numeric data. This includes:

*   The basics of regression analysis, a set of statistical processes for modelling the size and strength of numeric relationships.
*   How to prepare data and interpret the regression models.
*   An overview of different techniques on how to adapt decision tree classifiers for numeric prediction tasks.

**Vorwissen über decision tree classifiers ist von vorteil (chapter 5).**

---

## Part A - Basics of regression analysis

Regression analysis is amongst the most widely used methods in science and especially machine learning. In particular, it can helpt to gain insight about a set of data which can helpt to explain the past and extrapolate into the future. This method can be applied to scientific studies, such as in the fields of economics, physics or psychology, for example to quantify the relationship between a dependent variable (the value to be predicted) and one or more independent variables (the predictors), or to identify patterns that can be used to forecast future behaviour. In statistical modelling this process is known as regression analysis, which will be introduced in the following sections. 

### Understanding regression

Recall that a line in the classical cartesian xy-plane can be defined in the form
$$
y=a+bx
$$
where $y$ indicates the dependent variable ($y$ is to be predicted) and $x$ the indipendent variable (the predictor). The slope of that line is specified by the term $b$ and indicates an increasing line if positive, and a decreasingand if negative. The intercept with the y-axis is given by the term $a$ (when $x=0$).

In machine learning a similar format of that equation is used, where the purpose of the machine is to identify values of $a$ and $b$ such that the specified line is able to describe the relationship between the supplied values $x$ to the values of $y$ in the best possible way. In practice it is rarely the case that the function perfectly relates those values, so the machine quantifies this error with an additional term. Hence, the smaller that error term, the better is the function to explain the relationship between the $y$ and $x$. 

This chapter focuses on the most basic regression models, the so-called **linear** regression models, since they use straight lines to explain the relationships. In the case of only one independent variable ($x_1$) it is called a **simple** linear regression, in the case of two or more independent variables ($x_1, x_2, ..., x_n$) it is known as **multiple** linear regression. 

## Simple linear regression

A simple linear regression model uses a line defined by an equation in the form

$$
y = \alpha + \beta x
$$

to explain the relationship between a dependent variable and a single independent variable. This equation is identical to the equation described previously, beside from the Greek characters which indicate variables that are parameters of statistical functions. As stated above, those parameters (or its estimates) however can be evaluated by performing a regression analysis. Using data from the space shuttle "Challenger" launch in 1986 (which went terribly wrong) will give a glimpse how such an analysis can help to gain insight about the data and to test hypotheses. 

Hence, a regression model that demonstrates the connection between O-ring failures (the dependent variable $y$) and the outside temperature during launch (the independent variable $x$, i.e. the predictor) could predict the possibility of failure given the expected temperature at launch. Since the parameters $\alpha$ and $\beta$ are necessary *ingredients* to form a line through the data set, finding their values is inevitable. As usual in real-world problems, finding an exact value, meaning the line passes through every measured data point exactly, is very unlikely. Instead, the line will rather somewhat evenly "cut" through the data. Therefore, the resulting values for the parameters (if error > 0) are considered *parameter estimates*. The best estimates are online the ones which generate the smallest error possible. To identify the optimal paramters such that the line is closest to the data points an estimation method known as **ordinary least squares (OLS)** is applied. 

Note that the term *line* is meant as the solucion space of the equation $y = \alpha + \beta$. Values on that line, i.e. the line itself, are predictions made by the regression model. If the observed data point is below or above the predicted solution, the error is greater than 0, since the vertical distance from the prediction to the true value is > 0.

## Ordinary least squares estimation

Estimating the optimal values for the parameters $\alpha$ and $\beta$ means finding the optimal value for the intercept $\alpha$ and the slope $\beta$ such that the deviation of the predicted values ($\hat{y}$) from the actual value ($y$) is as small as possible. Statistically speaking, *errors* are referred to as **residuals**. In OLS regression, the estimated values for the parameters are chosen such that the **sum of the squared errors (SSE)** is minimal. Mathematically speaking, the goal of OLS regression is to minimise following equation: 

$$
\sum{(y_i - \hat{y_i})^2} = \sum{e^2_i},
$$

where $y_i$ is the actual value and $\hat{y_i}$ the value the regression model has predicted. The difference between those values is the residual, i.e. the error, denoted as $e$. Since the errors can be positive valued (over-estimation) or negative valued (under-estimation) they being squared to eliminate the negative values and summed across all points in the data. 

### Evaluating solutions for for $\alpha$ and $\beta$

The solution for $\alpha$ depends on $\beta$. Thus, the value is obtained by applying simple algebra and solving following equation:

$$
\alpha = \bar{y} - \beta\bar{x},
$$

where $\bar{y}$ and $\bar{x}$ denotes the mean value of y and x, respectively. Calculating the slope of the regression model requires a bit more calculus and is calculated by solving 

\begin{equation}
\beta = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}.
\end{equation}

By breaking up the equation into its components, it becomes evident that the slope $\beta$ can be calculated by dividing the covariance by the variance of the independent variable:

$$
\beta = \frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)}
$$

For the sake of convenience no proof will be given for the equation of $\beta$, the interested reader however can consult standard statistical books if interested. Having those statements as the basis, it is no hurdle to calculate the slope and the intercept of the regression model using built-in R functions, i.e. finding the values for $\alpha$ and $\beta$. This will be demonstrated on the dataset `challenger.csv` from the Packt Publishing website.

First, the data has to be stored in a dataframe. Note that the independent variable $x$ is named `temperature` and the dependent variable $y$ is named `distress_ct`. 

In [0]:
url <- "https://raw.githubusercontent.com/tanasrad/Machine_Learning_with_R/master/Ch6/challenger.csv?token=AH4PSO33CAWPK7DHMLG7U4C6QYS6O"
launch <- read.csv(url)

Using built-in R functions for the calculation of covariance and variance is straightforward. Keep in mind that the independent variable $x$ is `temperature` and the value to be predicted, $y$ is the independent variable `distress_ct`. Thus, $\beta$ is manually calculated using Cov(x,y) and Var(x):

In [5]:
b <- cov(launch$temperature, launch$distress_ct) / var(launch$temperature)
b

The rounded result is -0.0475. The negative slope indicates already that for increasing values in $x$, the value for $y$ will decrease. Meaning, the `distress_ct` will deacrease by a factor of 0.0475 with increasing `temperature`. 

Using the computed slope, the value for the intercept $\alpha$ can be computed, also manually using built-in R functions:

In [6]:
a <- mean(launch$distress_ct) - b * mean(launch$temperature)
a

Even though calculating the values manually for $\alpha$ and $\beta$ is not ideal, to further understand the regression model's fit it is usefull to first learn a method for measuring the strength of a linear relationship. Afterwards, the more sophisticated way of carrying out linear regressions with the lm-function will be introduced, as well as how to apply multiple linear regressions to problems with mulitple independent varibles. 

In [0]:
# calculate the correlation of launch data
r <- cov(launch$temperature, launch$distress_ct) /
       (sd(launch$temperature) * sd(launch$distress_ct))
r
cor(launch$temperature, launch$distress_ct)

# computing the slope using correlation
r * (sd(launch$distress_ct) / sd(launch$temperature))

# confirming the regression line using the lm function (not in text)
model <- lm(distress_ct ~ temperature, data = launch)
model
summary(model)

Clearly, this isn't the most sophisticated way to calculate the values for $\alpha$ and $\beta$, it is nonetheless important to 