# How Does Linear Regression Actually Work?

Regression is the process of estimating the relationship between input (independent) and output variables (dependent). The output variables are continuous-valued real numbers. Hence there are an infinite number of possibilities. In linear regression, we assume that the relationship between input and output is linear (a straight line).

Following video (from 0:00 till 7:39) will explain how regression works.

<a href="https://www.youtube.com/embed/yMgFHbjbAW8"><img src="resources/video1.png" width="400"></a>

## 1. Theory

The term *linearity* in algebra refers to a linear relationship between two or more variables. If we draw this relationship in a two-dimensional space (between two variables), we get a straight line.

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y (output). Hence, the name is *Linear Regression*. If we plot the independent variable (x) on the x-axis and dependent variable (y) on the y-axis, linear regression gives us a straight line that best fits the data points, as shown in the figure below.


## 2. Single Linear Regression

The following figure illustrates simple linear regression:

<img src="./resources/LRgraph.png" style="height: 300px"/>

When implementing simple linear regression, you typically start with a given set of input-output (𝑥-𝑦) pairs (green circles). These pairs are your observations. For example, the leftmost observation (green circle) has the input 𝑥 = 5 and the actual output (response) 𝑦 = 5. The next one has 𝑥 = 15 and 𝑦 = 20, and so on.

The estimated regression function (black line) has the equation `𝑓(𝑥) = 𝑏 + m𝑥`. Your goal is to calculate the optimal values of the predicted weights **b and m** that minimize SSR (Sum of Squared Residuals) and determine the estimated regression function. The value of b, also called the intercept, shows the point where the estimated regression line crosses the 𝑦 axis. It is the value of the estimated response 𝑓(𝑥) for 𝑥 = 0. The value of m determines the slope of the estimated regression line.

The predicted responses (red squares) are the points on the regression line that correspond to the input values. For example, for the input 𝑥 = 5, the predicted response is 𝑓(5) = 8.33 (represented with the leftmost red square).

The residuals (vertical dashed gray lines) can be calculated as 𝑦ᵢ - 𝑓(𝐱ᵢ) = 𝑦ᵢ - b - m𝑥ᵢ for 𝑖 = 1, …, 𝑛. They are the distances between the green circles and red squares. When you implement linear regression, you are actually trying to minimize these distances and make the red squares as close to the predefined green circles as possible. The most used technique to do this is to minimize the sum of the squares of these distances (Least Squared Residuals).

To find to optimal settings for m and b that minimizes the residuals, we often use a 'one-shot learner': mathematically calculate the best settings for m and b (via a normal equation from linear algebra) in one try. But when we get to multiple dimensions (more features in X), it get complicated, pretty fast. There are other possibilies as well. For instance, when the number of features gets to big, it's very hard (requires a lot of memory) to calculate the best settings in one shot. Then, we can calculate the optimal settings using a technique called 'gradient descent', to find the optimal settings for m and b. We'll cover the technique of gradient descent, when we optimize neural networks (see Deep Learning part).

## 2. Multiple Linear Regression

This same concept can be extended to cases where there are more than two variables. This is called multiple linear regression. For instance, consider a scenario where you have to predict the price of the house based upon its area, number of bedrooms, the average income of the people in the area, the age of the house, and so on. In this case, the dependent variable (target variable) is dependent upon several independent variables. A regression model involving multiple variables can be represented as:

y = b + m1x1 + m2x2 + m3x3 + … … mnxn

## 3. Polynomial Regression

Instead of using a straight line to estimate the values of y, it might be more accurate to draw a curve. In that case we are not using a linear function (of degree 1), but instead the estimated regression function will be a polynomial of degree 2, 3, ... You can see some examples below.

<img src="./resources/LRpol.png" style="height: 600px"/>

## 4. Underfitting and overfitting

One very important question that might arise when you’re implementing polynomial regression is related to the choice of the optimal degree of the polynomial regression function.

There is no straightforward rule for doing this. It depends on the case. You should, however, be aware of two problems that might follow the choice of the degree: underfitting and overfitting. These will be explainded in the video below (from 0:00 till 5:43).

<a href="https://www.youtube.com/embed/EuBBz3bI-aA?start=22&end=343"><img src="resources/video2.png" width="400"></a>

## 5. Questions

Can you explain the terms *bias* and *variance*? Is bias, resp. variance low or high for the straight line / squiggly line?

In [None]:
# bias:
# variance: 

# straight line: low or high bias / low or high variability
# squiggly line: low or high bias / low or high variability

## 6. Excercise

Consider the following regression lines with different degrees. The green dots are the train data, the red dots the test data. Are the models underfitted, well-fitted or overfitted? Note: the value R² = 1 corresponds to SSR = 0, that is to the perfect fit since the values of predicted and actual reponses fit completely to each other.

<img src="./resources/LRoverunder.png" style="height: 700px"/>

In [None]:
# a) underfitted, well-fitted or overfitted
# b) underfitted, well-fitted or overfitted
# c) underfitted, well-fitted or overfitted
# d) underfitted, well-fitted or overfitted