# Introduction to Linear Regression

### Table of Content
1. General Overview 
2. Model Parameters
3. Learning/Training the Model
    1. Basic Statistics
    2. OLS
    3. Gradient Descent

## <ins>1. General Overiew<ins>

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).

When there is a single input variable (x), the method is referred to as **simple linear regression**. When there are multiple input variables, literature from statistics often refers to the method as **multiple linear regression**.

<ins>A few important terminologies before we start<ins>:

* **Independent Variables** (features): An independent variable is a variable that is manipulated to determine the value of a dependent variable. Simply, they are the features which we want to use to predict some given value of Y. It can be also called an explanatory variable
<br><br>
* **Dependent Variable** (target): The dependent variable depends on the values of the independent variable. Simply put, it is the feature which we are trying to predict. This can also be commonly known as a response variable.

Different techniques can be used to prepare or train the linear regression equation from data, the most common of which is called Ordinary Least Squares. It is common to therefore refer to a model prepared this way as Ordinary Least Squares Linear Regression or just Least Squares Regression.


##### Goal of Regression Analysis

The goal of regression analysis is to derive a **trend line** to best fit the data (hence a regression trend line is also known as a **best-fit line**). This line is positioned to reduce prediction error as much as possible.

##### Regression in Python

There are two main ways to perform linear regression in Python
* Statsmodels 
* scikit-learn


<br><br>
## <ins>2. Linear Regression Model Parameter<ins>

We establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation Y = a*X + b.

<ins>In this equation<ins>:

* Y – Dependent Variable
* a – Slope
* X – Independent variable
* b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

<br>

## <ins>3. Linear Regression Learning the Model<ins>

Learning a linear regression model means estimating the values of the coefficients used in the representation with the data that we have available.

#### 3.1) Simple Linear Regression
With simple linear regression when we have a single input, we can use statistics to estimate the coefficients.This requires that you calculate statistical properties from the data such as **means**, **standard deviations**, **correlations** and **covariance**. All of the data must be available to traverse and calculate statistics.


#### 3.2) Ordinary Least Squares
When we have **more than one input** we can use Ordinary Least Squares to estimate the values of the coefficients.

The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. This means that given a regression line through the data we calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize.

Another Approach:
<br>
The objective of ordinary least square regression (OLS) is to learn a linear model (line) in which we can use to predict (Y), while consequently attempting to reduce the error (error term). By reducing our error term, we inversely increase the accuracy of our prediction. Thereby, improving our learned function.


Reminder, our objective is to learn the model parameters which minimises our error term, thereby increasing the accuracy of our model’s prediction.

<ins>To find the best model parameters<ins>:
> 1. Define a cost function, or loss function, that measures how inaccurate our model’s prediction is.
> 2. Find the parameter that minimises loss, i.e. make our model as accurate as possible.
> Graphically this can be represented in a Cartesian plane, as our model is two dimensional. 
> This would change into a plane for three dimensions, etc…

HIER: Foto von OLS einfügen

**<ins>Cost Function<ina>**
<br><br>
Hier: Cost Function einfügen

Simply, the cost function says to take the difference between each real data point (y) and our model’s prediction (ŷ), square the differences to avoid negative numbers and penalise larger differences. Finally, add them up and take the average. Except rather than dividing it by n, we divide it by 2*n. This is because mathematicians have decided that it is easier to derive. Feel free to take this to the mathematics court of justice. However, for simplicity just remember that we take 2*n.

For problems that are 2D, we can could simply derive the optimal beta parameters that minimise our loss function. However, as the model grows increasingly complex, computing the beta parameters for each variable becomes no longer feasible. As such, a method known as Gradient Descent will be necessary in allowing us to minimise our loss function.
    
    
    

#### 3.3) Gradient Descent
When there are one or more inputs you can use a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on your training data.

This operation is called Gradient Descent and works by starting with random values for each coefficient. The sum of the squared errors are calculated for each pair of input and output values. A learning rate is used as a scale factor and the coefficients are updated in the direction towards minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.

When using this method, you must select a learning rate (alpha) parameter that determines the size of the improvement step to take on each iteration of the procedure.

Gradient descent is often taught using a linear regression model because it is relatively straightforward to understand. In practice, it is useful when you have a very large dataset either in the number of rows or the number of columns that may not fit into memory.
