# Linear regression

## Prerequisites

- Basic Python
- Linear algebra

## Learning objectives

- Know the difference between monovariate and multivariate regressions
- Implement your first machine learning algorithm from scratch, in Python
- Use analytical solution to solve for it
- See how to optimize linear regression using analytical solution

## Loading dataset

Once again we will use `Boston` dataset from `sklearn`, we saw it previously, easy stuff by now. Let's also split it into validation and test:

In [None]:
from sklearn import datasets, model_selection

# 15% for validation and test, 70% for train in total
X, y = datasets.load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)

X_validation, X_test, y_validation, y_test = model_selection.train_test_split(
    X_test, y_test, test_size=0.5
)

print(X_train.shape, y_train.shape)

## What is linear regression?

Classic starting point for machine learning adventures, something like `Hello World` but in ML world.

__Linear regression predicts continuous outputs__ - hence the regression part of the name.
Linear regression makes predictions that are simply a __`w`eight__ed combination (a linear combination) of the inputs (plus some offset called __`b`ias__). It is described by linear function:

$$
    y = wx + b
$$


![](images/linear_model.jpg)

In future we will experience much more complex, nonlinear relationships between features and labels that we wish to model. 

But __do not underestimate linear regression__ as it is often used in statistics and to explain a lot of phenomenas, at the end of the chapter we will see when it should be used in real world.

Functions that a model represent are often referred to as the **hypothesis**.

![](images/linear_model_example.jpg)

We will make our model able to make predictions for many examples at a time by expressing the hypothesis in vector form as shown below.

![](images/linear_model_vector.jpg)

Here's an example of what that computation might look like numerically.

![](images/linear_model_vector_example.jpg)


## Mathematical formula of model

Formula below presents linear regression for single example __but multiple features__:

$$
\begin{equation}
    y = w_1x_1 + w_2x_2 + ... + w_Nx_N + b = \sum_{i=1}^{N} w_ix_i + b
\end{equation}
$$

Essentially:
- each feature in our sample is multiplied by bias
Now let's implement our first machine learning model in code!

## Multiple features

We will go for multiple features, so here is how our weights will look like:

![title](images/w_vector.jpg)

The weights variable (w) becomes a row vector so we need to transpose it when we multiply it by the X matrix (or take a `dot` product of `data` and `weights`).

![title](images/vector_linear_regression.jpg)

## Monovariate vs multivariate

A dychotomy you might sometimes come across:

> monovariate linear regression is linear regression done with __one or multiple features__ but __predicting single target__

And __multivariate__ (as you may of guessed) would be

> linear regression with __one or multiple variables (features)__ but __predicting multiple targets__ (which are correlated with each other)

In this notebook we will be doing __monovariate__ only, but we will get to __multivariate__ when we do multiclass classification.

## Exercise

`LinearRegression` implementation is our task!

- Create a class `LinearRegression` which takes a single `n_features` argument during initialization.
    - Create `W` and `b` variables inside initialization. One of shape `(n_features, 1)` and `bias` of shape `1` initialized with random normal distribution
- Create `__call__` function (what does it do, what is a functor?) which takes `X` (`np.array`). It should return predictions our linear regression should do (see formulas above in the picture, it's two operations only)
- Create `update_params` function which takes `W` and `b` and assigns them to appropriate variables in `self`.

In [None]:
import numpy as np

...

In [None]:
model = LinearRegression(n_features=13)  # instantiate our linear model
y_pred = model(X_train)  # make prediction on data
print("Predictions:\n", y_pred[:10]) # print first 10 predictions

In [None]:
import matplotlib.pyplot as plt

def plot_predictions(y_pred, y_true):
    samples = len(y_pred)
    plt.figure()
    plt.scatter(np.arange(samples), y_pred, c='r', label='predictions')
    plt.scatter(np.arange(samples), y_true, c='b', label='true labels', marker='x')
    plt.legend()
    plt.xlabel('Sample numbers')
    plt.ylabel('Values')
    plt.show()

In [None]:
plot_predictions(y_pred[:10], y_train[:10])

## Analysis

As you can see predictions of our model are __way off__. This happens because we initialized our model with random weights and bias.

Now, we should learn how we can improve this model to learn from data:

## 3. Loss - how do we know how good our model is?

> Our **loss** should measure __how poor our model performs__. 

The larger the value, the worse so we will later try to __minimize it__ (bring as close to zero as possible). We will use it to give our model feedback about it's performance. 

> Loss values needs to return a **single number**, not a vector, not a matrix.

__NOTE:__ minimising the objective is equivalent to maximising the negative of it. 

Commonly, loss value is also called __cost function__ though it is not exact. Let's go over the difference now.

### Squared Error loss

> loss is a function which takes prediction and true label and returns __a positive scalar__

- The higher the loss value, the worse our model performs
- __Loss is defined on a single data point__

Squared error is one of the loss functions __used for regression tasks__ and is simply defined as:

$$
\begin{equation}
    (\hat{y} - y)^2
\end{equation}
$$

This does exactly what you think: it calculates the error (difference between our model's prediction $\hat{y}$ and the true value $y$):

$$
\begin{equation}
    \hat{y} - y
\end{equation}
$$

and then squares it to make the value positive. As long as the error is not zero it will increase the value of loss regardles of whether our prediction is below (negative error) or above (positive error) the value of the label.

### Mean Squared Error (MSE) cost function

> cost function is a generalization of loss functions for many data samples

So, __loss__ operates on single sample, while __cost__ operates on multiple of them.
In case of __Mean Squared Error__ we calculate squared error for each sample and take the mean of that value:

$$
\begin{equation}
    L_{mse} = \frac{1}{N}\sum_{i}^{N}(\hat{y_i} - y_i)^2
\end{equation}
$$

There are many other criterions that are useful for different tasks (e.g. the binary cross entropy (BCE) loss for classification, which we will cover later).

Let's write a function to calculate the cost using the mean squared error loss function. It should take in an array of predictions for different example inputs as well as an array of corresponding example labels. It should return a single number (scalar) that represents the MSE loss. 

## Exercise

Implement `mean_squared_error` function taking `y_pred` and `y_true`. Every formula is above (focusing on the last one is enough ;) )

In [None]:
...

In [None]:
cost = mean_squared_error(y_pred, y_train)
print(cost)

## The analytical solution to minimising mean square error

Now that we have our __loss__ equation we can calculate it's derivative w.r.t. weights. When we set it to zero we can calculate __weights values (`W`)__ which minimize it.

![](images/analytical_linear_reg.jpg)

Now let's implement this analytical solution for least squares regression in code:

## Exercise

Now that we have mathematical formula we can jump in straight to the implementation.

- For matrix inverse, you can use [`np.linalg.inv`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html) function
- Remember to return `weights` part of `matrix` first and `bias` after that (`bias` is the `0` element of the result)

In [None]:
def minimize_loss(X_train, y_train):
    X_with_bias = np.hstack((np.ones((X_train.shape[0], 1)), X_train))
    ...
    return ..., ... # Return weights and bias here


weights, bias = minimize_loss(X_train, y_train)
print(weights, bias)

In case you didn't notice, this analytical solution has no mention of the model bias. 
In fact, we incorporate the model bias into our features matrix by adding an extra column filled with `1`.

![](images/bias_in_weight_matrix.jpg)

Doing this makes the analytical solution much clearer and means we have to solve it only for one value $W$, rather than also for $b$.

In practice (iterative optimization), we treat them as separate variables (we will later see more about that).

## Update parameters

Now that we have found `optimal_w` we should update our model and see how it performs:

In [None]:
model.update_params(weights, bias)
y_pred = model(X_train)
cost = mean_squared_error(y_pred, y_train)
print(cost)

## Success, __BUT__

__Congratulations, we have trained our first model from scratch!__

We should talk about scale though... Let's plot our labels with respect to certain features to see how that looks:

In [None]:
def plot_feature_label(X_train, y_true, feature, n_samples: int = 20):
    features = X_train[:n_samples, feature]
    labels = y_true[:n_samples]
    plt.figure()
    plt.scatter(features, labels, c='r', label='targets')
    plt.legend()
    plt.xlabel('Feature values')
    plt.ylabel('Target values')
    plt.show()


for i in range(13):
    plot_feature_label(X_train, y, i, 20)

## Data normalization

As you could notice in our plot above different features has different value ranges.
- some features were binary or between `[0, 1]`
- others has values ranging in hundreds or even thousands

This is problematic to most of machine learning models.

### Why is it a problem?

> Small change in weight connected to feature with large values changes output significantly

Let's take two weights `a` and `b` and single example `x` with two features:
- first has value `0.1`
- second has value `1000`

Now, formula for linear regression would be the following:

$$
    \hat{y} = 0.1a + 1000b
$$

Now, let's see impact of `a` and `b` on $\hat{y}$:
- $a = 10, b = 0.001$ - `a` and `b` have the same impact on $\hat{y}$
- $a = 1, b = 1$ - `b` has `10000` times (!!!) more impact on $\hat{y}$

It is unlikely `a` has `10000` times less impact on the value we want to predict (and is unlikely in real world).

> We should assume all variables are __equally important unless we verify them__ via statistical testing or other measures

The range of values __is not an important factor__, relative differences between values are.

### Possible solution

We can normalize our data, which means:

> Normalization is a process of bringing features to the same value range

This ensures that relative difference between values for each feature are important, not the scale.

> We should always normalize our features (unless they are not continuous)

## Normalization & standardization

There are a lot of schemes to put values in the `[0, 1]` range. Here we will use `minmax` approach
We can do this by subtracting the minimum then dividing by the range (feature normalisation).

![title](images/normalisation.jpg)

We can alternatively use a similar method called standardisation, where we subtract the mean then divide by the standard deviation.

![](images/standardisation.jpg)

Feature normalisation puts gradients of each different model parameter on the same order of magnitude. This converts loss surfaces that might look like *valleys* into loss surfaces that look like *bowls*. Feature normalisation means that we should be able to make progress with optimisation for all model parameters using the same learning rate.

![](images/bowl.png)

## Exercise

We will implement standardization scheme.

Formula was given above. In this case:
- use `mean` and `std` if those are __both__ not `None`
- otherwise calculate `mean` and `std` from dataset
- standardize dataset with those values
- Return `tuple` containing:
    - standardized_dataset
    - `tuple` with:
        - mean
        - std
        
Do you have any idea why we would like to do it this way? __Tip:__ We may want to normalize another dataset on which we should not calculate any values...

In [None]:
def standardize_data(dataset, mean=None, std=None):
    ...

X_train, (mean, std) = standardize_data(X_train)

## Test again

Now that we have our data standardized, let's see how our model performs

In [None]:
weights, bias = minimize_loss(X_train, y_train)
model.update_params(weights, bias)
y_pred = model(X_train)
cost = mean_squared_error(y_pred, y_train)
print(cost)

## Success?

Our loss is exactly the same? Trust me we did everything correctly, so why this happened?
See challenges at the end.

> __Always normalize and sanitize your input data__ (though in this case it didn't change anything)

In the next chapter we will see other tricks which will help you to improve the score even further

### Drawbacks of computing the analytical solution

This solution involves inverting a matrix of size $R^{n \times n}$. 
Here $n$ is the number of features that each example has. 

With `560` features it is becoming more difficult. Furthermore, here, we only have ~500 samples, while in real life we can have millions or more.

However, as we will see, most problems of practical interest contain examples with many more features. 

> For example, 1080p images have more than 1,000,000 features each. 

The time complexity of inverting a matrix of size $n \times n$ is around $O(n^3)$. 
This means that computing the analytical solution for these kinds of real world problems is often computationally expensive or even impossible.

Analytical solutions however, are not the only approach that we can take (and usually we __even cannot use them__ as the close form cannot be calculated).

We will see how to update parameters iteratively soon.

## Challenges

- What other normalization schemes exist? Check out unit vector normalization
- Try things presented in this notebook on different datasets. Maybe find one of your own and preprocess it?

## Summary

- linear regression is "hello world" basic machine learning model
- linear regression updates it's weight vector and bias in order to improve on the task
- this update can be carried out via analytically calculated formula
- the MSE loss is appropriate for many regression problems and is the most common loss function for this task
- normalization scheme almost always improves our scores __sometimes our solution will diverge without it!__