# Linear Regression

## Learning Objectives
- Understand Linear Regression modelling and apply it to a real dataset 
- Explore and analyse a real dataset Bikeshare Dataset to predict rental rate 
- Implement an ordinary linear regression model on a dataset satisfying linearity assumption using scikit-learn library.
- Understand and identify multicollinearity in a multiple regression.

## Introduction to Linear Regression
A linear regression model predicts the value of a continuous target variable as a weighted sum of the input features. The target variable is denoted as $y$ and the input features are denoted as a feature vector $x$. It is assumed that $y$ is a linear function of one or more input features with some random noise.
The relationship between each input features $x^{(1)}, x^{(2)}, ..., x^{(d)}$ and the target $y$ for one observation in the sample is:
\begin{equation}
y=\beta_{0}+\beta_{1}x^{(1)}+\ldots+\beta_{d}x^{(p)}+\epsilon
\end{equation}
where $\beta_{j}$ are the learned feature weights or coefficients, $x^{(j)}$ are the input features and $\epsilon$ is the noise or error of the prediction. Each of the input features (e.g. $x^{(j)}$) has a highest degree of 1, therefore the equation represents a **linear** relationship between input and output target. A given set of weight values $\beta_{j}$ gives a predicted output target which will then be compared against the actual label of that observation. A `loss` function $loss(\hat{y}_i, y_i)$ is defined as the difference between the predicted target and the actual label for each training example $i$. It depends on which modeling problem we are trying to solve that we have different loss functions. In this linear regression problem we are going to use the Squared Error loss function which is the squared difference between predicted value and the actual label.
\begin{equation}
 loss = (\hat{y}_{i} - y_{i})^{2}
\end{equation}
The best weights or coefficients are the ones minimising the error or the difference between the predicted targets and the actual labels for all the training examples in the dataset (A side note: for now we use the training data as the ground for determining the best weights, we will revisit this defintion later on once we reach the definition of training error and test error). Therefore, we have to sum the loss over all training examples and calculate the Mean Squared Error (MSE) loss. This process of fitting the equation through all the data in the training examples is called training.
\begin{equation}
\hat{\boldsymbol{\beta}}=\arg\!\min_{\beta_0,\ldots,\beta_p}\sum_{i=1}^n(\hat{y}_{i}- y_{i})^{2}
\end{equation}

The above equation uses $argmin$ as a function to determine the set of $\beta$s that minimise the overall MSE.

The equation with the learned coefficients describe a line of best fit that minimises the difference between each actual points and the prediction values in a 2D space as in the below illustration. In a n-dimensional feature space, the equation describes a hyperplane.

<center><img src='./assets/estimation.png'></center>


## Linear Regression simple 1-D example
- This part illustrates a complete process on applying a machine learning model to a dataset. 
- Demonstrate the use of a simple linear regression model on a 1-dimensional toy dataset
- Use common libraries such as numpy, pandas, seaborn and scikit-learn

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")



We start off by collecting/ generating the data. Here we generate a random sample of 100 2-D points following a uniform distribution $\mathcal{N}(0,1)$ where the x-y coordinates related linearly to each other. There is random noise (e.g $\epsilon$) introduced into the data generation process to scatter the points out of an obvious straight line. This is also simulating the noise coming from our input data.

Using Machine Learning model in scikit-learn is quite straight-forward. We usually follow 2 processes: 
- instantiate an instance of the model, 
- then fit the model with the input-output data.

Once the model finishes training, we use it to predict the target from an unseen data point.

The coefficients of the linear regression model can be accessed by the `coef_` and `intercept_` attributes of the model instance. In this example, since we only have 1-D data point, we can access $\beta_0$ and $\beta_1$ as below.

#### Nonlinear relationship between features and target variable

From the above toy example, we hope you have an idea of what linear regression is and how to train a linear regression model on a dataset using scikit-learn library. Now we are going to see how to apply ML model in a real dataset. 

<a id="introduce-the-bikeshare-dataset"></a>
## Introduce the Capital Bikeshare Data Set
---

- This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal informat (http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)
- The goal is to predict how many bikes will be rented depending on the weather and the day.
- Here are the list of features that we are going to use:
 * datetime - hourly date + timestamp  
 * season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
 * holiday - whether the day is considered a holiday
 * workingday - whether the day is neither a weekend nor holiday
 * weather:
        1. Clear, Few clouds, Partly cloudy, Partly cloudy 
        2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
        3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
        4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
 * temp - temperature in Celsius
 * atemp - "feels like" temperature in Celsius
 * humidity - relative humidity
 * windspeed - wind speed
 * casual - number of non-registered user rentals initiated
 * registered - number of registered user rentals initiated
 * count - number of total rentals

## Exploratory Data Analysis

We want to find a set of coefficients that best predict the total rentals. The features in our input are temperature, windspeed, holiday etc...
An example would be something like this:

$total\_rentals = 20 + -2 \cdot temp + -3 \cdot windspeed\ +\ ...\ +\ 0.1 \cdot registered$

#### Let's build a Linear Regression Model in sklearn to apply to the bikeshare dataset
- Select the appropriate features to give to the model as inputs
- Perform feature engineering to transform the data types into the right format to be useful for the model to learn
- Compare train and test error. Eventhough we mentioned in the introduction that the best set of coefficients is the one that minimises the train MSE, test MSE is actually more important in evaluating the performance of an ML model. It shows you how your model is going to perform on new unseen data which is eventually what it needs to perform in reality. Therefore we are going to check on both the train error, as well as the test error for every modification steps and for new model selection.

In [2]:
# Import LinearRegression model from sklearn
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

import numpy as np

## We need to define a train function to reduce the amount of typing
def train_linear(features, label, test_size=0.2):
    """
    Ins: features, labels, test_size
    Outs: None but print out train MSE, test MSE
    """
    X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=test_size, random_state=42)
    # Instantiate a linear regression model and fit it through our training data
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    # Make predictions on our training
    training_preds = linreg.predict(X_train)
    
    # Calculate and print the MSE
    print('Training RMSE:', np.sqrt(metrics.mean_squared_error(y_train, training_preds)))
    # Now apply the trained model on our test data and calculate the test error
    test_preds = linreg.predict(X_test)
    print('Test RMSE:', np.sqrt(metrics.mean_squared_error(y_test, test_preds)))


For continuous variable prediction, accuracy is not a good metrics, since we can never have the predicted the value to match 100% with the target value (remember the noise?). Therefore, we need to look at the errors that the prediction has and try to reduce it.
Looking at a single test MSE or train MSE is not very informative as they are relative errors which describes the difference between the predicted outputs and the actual outputs. We normally want to look at the trend of these errors as trying out different models or after certain modifications to the features, then select the model or the configuration that yields the lowest train and test error. More on this later. For now, there are two important things that we want you to have in mind:
- Know how to create a train function to take in input features and create a model to predict the output labels
- Have a train RMSE and test RMSE to be used as baseline

Let's use this as a baseline and we are going to see how other models and feature engineering techniques that you are going to learn later perform against this naive linear regression model.

### Regression Evaluation Metrics:

With $n$ is the number of training examples, we have the following error metrics:

**Mean absolute error (MAE)** is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
- Easy to understand
- Does not highlight the effect of outliers

**Mean squared error (MSE)** is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
- The squared part is used to make derivative easier to deal with in some optimization algorithms
- Highlight the significance of outliers

**Root mean squared error (RMSE)** is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
- Give the benefit that the errors' unit is the same as of the targets
- Still maintain the squared to highlight the significance of outliers

In scikit-learn:
```
>>> print('MAE:', metrics.mean_absolute_error(true, pred))
>>> print('MSE:', metrics.mean_squared_error(true, pred))
>>> print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))
```

Null RMSE is the RMSE that could be achieved by always predicting the mean response value. It is a benchmark against which you may want to measure your regression model

Create a NumPy array with the same shape as y_test.
`
y_null = np.zeros_like(y_test, dtype=float)
assert len(y_null)==len(y_test)
`


Fill the array with the mean value of y_test.
`
y_null.fill(y_test.mean())
y_null
`

### Feature Engineering

As we see from the above results, the columns `season`,`holiday`,`workingday` and `weather` should be of `categorical` data type.But the current data type is `integer` for those columns. Let us transform the dataset in the following ways so we can make it more useful for the model to learn.

- Extract more information from the current datetime by creating new columns `date`, `hour`, `weekDay`, `month` from `datetime` column.
- Convert the datatype of `season`, `holiday`, `workingday` and `weather` to category.
- Drop the datetime column as we already extracted useful features from it.
- Drop the `atemp` column because it conveys the same information as the `temp` column

### One hot encoding ###

Categorical data are represented using integers and hence some ordinal relationships may be wrongly interpreted by the learning algorithms.
For example if there are five categories 'A', 'B', 'C', 'D' and 'E' that do not have any ordinal meanings, interprete them using integers from 1 to 5 may accidentially introduce comparability. Therefore, one-hot encoding can represent each of these categories using 5 binary vectors,or to be useful to the ML model, 4 of them are going to be used as the 5th vector can be infered from the first four. This is to remove collinearity between input features. Pandas has an option to remove this redundant column.

For example, the season column which contains 4 integers: 1 - Spring, 2 - Summer, 3 - Autumn and 4- Winter can be represented as 4 binary vectors:
- 1: [1, 0, 0, 0]
- 2: [0, 1, 0, 0]
- 3: [0, 0, 1, 0]
- 4: [0, 0, 0, 1]

## Pros/ Cons of Linear Regression
Advantages of linear regression:

- Simple to explain.
- Highly interpretable.
- Model training and prediction are fast.

Disadvantages of linear regression:

- Presumes a linear relationship between the features and the response.
- Performance is in general not competitive with other ML models.
