<h1 style="text-align:center;">Regression</h1>

# Introduction

Regression analysis is a statistical process that helps us understand the relationship between a dependent variable and one or more independent variables. The dependent variable is the thing we want to predict or understand better, and the independent variables are the things that we think might be related to the dependent variable. For example, if we want to predict the price of a house based on some of its features like size, location, number of bedrooms etc. We can employ a regression analysis to understand how these features of a particular house affects its price. In this case, the dependent variable is the price of the house, and the independent variables are the features.

The relationship between the independent variables and the dependent variable is always explained mathematically because it helps us understand how strong or weak the relationship is. We use mathematical formulas to find the best possible mapping between the independent variables and the dependent variable. This mapping can then be used to predict the value of the dependent variable for new values of the independent variable

Machine learning algorithms aim to predict the values of one output column, (called the target vector and denoted by small $y$), using data from one or more input columns (or the feature matrix, denoted with capital $X$). The predictions rely on mathematical equations determined by the general class of machine learning problems being addressed.

The most common regression algorithm is linear regression. Linear regression takes each column that makes up the feature matrix as a polynomial variable and multiplies the values by coefficients (also called weights) to predict the target column. Gradient descent works under the hood to minimize the error. The predictions of linear regression could be any real number.

In the bike rentals dataset we worked on in the last section, `'cnt'` is the number of bike rentals in a given day. Predicting this column would be of great use to a bike rental company. We will denote this as our target vector $y$.

Our mission (should we accept) would be to predict the correct number of bike rentals on a given day based on data such as whether this day is a holiday or working day, forecasted temperature, humidity, windspeed, and so on (our feature matrix $X$).

ion

## Declaring predictor and target columns

Machine learning works by performing mathematical operations on each of the predictor columns or feature matrix to determine the target vector.

Let's import our libraries, including our data_wrangle so we can get our data ready for this next steps.

In [3]:
import pandas as pd
from data_wrangle import *

In [4]:
url = 'https://raw.githubusercontent.com/theAfricanQuant/XGBoost4machinelearning/main/data/bike_rentals.csv'

In [6]:
df = get_data(url)
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [7]:
df_bikes = prep_data(df)
df_bikes

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1.0,0.0,1,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,1.0,0.0,1,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,1.0,0.0,1,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,1.0,0.0,1,0.0,2.0,1.0,1,0.200000,0.212122,0.590435,0.160296,1562
4,5,1.0,0.0,1,0.0,3.0,1.0,1,0.226957,0.229270,0.436957,0.186900,1600
...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,1.0,1.0,12,0.0,4.0,1.0,2,0.254167,0.226642,0.652917,0.350133,2114
727,728,1.0,1.0,12,0.0,5.0,1.0,2,0.253333,0.255046,0.590000,0.155471,3095
728,729,1.0,1.0,12,0.0,6.0,0.0,2,0.253333,0.242400,0.752917,0.124383,1341
729,730,1.0,1.0,12,0.0,0.0,0.0,1,0.255833,0.231700,0.483333,0.350754,1796


We start by creating a new variable called `target` and assign the name of the target vector column to it. Then we create a another variable called `features` and using list comprehension, we assign the rest of the columns name (except the target) to it. 

I think the next lines of code are self explanatory. There are other ways of accomplishing this step. i just love practicing my list comprehensions whenever I get the chance.

In [8]:
target = 'cnt'
features = [cols for cols in df_bikes.columns if cols not in target]

We next separate our data into the target vector, $y$ and the feature matrix $X$.

In [12]:
y = df_bikes[target]
X = df_bikes[features]
print(f"shape of target vector, y: {y.shape}")
print(f"shape of feature matrix, X: {X.shape}")

shape of target vector, y: (731,)
shape of feature matrix, X: (731, 12)


Everything up to this point looks okay.

Before running linear regression, we must split the data into a training set and a test set. The training set fits the data to the algorithm, using the target column to minimize the error. After a model is built, it's scored against the test data. 

## Accessing scikit-learn

We will allow scikit-learn to handle all our machine learning libraries. We will i
Impor`t train_test_spl`it an`d LinearRegressi`on fro`m scikit-lea`.s:

In [13]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

We now split the data into the training set and test set.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)

The `random_state` parameter is called here so as to ensure that when next we or another person checks out our code and runs the codes, the same set of data that we randomly selected on this run will also be selected as long as we are using exactly the same dataset. Any number choses is okay as long as it is used again when re-runing the codes. I chose the number 43 because that was the age I was when I started learning data science in 2017.

## Modeling linear regression

A linear regression model may be built with the following steps:

1. We initialize the machine learning model. We do so by instantiating the model and storing it to a variable name of our choosing.
2. Fit the model on the training set. This is where the machine learning model is built. Note that `X_train` is the feature matrix and `y_train` is the target vector. The machine learning tries to assign weights that it feels bewst represents how tp get the values in the target vector from the values in the feature matrix.
3. Make predictions for the test set. we store the predictions of `X_test`, or the feature matrix of the test set, in the variable `y_pred` using the `.predict` method on lin_reg.

In [16]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)

The `y_pred` (called $y$-$hat$ in Statistics) above contains our predicted values. Our next step is to compare the values with that of the original target vector of the test set or `y_test`. we calculate the error from the difference between the values. The standard for linear regression is the root mean squared error (RMSE), which is the sum of the squares of differences between predicted and actual values, and the square root, to keep the units the same.  

In [17]:
from sklearn.metrics import mean_squared_error

import numpy as np

In [19]:
mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

print(f"RMSE: {rmse:.2f}")

RMSE: 969.77


For us to determine if the value we got is good or bad, we'd need to know the expected range of rentals per day. The .describe() method may be used on the `'cnt'` column to see that range.

In [20]:
df_bikes['cnt'].describe()

count     731.000000
mean     4504.348837
std      1937.211452
min        22.000000
25%      3152.000000
50%      4548.000000
75%      5956.000000
max      8714.000000
Name: cnt, dtype: float64

Our value doesn't look bad at all. Especially seeing the mean and also the standard deviation. However, we are going to try a different model to see if we could further reduce this RMSE value.

## XGBRegressor

We will now import the `XGBRegressor` and repeat the exact same steps we took with the `LinearRegressor`

In [21]:
from xgboost import XGBRegressor

In [22]:
xgb_reg = XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)

In [23]:
mse = mean_squared_error(y_test, y_pred)

rmse = np.sqrt(mse)

print(f"RMSE: {rmse:.2f}")

RMSE: 702.00


Our performance has improved significantly with the XGBRegressor.

One thing we should note is that fact that splitting the data into a training set and a test set is arbitrary, and a different number chosen as `random_state` will give a different RMSE. So depending one one value like we did above is not going to be a reliable source for a robust model. You couls try it with a different random number.

## Cross-validation

Since different splits would tend to give different answer, we will need to find a way to address this issues: We will use k-fold cross validation. What this does is to split the data multiple times into different training sets and test sets, and then to take the mean of the scores. The number of splits, called folds, is denoted by $k$. It's standard to use k = 3, 4, 5, or 10 splits.

Cross-validation works by fitting a machine learning model on the first training set and scoring it against the first test set, and then repeating the process again and again based on  the number of folds.

Five folds is standard because $20\%$ of the test set is held back each time. With $10$ folds, only $10\%$ of the data is held back; however, $90\%$ of the data is available for training and the mean is less vulnerable to outliers. For a smaller datatset, three folds may work better.

At the end, there will be $k$ different scores evaluating the model against $k$ different test sets. Taking the mean score of the $k$ folds gives a more reliable score than any single fold.

In [24]:
from sklearn.model_selection import cross_val_score

We will now try thos out with the LinearRegressor. The way it works this time around is for us to drop the entire feature matrix $X$ and the target vector $y$ into the `cross_val_score`.

### Linear Regression

In [26]:
model = LinearRegression()
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)

In the above, we chose the value of our cv parameter to be 10 and that for scoring to be `'neg_mean_squared_error'`. Scikit-learn is designed to select the highest score when training models, which might work well when we are seeking for accuracy, but not for errors when the lowest is best. By taking the negative of each mean squared error, the lowest ends up being the highest. To get our values in the positive, we will find the square root of the our final values.

In [31]:
rmse = np.sqrt(-scores)

print(f'Reg rmse: {np.round(rmse, 2)}')

print(f'RMSE mean: {rmse.mean()}')

Reg rmse: [ 504.01  840.55 1140.88  728.39  640.2   969.95 1133.45 1252.85 1084.64
 1425.33]
RMSE mean: 972.0234147419287


LinearRegression performed better this time than it did the first time. The point here is not whether the score is better or worse. The point is that it's a better estimation of how linear regression will perform on unseen data.

Using cross-validation is always recommended for a better estimate of the score.

### XGBoost

Let us repeat the same steps as above now with the XGBRegressor

In [32]:
model = XGBRegressor()
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)

rmse = np.sqrt(-scores)

print(f'Reg rmse: {np.round(rmse, 2)}')

print(f'RMSE mean: {rmse.mean()}')

Reg rmse: [ 717.65  692.8   520.7   737.68  835.96 1006.24  991.34  747.61  891.99
 1731.13]
RMSE mean: 887.3099729285601


XGBoost wins again.