# Intro

This exercise is based on the following video:
https://www.youtube.com/watch?v=TrzUlo4BImM

This video explains the different metrics for regression problems.

The metrics that we are going to see are the following ones:


In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
features = data["data"]
target = data["target"]

# Source Data

In [3]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [4]:
# Getting the features and the target as Dataframes 
features_df = pd.DataFrame(features, columns = data.feature_names)
target_df = pd.DataFrame(target, columns = data.target_names)

# Merging both the features and the target into the same dataframe
housing_df = pd.concat( 
    [features_df, target_df],
    axis= 1 # Horizontally
)

In [5]:
# The target variable is express in hundreds of thousands, but 
# I would like to have the original value in the target variable
housing_df["MedHouseVal"] = housing_df["MedHouseVal"] * 100_000

In [6]:
housing_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,452600.0
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,358500.0
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,352100.0


In [7]:
print("---- Data -----")

print(f"Rows: {housing_df.shape[0]}")
print(f"Columns: {housing_df.shape[1]}")

print("Target:", data.target_names[0])

print(f"Features ({len(data.feature_names)}):")
for feature in data.feature_names:
    print(" -", feature)

print("")

example_nrows = 4
print(f"Example:")
housing_df.head(example_nrows)

---- Data -----
Rows: 20640
Columns: 9
Target: MedHouseVal
Features (8):
 - MedInc
 - HouseAge
 - AveRooms
 - AveBedrms
 - Population
 - AveOccup
 - Latitude
 - Longitude

Example:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,452600.0
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,358500.0
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,352100.0
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,341300.0


# Split Data between Train and Test sets

In order to demostrate the different metrics available for regression we need to train a model and to test it with unseen data. 

In [8]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
    housing_df,
    test_size=0.2, 
    random_state=42
)

print("Train data set: ",  train_df.shape[0])
print("Test data set: ", test_df.shape[0])

Train data set:  16512
Test data set:  4128


# Machine Learning Models

Just to test how different models get a different score, we are going to use a Linear Regression Model and a Random Forest model.

## Linear Model

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    features, target, 
    test_size=0.2, 
    random_state = 42
)

In [10]:
from sklearn.linear_model import LinearRegression
linear_regression_model = LinearRegression()
linear_regression_model.fit(
    X_train, y_train
)

In [11]:
r_squared = linear_regression_model.score(X_train, y_train)

In [12]:
print("R^2 = " , r_squared)

R^2 =  0.6125511913966953


In [13]:
predicted = linear_regression_model.predict( train_df[data.feature_names])



In [14]:
linear_regression_model_df = LinearRegression()
linear_regression_model_df.fit(train_df, y_train)


In [15]:
y_train

array([1.03 , 3.821, 1.726, ..., 2.221, 2.835, 3.25 ])

In [16]:
train_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,172600.0


In [17]:
lr = LinearRegression()
lr.fit(train_df[data.feature_names], train_df[data.target_names])


In [18]:
train_df["predicted_lr_train"] = lr.predict(train_df[data.feature_names])
train_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted_lr_train
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0,193725.844705
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0,248910.616182
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,172600.0,264735.483406


In [19]:
train_df["difference_train"] = train_df["MedHouseVal"] - train_df["predicted_lr_train"]

In [20]:
train_df["difference_squared_train"] = train_df["difference_train"] ** 2

In [21]:
train_df.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted_lr_train,difference_train,difference_squared_train
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0,193725.844705,-90725.844705,8231179000.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0,248910.616182,133189.383818,17739410000.0


# Example of $ R^2 $ calculation

The following example is just a toy data set to illustrate the calculation of $R^2$.

The $ R^2 $ formula is defined as follows: 

$ R^2 =  1 - \frac{ SS_{res} }{ SS_{tot} }  $

Where:
- $ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2  $
- $ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $
    - $ {y_i} $: Real Value
    - $ \hat{y_i} $: Predicted value 
    - $ \bar{y_i} $: Mean of observed values

Meaning:
- $ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2  $ (Residuals sum of squares)
- $ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $ (Residuals considering the mean)

Example: 
$$
\begin{array}{|c|c|c|}
\hline
Observed (y) & Predicted (\hat{y}) & Error (y - \hat{y}) \\ 
\hline
5.0 & 4.8 & 0.2 \\ 
\hline
7.0 & 6.5 & 0.5 \\ 
\hline
4.0 & 4.2 & -0.2 \\ 
\hline
\end{array}
$$ 


$ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $
$ = (5-4.8)^2 + (7-6.5)^2 + (4 - 4.2)^2 $
$ = 0.04 + 0.25 + 0.04 $
$ = 0.33 $

To calculate $ SS_{tot} $ we need to first calculate the mean ($ \hat{y} $). So let's calculate the mean: 
$$ \hat{y} = \frac{5 + 7 + 4}{3} = \frac{16}{3} = 5.33 $$

Now, we can calculate the $ SS_{tot} $ 

$ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $ 
$ = (5 - 5.33)^2 + (7 - 5.33)^2 + (4 - 5.33)^2  $ 
$ =  0.1089 + 2.7889 + 1.7689$ 
$ =  4.67 $ 

Now that we have both terms calculated ( $ SS_{res} $ and $ SS_{tot} $ ) we can calculate $ R^2 $

$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $ 
$ = 1 - \frac{0.33}{4.67} $ 
$ = 1 - 0.07 $ 
$ = 0.93 $

So we can say that the model explains approximately 93% of the variance in the observed data.


## $ R^2 $ in the training set

Now we want to calculate the value of $ R^2 $ applied to our training set. This means that we have used our machine learning model to predict on our train set (this doesn't make much sense because the model has been trained with the training set) but we can do this later on the test set and compare both results.

In [22]:
train_df.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted_lr_train,difference_train,difference_squared_train
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0,193725.844705,-90725.844705,8231179000.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0,248910.616182,133189.383818,17739410000.0


In [23]:
# We need the mean of the observed values
mean_observed_values = np.mean(train_df["MedHouseVal"])
print(f"Mean of the house price: {mean_observed_values }")

Mean of the house price: 207194.6937378876


Now, let's calculate the term: $ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $ 


In [24]:
train_df["SS_tot"] =  (train_df["MedHouseVal"] - mean_observed_values)**2

In [25]:
ss_tot = train_df["SS_tot"].sum()
print(f"SS_tot term: {ss_tot}")

SS_tot term: 220728818330670.2


Now, let's calculate the - $ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2  $ (Residuals sum of squares)


In [26]:
train_df["SS_res"] = (train_df["MedHouseVal"] - train_df["predicted_lr_train"]) ** 2


In [27]:
ss_res = train_df["SS_res"].sum()
print(f"SS_res term: {ss_res}")

SS_res term: 85521117686633.47



$ R^2 =  1 - \frac{ SS_{res} }{ SS_{tot} }  $

In [28]:
r_squared = 1 - (ss_res / ss_tot)
print(f"R squared: {r_squared}")

R squared: 0.6125511913966952


We should get the same result if we use the built in function in the linear regression model "score"


In [41]:
linear_regression = LinearRegression()
linear_regression.fit(train_df[data.feature_names], train_df[data.target_names])
score_sklearn = linear_regression.score(train_df[data.feature_names], train_df[data.target_names])
print(f"Score (R^2) calculated with sklearn built in function: {score_sklearn}")


Score (R^2) calculated with sklearn built in function: 0.6125511913966952


In [43]:
print(f"""
R squared: 
     Self calculated: {r_squared}
  Sklearn calculated: {score_sklearn}
""")


R squared: 
     Self calculated: 0.6125511913966952
  Sklearn calculated: 0.6125511913966952



As we can see its the same result

Now we can try to our model to predict with unseen data (test set), and check how much $R^2$ we get