# Intro

This exercise is based on the following video:
https://www.youtube.com/watch?v=TrzUlo4BImM

This video explains the different metrics for regression problems.

The metrics that we are going to see are the following ones:


In [1]:
import pandas as pd
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
features = data["data"]
target = data["target"]

# Source Data

In [2]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [3]:
# Getting the features and the target as Dataframes 
features_df = pd.DataFrame(features, columns = data.feature_names)
target_df = pd.DataFrame(target, columns = data.target_names)

# Merging both the features and the target into the same dataframe
housing_df = pd.concat( 
    [features_df, target_df],
    axis= 1 # Horizontally
)

In [4]:
# The target variable is express in hundreds of thousands, but 
# I would like to have the original value in the target variable
housing_df["MedHouseVal"] = housing_df["MedHouseVal"] * 100_000

In [5]:
housing_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,452600.0
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,358500.0
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,352100.0


In [6]:
print("---- Data -----")

print(f"Rows: {housing_df.shape[0]}")
print(f"Columns: {housing_df.shape[1]}")

print("Target:", data.target_names[0])

print(f"Features ({len(data.feature_names)}):")
for feature in data.feature_names:
    print(" -", feature)

print("")

example_nrows = 4
print(f"Example:")
housing_df.head(example_nrows)

---- Data -----
Rows: 20640
Columns: 9
Target: MedHouseVal
Features (8):
 - MedInc
 - HouseAge
 - AveRooms
 - AveBedrms
 - Population
 - AveOccup
 - Latitude
 - Longitude

Example:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,452600.0
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,358500.0
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,352100.0
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,341300.0


# Split Data between Train and Test sets

In order to demostrate the different metrics available for regression we need to train a model and to test it with unseen data. 

In [7]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(
    housing_df,
    test_size=0.2, 
    random_state=42
)

print("Train data set: ",  train_df.shape[0])
print("Test data set: ", test_df.shape[0])

Train data set:  16512
Test data set:  4128


# Machine Learning Models

Just to test how different models get a different score, we are going to use a Linear Regression Model and a Random Forest model.

## Linear Model

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    features, target, 
    test_size=0.2, 
    random_state = 42
)

In [9]:
from sklearn.linear_model import LinearRegression
linear_regression_model = LinearRegression()
linear_regression_model.fit(
    X_train, y_train
)

In [10]:
r_squared = linear_regression_model.score(X_train, y_train)

In [11]:
print("R^2 = " , r_squared)

R^2 =  0.6125511913966953


In [12]:
predicted = linear_regression_model.predict( train_df[data.feature_names])



In [13]:
linear_regression_model_df = LinearRegression()
linear_regression_model_df.fit(train_df, y_train)


In [14]:
y_train

array([1.03 , 3.821, 1.726, ..., 2.221, 2.835, 3.25 ])

In [15]:
train_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,172600.0


In [16]:
lr = LinearRegression()
lr.fit(train_df[data.feature_names], train_df[data.target_names])


In [17]:
train_df["predicted_lr"] = lr.predict(train_df[data.feature_names])
train_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted_lr
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0,193725.844705
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0,248910.616182
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,172600.0,264735.483406


In [21]:
train_df["difference"] = train_df["MedHouseVal"] - train_df["predicted_lr"]

In [22]:
train_df["difference_squared"] = train_df["difference"] ** 2

In [23]:
train_df.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted_lr,difference,difference_squared
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0,193725.844705,-90725.844705,8231179000.0
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,382100.0,248910.616182,133189.383818,17739410000.0


# Example of $ R^2 $ calculation

The $ R^2 $ formula is defined as follows: 

$ R^2 =  1 - \frac{ SS_{res} }{ SS_{tot} }  $

Where:
- $ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2  $
- $ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $
    - $ {y_i} $: Real Value
    - $ \hat{y_i} $: Predicted value 
    - $ \bar{y_i} $: Mean of observed values

Meaning:
- $ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2  $ (Residuals sum of squares)
- $ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $ (Residuals considering the mean)

Example: 
$$
\begin{array}{|c|c|c|}
\hline
Observed (y) & Predicted (\hat{y}) & Error (y - \hat{y}) \\ 
\hline
5.0 & 4.8 & 0.2 \\ 
\hline
7.0 & 6.5 & 0.5 \\ 
\hline
4.0 & 4.2 & -0.2 \\ 
\hline
\end{array}
$$ 


$ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $
$ = (5-4.8)^2 + (7-6.5)^2 + (4 - 4.2)^2 $
$ = 0.04 + 0.25 + 0.04 $
$ = 0.33 $

To calculate $ SS_{tot} $ we need to first calculate the mean ($ \hat{y} $). So let's calculate the man: 
$$ \hat{y} = \frac{5 + 7 + 4}{3} = \frac{16}{3} = 5.33 $$

Now, we can calculate the $ SS_{tot} $ 

$ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y_i})^2  $ 
$ = (5 - 5.33)^2 + (7 - 5.33)^2 + (4 - 5.33)^2  $ 
$ =  0.1089 + 2.7889 + 1.7689$ 
$ =  4.67 $ 

Now that we have both terms calculated ( $ SS_{res} $ and $ SS_{tot} $ ) we can calculate $ R^2 $

$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $ 
$ = 1 - \frac{0.33}{4.67} $ 
$ = 1 - 0.07 $ 
$ = 0.93 $

So we can say that the model explains approximately 93% of the variance in the observed data.


In [24]:

linear_regression_model_df.predict(X_train)



ValueError: X has 8 features, but LinearRegression is expecting 9 features as input.

In [94]:
predicted

array([1.93725845, 2.48910616, 2.64735483, ..., 2.03879912, 2.84075139,
       2.27373156])

In [84]:
linear_regression_model.score(X_train, y_train)



-38239005668.57091

In [79]:
X_train[0]

array([ 3.25960000e+00,  3.30000000e+01,  5.01765650e+00,  1.00642055e+00,
        2.30000000e+03,  3.69181380e+00,  3.27100000e+01, -1.17030000e+02])

In [81]:
train_df.head(1)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,103000.0


In [40]:
predictions = linear_regression_model.predict(test_df[data.feature_names])

In [65]:
test_predicted_df = test_df.copy()

In [73]:
import numpy as np

In [74]:
test_predicted_df["predicted"] = np.round(predictions, 0)

In [75]:
test_predicted_df["difference"] = round(test_predicted_df["MedHouseVal"] - test_predicted_df["predicted"], 0) 

In [None]:
linear_regression_model.score(train_df[data.features]

In [76]:
test_predicted_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted,difference
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,47700.0,71912.0,-24212.0
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,45800.0,176402.0,-130602.0
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44,500001.0,270966.0,229035.0


In [67]:
test_predicted_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,47700.0,71912.28416
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,45800.0,176401.657066
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44,500001.0,270965.883343


In [51]:
test_predicted_df = pd.DataFrame(predictions, columns=["predicted"])
test_predicted_df.head(3)

Unnamed: 0,predicted
0,71912.28416
1,176401.657066
2,270965.883343


In [59]:
test_df.head(3)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,47700.0
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,45800.0
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44,500001.0


In [63]:
pd.concat(
    [test_df.head(3), test_predicted_df.head(3)], axis=0)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,47700.0,
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,45800.0,
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44,500001.0,
0,,,,,,,,,,71912.28416
1,,,,,,,,,,176401.657066
2,,,,,,,,,,270965.883343


In [55]:
test_predicted_df.isna().sum()

predicted    0
dtype: int64

In [56]:
test_predicted_df.shape

(4128, 1)

In [58]:
test_df.shape

(4128, 9)

In [52]:
pd.concat([test_df, test_predicted_df], axis=1)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,predicted
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,47700.0,
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,45800.0,261250.527120
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.80,-122.44,500001.0,
20484,5.7376,17.0,6.163636,1.020202,1705.0,3.444444,34.28,-118.72,218600.0,
9814,3.7250,34.0,5.492991,1.028037,1063.0,2.483645,36.62,-121.93,278000.0,
...,...,...,...,...,...,...,...,...,...,...
4123,,,,,,,,,,199174.574281
4124,,,,,,,,,,224983.911630
4125,,,,,,,,,,446877.016572
4126,,,,,,,,,,118751.118552


In [44]:
pd.concat([test_df, predictions], axis=1) 

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid