# Cross Validation Lab

### Intro and objectives


### In this lab you will learn:
1. an complete example of using cross-validation to evaluate linear models
## What I hope you'll get out of this lab
* Worked Examples
* How to interpret the results obtained

In [1]:
import sys

assert sys.version_info >= (3, 7)


In [2]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures

from sklearn.model_selection import cross_validate


In [4]:


# Load data

from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

In [5]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [6]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [None]:
#housing.dropna(inplace=True)

## 1. First Model: Linear Model

In [15]:
# Choose target and features
y = housing['median_house_value']
housing_features = ['housing_median_age', 'total_rooms']
X = housing[housing_features]

In [16]:
# specify the model
linear_model = LinearRegression()

In [17]:
# benchmark the model's performance using cross-validation

scores = cross_validate(linear_model, X, y, cv=5,scoring=('r2', 'neg_mean_squared_error'),return_train_score=True)
print(scores['test_r2'])

[0.04505694 0.04818973 0.03266756 0.05468838 0.04396787]


In [19]:
print(scores['train_r2'])

[0.04536379 0.04455361 0.04822794 0.04288053 0.04561384]


### The previous linear model has both low training and test scores. This model is therefore underfitting the data.

## 2. Second Model: Linear Model (Expanded Features)

In [20]:
# Choose target and features
y = housing['median_house_value']
housing_features = ['housing_median_age', 'total_rooms', 'population','households','median_income']
X = housing[housing_features]

In [21]:
# specify the model
linear_model_enhanced = LinearRegression()

In [24]:
scores = cross_validate(linear_model_enhanced, X, y, cv=5,scoring=('r2', 'neg_mean_squared_error'),return_train_score=True)
print(scores['test_r2'])

[0.5403159  0.5633993  0.55691745 0.57996112 0.56837741]


In [25]:
print(scores['train_r2'])

[0.56849084 0.56269167 0.5643067  0.55821239 0.56143466]


#### Given the training and test scores we conclude that this linear model with enhanced features, performs significantly better than the first one.

#### This model has similar training and test scores, therefore it is not overfitting the data.

## 3. Third Model: Polynomial Model (2nd Order)

In [65]:
# Choose target and features
y = housing['median_house_value']
housing_features = ['housing_median_age', 'total_rooms', 'population','households','median_income']
X = housing[housing_features]

In [66]:
# Add quadratic elements

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)


In [67]:
linear_model_polynomial = LinearRegression()

In [68]:
scores = cross_validate(linear_model_polynomial, X_poly, y, cv=5,scoring=('r2', 'neg_mean_squared_error'),return_train_score=True)
print(scores['test_r2'])

[0.58358232 0.60001001 0.59830329 0.42557061 0.60918571]


In [69]:
print(scores['train_r2'])

[0.61359125 0.60878474 0.60912326 0.60632095 0.6071873 ]


In [75]:
#### Given the training and test scores we conclude that this polinomial model, performs significantly better than the previous one.

#### The performance of the training data is slighly better than that of the test data, this means that the models is starting to overfit the data

## 4. Fourth Model: Polynomial Model (4th Order)

In [71]:
# Add quadratic elements

poly_features = PolynomialFeatures(degree=4, include_bias=False)
X_poly = poly_features.fit_transform(X)


In [72]:
linear_model_polynomial = LinearRegression()

In [73]:
scores = cross_validate(linear_model_polynomial, X_poly, y, cv=5,scoring=('r2', 'neg_mean_squared_error'),return_train_score=True)
print(scores['test_r2'])

[ -1.45313957  -0.7566881   -6.27666764 -59.26181999  -0.18749205]


In [74]:
print(scores['train_r2'])

[0.45560242 0.43051944 0.08447725 0.40197595 0.49271052]


#### Based on the peformance for the training data we conclude that this model is worse than the previous one. Moreover it is overfitting the data as the performance on the test data is lower than that on the train data.