# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model
- Compare and contrast model validation strategies

## Let's Get Started

We included the code to pre-process the Ames Housing dataset below. This is done for the sake of expediency, although it may result in data leakage and therefore overly optimistic model metrics.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

## Train-Test Split

Perform a train-test split with a test set of 20% and a random state of 4.

In [2]:
# Import train_test_split from sklearn.model_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Ensure plots are displayed inline
%matplotlib inline

### Fit a Model

Fit a linear regression model on the training set

In [3]:
# Import LinearRegression from sklearn.linear_model
# Load the dataset
ames = pd.read_csv('ames.csv')

# Define continuous and categorical features
continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

# Extract continuous features
ames_cont = ames[continuous]

# Log transform continuous features
log_names = [f'{column}_log' for column in ames_cont.columns]
ames_log = np.log(ames_cont)
ames_log.columns = log_names

# Normalize continuous features
def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# One-hot encode categorical features
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

# Combine preprocessed continuous and categorical features
preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

# Define features (X) and target variable (y)
X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']


In [4]:
# Instantiate and fit a linear regression model
# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=4)

# Display the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (1168, 47)
X_test shape: (292, 47)
y_train shape: (1168,)
y_test shape: (292,)


### Calculate MSE

Calculate the mean squared error on the test set

In [7]:
# Import mean_squared_error from sklearn.metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Initialize the Linear Regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Generate predictions on the test set
y_test_pred = model.predict(X_test)


In [8]:
# Calculate MSE on test set
# Calculate the MSE for the test set
test_mse = mean_squared_error(y_test, y_test_pred)

# Display the MSE
print("Test Set Mean Squared Error:", test_mse)


Test Set Mean Squared Error: 0.1523399721070815


## Cross-Validation using Scikit-Learn

Now let's compare that single test MSE to a cross-validated test MSE.

In [9]:
# Import cross_val_score from sklearn.model_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Ensure plots are displayed inline
%matplotlib inline

# Load the dataset
ames = pd.read_csv('ames.csv')

# Define continuous and categorical features
continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

# Extract continuous features
ames_cont = ames[continuous]

# Log transform continuous features
log_names = [f'{column}_log' for column in ames_cont.columns]
ames_log = np.log(ames_cont)
ames_log.columns = log_names

# Normalize continuous features
def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# One-hot encode categorical features
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

# Combine preprocessed continuous and categorical features
preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

# Define features (X) and target variable (y)
X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=4)

# Initialize the Linear Regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Generate predictions on the test set
y_test_pred = model.predict(X_test)

# Calculate the MSE for the test set
test_mse = mean_squared_error(y_test, y_test_pred)

# Display the single test set MSE
print("Single Test Set Mean Squared Error:", test_mse)

# Perform 5-fold cross-validation and calculate the MSE for each fold
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Convert negative MSE to positive MSE
cv_mse_scores = -cv_scores

# Calculate the mean cross-validated MSE
mean_cv_mse = cv_mse_scores.mean()

# Display the cross-validated MSE
print("Cross-Validated Mean Squared Error:", mean_cv_mse)


Single Test Set Mean Squared Error: 0.1523399721070815
Cross-Validated Mean Squared Error: 0.17702834210001095


Compare and contrast the results. What is the difference between the train-test split and cross-validation results? Do you "trust" one more than the other?

## Level Up: Let's Build It from Scratch!

### Create a Cross-Validation Function

Write a function `kfolds(data, k)` that splits a dataset into `k` evenly sized pieces. If the full dataset is not divisible by `k`, make the first few folds one larger then later ones.

For example, if you had this dataset:

In [10]:
example_data = pd.DataFrame({
    "color": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
})
example_data

Unnamed: 0,color
0,red
1,orange
2,yellow
3,green
4,blue
5,indigo
6,violet


`kfolds(example_data, 3)` should return:

* a dataframe with `red`, `orange`, `yellow`
* a dataframe with `green`, `blue`
* a dataframe with `indigo`, `violet`

Because the example dataframe has 7 records, which is not evenly divisible by 3, so the "leftover" 1 record extends the length of the first dataframe.

In [11]:
def kfolds(data, k):
    folds = []
    
    # Your code here
    
    return folds

In [12]:
results = kfolds(example_data, 3)
for result in results:
    print(result, "\n")

### Apply Your Function to the Ames Housing Data

Get folds for both `X` and `y`.

In [None]:
# Apply kfolds() to ames_data with 5 folds


### Perform a Linear Regression for Each Fold and Calculate the Test Error

Remember that for each fold you will need to concatenate all but one of the folds to represent the training data, while the one remaining fold represents the test data.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import numpy as np

# Number of folds
k = 5

# Initialize KFold with k splits
kf = KFold(n_splits=k, shuffle=True, random_state=4)

# List to store test errors for each fold
test_errs = []

# Iterate over each fold
for train_index, test_index in kf.split(X):
    # Split into train and test for the fold
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Initialize the Linear Regression model
    model = LinearRegression()
    
    # Fit the model to the training data
    model.fit(X_train, y_train)
    
    # Generate predictions on the test set
    y_test_pred = model.predict(X_test)
    
    # Evaluate test errors
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_errs.append(test_mse)

# Print test errors for each fold
print(test_errs)


[0.15233997210708142, 0.17537000743137915, 0.23594210947295105, 0.15838282254669322, 0.16149643043448308]


If your code was written correctly, these should be the same errors as scikit-learn produced with `cross_val_score` (within rounding error). Test this out below:

In [16]:
# Compare your results with sklearn results
# Perform 5-fold cross-validation and calculate the MSE for each fold using cross_val_score
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Convert negative MSE to positive MSE
cv_mse_scores = -cv_scores

# Print the cross-validated MSE scores
print("Cross-Validated MSE Scores (cross_val_score):", cv_mse_scores)


Cross-Validated MSE Scores (cross_val_score): [0.12431546 0.19350065 0.1891053  0.17079325 0.20742705]


This was a bit of work! Hopefully you have a clearer understanding of the underlying logic for cross-validation if you attempted this exercise.

##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are more robust than train-test split.