# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model
- Compare and contrast model validation strategies

## Let's Get Started

We included the code to pre-process the Ames Housing dataset below. This is done for the sake of expediency, although it may result in data leakage and therefore overly optimistic model metrics.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

## Train-Test Split

Perform a train-test split with a test set of 20% and a random state of 4.

In [2]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

In [3]:
# Split the data into training and test sets (assign 20% to test set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)


### Fit a Model

Fit a linear regression model on the training set

In [4]:
# Import LinearRegression from sklearn.linear_model
from sklearn.linear_model import LinearRegression

In [5]:
# Instantiate and fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


LinearRegression()

### Calculate MSE

Calculate the mean squared error on the test set

In [6]:
# Import mean_squared_error from sklearn.metrics
from sklearn.metrics import mean_squared_error

In [7]:
# Calculate MSE on test set
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f'Mean Squared Error (MSE) on the test set: {mse}')


Mean Squared Error (MSE) on the test set: 0.1523399721070817


## Cross-Validation using Scikit-Learn

Now let's compare that single test MSE to a cross-validated test MSE.

In [8]:
# Import cross_val_score from sklearn.model_selection
from sklearn.model_selection import cross_val_score


In [9]:
# Find MSE scores for a 5-fold cross-validation
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mse_scores = -mse_scores


In [10]:
# Get the average MSE score
average_mse = mse_scores.mean()
average_mse

0.1770283421000112

Compare and contrast the results. What is the difference between the train-test split and cross-validation results? Do you "trust" one more than the other?

In [11]:
# Your answer here
"""
The train-test split provides a single evaluation of the model's performance on a holdout test set, which may be sensitive to the specific data split. 
If the test set is unrepresentative or small, the performance estimate can be overly optimistic or pessimistic. 
In contrast, cross-validation (especially 5-fold) gives a more robust estimate by averaging performance over multiple splits, reducing the variance in the evaluation metrics.

In terms of trust, cross-validation is generally considered more reliable because it uses the entire dataset for both training and testing, which mitigates the risk of an unlucky train-test split. 
However, the computational cost of cross-validation is higher, so if speed is critical, the train-test split might be preferred. 
In practice, cross-validation is usually preferred for a more reliable model evaluation.
"""

"\nThe train-test split provides a single evaluation of the model's performance on a holdout test set, which may be sensitive to the specific data split. \nIf the test set is unrepresentative or small, the performance estimate can be overly optimistic or pessimistic. \nIn contrast, cross-validation (especially 5-fold) gives a more robust estimate by averaging performance over multiple splits, reducing the variance in the evaluation metrics.\n\nIn terms of trust, cross-validation is generally considered more reliable because it uses the entire dataset for both training and testing, which mitigates the risk of an unlucky train-test split. \nHowever, the computational cost of cross-validation is higher, so if speed is critical, the train-test split might be preferred. \nIn practice, cross-validation is usually preferred for a more reliable model evaluation.\n"

## Level Up: Let's Build It from Scratch!

### Create a Cross-Validation Function

Write a function `kfolds(data, k)` that splits a dataset into `k` evenly sized pieces. If the full dataset is not divisible by `k`, make the first few folds one larger then later ones.

For example, if you had this dataset:

In [12]:
example_data = pd.DataFrame({
    "color": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
})
example_data

Unnamed: 0,color
0,red
1,orange
2,yellow
3,green
4,blue
5,indigo
6,violet


`kfolds(example_data, 3)` should return:

* a dataframe with `red`, `orange`, `yellow`
* a dataframe with `green`, `blue`
* a dataframe with `indigo`, `violet`

Because the example dataframe has 7 records, which is not evenly divisible by 3, so the "leftover" 1 record extends the length of the first dataframe.

In [13]:
def kfolds(data, k):
    folds = []
    n = len(data)
    fold_size = n // k
    remainder = n % k
    
    start = 0
    for i in range(k):
        end = start + fold_size + (1 if i < remainder else 0)
        folds.append(data.iloc[start:end])
        start = end
    # Your code here
    
    return folds

In [14]:
results = kfolds(example_data, 3)
for result in results:
    print(result, "\n")

    color
0     red
1  orange
2  yellow 

   color
3  green
4   blue 

    color
5  indigo
6  violet 



### Apply Your Function to the Ames Housing Data

Get folds for both `X` and `y`.

In [15]:
# Apply kfolds() to ames_data with 5 folds
X_folds = kfolds(X, 5)
y_folds = kfolds(y, 5)

# Print the folds for X and y to verify
for i in range(5):
    print(f"Fold {i+1} - X fold:\n{X_folds[i]}\n")
    print(f"Fold {i+1} - y fold:\n{y_folds[i]}\n")


Fold 1 - X fold:
     LotArea_log  1stFlrSF_log  GrLivArea_log  BldgType_2fmCon  \
0      -0.133185     -0.803295       0.529078                0   
1       0.113403      0.418442      -0.381715                0   
2       0.419917     -0.576363       0.659449                0   
3       0.103311     -0.439137       0.541326                0   
4       0.878108      0.112229       1.281751                0   
..           ...           ...            ...              ...   
287    -0.208982     -0.795950      -1.538509                0   
288     0.156994     -0.645537      -1.395230                0   
289    -0.070186     -1.445511      -0.079173                0   
290     1.053039     -0.074628       0.874786                0   
291    -0.898448     -0.522097       0.539579                1   

     BldgType_Duplex  BldgType_Twnhs  BldgType_TwnhsE  KitchenQual_Fa  \
0                  0               0                0               0   
1                  0               0        

Fold 3 - X fold:
     LotArea_log  1stFlrSF_log  GrLivArea_log  BldgType_2fmCon  \
584    -0.756638     -0.348746       0.278715                0   
585     0.452790      1.911383       1.040415                0   
586     0.243216     -0.870183      -1.609221                0   
587    -0.067974     -0.788622      -1.531529                0   
588     1.970412      0.905029       0.081793                0   
..           ...           ...            ...              ...   
871    -0.065764     -1.000542       0.479581                0   
872    -0.034653     -0.610760      -1.362102                0   
873     0.567694     -0.202622      -0.367495                0   
874    -0.887266     -1.546307      -0.710288                0   
875    -0.011322      0.217645       1.796823                0   

     BldgType_Duplex  BldgType_Twnhs  BldgType_TwnhsE  KitchenQual_Fa  \
584                0               0                0               0   
585                0               0        

### Perform a Linear Regression for Each Fold and Calculate the Test Error

Remember that for each fold you will need to concatenate all but one of the folds to represent the training data, while the one remaining fold represents the test data.

In [16]:
# Replace None with appropriate code
test_errs = []
k = 5

for n in range(k):
    # Split into train and test for the fold
    X_train = pd.concat([X_folds[i] for i in range(k) if i != n])
    X_test = X_folds[n]
    y_train = pd.concat([y_folds[i] for i in range(k) if i != n])
    y_test = y_folds[n]
    
    # Fit a linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    # Evaluate test errors
    mse = mean_squared_error(y_test, y_pred)
    test_errs.append(mse)


print(test_errs)

[0.12431546148437424, 0.19350064631313132, 0.18910530431311184, 0.17079325250026917, 0.20742704588916946]


If your code was written correctly, these should be the same errors as scikit-learn produced with `cross_val_score` (within rounding error). Test this out below:

In [17]:
# Compare your results with sklearn results
cv_scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='neg_mean_squared_error')

# Convert negative MSE to positive MSE
cv_mse = -cv_scores

# Compare the manual test errors with the cross_val_score results
print("Manual test errors (from kfolds function):")
print(test_errs)

print("\nCross-validation MSE from cross_val_score:")
print(cv_mse)

# Check if the results are similar
assert np.allclose(test_errs, cv_mse, atol=1e-5), "The errors are not the same!"
print("\nThe errors are identical within the rounding error!")


Manual test errors (from kfolds function):
[0.12431546148437424, 0.19350064631313132, 0.18910530431311184, 0.17079325250026917, 0.20742704588916946]

Cross-validation MSE from cross_val_score:
[0.12431546 0.19350065 0.1891053  0.17079325 0.20742705]

The errors are identical within the rounding error!


This was a bit of work! Hopefully you have a clearer understanding of the underlying logic for cross-validation if you attempted this exercise.

##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are more robust than train-test split.