## Lab 5 - Cross-Validation for Model Selection

### Task

1. Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8. (2 points)
2. Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. Include the R-Squared and Mean Absolute Error (MAE) metrics for each model. Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values. (2 points)
3. Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared and MAE metrics. Provide an explanation for choosing this specific model. (2 points)

In [27]:
# Import the necessary libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import (
    train_test_split,
    cross_val_score
)


In [28]:
# Load the diabetes dataset
X, y = datasets.load_diabetes(as_frame=True, scaled=False, return_X_y=True) # as_frame returns data as a pandas DataFrame

X.shape, y.shape

((442, 10), (442,))

In [29]:
# Splitting Data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, y_train.shape

((353, 10), (353,))

In [30]:
X_test.shape, y_test.shape

((89, 10), (89,))

In [31]:
X.head(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,59.0,2.0,32.1,101.0,157.0,93.2,38.0,4.0,4.8598,87.0
1,48.0,1.0,21.6,87.0,183.0,103.2,70.0,3.0,3.8918,69.0
2,72.0,2.0,30.5,93.0,156.0,93.6,41.0,4.0,4.6728,85.0
3,24.0,1.0,25.3,84.0,198.0,131.4,40.0,5.0,4.8903,89.0
4,50.0,1.0,23.0,101.0,192.0,125.4,52.0,4.0,4.2905,80.0
5,23.0,1.0,22.6,89.0,139.0,64.8,61.0,2.0,4.1897,68.0
6,36.0,2.0,22.0,90.0,160.0,99.6,50.0,3.0,3.9512,82.0
7,66.0,2.0,26.2,114.0,255.0,185.0,56.0,4.55,4.2485,92.0
8,60.0,2.0,32.1,83.0,179.0,119.4,42.0,4.0,4.4773,94.0
9,29.0,1.0,30.0,85.0,180.0,93.4,43.0,4.0,5.3845,88.0


In [32]:
y.head()

0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64

In [33]:
X.shape

(442, 10)

Using ***K-Fold Cross Validation*** with cv=5. It performs 5-fold cross validation, where data is split into 5 equal-sized folds, and model is trained and evaluated 5 times, each time using a different fold as the validation set and the remaining folds as the training set.

In [34]:
cross_val_scores = []
degrees = np.arange(0, 9)

for degree in degrees:
    # Make a pipeline
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    # Apply cross-validation and calcaute mean and standard deviation on 5 folds
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    mean_r2 = cv_scores.mean()
    std_r2 = cv_scores.std()

    cv_scores_MAE = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
    mean_MAE = cv_scores_MAE.mean()
    std_MAE = cv_scores_MAE.std()

    cross_val_scores.append([degree ,mean_r2, std_r2, mean_MAE, std_MAE])

In [35]:
df = pd.DataFrame(cross_val_scores, columns=['Degree','Mean R-Squared','Standard dev R-Squared', 'Mean MAE', 'Standard dev MAE'])

In [36]:
df

Unnamed: 0,Degree,Mean R-Squared,Standard dev R-Squared,Mean MAE,Standard dev MAE
0,0,-0.027506,0.036772,-66.045624,3.47466
1,1,0.482316,0.049269,-44.276499,2.10011
2,2,-0.528281,1.086913,-66.857204,24.676461
3,3,-203.419757,225.878956,-342.05218,142.438635
4,4,-571.083108,369.891883,-657.260477,159.475902
5,5,-436.856887,379.100423,-562.993636,59.917202
6,6,-1695.473267,2632.637763,-742.496432,191.119317
7,7,-5530.894075,9518.586955,-1032.681725,393.439617
8,8,-16076.255116,28049.952578,-1475.658532,706.280274


In [37]:
# Identify the best model based on R-Squared and MAE metrics
best_r2_model = degrees[np.argmax(df['Mean R-Squared'])]
best_mae_model = degrees[np.argmax(df['Mean MAE'])]

print(f"\nBest model based on R-Squared: Degree {df['Degree'][best_r2_model]}")
print(f"Best model based on MAE: Degree {df['Degree'][best_mae_model]}")


Best model based on R-Squared: Degree 1
Best model based on MAE: Degree 1


As shown above, polynomial regression with model degree 1 gives the highest scores with mean r2 = 0.482316 and mean mae = -44.276499 (The scorer objects adhere to a convention where higher values are considered better than lower values) which means it has the least error.

