## Lab 5 - Cross-Validation for Model Selection

### Task

1. Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8. (2 points)
2. Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. Include the R-Squared and Mean Absolute Error (MAE) metrics for each model. Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values. (2 points)
3. Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared and MAE metrics. Provide an explanation for choosing this specific model. (2 points)

Optional (No Extra Grade):Additional analysis and interpretation of the models' performances are welcome. You may explore further insights beyond the required metrics. Feel free to include your findings.

In [27]:
# Import the necessary libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import (
    train_test_split,
    cross_val_score
)


In [16]:
# Load the diabetes dataset
X, y = datasets.load_diabetes(as_frame=True, scaled=False, return_X_y=True) # as_frame returns data as a pandas DataFrame

X.shape, y.shape

((442, 10), (442,))

In [17]:
# Splitting Data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, y_train.shape

((353, 10), (353,))

In [18]:
X_test.shape, y_test.shape

((89, 10), (89,))

In [19]:
X.head(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,59.0,2.0,32.1,101.0,157.0,93.2,38.0,4.0,4.8598,87.0
1,48.0,1.0,21.6,87.0,183.0,103.2,70.0,3.0,3.8918,69.0
2,72.0,2.0,30.5,93.0,156.0,93.6,41.0,4.0,4.6728,85.0
3,24.0,1.0,25.3,84.0,198.0,131.4,40.0,5.0,4.8903,89.0
4,50.0,1.0,23.0,101.0,192.0,125.4,52.0,4.0,4.2905,80.0
5,23.0,1.0,22.6,89.0,139.0,64.8,61.0,2.0,4.1897,68.0
6,36.0,2.0,22.0,90.0,160.0,99.6,50.0,3.0,3.9512,82.0
7,66.0,2.0,26.2,114.0,255.0,185.0,56.0,4.55,4.2485,92.0
8,60.0,2.0,32.1,83.0,179.0,119.4,42.0,4.0,4.4773,94.0
9,29.0,1.0,30.0,85.0,180.0,93.4,43.0,4.0,5.3845,88.0


In [20]:
y.head()

0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64

In [21]:
X.shape

(442, 10)

In [22]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB


Using ***K-Fold Cross Validation*** with cv=5. It performs 5-fold cross validation, where data is split into 5 equal-sized folds, and model is trained and evaluated 5 times, each time using a different fold as the validation set and the remaining folds as the training set.

In [28]:
# Define a function to create polynomial models and perform cross-validation
def perform_cross_validation(degree):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train)

    # Perform cross-validation
    model = LinearRegression()
    scores_r2 = cross_val_score(model, X_train_poly, y_train, cv=5, scoring='r2') # 5-fold cv
    scores_mae = cross_val_score(model, X_train_poly, y_train, cv=5, scoring='neg_mean_absolute_error') #MAE

    # Calculate mean and standard deviation
    mean_r2 = np.mean(scores_r2)
    std_r2 = np.std(scores_r2)
    mean_mae = -np.mean(scores_mae)
    std_mae = np.std(scores_mae)

    return mean_r2, std_r2, mean_mae, std_mae

In [24]:
# Create a list to store the cross-validation results
results = []

# Iterate over different polynomial degrees and perform cross-validation
for degree in range(9):
    mean_r2, std_r2, mean_mae, std_mae = perform_cross_validation(degree)
    results.append([degree, mean_r2, std_r2, mean_mae, std_mae])

In [25]:
# Create a DataFrame to store the results
results_df = pd.DataFrame(results, columns=['Degree', 'Mean R-Squared', 'Std R-Squared', 'Mean MAE', 'Std MAE'])

# Print the results table
print(results_df)

   Degree  Mean R-Squared  Std R-Squared     Mean MAE     Std MAE
0       0       -0.030469       0.047964    66.759216    7.282955
1       1        0.449256       0.144095    45.543329    3.114136
2       2       -1.774636       2.061861    86.272470   21.303926
3       3    -1067.917338     713.426262   955.567804  230.116961
4       4     -205.000825     109.833480   498.899342  139.303076
5       5     -341.064592     285.522005   533.337079  183.993533
6       6     -752.566519     723.952751   664.415606  255.201828
7       7    -1604.957874    1666.816468   861.068169  360.606033
8       8    -3668.447375    4094.407062  1171.089566  565.618535


In [26]:
# Identify the best model based on R-Squared and MAE metrics
best_r2_model = results_df['Mean R-Squared'].idxmax()
best_mae_model = results_df['Mean MAE'].idxmin()

print(f"\nBest model based on R-Squared: Degree {results_df['Degree'][best_r2_model]}")
print(f"Best model based on MAE: Degree {results_df['Degree'][best_mae_model]}")


Best model based on R-Squared: Degree 1
Best model based on MAE: Degree 1
