###### FOUNDATION OF MACHINE LEARNING
### Practical Lab 
# **#5**
---

1. Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8. (2 points)
2. Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. Include the R-Squared and Mean Absolute Error (MAE) metrics for each model. Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values. (2 points)

In [31]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline


X, y = datasets.load_diabetes(as_frame=True, scaled=False, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# This array will hold the result summary of prediction modals with different degree of polynomial.
results = []


# Calculate different model metrics and add to results
def report_metrics(degree, model):
    result = []

    cv_mae = cross_val_score(model, X, y, scoring="neg_mean_absolute_error")
    cv_r2 = cross_val_score(model, X, y, scoring="r2")

    result.append(degree)
    result.append(-cv_mae.mean())
    result.append(cv_mae.std())
    result.append(cv_r2.mean())
    result.append(cv_r2.std())

    results.append(result)


for degree in range(1, 9):
    # Create model with given degree of features
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

    report_metrics(degree, model)

result_df = pd.DataFrame(
    results,
    columns=[
        "Degree",
        "MAE-Mean",
        "MAE-Std",
        "R2-Mean",
        "R2-Std",
    ],
)

print(result_df.to_string(index=False))

 Degree    MAE-Mean    MAE-Std       R2-Mean       R2-Std
      1   44.276499   2.100110      0.482316     0.049269
      2   68.750394  17.429324     -0.566826     0.690190
      3  342.052153 142.438360   -203.419215   225.878367
      4  657.260477 159.475902   -571.083108   369.891883
      5  562.993636  59.917202   -436.856887   379.100423
      6  742.562586 190.956306  -1694.818876  2631.043243
      7 1032.681725 393.439617  -5530.894074  9518.586954
      8 1475.658532 706.280274 -16076.255118 28049.952582


3. Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared and MAE metrics. Provide an explanation for choosing this specific model. (2 points)

To find the best model using "R-squared" metrics, we will look at the table to find the degree of model with largest value for "R2-Mean". This also will be the closest value to 1. The value of "R2-Mean" will not be more than 1. The higher value of R-squared indicated highest proportion of variation in targets which helps in better prediction for unknown data.

In [42]:
# Winner for best model based on R-squared metric:
f'Degree: {result_df["Degree"][result_df["R2-Mean"].idxmax()]}'

'Degree: 1'

To find the best model using "Mean Absolute Error", we will look at the table to find degree of model with smallest value for "MAE". This is because MAE indicates the variation of predictions with actual values of target. The lower the variation, the better the model is.

In [43]:
# Winner for best model based on MAE metric:
f'Degree: {result_df["Degree"][result_df["MAE-Mean"].idxmin()]}'

'Degree: 1'