# Sandeep Pandey - 8878312 - Lab 5

In [15]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

In [16]:
# Step 1: Load the diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

In [17]:
# Initialize lists to store results
degrees = list(range(9))
r_squared = []
mae = []
mape = []

In [18]:
# Step 2 and 3: Generate polynomial features and perform cross-validation
for degree in degrees:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model = LinearRegression()
    scores = cross_val_score(model, X_poly, y, cv=5, scoring='r2')
    r_squared.append(np.mean(scores))
    y_pred = np.mean(cross_val_score(model, X_poly, y, cv=5, scoring='neg_mean_absolute_error'))
    mae.append(-y_pred)
    y_pred = np.mean(cross_val_score(model, X_poly, y, cv=5, scoring='neg_mean_absolute_percentage_error'))
    mape.append(-y_pred)

In [19]:
# Step 4: Create a DataFrame to store the results
results_df = pd.DataFrame({
    'Degree': degrees,
    'R-Squared': r_squared,
    'MAE': mae,
    'MAPE': mape
})

In [20]:
# Step 5: Calculate mean and standard deviation
mean_values = results_df.mean()
std_deviation = results_df.std()

In [21]:
# Step 6: Identify the best model
best_model = results_df.iloc[results_df['R-Squared'].idxmax()]

In [22]:
# Best model parmeters
results_df.iloc[results_df['R-Squared'].idxmax()]

Degree        1.000000
R-Squared     0.482316
MAE          44.276499
MAPE          0.394860
Name: 1, dtype: float64

In [25]:
results_df

Unnamed: 0,Degree,R-Squared,MAE,MAPE
0,0,-0.027506,66.045624,0.623622
1,1,0.482316,44.276499,0.39486
2,2,0.391502,46.612882,0.402669
3,3,-183.550427,341.013011,2.342632
4,4,-70.667516,303.158461,2.453685
5,5,-67.387407,295.686026,2.405233
6,6,-67.448908,295.632654,2.404961
7,7,-67.448503,295.630255,2.404951
8,8,-67.446928,295.60595,2.404757


### R-Squared (Coefficient of Determination):
Degree 1 has the highest R-Squared value of approximately 0.4823. This indicates that the model explains about 48.23% of the variance in the target variable, which is a moderate level of predictive power.
Degree 2 also has a respectable R-Squared value of about 0.3915, meaning it explains around 39.15% of the variance.
However, starting from Degree 3 onwards, the R-Squared values become highly negative. This indicates that these models perform much worse than a simple horizontal line (a horizontal line would have an R-Squared of 0). These models likely suffer from overfitting, fitting the training data very poorly, and are not able to generalize well to unseen data.

### Mean Absolute Error (MAE):
Degree 1 has the lowest MAE among all models, which suggests it has the smallest average absolute difference between predicted and actual values. This indicates that the linear model (degree 1) provides the most accurate predictions on average.
Degree 2 follows with a slightly higher MAE, but it is still lower than the MAE of higher-degree models.
As the degree of the polynomial increases beyond 2, the MAE increases dramatically, indicating that these models are performing poorly in terms of absolute prediction accuracy.

### Mean Absolute Percentage Error (MAPE):
Degree 1 has the lowest MAPE, indicating it has the smallest average percentage difference between predicted and actual values.
Degree 2 follows with a slightly higher MAPE, but it is still lower than the MAPE of higher-degree models.
Like MAE, as the degree of the polynomial increases beyond 2, the MAPE increases significantly, indicating that these models have higher relative prediction errors.

### Insights:
The best-performing model, based on the provided metrics, is Degree 1. It exhibits the highest R-Squared, and the lowest MAE and MAPE values, indicating better overall predictive performance compared to higher-degree models.

The models from Degree 3 onwards exhibit extremely poor performance. They have highly negative R-Squared values and very high MAE and MAPE values. This suggests that they are likely overfitting to the training data and failing to generalize to new data.

The high-degree polynomial models (3rd degree and above) are likely suffering from overfitting. They are fitting the training data too closely, capturing noise rather than the underlying pattern.