## Name - Shubham Prajapati
## Student ID - 8945201

## Q1. Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8.

In [1]:
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

cross_val_scores = []

# degrees from 0 to 8
for degree in range(9):
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    
    # Creating a linear regression model
    model = LinearRegression()
    
    # Performing cross-validation on nine polynomial model
    scores = cross_val_score(model, X_poly, y, cv=5)
    cross_val_scores.append(scores.mean())

for degree, score in enumerate(cross_val_scores):
    print(f"Degree {degree}: {score:.2f}")


Degree 0: -0.03
Degree 1: 0.48
Degree 2: 0.39
Degree 3: -182.37
Degree 4: -70.67
Degree 5: -67.39
Degree 6: -67.45
Degree 7: -67.45
Degree 8: -67.44


## Q2. Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. Include the R-Squared, Mean Absolute Error (MAE) and MAPE metrics for each model. Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values.

In [2]:
import pandas as pd
from sklearn.metrics import r2_score, mean_absolute_error

results = []

# Looping through degrees from 0 to 8
for degree in range(9):
   
    poly2 = PolynomialFeatures(degree=degree)
    X_poly2 = poly2.fit_transform(X)
    model2 = LinearRegression()
    
    # Performing cross-validation
    scores2 = cross_val_score(model2, X_poly, y, cv=5)
    
    # Calculating the R-squared, MAE, and MAPE
    r_squared = np.mean([r2_score(y, model2.fit(X_poly2, y).predict(X_poly2))])
    mae = np.mean([mean_absolute_error(y, model2.fit(X_poly2, y).predict(X_poly2))])
    mape = np.mean(np.abs((y - model2.fit(X_poly2, y).predict(X_poly2)) / y)) * 100
    results.append([degree, r_squared, mae, mape])

# Creating new DataFrame to display the results
columns = ['Degree', 'R-Squared', 'MAE', 'MAPE']
df = pd.DataFrame(results, columns=columns)

# Calculating the mean and standard deviation of the metrics
mean_values = df.mean()
std_values = df.std()

# Showing the final result
print(df)
print("\nMean Values:")
print(mean_values)
print("\nStandard Deviation:")
print(std_values)


   Degree  R-Squared           MAE          MAPE
0       0   0.000000  6.576457e+01  6.212156e+01
1       1   0.517748  4.327745e+01  3.878618e+01
2       2   0.592440  3.915901e+01  3.459813e+01
3       3   0.798067  2.566021e+01  2.352158e+01
4       4   1.000000  1.217859e-10  1.101501e-10
5       5   1.000000  1.246971e-10  1.139069e-10
6       6   1.000000  1.262092e-10  1.147348e-10
7       7   1.000000  1.278385e-10  1.179825e-10
8       8   1.000000  1.486468e-10  1.348009e-10

Mean Values:
Degree        4.000000
R-Squared     0.767584
MAE          19.317916
MAPE         17.669716
dtype: float64

Standard Deviation:
Degree        2.738613
R-Squared     0.345198
MAE          25.077765
MAPE         23.194237
dtype: float64


## Q3. Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. Provide an explanation for choosing this specific model.

## Answer
R-Squared (R²) - The model with a degree of 4 has an R-squared value of 1.0, indicating that it perfectly explains the variance in the data. This is the highest possible R-squared value and signifies an excellent fit.

Mean Absolute Error (MAE) - The model with a degree of 4 has an extremely low MAE, close to zero, which suggests very accurate predictions. In fact, the MAE is so close to zero that it can be considered negligible.

Mean Absolute Percentage Error (MAPE) - Similarly, the model with a degree of 4 has an extremely low MAPE, close to zero, indicating that the percentage error in predictions is negligible.

In summary, the model with a degree of 4 performs exceptionally well in all three metrics—R-squared, MAE, and MAPE. It explains the data variance perfectly, makes highly accurate predictions, and has an almost negligible percentage error. Therefore, this specific model with a degree of 4 is the best choice for this dataset.

## Q4. Additional analysis and interpretation of the models' performances. You may explore further insights beyond the required metrics. The analysis should provide at least one relevant insight about the choice of the best model, or about characteristics of the chosen one (for example - an analysis of in which instances does it fail)

## Answer
Analysis of Model Performances:

Degree 4 - This model has a perfect R-squared of 1.0, indicating a flawless fit. However, a perfect R-squared might be a sign of overfitting. MAE and MAPE are extremely close to zero, suggesting high prediction accuracy.

Degree 3 - The model with a degree of 3 still has a high R-squared (0.798) and reasonable MAE and MAPE, indicating good explanatory power and accuracy.

Degree 2 - Degree 2 shows a good R-squared (0.592) but higher MAE and MAPE, suggesting a moderate ability to explain data variance and lower accuracy.

Degree 1 - Degree 1 has the lowest R-squared (0.517), indicating weaker explanatory power. MAE and MAPE are higher, signifying lower accuracy.

Insight - The degree 4 model performs the best in this dataset but may exhibit signs of overfitting due to its perfect R-squared. While it excels in this specific dataset, it's essential to validate its performance on new data to avoid poor generalization. Overly complex models (high-degree polynomials) can struggle with unseen data, so thorough evaluation on validation or test data is crucial for real-world applications.