Lab1: Complete the TODO parts in the following code. 
- Using California Housing Dataset from sklearn, select input attributes 1,3,4  as the input features. 
- Using K-fold cross validation technique (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html), complete the implementation to train a regression model and report performance merics when asked in the following code. 
- For multiple degrees of model complexity (i.e., degree of polynomial in this exercise) in a for-loop, obtain the model with the minimum reducible_error, polynomial degree, and run the obtained model on the test data. For this part,you should use the split the data into train and test by [75:25] rate and report mse of the final model on test data. 
- Analyse the results of model performance according to different degrees of polynomial and the number of folds used. You can manipulate the code and share your analysis in terms of the performance of the model (mse and total error), such as for instnace which degree of the model complexity (in relation to the polynomial order) would give a better model? Feel free to include other analysis about the generated models in relation to their performance results. You can event plot the results to support your analysis. 

In [7]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, explained_variance_score, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import fetch_california_housing

In [24]:
def polynomial_regression(degree, X, y, folds, test_size=0.25, random_state=None):
    # Define number of folds for cross-validation
    kf = KFold(folds)

    # Initialize lists to store results for variance, bias2s, total_error, and models
    variance = []
    bias2s = []
    total_error = []
    models = []
    
    # Set the polynomial degree of the model
    poly_features = PolynomialFeatures(degree)
    X_poly = poly_features.fit_transform(X)

    # Perform cross-validation
    for train_index, test_index in kf.split(X_poly):
        # Split data into training and testing sets for this fold
        X_train, X_test = X_poly[train_index], X_poly[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Fit polynomial regression model
        model = LinearRegression()
        model.fit(X_train, y_train)

        # Make predictions on the test set
        y_pred = model.predict(X_test)

        # Calculate variance and R^2 for this fold
        var = explained_variance_score(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        bias2 = mean_squared_error(y_test, y_pred)
        
        # Append results to lists
        variance.append(var)
        bias2s.append(bias2)
        total_error.append(bias2 + var)
        models.append(model)

        # Print results for this fold
        print("Variance: {:.4f}, Bias2: {:.4f}, Total error: {:.4f}".format(var, bias2, var + bias2))

    # print the total_error of the best model
    min_error_index = np.argmin(total_error)
    best_model = models[min_error_index]
    print("Total Error of Best Model:", total_error[min_error_index])

    # Testing the final model on the test data
    X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=test_size, random_state=42)
    # Obtain the predictions on the test data
    y_test_pred = best_model.predict(X_test)
    
    # store mse score of the model applied on the test data
    mse = mean_squared_error(y_test, y_test_pred)
    

    return mse , best_model

In [26]:
# Example usage: load California Housing Dataset and select the first, third, and forth attributes as input features in X
ca_housing = fetch_california_housing()
X = ca_housing.data[:, [0, 2, 3]]
# Set the target valiable 
y = ca_housing.target

degrees = range(1, 6)  # Try polynomial degrees from 1 to 5
# Try degrees from 1 to 5 and in a loop, report mse of the best model trained using k-fold cross validation and print("Degree:", degree, "MSE:", mse)
for degree in degrees:
    mse, _ = polynomial_regression(degree, X, y, folds=5)
    print("Degree:", degree, "MSE:", mse)

Variance: 0.5714, Bias2: 0.5454, Total error: 1.1168
Variance: 0.4615, Bias2: 0.6744, Total error: 1.1359
Variance: 0.4973, Bias2: 0.7296, Total error: 1.2268
Variance: 0.3804, Bias2: 0.7385, Total error: 1.1190
Variance: 0.5540, Bias2: 0.6649, Total error: 1.2189
Total Error of Best Model: 1.1168039310180593
Degree: 1 MSE: 0.6631111314294396
Variance: 0.0711, Bias2: 1.0627, Total error: 1.1338
Variance: 0.4939, Bias2: 0.6217, Total error: 1.1156
Variance: 0.5124, Bias2: 0.7064, Total error: 1.2188
Variance: 0.3932, Bias2: 0.7215, Total error: 1.1147
Variance: 0.5793, Bias2: 0.6307, Total error: 1.2100
Total Error of Best Model: 1.114740798455602
Degree: 2 MSE: 0.624329168125288
Variance: -9.6676, Bias2: 11.5136, Total error: 1.8461
Variance: 0.5016, Bias2: 0.6084, Total error: 1.1099
Variance: 0.4538, Bias2: 0.7873, Total error: 1.2411
Variance: 0.3982, Bias2: 0.7130, Total error: 1.1112
Variance: 0.5993, Bias2: 0.6016, Total error: 1.2009
Total Error of Best Model: 1.1099270673042758