# Session 1a - Inmas Workshop Machine Learning Workshop, January 13-14, 2024

Instructor: Christian Kuemmerle - kuemmerle@uncc.edu

**This version of the notebook is more suitable for students with less experience in machine learning / who are less familiar with using Python for machine learning and/or the underlying concepts. It requires less coding/ Python fluency compared to the version ``session1a_RidgeCrossVal_advanced.ipynb``, but is similar in nature.**

## Regularization and Ridge Regression, Cross Validation

### Case study with a simple one-dimensional function

On the example of a simple sinusoidal one-dimensional function, we explore:
   - the idea of a [training / test (or validation) set split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) of samples, and based on this, we evaluate the predictive power of 
   - [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html),
   - [ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), which is an **$L_2$-regularized** linear model, 
   
and furthermore, explore the effect of enriching the model by using [polynomial features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

First, we import the packages we need. As during the entire weekend, we will use the package [scikit-learn](https://scikit-learn.org/stable/).

If you have not yet done so, you may go to the pre-work Jupyter notebook `session0a_test_environments.ipynb` available [here](https://webpages.charlotte.edu/~ckuemme1/teaching/machinelearningworkshop2023/00-preparation.html) and **execute the its first code** cell to install scikit-learn.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

Importing `sklearn` should not result in an error after going through the pre-work notebook, however, if it does, please check out the `environment.yml` file available on the Github repository / shared OneDrive folder and and create a respective conda enviroment. Also note the [installation instructions](https://scikit-learn.org/stable/install.html) for your specific operating system.

### A simple one-dimensional function

We define a function to sample $n$=`nr_samples` points $x_1,\ldots,x_n$ uniformly distributed at random and define corresponding samples $y_i=f(x_i) + \epsilon_i$ where $f(x)= \sin(4 x) + x$ and $\epsilon_i$ are i.i.d. standard normal random variables. 

We start by visualizing the function $f$.

In [None]:
fn = lambda x: np.sin(4 * x) + x # This is one convenient way to define simple, "anonymous" functions.

For more about lambda expressions to define anonymous functions, see also [here](https://realpython.com/python-lambda/).

In [None]:
# Visualize the function f:
line_xvalues = np.linspace(-3, 3, 1000).reshape(-1, 1) # generate 1d grid for x-axis samples
plt.figure(figsize=(14,8))
plt.plot(line_xvalues,fn(line_xvalues))
ax = plt.gca()
ax.set_ylim(-5, 5)
ax.set_xlim(-8, 8)
ax.grid(True)

We create a data set (consisting of $x$-coordinates in '$X$' and $y$-coordinates in '$y$') based on the above model, using 'nr_samples' samples.

In [None]:
### function to create random samples from a wave-like signal
def make_wave(n_samples=100):
    rnd = np.random.RandomState(42) # this fixes the random seed (for reproducibility).
    x = rnd.uniform(-3, 3, size=n_samples)
    x = np.sort(x)
    y_no_noise = (np.sin(4 * x) + x)
    y = y_no_noise + rnd.normal(size=len(x))
    return x.reshape(-1, 1), y

In [None]:
# %% Create data
nr_samples = 50
X, y = make_wave(n_samples=nr_samples)
y = y.reshape((len(y), 1))

In [None]:
print("X values:", X.T,"\n y values:", y.T)

## Training data and test data sets

We now use the method [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test#sklearn.model_selection.train_test_split) of scikit-learn to split the available samples into two random subsets, a training dataset and a test dataset. The share of samples in the training data is provided by the choice of 'train_share', i.e., if `train_share` $= 0.6$, then 60% of the samples are used in the training dataset.
 
From the samples variable $X$, we create a [pandas](https://pandas.pydata.org/docs/) DataFrame dfX whose methods are sometimes more convenient to use. <br>
**Complete the last line below by using train_test_split** for the DataFrame dfX.

In [None]:
train_share = 0.6
dfX = pd.DataFrame(X)
X_train, X_test, y_train, y_test = # split data into a training data set of size ('train_share'*nr_samples)
# and test data set of size (1-'train_share')*nr_samples.

To see what happened, we print the x-values of both training and test dataset. As we work with `pandas` DataFrame data structures, we can simply type the variables.

In [None]:
X_train

In [None]:
X_test

The first columns above contain the indices of the respective data points. <br> 
We can now easily obtain a vector of indices corresponding to the training and test set, respectively. We define the corresponding vectors 'id_train' and 'id_test'.

In [None]:
id_train = X_train.index # obtain index of training set
id_test = X_test.index # obtain index of test set
type(X_train)

Next, **please convert the pandas.DataFrame of X_train and X_test to a multi-dimensional array in numpy**. You may use [this](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html) method.

In [None]:
X_train_ = X_train
X_test_ = X_test
### Add your code below ###
X_train = # convert X_train_ to numpy.ndarray X_train
X_test = # convert X_test_ to numpy.ndarray X_test

In [None]:
# The samples can be visualized as follows.
# %% Plot training data
plt.figure(figsize=(14,8))
plt.plot(X_train,y_train,'o',c='blue')
ax = plt.gca()
ax.set_ylim(-4, 4)
ax.set_xlim(-6, 6)
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('none')
ax.legend(["training data"],loc=0,fontsize=8)
ax.grid(True)

In [None]:
# %% Plot training data and test data
plt.figure(figsize=(14,8))
plt.plot(X_train,y_train,'o',c='blue')
plt.plot(X_test,y_test,'o',c='red')
ax = plt.gca()
ax.set_ylim(-4, 4)
ax.set_xlim(-6, 6)
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('none')
ax.legend(["training data","test data"],loc=0,fontsize=8)
ax.grid(True)

We now train a linear regression model on the samples of the training set (without any modification on the features). This corresponds to a linear model in the standard coordinate system.

In [None]:
# %% Fit linear regression model to training data
lr = LinearRegression().fit(X_train, y_train)

**What are the _coefficient_ and the _intercept_ of the fitted linear regression model?**
<br> Hint: Use `vars` or look up in the documentation of `sklearn.linear_model.LinearRegression` how to access these values.

In [None]:
### Add your code below ###



### Generation of Polynomial Features
 
Based on the wave-like behavior of the function $f$ that is underlying the data generation process, 
we would like to have larger expressive power of the models that we use.
For this reason we use an appropriate method of sklearn's [preprocessing](https://scikit-learn.org/stable/modules/classes.html?highlight=preprocessing#module-sklearn.preprocessing) to transform the $x$-samples to a vector of monomials, and run linear regression on these "enhanced" features derived from the samples.

Run linear regression with polynomial features of degree `'degree_poly'` (which can be chosen as 10 here).

In [None]:
degree_poly = 10;
poly = PolynomialFeatures(degree_poly) # define polynomial data model
X_trainpoly = poly.fit_transform(X_train) # obtain coordinates of polynomial features of training set
X_testpoly = poly.transform(X_test) # obtain coordinates of polynomial features of training set
polylr = LinearRegression().fit(X_trainpoly,y_train) # perform linear regression with polynomial features

Now, **we run ridge regression on the polynomial features for a fixed regularization parameter** $\alpha=\text{alphaval}$. Please note the options of [linear_models](https://scikit-learn.org/stable/modules/linear_model.html#linear-model).

In [None]:
alphaval = 10 # ridge regression parameter alpha
polyridge =  # perform ridge regression with polynomial features

**Print, analogously to the linear regression model above with exclusively linear features above, the learned model parameters in this case as well. <br> <br> What do you notice?** <br>
Play with different values of `alphaval` and see how the coefficients behave.

In [None]:
### Add your code below ###

## Model Evaluation

We want to start quantifying the predictive accuracy of the different models. To that end, we evaluate the mean squared errors as well as the resulting $R^2$ _coefficient of determination_ (see this for an [explanation](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score)).

**Compute the $R^2$ for the three considered models, for both training and test set, as well as the _mean squared errors_ for these, at the appropriate position of `trainerrors_list`, `testerrors_list`, `r2_train_list` and `r2_test_list`**.

Hint: For the three models, `lr`, `polylr` and `polyridge`, you can use e.g. `dir(polyridge)` to see a list of methods available for e.g. evaluating the model on your training or test data and for computing the $R^2$ value.

In [None]:
text_trainerrors = ['Training error of '+str(lr)+':','Training error of '+str(lr)+', poly. feat.:',
                    'Training error of '+str(polyridge)+', poly. feat.:']
print(text_trainerrors)

In [None]:
trainerrors_list =     # add your code here

In [None]:
text_testerrors = ['Test error of '+str(lr)+':','Test error of '+str(lr)+', poly. feat.:','Test error of '+str(polyridge)+', poly. feat.:']
print(text_testerrors)

In [None]:
testerrors_list =      # add your code here

In [None]:
text_r2 = ['R2 (train) of '+str(lr)+':','R2 (train) of '+str(lr)+', poly. feat.:','R2 (train) of '+str(polyridge)+', poly. feat.:']

In [None]:
r2_train_list =        # add your code here

In [None]:
text_r2_test = ['R2 (test) of '+str(lr)+':','R2 (test) of '+str(lr)+', poly. feat.:','R2 (test) of '+str(polyridge)+', poly. feat.:']

In [None]:
r2_test_list =         # add your code here

To obtain simple, formatted output, it is convenient to create a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) that includes the text as a row description and the quantities as entries.

In [None]:
pd.options.display.max_colwidth = 100
print(pd.DataFrame(trainerrors_list, index=text_trainerrors, columns=['']))
print(pd.DataFrame(r2_train_list, index=text_r2, columns=['']))

print(pd.DataFrame(testerrors_list, index=text_testerrors, columns=['']))
print(pd.DataFrame(r2_test_list, index=text_r2_test, columns=['']))

In [None]:
# We summarize the resulting curves in a plot.
# %% Plot summary
line = np.linspace(-8, 8, 1000).reshape(-1, 1)
line_poly = poly.transform(line)
plt.figure(figsize=(14,8))
plt.plot(X_train,y_train,'o',c='blue')
plt.plot(X_test,y_test,'o',c='red')
plt.plot(line, lr.predict(line))
plt.plot(line, polylr.predict(line_poly))
plt.plot(line, polyridge.predict(line_poly))
ax = plt.gca()
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['bottom'].set_position('center')
ax.spines['top'].set_color('none')
ax.set_ylim(-6, 6)
ax.legend(["training data","test data","linear","lin., poly. features","ridge, alpha="+str(alphaval)], loc=0,fontsize=10)
ax.grid(True)

**How do you interpret the accuracy scores above in view of this plot?**

Answer: 

## Exploring trade-offs of model complexity

## Linear regression on polynomial features of different degree

We now consider the resulting errors of applying linear regression with polynomial features of different degrees in order to explore the behavior of training and test errors for different model complexities.
 
To this end, we write a snippet of code that creates two [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `train_errors` and `test_errors` that contain the mean squared errors on both training and test set that are obtained by running linear regression on polynomial features of degree between 1 and 30.

In [None]:
# %% Explore trade-off of model complexity
degree_range = range(1,30) # contains all integers between 1 and 30
train_errors = np.zeros((len(degree_range), ))
test_errors = np.zeros((len(degree_range), ))
for k, degree in enumerate(degree_range):
    poly =  # define polynomial data model
    X_trainpoly =  # obtain coordinates of polynomial features of training set
    X_testpoly = # obtain coordinates of polynomial features of training set
    polylr = # perform linear regression with polynomial features
    train_errors[k] = # compute Mean Squared Error on training set
    test_errors[k] =  # compute Mean Squared Error on test set

In [None]:
train_errors

Using this, we create a plot containing training and test errors on the $y$-axis and the degree of the polynomial model on the $x$-axis.

In [None]:
# %% Plot training and test errors for different model complexities
plt.figure()
plt.plot(degree_range,train_errors)
plt.plot(degree_range,test_errors)
ax = plt.gca()
ax.set_yscale('log')
ax.legend(["training data","test data"], loc=0)
ax.set(xlabel='degree of polynomial', ylabel='mean squared error',
       title='Linear regression with polynomial features of different degree');

print("Degree resulting in smallest error on test data: %f" % (np.argmin(test_errors)+1))

We observe that for growing model complexity, the training error *decreases*, whereas the test error actually *increases*.
 
**Can you explain this behavior?**

## Cross Validation

Since the goal of a supervised machine learning model/algorithm is to attain high accuracy / small errors for a dataset drawn from a similar distribution as the training set (the test set), it is of interest to find good "hyperparameters" of the model that choose the "best" model among a class of candidate models. A typical workflow would look like this:

<img src=https://scikit-learn.org/stable/_images/grid_search_workflow.png alt="image info" width="400" />

## Ridge regression on polynomial features with different regularization parameters
 
Next, we focus on ridge regression on features with fixed degree, but using different regularization parameters (between $\alpha=10^{-5}$ and $\alpha=10^{9}$).

In [None]:
import warnings
from scipy.linalg import LinAlgWarning 
warnings.filterwarnings("ignore",category= LinAlgWarning, module='sklearn') # suppresses a warning that might come up due to ill-conditioning of linear systems

In the following, we fix the polynomial degree to 11. For such polynomial features, **write code that**
   - **uses the available training-test set split to run a [hyperparameter optimization](https://scikit-learn.org/stable/modules/grid_search.html#tuning-the-hyper-parameters-of-an-estimator) for ridge regression over $60$ different values of $\alpha$, logarithmically spaced between $10^{-5}$ and $10^{9}$. Relevant methods you might want to use are _PredefinedSplit_ and _GridsearchCV_.**
   - **Eventually, extract the test and training errors into the numpy.ndarrays `train_errors_ridge` and `test_errors_ridge`.**

In [None]:
# Creating the polynomial features
degree = 11
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)

We first focus on a (simpler) hyperparameter optimization over the already existing training-test set split (with folds etc.). Strictly speaking, this is not a good practice, since we use information from the test set to optimize a parameter in the learned model. We will be more careful about this when working with a "real" data set.

In [None]:
test_fold = np.zeros((nr_samples, ))
test_fold[id_train] = -1
test_fold[id_test] = 0
from sklearn.model_selection import PredefinedSplit
cv = PredefinedSplit(test_fold)

Using this existing split, we use hyperparameter optimization for ridge regression to optimize the parameter $\alpha$.

In [None]:
alphas =  # create vector of logarithmically interpolated values between 10^(-5) and 10^(9) (hint: see np.logspace)
parameters = {'alpha':alphas}

In [None]:
gridsearch =  # (hint: see the GridSearchCV method)

train_errors_ridge = 
test_errors_ridge = 

In [None]:
# Define plotting function for training and test errors for different model complexities:
def plot_train_test(alphas,train_error,test_error):
    plt.figure()
    plt.plot(alphas,train_errors_ridge)
    plt.plot(alphas,test_errors_ridge)
    ax = plt.gca()
    ax.set_xscale('log')
    ax.set_yscale('log')
    ax.set(xlabel='alpha', ylabel='mean squared error',
           title='Ridge regession for different model complexities')
    ax.legend(["training data","test data"], loc=0)

In [None]:
# Plot resulting train/test errors:
plot_train_test(alphas,train_errors_ridge,test_errors_ridge)

## K-Fold Cross Validation

Alternatively, we also perform [k-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html?highlight=kfold#sklearn.model_selection.KFold) with $k=5$ (see also [here](https://scikit-learn.org/stable/modules/cross_validation.html#k-fold)) of the same parameter $\alpha$ on the training set only.

In any realistic machine learning problem, this method would be much preferred over what we did above: During the process of determining the best hyperparameter $\alpha$, the test set is not touched at all - it would be only considered at the end after determining the hyperparameter to "test" the accuracy. In this way, it is more likely that we have learned to understand the actual data distribution instead of potentially overfitting our model to the particular, finite dataset at hand.

<img src=https://scikit-learn.org/stable/_images/grid_search_cross_validation.png alt="image info" width="500" />

**Find an optimal ridge regression model by performing 5-fold cross validation on the training set (polynomial features), and report the average errors on training sets and validation sets,**

In [None]:
gridsearch_kfold = GridSearchCV(Ridge(),param_grid=parameters,
                                scoring='neg_mean_squared_error',
                                return_train_score=True,cv=5)
X_trainpoly = poly.transform(X_train)
gridsearch_kfold.fit(X_trainpoly, y_train)
train_errors_ridge_kfold = -gridsearch_kfold.cv_results_['mean_train_score']
validation_errors_ridge_kfold = -gridsearch_kfold.cv_results_['mean_test_score']

In [None]:
# Again, we plot the resulting train/validation errors:
plot_train_test(alphas,train_errors_ridge_kfold,validation_errors_ridge_kfold)

Comparing the last two plots, we see that for this data set, the two different cross-validation strategies result in basically the same behavior.

Finally, let's plot the obtained function stemming from the "optimal" ridge regression model (optimized regularization parameter $\alpha$) compared to the ridge regression model that we obtain if we choose $\alpha = 10^{-5}$ or $\alpha = 10^{9}$.

In [None]:
# %% Plot optimal, under-fitted and over-fitted model
alpha_opt = gridsearch.best_params_['alpha']
alpha_overfit = alphas[-1]
alpha_underfit = alphas[0]
X_trainpoly0 = poly.fit_transform(X_train) # obtain coordinates of polynomial features of training set
line_poly = poly.transform(line)
rd = Ridge(alpha=alpha_opt).fit(X_trainpoly0,y_train)
plt.figure(figsize=(14,8))
plt.plot(X_train,y_train,'o',c='blue')
plt.plot(X_test,y_test,'o',c='red')
plt.plot(line,rd.predict(line_poly))
rd = Ridge(alpha=alpha_overfit).fit(X_trainpoly0,y_train)
plt.plot(line,rd.predict(line_poly))
rd = Ridge(alpha=alpha_underfit).fit(X_trainpoly0,y_train)
plt.plot(line,rd.predict(line_poly))
ax = plt.gca()
ax.set_ylim(-5, 5)
ax.legend(["training data","test data","optimized alpha","alpha="+str(alpha_overfit),"alpha="+str(alpha_underfit)], loc=0)

print("'Optimal' regularization parameter alpha (i.e., resulting in smallest test error): %f" % alpha_opt)

We observe that for the smallest regularization parameter $\alpha = 10^{-5}$ (which almost corresponds to a linear regression model without regularization), we can interpolate the training data points very well, at the cost of a large error on the test data points.
 
On the other hand, for large regularization parameters such as $\alpha = 10^9$, we obtain a "simple" model which interpolates the training data less well, with a comparable error on the test set.
 
The minimal error on the test data will be achieved for a value of $\alpha$ that is in between.

## Regression on the [Boston housing dataset](https://www.openml.org/d/531)
## Case study with a real data set (part 1)

In the following, we study and compare the performance of different regression methods applied to a data set with 11 regressors and one target variable. The dataset is a [housing dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) about the Boston area from 1978.

 For this data set, we cover the
   - visualization of the data,
   - preprocessing of the data,
   - differences between learning algorithms, in particular, between ridge regression and sparse regression (Lasso),
   - cross-validation of an algorithmic hyperparameter.

We will complete first two tasks in part 1, and complete the last two tasks in part 2 of this session.

## Understanding the dataset

First, we load the data. We then create a [pandas](https://pandas.pydata.org/docs/) DataFrame from the data as it helps to visualize it in the following.

In [None]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO']
df_boston = pd.DataFrame(data[:,0:11],columns = column_names)
df_boston.assign(MEDV=target)

In [None]:
# Generate some descriptive statistics of dataset:
df_boston.describe()

## Visualization

Create histrogram of the target variable, the median house prices, using [matplotlib](https://matplotlib.org):

In [None]:
import matplotlib.pyplot as plt
# %% Plot histogram of median house prices in dataset
plt.figure()
plt.hist(target)
ax1 = plt.gca()
ax1.set(xlabel='median housing price (in $1000)', ylabel='count',
       title='Histogram of Boston area housing price medians')

## Preprocessing

The regressor variables cover varying scales: While the variable 'CRIM', the per capita crime rate, ranges between 0.006320 and 88.97, the variable 'NOX' (nitric oxides concentration in parts per 10 million) has a range between 0.385 and 0.871. This will have implications on the magnitudes of the coefficients in the linear models that are used by the learning algorithms we consider.

In order to mitigate the issue of these varying scales, we preprocess the data by **scaling** all feature/regressor variables to lie between 0 and 1. <br>
More information about preprocessing methods (scaling and others) can be found [here](https://scikit-learn.org/stable/modules/preprocessing.html) and [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py).

**Exercise:** Repeat experiments below _without_ this preprocessing. How does the performance change?

In [None]:
# %% rescale regressor coordinates (improves regression performance by levelling the regression coefficient sizes)
# create numpy arrays representing features and target values
X = 
y = 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
scaler =  # define scaling object
X_scaled = # apply scaling to X

## Data split for experimental setup

We split our data set into two parts, a training and a test set. The test set contains the a share of 'test_share' of the total data points.

In [None]:
from sklearn.model_selection import train_test_split
test_share = 0.2

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,random_state=10,test_size=test_share)

## Run learning algorithms

### Ridge Regression for fixed regualarization

We run ridge regression for a fixed regularization parameter on the training set. By defining two functions for evaluating the performance of the model and for the visualization of a correlation between predicted target variable and true target variable, we not only achieve what we want for the ridge regression model, but can easily reuse the code for other learning algorithms below.

In [None]:
# %% Set up ridge regression model
import numpy as np
from sklearn.linear_model import Ridge

alpha_ridge = 100
ridge = Ridge(alpha=alpha_ridge,fit_intercept=True)
ridge.fit(X_train, y_train)

Compute prediction scores for training and test data for model, plot correlation matrix of expected vs. "measured" median housing price

In [None]:
def plot_correlation_true_predicted(y_train,y_test,model,figsize=(8,8)): # Plot correlation matrix of expected vs. "measured" median housing price for model
    plt.figure(figsize=figsize)
    plt.scatter(y_train,model.predicted_train)
    plt.scatter(y_test, model.predicted_test)
    maxx= max(np.max(model.predicted_train),np.max(model.predicted_test),np.max(y_train),np.max(y_test))
    ax=plt.gca()
    ax.set_ylim(0, 1.02*maxx)
    ax.set_xlim(0, 1.02*maxx)
    ax.legend(["training data","test data"], loc=0)
    ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3")
    ax.set(ylabel='predicted median housing price (in $1000)', xlabel='true median housing price (in $1000)')
    ax.set(title="Correlation between true and predicted prices for model "+str(model))

In [None]:
def eval_prediction_train_test(model,X_train,X_test,y_train,y_test,plot=True):
    model.score_train = model.score(X_train, y_train)
    model.score_test = model.score(X_test, y_test)
    model.predicted_test = model.predict(X_test)
    model.predicted_train = model.predict(X_train)
    print("R^2 value of model",str(model),"on the train set: %f" % model.score_train)
    print("R^2 value of model",str(model),"on the test set: %f" % model.score_test)
    if plot:
        plot_correlation_true_predicted(y_train,y_test,model)
    return model

In [None]:
ridge = eval_prediction_train_test(ridge,X_train,X_test,y_train,y_test)

In this plot, the closeness of data points to the main diagonal visualized the accuarcy of the prediction. It has some similarities to a plot of the confusion matrix for classification problems, which we will encounter later.

### Ridge regression with cross-validated regularization  paramter (hyperparameter optimization via grid search)

Next, we optimize the regularization parameter alpha of ridge regression. We use a grid of logarithmically spaced values between $10^{-5}$ and $10^{-9}$, and a 5-fold [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

In [None]:
# %% Use cross validation to optimize regularization parameter alpha of ridge regression
from sklearn.model_selection import GridSearchCV
alphas=np.logspace(-5,9,num=60) # create vector of logarithmically interpolated values between 10^(-5) and 10^(9)
parameters = {'alpha':alphas}

In [None]:
gridsearch = GridSearchCV(Ridge(),param_grid=parameters,scoring='r2',return_train_score=True,cv=5) # cv
gridsearch.fit(X, y)
gridsearch.train_errors = gridsearch.cv_results_['mean_train_score']
gridsearch.test_errors  = gridsearch.cv_results_['mean_test_score']

Next, we plot the training against the validation error.

In [None]:
def plot_training_test(model,parameters,train_errors,test_errors): # Plot training and test errors for different model complexities of ridge regression
    plt.figure(figsize=(8,8))
    plt.plot(list(parameters.values())[0],train_errors)
    plt.plot(list(parameters.values())[0],test_errors)
    ax = plt.gca()
    ax.set_xscale('log')
    ax.set(xlabel='alpha', ylabel='R2 value',
           title=str(model)+' for different model complexities')
    ax.set_box_aspect(1)
    ax.legend(["training data","test data"], loc=0)

In [None]:
plot_training_test(Ridge(),parameters,gridsearch.train_errors,gridsearch.test_errors)

Now crossvalidate only on training set:

In [None]:
gridsearch_train = GridSearchCV(Ridge(),param_grid=parameters,scoring='r2',return_train_score=True,cv=5)
gridsearch_train.fit(X_train,y_train)
ridge_optimized = gridsearch_train.best_estimator_

We plot the true prices vs. the prices predicted by the "optimized" ridge regression model.

In [None]:
ridge_optimized = eval_prediction_train_test(ridge_optimized,X_train,X_test,y_train,y_test)

In the first plot, we see the $R^2$ value (which can be thought of as a value that is negatively correlated to the mean squared error, or empirical risk in our case). The regularization parameter $\alpha$ to be used in our predicive model is the one that maximizes the $R^2$ value on the test set (in this context, the term *validation set* is more appropriate).