## Deep Learning and Computer vision

### Overfitting, Regularization and Cross-validation

Shani Israelov

Jean Monnet University, 2023

In this introduction exercise, we are going to work on a polynomial ridge regression task. The idea is
to understand the notions of overfitting, regularization and cross-validation.
We have some points in a 2D space (X,y), which are our training data. We want to learn a polynomial
function P() that predicts the y from the X : y = P(X).
The test data will beprovided when you think you have a « good » predictor.
The optimization step can not be changed, you can only play with two hyper-parameters: the order
of the polynomial function and the regularization term.
You can use your own python environment or just run your code online on trinket
(https://trinket.io/python3).

1/ Read, understand and run the provided code.


In [1]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
# from sklearn.model_selection import cross_val_predict

# Data
X=[[-3], [-2.5], [-2], [-1.5], [-1], [-0.5], [0], [0.5], [1], [1.5], [2]]
y=[-0.303, -0.545, -1.025, -0.959, -0.768, -0.375, -0.021, 0.438, 0.883, 0.807, 0.932]

# Hyperparameters
degree=1
regul_param=0

# Model
model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=regul_param))

# Training
model.fit(X, y)

# Testing
ypred=model.predict(X)
r2 = r2_score(y,ypred)
print("Training score : %0.5f" % (r2))

Training score : 0.71818


##### Self-Notes:
##### make_pipeline
Construct a Pipeline from the given estimators.
Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

##### PolynomialFeatures(degree)
Generate polynomial and interaction features.
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

##### Ridge(alpha=regul_param))
Linear least squares with l2 regularization. Minimizes the objective function:
||y - Xw||^2_2 + alpha * ||w||^2_2
This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape (n_samples, n_targets)).

from sklearn

2/ What is the R2 score (maximum, zero value) ?


You need to understand these metrics in order to determine whether regression models are accurate or misleading. 
In terms of linear regression, variance is a measure of how far observed values differ from the average of predicted values, i.e., their difference from the predicted value mean. The goal is to have a value that is low. 
The r2 score varies between 0 and 100%. It is closely related to the MSE, but not the same. 
 “(total variance explained by model) / total variance.”
A low value would show a low level of correlation, meaning a regression model that is not valid, but not in all cases.
from: https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/

regression score function.
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In the general case when the true y is non-constant, a constant model that always predicts the average y disregarding the input features would get a score of 0.0.

3/ For the orders from 1 to 8, which polynomial function provides the best training results
(regularization coefficient = 0, here) ? Why ?


In [2]:
for i in range(9):
    # Hyperparameters
    degree=i
    regul_param=0
    # Model
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=regul_param))
    # Training
    model.fit(X, y)
    # Testing
    ypred=model.predict(X)
    r2 = r2_score(y,ypred)
    print("Degree : %f, Training score : %0.5f" % (degree, r2))

Degree : 0.000000, Training score : 0.00000
Degree : 1.000000, Training score : 0.71818
Degree : 2.000000, Training score : 0.85192
Degree : 3.000000, Training score : 0.98302
Degree : 4.000000, Training score : 0.98515
Degree : 5.000000, Training score : 0.99111
Degree : 6.000000, Training score : 0.99114
Degree : 7.000000, Training score : 0.99882
Degree : 8.000000, Training score : 0.99898


we get the highest score for degree 8, it makes sense cause we have 10 data samples and 8 is probabley overfitted to the data. 

4/ By fixing the order to 5, play with the regularization coefficient from 1e-4 to 10. What is the
impact on the training result ? Why ?

In [3]:
regul_coeffs = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10]
for i in range(len(regul_coeffs)):
    # Hyperparameters
    degree=5
    regul_param=regul_coeffs[i]
    # Model
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=regul_param))
    # Training
    model.fit(X, y)
    # Testing
    ypred=model.predict(X)
    r2 = r2_score(y,ypred)
    print("Regularization coefficient : %f, Training score : %0.5f" % (regul_param, r2))

Regularization coefficient : 0.000100, Training score : 0.99111
Regularization coefficient : 0.001000, Training score : 0.99110
Regularization coefficient : 0.010000, Training score : 0.99109
Regularization coefficient : 0.100000, Training score : 0.99013
Regularization coefficient : 1.000000, Training score : 0.95363
Regularization coefficient : 10.000000, Training score : 0.80067


we can see that the more the regularization coefficient is bigger than the training score is lower. 
regularization technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.Regularization, significantly reduces the variance of the model, without substantial increase in its bias.
from: https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a

if the alpha is bigger, it means that the regularization term has more weight, more penalty to errors, the loss function is bigger, means the error is bigger.

5/ Use the « cross_val_predict » function to run a cross validation. What would be a good number of
folds, here ?

In [4]:
from sklearn.model_selection import cross_val_predict
degree = 5
regul_param = 0.1
cv = 8
model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=regul_param))
model.fit(X, y)
y_pred = cross_val_predict(model, X, y, cv=cv) # cv is the number of folds, default is 5 
r2 = r2_score(y,y_pred)
print("Folds Number: %f, Training score : %0.8f" % (cv, r2))

Folds Number: 8.000000, Training score : 0.93221821


##### Cross val predict 

Generate cross-validated estimates for each input data point.
The data is split according to the cv parameter (int, number of folds). Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.
Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from cross_validate and cross_val_score unless all tests sets have equal size and the metric decomposes over samples.

6/ Test a cross validation prediction with an order of 5 and a regularization of 0.1. Observe the result.
How many values contains this vector ?

7/ Use the cross-validation to find the best hyperparameters. 


In [7]:
regul_coeffs = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10]
max_r2 = 0
for i in range(8):
    for j in range(len(regul_coeffs)):
        degree = i+1
        regul_param = regul_coeffs[j]
        cv = 8
        model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=regul_param))
        model.fit(X, y)
        y_pred = cross_val_predict(model, X, y, cv=cv) # cv is the number of folds, default is 5 
        r2 = r2_score(y,y_pred)
        if r2 > max_r2:
            max_r2 = r2
            best_degree = degree
            best_regul_param = regul_param

print("Degree: %f, Regularization: %f, cv: %f, Training score : %0.8f" % (best_degree, best_regul_param, cv, max_r2))

Degree: 3.000000, Regularization: 0.000100, cv: 8.000000, Training score : 0.96958647


8/ Test your solution on the the test data.

In [10]:
degree = 3
regul_param = 0.0001
model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=regul_param))
model.fit(X, y)
Xtest=[[-3.2], [-2.2], [-1.2], [-0.2], [0.8], [1.8], [2.8]]
ytest=[0.058, -0.808, -0.932, -0.199, 0.717, 0.974, 0.335]
ytestpred=model.predict(Xtest)
r2 = r2_score(ytest, ytestpred)
print("Training score : %0.5f" % (r2))

Training score : 0.98749
