NAME: __TODO: FULLNAME__

# Machine Learning Practice - Asynchronous
## Homework 05: Regularization 

### Data set
The dataset is identical to what we used in HW04


### Task
For this assignment you will be exploring **regularization.** Regularization
is a powerful tool in machine learning to impose rational constraints on 
models during the training process to mitigate overfitting to the training 
set and improve model generalization. By including one or more terms within
the cost (error) function to penalize large weights, the learning algorithm will try 
to fit the data while avoiding certain values for the weights that might lead to
overfitting of the training data.


### Objectives
* Use and understand regularization in regression
* Learn to select hyper-parameters to tune model behavior


### Instructions
* All Homework must be individual work.  Do not look at or copy solutions of other students or that are available on the Internet or via LLMs
* Only work in a copy of the file that is from your ~/homework_in/ directory
   + __If you do not use your own copy of this file, then it is an automatic zero on the assignment__
* Read the code below 
* For any cell that is flagged as *TODO*, complete the code according to the specifications
* Execute each cell and verify that it is showing correct results.  Note that because we are reusing variables, the order of execution is *really* important.
* Hand-In Procedure
  + Make sure that your notebook has been saved.  You are responsible for ensuring that the copy that you submit is current and complete
  + The name of the file should be the same as what we gave you
  + Download this file to your local machine (extension: .ipynb)
  + Submit to the Gradescope Notebook HW05 dropbox



### General References
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/api/sklearn.linear_model.html)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/api/sklearn.model_selection.html)
* [JobLib](https://joblib.readthedocs.io/en/latest/)


In [None]:
# PROVIDED
import pickle as pkl
import pandas as pd
import numpy as np
import os, re, fnmatch, time
import matplotlib.pyplot as plt
import joblib
import copy

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.metrics import make_scorer

# Default figure parameters
plt.rcParams['figure.figsize'] = (10,7)
plt.rcParams['font.size'] = 12
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = True
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['axes.labelsize'] = 12
                                   
%matplotlib inline

# LOAD DATA

In [None]:
""" PROVIDED: Execute cell
Load the BMI data from all the folds
"""
fname = '/mlp/datasets/bmi/bmi_dataset.pkl'

with open(fname, 'rb') as f:
    bmi = pkl.load(f)
    theta_folds = bmi['theta']
    dtheta_folds = bmi['dtheta']
    ddtheta_folds = bmi['ddtheta']
    torque_folds = bmi['torque']
    time_folds = bmi['time']
    MI_folds = bmi['MI'] 

print("Number of folds:", len(MI_folds))

# Helper Functions

In [None]:
""" PROVIDED
Evaluate the performance of an already trained model

"""

def predict_score_eval(model, X, y, convert_deg=False):
    '''
    
    Compute the model predictions and cooresponding scores.
    PARAMS:
        model: the trained model used to make predicitons
        X: feature data (MxN)
        y: desired output (Mxk)
        convert_deg: Boolean flag to indicate whether rmse should be
            converted from rad to deg
            
    RETURNS:
        mse: mean squared error for each column (k vector)
        rmse: rMSE (k vector)
        fvaf: fraction of variance accounted for metric (k vector)
        preds: predictions made by the model (M x k matrix)
    '''
    # use the model to predict the outputs from the input data
    preds = model.predict(X) 
    

    # Compute VAR/MSE/RMSE
    mse = np.sum(np.square(y - preds), axis=0) / y.shape[0] 
    var = np.var(y, axis=0)

    fvaf = 1 - mse/var 
    
    rmse = np.sqrt(mse) 
    
    if convert_deg:
        rmse = rmse * 180 / np.pi 

    return mse, rmse, fvaf, preds, var




In [None]:
'''
PROVIDED: Execute Cell
'''
def extract_data_set(folds, data):
    '''
    For the data provided, extract only the specified folds and concatenate them together
    
    :param folds: Python list of fold indices to extract
    :param data: Python list of all folds of any number of data fields (e.g., 20 folds 
        of ddtheta and torque)
    
    :return: Tuple of the specific types, containing only the specified folds
    '''
    # For each field in data, extract only the specified folds
    output = [np.concatenate([d[f] for f in folds]) for d in data]
    
    # Convert the list to a tuple
    return tuple(output)

In [None]:
"""
TODO

Construct training, validation and test sets

Training set: used for selecting model parameters
Validation set: used for selecting model hyper-parameters
Test set: used sparingly to evaluate the final models

We are building models to predict joint position
"""
# Extract fold indices for the training, validation and testing sets
trainset_fold_inds = [19] 
validationset_fold_inds = [0, 1, 2, 3] 
testset_fold_inds = [4, 5, 6, 7] 

# Data to predict: Joint position
predict_folds = theta_folds

# We are focusing on just the elbow
predict_index = 1

# Combine the folds into singular numpy arrays for each of the Training, Validation and Testing sets
#  Use extract_data_set() above

# Training set
timetrain, Xtrain, ytrain = extract_data_set(# TODO  
# Extract just the predict_index
ytrain = np.squeeze(ytrain[:,predict_index]) 

# Validation set
timeval, Xval, yval = extract_data_set( # TODO
yval = np.squeeze(yval[:,predict_index]) 

# Testing set
timetest, Xtest, ytest = extract_data_set( # TODO
ytest = np.squeeze(ytest[:,predict_index]) 


In [None]:
# PROVIDED: Execute Cell

print('Train:', Xtrain.shape, ytrain.shape)
print('Validation:', Xval.shape, yval.shape)
print('Test:', Xtest.shape, ytest.shape)

## Linear Model

In [None]:
""" TODO
Construct and train a model using the training set.  This model is a pipeline:
- StandardScaler
- LinearRegression

Display the Training set rmse (degrees) and fvaf
"""

model_lnr = Pipeline([
 # TODO
])

model_lnr.fit(Xtrain, ytrain)

# Show the performance of the model with respect to the training set
#  Print FVAF and RMSE (latter in degrees) 

mse, rmse_deg, fvaf, pred, var = predict_score_eval(# TODO
print(fvaf, rmse_deg)

In [None]:
# TODO
# Show model performance with respect to the validation data set
#  Print FVAF and RMSE (latter in degrees) 
mse, rmse_deg, fvaf, pred, var = predict_score_eval(#TODO
print(fvaf, rmse_deg)

In [None]:
# DELETE
# Show model performance with respect to the test data set
#  Print FVAF and RMSE (latter in degrees) 
mse, rmse_deg, fvaf, pred, var = predict_score_eval(#TODO
print(fvaf, rmse_deg)

## Regularized Regression

In [None]:
# TODO

# Create a Lasso model Pipeline:
# - StandardScalar()
# - Lasso()

model_regularized = Pipeline([
 # TODO
])

# A set of alpha parameter values to try 
#  These are factors of 10 from 10^-7 to 10^0 spaced exponentially 

alphas = np.logspace(-7, 0, base=10, num=28, endpoint=True)
alphas

In [None]:
# TODO

def hyperparameter_loop(model, alphas, Xtrain, ytrain, Xval, yval, convert_deg=False):
    '''
    Loop over all possible alphas:
    - Set the Lasso model alpha parameter to the specific alpha
    - Fit model to Xtrain/ytrain
    - Compute rmse (DEG) and FVAF for Xtrain/ytrain and Xval/yval & record these in 
            numpy arrays 
     Return the fvaf amd rmse_degree for both the training and validation sets
    
    :param model: ML model to fit
    :param alphas: List of alpha hyper-parameter values to try
    :param Xtrain: training set inputs
    :param ytrain: training set desired output
    :param Xval: validation set inputs
    :param yval: validation set desired output
    :param convert_deg: Convert from radians to degrees
    
    :return: rmse/fvaf for the training set and validation set, as well as the zero coefficient count
    '''
    rmse_train = np.zeros((len(alphas),))
    rmse_valid = np.zeros((len(alphas),))
    fvaf_train = np.zeros((len(alphas),))
    fvaf_valid = np.zeros((len(alphas),))
    zero_count = np.zeros((len(alphas),))
    
    # Loop over all possible alphas
    for i, a in enumerate(alphas):
        # Copy model
        model_tmp = copy.deepcopy(model)
        
        # Set alpha property of the Lasso model
        # TODO
        
        # Fit the model to the training set
        
        # TODO
        
        # Record rmse/fvaf for both training and validation sets
        _, rmse_deg, fvaf, _, _ = predict_score_eval( #TODO
        rmse_train[i] = rmse_deg
        fvaf_train[i] = fvaf
        
        _, rmse_deg, fvaf, _, _ = predict_score_eval( #TODO
        rmse_valid[i] = rmse_deg
        fvaf_valid[i] = fvaf
        
        # Count and the number of model parameters that are exactly zero
        zero_count[i] =  # TODO
        
        
    # Return training and validation performance arrays
    return rmse_train, fvaf_train, rmse_valid, fvaf_valid, zero_count

In [None]:
# TODO
# Call hyperparameter_loop with the regularized model
rmse_train, fvaf_train, rmse_valid, fvaf_valid, zero_count = hyperparameter_loop( # TODO


In [None]:
# TODO
# Plot training and validation FVAF as a function of alpha
#  Set the xscale to 'log'

plt.figure()
# TODO


In [None]:
# TODO
# Plot training and validation rmse as a function of alpha
#  Set the xscale to 'log'

plt.figure()
# TODO


In [None]:
# TODO
# Plot the FRACTION of parameters that are exactly zero as a function of alpha
#  Set the xscale to 'log'

plt.figure()
# TODO


In [None]:
# TODO
# Identify and print the index in fvaf_valid that is best
idx_fvaf = # TODO
idx_fvaf

In [None]:
# TODO
# Show the alpha that corresponds to this best model
# TODO


In [None]:
# TODO
# Identify and print the index in fvaf_rmse that is best
idx_rmse = # TODO
idx_rmse

In [None]:
# TODO
# Show the alpha that corresponds to this best model
# TODO


In [None]:
# TODO
# Set the regularized model alpha to the best value with respect to FVAF
#  and fit the model to the training data

# TODO


In [None]:
# TODO
# Compute the predictions for the training data
predtrain =   # TODO


In [None]:
# TODO
# Compute the predictions for the test data
predtest = # TODO


# Report the fvaf and rmse for the test data
# TODO
_, rmse_deg, fvaf, _, _ = # TODO
print(fvaf, rmse_deg)

In [None]:
# TODO
# You have already fit the LinearRegression model to the training data
# (above).   Use it to predict arm movement for the test data
preds_lnr =  # TODO

In [None]:
# Report the LinearRegression fvaf and rmse for the test data
# TODO
_, rmse_deg, fvaf, _, _ = # TODO
print(fvaf, rmse_deg)

In [None]:
# TODO
# Plot: ground truth, regularized model predictions and the Linear model predictions for 
#  time period 700 to 720

plt.figure()
# TODO


In [None]:
""" TODO: complete implementation
Generate a plot that contains two overlapping histograms:
- Coefficients discovered by LinearRegression
- Coefficients discovered by the best regularized model

"""
nbins = 51
start = -0.05
end = 0.05
incr = (end - start) / nbins
bins = np.arange(start, end, incr)

# When using plt.hist(), use bins=bins.  This will use the exact same bins for
#   both histograms
plt.figure()



## Reflection
Respond to each of the following questions with short answers.

_Q1. For the simple LinearRegression model, what is the difference in performance between the training and test sets?  Explain this difference_

**TODO**


_Q2. For the FVAF vs alpha curves, describe the difference between the training and validation data sets.  Explain this difference._

**TODO**


_Q3.  Referring to the figure showing the Zero Coefficient Count as a function of alpha, describe and then explain the shape of the curve._

**TODO**

_Q4. How does the performance of the best Regularized Model compare to that of the LinearRegression model with respect to the test data set?_

**TODO**


_Q5. How many non-zero coefficients are there for the best regularized model?_

**TODO**


_Q6. Why are the model coefficient distributions different for the LinearRegression and the Lasso models?_

**TODO**

_Q7. True or False? The best choice of alpha is the same whether we are optimizing for FVAF or RMSE._

**TODO**

_Q8. Compare and contrast the three joint position curves in the above figure._

**TODO**

