# Assignment 03: Regression, Learning Curves and Regularization

**Due Date:** Friday 10/02/2020 (by 5pm)


**Please fill these in before submitting, just in case I accidentally mix up file names while grading**:

Name: Joe Student

CWID-5: (Last 5 digits of cwid)

## Introduction 

In this exercise we will be using what you have learned about linear regression, polynomial regression and
regularization, to explore an artificial dataset.

I have generated a secret dataset.  The dataset uses a polynomial combination of a single parameter.
The unknown function is no less than degree 3, but no more than a degree 15 polynomial.  And some random
noise has also been added into the function, so that fitting it is not a completely trivial or obvious
exercise.  Since the dataset is generated from a polynomial function, the output labels `y` are
real valued numbers.  And thus you will be performing a regression fitting task in this assignment.

Your task, should you choose to accept it, is to load and explore the data from this function.  Your ultimate
goal is to try your best to determine the degree of the polynomial used, and the values of the parameters
then used in the secret function.  Because of the noise added to the data you are given, you will not be able
to exactly recover the parameters used to generate the artificial data.  You will even find that determining the
exact degree of the generating polynomial function is not possible.  How you apply polynomial fitting and 
regularization techniques can give different and better or worse approximations of the true underlying function.

In the below cells, I give instructions for the tasks you should attempt.  You will need to load the data and
visualize it to begin with.  Then you will be asked to apply polynomial fitting and regularization in an attempt
to fit the data.  But ultimately, at the end, you will be asked to take what you have discovered, and try and
give your best answer for the polynomial degree and best fitted polynomial parameters for this unknown data set.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# By convention, we often just import the specific classes/functions
# from scikit-learn we will need to train a model and perform prediction.
# Here we include all of the classes and functions you should need for this
# assignment from the sklearn library, but there could be other methods you might
# want to try or would be useful to the way you approach the problem, so feel free
# to import others you might need or want to try
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# notebook wide settings to make plots more readable and visually better to understand
np.set_printoptions(suppress=True)

In [2]:
# notebook wide settings to make plots more readable and visually better to understand
np.set_printoptions(suppress=True)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
plt.rc('figure', titlesize=18)
plt.rc('legend', fontsize=14)
plt.rcParams['figure.figsize'] = (12.0, 8.0) # default figure size if not specified in plot
plt.style.use('seaborn-darkgrid')

## Part 1: Load Data, Explore and Visualize
------------

You have been given a set of 100 (artificial) data points in the file named `assg-03-data.csv` in our 
data subdirectory.  Start by loading this file into a pandas dataframe.  Explore the data a bit.  
Use the `describe()` function to get a sense of the number of values (there should be `m=100` samples),
and their mean and variance.  There are 2 columns, where `x` is the feature, and `y` is the function
value, or in other words the label we will use for the regression fitting task.  What is the range of
the `x` features?  What is the range of the `y` output label here?

Also plot a scatter plot of the data to get a sense of the function shape.  Does it appear linear
or nonlinear?

In [3]:
# load and explore the data.  Use describe and other functions


In [4]:
# minimal data cleaning / preparation is needed, but if you load using pandas you will have to split out the x input column separate from the y output labels.


In [5]:
# visualize the data we loaded using a simple scatter plot

## Part 2: Create an Overfit Model

You have been told the degree of the polynomial function governing the data you just loaded is somehwere
between 3 and 15.  Lets start by overfitting a degree 20 polynomial regression to the data.
In the next cells, use `PolynomialFeatures` and scikit-learn `LinearRegression()` to create a best fit
degree 20 polynomial model of the data.

In [6]:
# overfit with a degree 20 polynomial, display cross validation of fit
# no regularization


In the next cell, use introspection of your fitted model to display the intercept and fitted coefficient parameters.
Also discover the overall $R^2$ score of the fit.  Display them here for future reference.

You should of course get a single intercept parameter, but 20 fitted theta coefficients.  Is there anything
you think you can learn looking at the coefficients for this degree 20 fit?  For example, knowing that
the true degree is less than 20 for the underlying function, do you have any guess at this point of what
degree the polynomial might be that underlies this data?

You should discover you get a pretty good $R^2$ score here, probably above 0.96.  This indicates a good
fit that explains a lot of the variance seen.  But since we have a very high degree model, you should
be worried at this point that at least some of that performance is coming from overfitting to the noise
present in the data instead of the true function that underlies the data.

In [7]:
# display the intercept and coefficients of the fit as well as the R^2 score here


And finally for this part, visualize the fit from this degree 20 model.  Plot the raw data as a scatter
plot again, and then use the `predict()` function to visualize the predictions made by the degree 20
polynomial.

Any insights from this visualization?  Do you see evidence of the type of extreme overfitting that we
saw in the lectures? Especially around the ends of the data?

In [8]:
# visualize the fit of the degree 20 polynomial here.  Start by plotting the raw data as a scatter plot


# then display the fitted model using the predict() member function



## Part 3: Cross Validation of Degree 20 Model
------------------------------------------

In these next parts of the assignment, we will walk you through applying regularization and using cross validation
on your degree 20 model to try and discover a model that is not overfitting the noise
present in the data set.  This will hopefully lead to better insights on the true nature of the
function that may be generating the data you are analyzing.

First of, for convenience, we recreate the `plot_learning_curves()` function from our textbook for
our use in this part of the assignment.  Recall that this function, if you give it a 
scikit-learn `Pipeline()` model, and the `X` input data and `y` labels, will perform
a series of cross validation trainings using the model and plot the results.  In this case, the
function trains the model with a single input, then 2 inputs, and so on, and displays the
resulting model predictions on the data it trained with, and on the held back validation data.
As discussed in our lectures, these learning curves can help us determine whether a model is overfitting or
underfitting, and what performance we can expect from a properly powerful trained and fitted model.

In [9]:
def plot_learning_curves(model, X, y):
    """Plot learning curves obtained with training the given scikit-learn model
    with progressively larger amounts of the training data X.
    
    Nothing is returned explicitly from this function, but a plot will be created
    and the resulting learning curves displayed on the plot.
    
    Parameters
    ----------
    model - A scikit-learn estimator model to be trained and evaluated.
    X - The input training data
    y - The target labels for training
    """
    # we actually split out 20% of the data solely for validation, we train on the other 80%
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    
    # keep track of history of the training and validation cost / error function
    train_errors, val_errors = [], []
    
    # train on 1 to m of the data, up to all of the data in the split off training set
    for m in range (1, len(X_train)):
        # fit/train model on the first m samples of the data
        model.fit(X_train[:m], y_train[:m])
        
        # get model predictions
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        
        # determine RMSE errors and save history for plotting
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
        
    # plot the resulting learning curve
    plt.plot(np.sqrt(train_errors), 'r-', linewidth=2, label='train')
    plt.plot(np.sqrt(val_errors), 'b-', linewidth=2, label='val')
    plt.xlabel('Training set size')
    plt.ylabel('RMSE')
    plt.legend(fontsize=18)

Using the above function, display the learning curve for a degree 20 polynomial.  If you didn't do 
already, you will need a scikit-learn `Pipeline` here that first applies a `PolynomialFeatures` transformation
to get a degree 20 set of input features, and then performs linear regression on the pipeline after
creating the polynomial features.

Use your pipeline transform/fit model to call the `plot_learning_curves()` function and display the
learning curves for a degree 20 polynomial, that remember we should suspect is overfitting the
data at this point.

In [10]:
# create a pipeline here if needed for a degree 20 set of PolynomialFeatures that is then
# trained with a standard LinearRegression

# plot the learning curves.  You may need to change the limits of your plot, because if the data is overfitting
# the performance on the validation data may be very bad compared to on the training data.


You can and will get different results when you plot the learning curve here because the train/validation
split is done randomly each time.  So you should probably run the above plot of your learning curves
more than 1 time, to get a feel for what results you can get.

But you should observe 2 important points here

1. Does it seem like the model is overfitting?  e.g. Are you observing that often there is a very big gap in
performance between the validation and training RMSE measure, where validation RMSE is very bad compared to 
what is seen with the data model was trained with.
2. You should make a note of what performance is reached on the training data RMSE measure with this overfit
model.  This level of RMSE for training can probably be approached with a properly tuned model that
generalizes well, and can thus get this performance on data it has not seen before.

You might want to display the intercept, coefficients and $R^2$ score again here of the model you fit,
just to confirm they are similiar to what you saw the first time.  But if you are using a pipeline, you
may need to access these parameters in a slightly different way now.

In [11]:
# display the intercept and coefficients of the fit as well as the R^2 score here


## Part 4: Applying Ridge Regularization
---------------------

In this section you will be asked to perform some of the regularization techniques we discussed in our
lectures on linear regression and regularization.  The goal here is to try and get a better idea of what the
true degree of the governing function might be, as well as the values of the coefficients of this function.

Lets start by trying a simple "ridge" regularization.  Recall ridge regression applies $\ell_2$ penalities
(e.g. it adds in the square of the coefficient $\theta_i^2$) which tends to reduce (but not eliminate)
parameters that are not necessary for minimizing the fitness function.

As before, create a pipeline that creates degree 20 polynomial features.  But apply and fit a ridge
regression for this part of the assignment.  Try and explore alpha values to get a good model.
A "good" model is one here that shows you are no longer overfitting.  You can tell you are no longer
overfitting if there is no longer a big gap between training and validation performance.  Also you should compare
the validation RMSE here to that achieved on the training data previously.  If your validation RMSE here 
is approaching that seen on only the training data before, the generalization of this fitted model
with regularization is doing well.

In [12]:
# apply l2 "ridge" regularization to a degree 20 set of polynomial features

# plot the learning curves you observe for a good example of a ridge regularization fit


When you think you have a relatively good alpha parameter for your ridge regression, display the
intercept, coefficients and $R^2$ score of you fitted model here for future reference.  Compare the
coefficients here to the previous overfitted model.  Do you have any insights on the degree of the
underlying polynomial from looking at your coefficients here?

In [13]:
# display intercept, coefficients and R^2 fit score here


## Part 5: Applying Lasso Regularization
-----------------------

In this part of the assignment we will next apply lasso regression. Recall that lasso regression is
the same as $\ell_1$ norm penality, which in practical terms means we use the absolute value of
the coefficient in our regularization penalty term.

As before, create a pipeline that creates degree 20 polynomial features.  But apply and fit a lasso
regression for this part of the assignment.  You should explore `alpha` values around 0.001 to
0.05 or so.  You may get warnings with this fit, try setting the tolerance parameter `tol` to 
0.1 here, and maybe increasing `max_iter` as well, though it is not nesessary to completely eliminate
warnings to still get a relatively good fit here.  And actually you may want to explore higher values
of `alpha`, as the higher you go, the more pressure on the fitted model to eliminate terms.  This may give
you better insights into the true degree of the underlying polynomial function used to generate the
data here.

Once you have a good example, use your best fitted model to display the learning curves using 
the lasso regularization.

Try and determine an `alpha` parameter that you think is working well here for the regularization.  You are
doing "well" here if your learning curves indicate you are not ovrefitting, and you are approaching
RMSE performance on your validation data somewhere around where training RMSE achieved in previous
overfitting model.  And you are doing well if you maybe have some idea of the cutoff that looks likely for
the true degree of the underlying function you are attempting to model.

In [14]:
# apply l1 "lasso" regularization, to try and make unused terms drop out

# plot the learning curves you observe for a good example of a lasso regularization fit


When you think you have found relatively good `alpha` and other parameters, display the intercept, coefficients
and $R^2$ scores again for this fitted model using lasso regularization.  Any observations about
the coefficients now?  Compare them to the degree 20 model with no regularization.  You may be able to 
start getting some ideas on the true degree of the polynomial from the results here.

In [15]:
# display intercept, coefficients and R^2 fit score here


## Part 6: Give me Your Best Model Estimate

Taking what you have observed from the previous parts 3-5 of the assignment, try and give you best guess/estimate
for the true degree of the underlying polynomial.  Your lasso regularization results might be most useful
for this determination, try finding values of `alpha` that are obviously too big (maybe by watching your
$R^2$) score), and then reduce this a bit to get an estimate on the upper bound of the number of terms in
the true polynomial.  Recall that because of noise you won't be able to get a perfect answer here.  Also it
helps to know, especially for even powers of the polynomial, that coefficients here are often only
reflecting lower even power effects.  So for example, for a true function with only a $x^2$ term, you might
still get $x^4$ and $x^8$ coefficients using lasso regression, even with relatively high values of `alpha`.

In any case, choose your best estimate of the "true" degree of the underlying polynomial function.  Then train
a final linear regression with no or maybe slight $\ell_2$ regularization to try and get a best estimate
of the true model coefficients.  You should do a little bit of testing again using the learning curves
and checking your $R^2$ fit score to determine that the model appears to be able to fit as well as
your degree 20 models with regularization.  But then train a final model using all of the data, and report
the intercept, coefficients and $R^2$ fit you achieve with your best estimated model for this assignment.

In [16]:
# estimate the polynomial degree and create your best model.  You might want to try first with no regularization,
# and then maybe with a little bit of l-2 (ridge) regulariztion and compare.

# you should confirm that you model does not overfit and performs as well as your degree 20 models with
# regularization


In [17]:
# display intercept, coefficients and R^2 fit score here


Once you settle on your best model, do a final trainin of it with all of the data.  Display
your intercept, coefficients and $R^2$ fit score for this best model.

Then, visualize the fit of your best model.  Once again scatter plot the raw data.  Then using the
`predict()` function from you scikit-learn model, show the predicted regression as a line on top of the
raw data points.

In [18]:
# perform one final fit of our best performing estimated model on all the data

# fit on all the data

In [19]:
# display intercept, coefficients and R^2 fit score here


In [20]:
# visualize the fit, start with the raw data points we are fitting

# display the best model  fit
