### Today's exercise
Gather in the same group as last week, and please go through the following steps:
1. Look back at last week's notebook. If you have not applied any transformation to your input because you did not have time, spend some time thinking about whether it would make sense to do so. You can find relevant transformations in `scikit-learn`: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing. You will probably mostly be interested in `StandardScaler` and `MinMaxScaler`
2. Look at the performance of the models you've fitted last week: what is the best model? Do you see any evidence of overfitting?
3. Fit your maximal models with `Lasso` (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) and `Ridge` (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) estimators instead of `LinearRegression`. Make sure you look at the documentation to understand what these do. Fit multiple models with multiple values of alpha and store the outputs;
4. Plot the performance of your models against your linear and KNN models from last week. Does the performance of the model on the validation set improve with regularization?
5. For both `Lasso` and `Ridge` models, plot the value of the coefficients as a function of alpha. You can access the coefficients for a fitted `model` through `model.coef_`. What do you notice in terms of how LASSO versus Ridge behave? (Look at `example.ipynb` for inspiration)
6. Finally, if any models are doing better than the linear model without regularization, select the best `Ridge` and the best `Lasso` model, and plot their coefficients, alongsize coefficients from the simple linear models. How do estimates change with regularization? Which values have changed the most? Do you have any hypothesis as to why?

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import StandardScaler

In [2]:
df_train = pd.read_csv("/work/SilleHasselbalchMarkussen#4503/DataSci-AU-24/nbs/group_RMDS/data/bikes_train.csv")
df_val = pd.read_csv("/work/SilleHasselbalchMarkussen#4503/DataSci-AU-24/nbs/group_RMDS/data/bikes_validation.csv")
df_test = pd.read_csv("/work/SilleHasselbalchMarkussen#4503/DataSci-AU-24/nbs/group_RMDS/data/bikes_test.csv")

In [12]:
len(bike_test)

2606

#### Dividing into X and y

In [3]:
# train
y_train = df_train["propoertion_cas_reg"].values
X_train = df_train.drop(["propoertion_cas_reg", "registered", "casual", "cnt","dteday"], axis=1) # dropping on the first axis

# test
y_test = df_test["propoertion_cas_reg"].values
X_test = df_test.drop(["propoertion_cas_reg", "registered", "casual", "cnt","dteday"], axis=1) # dropping on the first axis

# validation
y_val = df_val["propoertion_cas_reg"].values
X_val = df_val.drop(["propoertion_cas_reg", "registered", "casual", "cnt","dteday"], axis=1) # dropping on the first axis

#### Reevaluating baseline models

In [4]:
# loading performances from last
perf_df = pd.read_csv("/work/SilleHasselbalchMarkussen#4503/DataSci-AU-24/nbs/group_RMDS/log/performances.csv", index_col=0)
performances = perf_df.to_dict(orient='records')

In [None]:
# Function for evaluating models

def evaluate(model, X, y, nsplit, model_name, constant_value=None):
    ''' Evaluates the performance of a model 
    Args:
        model (sklearn.Estimator): fitted sklearn estimator
        X (np.array): predictors
        y (np.array): true outcome
        nsplit (str): name of the split
        model_name (str): string id of the model
        constant_value (int or None): relevant if the model predicts a constant
    '''
    if constant_value is not None:
        preds = np.array([constant_value] * y.shape[0])
    else:
        preds = model.predict(X)
    r2 = r2_score(y, preds)
    performance = np.sqrt(mean_squared_error(y, preds))
    performances.append({'model': model_name,
                         'split': nsplit,
                         'rmse': performance.round(4),
                         'r2': r2.round(4)})