# Problem Session: Model Evaluation

Import `simulated_subject_data.csv` as `df`.

What are the unique values of the `subject_id` field?

These data will be used in a predictive regression model.  We should approach model validation differently depending on our goals:

1. Case 1:  We do not need to generalize to new `subject_id`s in the future, so we can treat `subject_id` as a categorical feature.  For instance, if each `subject_id` corresponds to a vendor we do business with, and we engage with a new vendor only very rarely, then it would be appropriate to predict a new row given knowledge of our previous history with the vendor. In this case a regular `train_test_split` and `KFold` would be appropriate.
2. Case 2:  We do need to generalize to new `subject_id`'s in the future.  We cannot treat `subject_id` as a feature in this case.  For instance, if `subject_id` corresponds to a subject in a medical trial, and each row corresponds to a biophysical measurement under different conditions, then we want to be able to predict what a *new* subject will do under novel conditions.  We should assess our models using data splits which randomly assign different subjects to the training or holdout sets.

For this problem set we will do *both* and see the difference in model performance.  This will illustrate the danger of using option 1 when you should really be using option 2.

In [None]:
# Make a train/test split for case 1.  Use random state 216, and test size 0.2
# Use the names X_train_1, X_test_1, y_train_1, y_test_1


In [None]:
# Make a train/test split for case 2. Put random 20 of the subject_ids in test, the rest in train.
# Use the names X_train_2, X_test_2, y_train_2, y_test_2
# There is no built-in such splitter, so you will need to write this code by hand.

Let's do some EDA using `X_train_2` and `y_train_2`.  Make some graphs of the target against each feature.  What do you notice?

Let's first get a little practice with residual plots.  Fit a linear regression model to `(X_train_2, y_train_2)` using only the features $x_1, x_2, x_3$. Graph the residuals against each feature and also against the predicted values.  What do you see?


We very clearly see that we have "un-modeled signal" in the form of a quadratic dependence on $x_3$.

We will now compare 4 different models, using the splitting strategy from Case 2:

1. A baseline model which predicts the mean of the training targets.
2. The linear regression model with features $x_1, x_2, x_3$.
3. The linear regression model with features $x_1, x_2, x_3, x_3^2$.
4. A random forest regression model with features $x_1, x_2, x_3$.

I have included the `sklearn` packages you will need to import as a hint.  Read the docs for any you are unfamiliar with!

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GroupKFold, cross_val_predict
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import root_mean_squared_error as rmse

models = {
    "dummy_mean": DummyRegressor(strategy="mean"),
    "linear": LinearRegression(),
    "linear_x3_sq": Pipeline(
        # your code here
    ),
    "random_forest": RandomForestRegressor(
        n_estimators=200, random_state=42, n_jobs=-1
    ),
}

gkf = GroupKFold(n_splits=5)

rows = []
for name, model in models.items():
    # use cross_val_predict and gkf to generated out-of-fold predictions for each model.
    # record mean out-of-fold RMSE.
    # Also train each model on the entire training set and record in-sample RMSE.
    # Compare in-sample to out-of-sample performance to assess overfitting.


Finally, let's see what kind of performance we would get using the quadratic model if we use `train_test_split`, `KFold`, and one-hot encode `subject_id`.

In [None]:
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

# Column transformer: x1,x2,x3 as-is, add x3^2, one-hot encode subject_id
ct = ColumnTransformer([
    # your code here
])

pipe = Pipeline([
    ("features", ct),
    ("linreg", LinearRegression())
])

# In-sample RMSE


# Out-of-sample RMSE (row-wise KFold, no grouping)


print("In-sample RMSE:", in_rmse)
print("Out-of-sample RMSE (row-wise KFold):", out_rmse)


We can see that if we were actually in Case 2, but mistakenly followed the validation strategy from Case 1, we could convince ourselves that the model performs **way better** than it actually will when applied to new subjects.