# Multiple Linear Regression, Part 2: 

## Model Validation

Today we'll focus on how to validate our models.

### Set Up Our Data Again

In [None]:
# Basic imports

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)

import matplotlib.pyplot as plt
import seaborn as sns

Credit data from https://www.kaggle.com/avikpaul4u/credit-card-balance

Target: `Balance`

In [None]:
# Data
df = pd.read_csv('data/Credit.csv', 
                 usecols=['Income', 'Limit', 'Rating', 'Age', 'Balance'])

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
sns.pairplot(df)

## Modeling Practice

Last time, we left off after identifying some issues in our initial multiple linear regression model. Let's build that model back - now, with sklearn! - and then discuss one change we can implement and see if it improves our model.

In [None]:
# Imports
from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
# Define X and y

X = None
y = None

In [None]:
# Let's be sure to scale our X variables


In [None]:
# Fit our model!


In [None]:
# Grab our predictions and evaluate
y_preds = None

print(f"R2 Score: {r2_score(y, y_preds):.4f}")
print(f"MAE: {mean_absolute_error(y, y_preds):.4f}")
print(f"RMSE: {mean_squared_error(y, y_preds, squared=False):.4f}")

In [None]:
# Visualize our residuals

#### What issues are there with this model?

- 


#### Now, make a change!

- 


## Model Validation - AKA How to Build Generalizable Models

![validation gif from giphy](https://media.giphy.com/media/242wLqQerWkxd6GgHB/giphy.gif)

Our premise: Let's say you have a dataframe, with some number of rows of data, and that's all you have available to you. The hope is that you can train a model on this data that can then be used to make predictions about new data that comes in. You want your model to _generalize_ well and work on this incoming data - not too complex from learning all the details/noise from the data, but also not so simple that the model is useless. How do we do that?

First, let's go into detail about this trade-off between simplicity and complexity:

### The Bias-Variance Trade Off

<img alt="original image from https://rmartinshort.jimdofree.com/2019/02/17/overfitting-bias-variance-and-leaning-curves/" src="images/underfit-goodfit-overfit.png" width=750, height=350>  

Remember - by modeling, we're assuming that there is some relationship between our X variables (the features in our dataset) and our y variable (the target). Thus, there is some underlying '_true_' function that captures the relationship between X and y, which we are trying to find by modeling. Of course, the actual relationship may be quite complex and not wholly represented in our data - our approximation, aka the model we create, is likely only a simplified estimator of whatever our '_true_' function actually would look like.

**Bias**: Error introduced by approximating a real-life problem (which may be extremely complicated) by a much simpler model (because the model is too simple to capture the underlying pattern)

**Variance**: Amount by which our model would change if we estimated it using a different training dataset (because the model is over-learning from the training data)

**Representation:**

<img alt="from https://hsto.org/files/281/108/1e9/2811081e9eda44d08f350be5a9deb564.png" src="images/bias-variance.png/" width=350, height=350>

## How To Minimize Bias and Variance

Good news! There are tried and true methods to reducing both bias and variance in our modeling process. Testing different models, trying models on different slices of data, transforming or engineering features - all of these things have a role to play in creating better, more robust models.

In particular, we've learned so far that we can evaluate the performance of our models, using a scoring metric, which will help us catch if a model is underfit - if it's performing quite poorly, it probably isn't capturing the relationship in our data! 

But what about overfitting?

<img alt="I Love Lucy shrug gif from Giphy" src="https://media.giphy.com/media/JRhS6WoswF8FxE0g2R/giphy.gif" width=350, height=350>

### Train-Test Split

The idea: don't train your model on ALL of your data, but keep some of it in reserve to test on, in order to simulate how it will work on new/incoming data.

#### Example:

<img alt="original image from https://www.dataquest.io/wp-content/uploads/kaggle_train_test_split.svg plus some added commentary" src="images/traintestsplit_80-20.png" width=850, height=150>  

Note - here, it looks like we're just taking the tail end of the dataset and setting it aside. In practice (most of the time), the split will randomly choose which rows are in the train vs. test sets.

How does this fight against overfitting? By witholding data from the training process, we are testing whether the model actually _generalizes_ well. If it does poorly on the test set, it's a good sign that our model learned too much noise from the train set and is overfit! 

![arrested development gif, found by Andy](https://heavy.com/wp-content/uploads/2013/05/tumblr_mjm9fqhrle1rvnnvyo6_250.gif)

#### Practice:

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
print(X.shape)
X.head()

In [None]:
print(y.shape)
y.head()

In [None]:
# Train test split here!
# Set test_size = .33
# Set random_state = 42



What did that do?

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
len(X_train + X_test) == len(X)

In [None]:
X_train.head()

Now let's put our train/test split into practice:

In [None]:
# Instantiate a new scaler to scale our data
# Let's use Standard Scaler here


In [None]:
# Fit our scaler - ON THE TRAINING DATA!!
# Then transform both train and test 


Quick aside: why is it so important to fit the scaler on the train set instead of the full set of X variables? Let's discuss what exactly these scalers are doing under the hood!

- 


**Rule of thumb:** if something is impacted by other rows in the dataset, it should _**only**_ learn from the training set!

In [None]:
# Instantiate an sklearn linear model


In [None]:
# Fit your model - ON THE TRAINING DATA!!


In [None]:
# Grab predictions for train and test set


In [None]:
# How'd we do?

print(f"Train R2 Score: {r2_score(y_train, y_pred_train)}")
print(f"Test R2 Score: {r2_score(y_test, y_pred_test)}")

Evaluate!

- 


In [None]:
# Single variable example

X_single = df['Income']
y = df['Balance']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_single, y, test_size=0.33, random_state=42)

In [None]:
scaler = StandardScaler()
X_s_train = scaler.fit_transform(X_train.values.reshape(-1, 1))
X_s_test = scaler.transform(X_test.values.reshape(-1, 1))

lr = LinearRegression()
lr.fit(X_s_train, y_train)
lr.score(X_s_test, y_test)

### But Wait... There's More!

Let's change something and see what happens:

In [None]:
for n in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.33, 
                                                        random_state=n) # <--
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    lr = LinearRegression()
    lr.fit(X_train_scaled, y_train)
    
    y_pred_train = lr.predict(X_train_scaled)
    y_pred_test = lr.predict(X_test_scaled)
    
    print(f"Random Seed: {n}")
    print(f"Train R2 Score: {r2_score(y_train, y_pred_train)}")
    print(f"Test R2 Score: {r2_score(y_test, y_pred_test)}")
    print("-----")

What's happening here? All we're doing is changing our `random_seed` - why is that having such an impact on our model's scores? Some models appear overfit, some don't - and for some, the test score is **better** than our train score!

### K-Fold Cross-Validation

Sometimes, random chance means your training data isn't representative, or includes wacky data like all of our outliers. So, why do just one train-test split when you can do `k` number of them!

![cross validation image from kaggle: https://www.kaggle.com/alexisbcook/cross-validation](images/cross-validation.png)

The good news is, we'll never actually have to do this by hand - `sklearn` will handle it for us!

Documentation: https://scikit-learn.org/stable/modules/cross_validation.html

In [None]:
# Scale our data

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)
# Note - in practice, better to scale within the cross validate...
# But we're saving how to do that with pipelines til later

In [None]:
# Instantiate a fresh linear regression model
lr = LinearRegression()

In [None]:
# Let's use cross_val_score
# Set cv = 5
from sklearn.model_selection import cross_val_score

scores = None

In [None]:
# Look at the test scores across our folds
scores

In [None]:
# Print it nicely
print(f"Scores: {scores.mean():.3f} +/- {scores.std():.3f}")

Why show the standard deviation of scores here? I want some measure of the variance among my scores, so I can tell how different my scores were based on different breakdowns of the training data.

If I made a change to my model and the average of my cross-validated scores stayed about the same, but the variance among those scores decreased, that's a better, more generalizable model than before!

### Additional Resources:

- [Great bias/variance infographic](https://elitedatascience.com/bias-variance-tradeoff) from Elite Data Science
- Taking a more statistical approach? [Probabilistic Model Selection with AIC, BIC, and MDL](https://machinelearningmastery.com/probabilistic-model-selection-measures/) 