<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-\amily:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Data Leakage
              
</p>
</div>

DS-NTL-010824
<p>Phase 3</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>

## Obejective
- Explain data leakage
- **Explain** the bias-variance tradeoff and the correlative notions of underfit and overfit models
- Explain the notion of "validation data"

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_validate, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error

#### Data Leakage
- When information not generally available/used at prediction time contaminates the modeling training process.

<center><img src = "Images/despair.jpg" width = 350 /></center>
<center> The despair of realizing you have a data leak.</center>

Leads to overconfident estimates of model performance during the validation and testing phases.

- Bad performance after model deployment in production.


<center><img src = "Images/despair.jpg" width = 350 /></center>
<center> My overlords are upset.</center>

<center><img src = "Images/despair.jpg" width = 350 /></center>
<center> I have spent all night trying to find the data leak.</center>

<center><img src = "Images/despair.jpg" width = 350 /></center>
<center> What is the nature of my leak?</center>

<center><img src = "Images/despair.jpg" width = 350 /></center>
<center> Where is it??</center>

Diagnosing a data leak can be subtle business:
- Understanding different types of leak
- Where in the process they can be accidentally introduced.

#### Training leakage

Fitting and applying transformations **before** train-test(hold out) splitting.

<img src = "Images/training_testing.png" width = 500 />

- Why is this bad?

Applying transformation to training set:
- contains information from test set (contained in fitted attributes of Transformer)

- Contaminated training set.
- Model has inadvertently trained on information you are trying to predict.

#### You can introduce it by accident on really innocuous steps.

In [None]:
import pandas as pd
titanic_df = pd.read_csv('Data/titanic.csv').drop(columns = ['Cabin'])
titanic_df.info()

Let's impute those NaNs in age with the mean.

In [None]:
# impute with the mean
titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df.info()

Bam. You just introduced data leakage.
- imputed with statistics of entire dataset before train-test split.
- Statistics of train in test and vice-versa.
- Whether impact is significant will depend on data and model.

#### Leakage between train and true test can be a truly costly mistake:
- You'll only figure it out after you've deployed
- Made suboptimal predictions:
    - incorrectly recommending inventory changes to increase sales
    - or *much* worse.


#### Other types of data leakage: Feature leakage

Dependent on how the data/features were collected.

<img src = "Images\data_leakage_predcontam.png" width = 800/>

- Predicting whether we should approve a loan or not.
- Have database of current lendees.

- Features include:
    - Salary information at time of loan approval
    - Pre-loan bankruptcy history
    - FICO score
    - Listed occupation
    - Monthly interest payments.
    - Bank balance across accounts.
    

**What's the problem here?**

Possibly conflating information from *after* prediction with properties before prediction.
- Features from **after** prediction potentially affected by target.
- Our target is now leaking into the way we trained our model.
- Features *before* and *after* approval may not be drawn from same distribution.

**These sources of feature leakage can be very subtle**
- Understanding of the data collection process and data definitions is critical here.

**Case Study**

Predict sale price of a given home.

- Size of the house (in square meters)
- Average sales price of homes in the same neighborhood
- Latitude and longitude of the house
- Whether the house has a basement

Is there a source of potential data leakage? Explain.

Depends on whether the average includes the sales prices including given sale or before it.

**Another Case Study**

Want to predict rate of infection during surgery based on:
- patient specific factors (immuno-history, etc.)
- environmental factors (hospital cleanliness inspections, etc.)

Your idea: include surgeon as a factor.

1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Potentially introduces both target leakage and train-test leak:

- Target leakage: if given patient's outcome contributes to the infection rate for his surgeon. 
- Using target to calculate feature.
- Then using this feature to predict target.

Can avoid by using:
- Surgeon's infection rate for only surgeries before the patient we are predicting for.
- Tricky, for sure.

Train-test contamination problem if you calculate this using all surgeries a surgeon performed
- Including those from the test-set. 

Are you tearing your hair out? Good. You now understand data leakage.

## Preprocessing

In general all preprocessing steps are subject to the same dangers here. Consider the preprocessing step of one-hot-encoding:

In [None]:
gun_poll = pd.read_csv('data/guns-polls.csv')

In [None]:
gun_poll.head()

In [None]:
gun_poll['Pollster'].value_counts()

Now if I were to fit a one-hot encoder to the whole `Pollster` column here, the encoder would learn all the categories. But I need to prepare myself for the real-world possibility that unfamiliar categories may show up in future records. Let's explore this.

In [None]:
# First I'll do a split
X_train, X_test = train_test_split(gun_poll, random_state=42)

Let's suppose now that I fit a `OneHotEncoder` to the `Pollster` column in my training data.

#### Exercise
Fit an encoder to the `Pollster` column of the training data and then check to see which categories are represented.

In [None]:
# OHE

# So what categories do we have?


<details>
    <summary>Answer</summary>

``` python
to_be_dummied = X_train[['Pollster']]
ohe = OneHotEncoder()
ohe.fit(to_be_dummied)
## So what categories do we have?
ohe.get_feature_names_out()
```
</details>

We'll want to transform both train and test after we've fitted the encoder to the train.

In [None]:
ohe.transform(to_be_dummied)
ohe.transform(X_test[['Pollster']])

There are categories in the testing data that don't appear in the training data! What should 
we do about that?

### Approaches
- **Strategy 1**: Divide up the categories proportionally when we do our train_test_split. If we're using `sklearn`'s tool, that means taking advantage of the `stratify` parameter:

In [None]:
new_X_train, new_X_test = train_test_split(gun_poll,
                                           stratify=gun_poll['Pollster'],
                                           random_state=42)

Unfortunately, in this case, we can't use this since some categories have only a single member.

- **Strategy 2**: Drop the categories with very few representatives.

In the present case, let's try dropping the single-member categories.

In [None]:
vc = gun_poll['Pollster'].value_counts()
vc_only_1 = vc[vc == 1]
bad_cols = vc_only_1.index
vc

In [None]:
bad_cols

In [None]:
gun_poll['Pollster'] = gun_poll['Pollster'].map(lambda x: np.nan if x in bad_cols else x)
gun_poll = gun_poll.dropna()

In [None]:
gun_poll['Pollster'].value_counts()

We could now split this carefully so that new categories don't show up in the testing data. In fact, now we can try the stratified split:

In [None]:
X_train3, X_test3 = train_test_split(gun_poll,
                                     stratify=gun_poll['Pollster'],
                                     test_size=0.3,
                                     random_state=42)

In [None]:
X_train3['Pollster'].value_counts()

In [None]:
X_test3['Pollster'].value_counts()

Now every category that appears in the test data appears also in the training data.

- **Strategy 3**: Adjust the settings on the one-hot-encoder.

For `sklearn`'s tool, we'll tweak the `handle_unknown` parameter:

#### Exericse
Fit a new encoder to our training data column that won't break when we try to use it to transform the test data. And then use the encoder to transform both train and test.

<details>
    <summary>Answer</summary>
    
```python
ohe2 = OneHotEncoder(handle_unknown='ignore')
ohe2.fit(to_be_dummied)
test_to_be_dummied = X_test[['Pollster']]
ohe2.transform(to_be_dummied)
ohe2.transform(test_to_be_dummied)
```
</details>

In [None]:
t = pd.DataFrame(ohe2.transform(test_to_be_dummied).todense(), columns = ohe2.get_feature_names_out())
t.head()

# The Bias-Variance Tradeoff

We can break up how the model makes mistakes (the error) by saying there are three parts:

- Error inherent in the data (noise): **irreducible error**
- Error from not capturing signal (too simple): **bias**
- Error from "modeling noise", i.e. capturing patterns in the data that don't generalize well (too complex): **variance**

### Bias

**High-bias** algorithms tend to be less complex, with simple or rigid underlying structure.

![](images/noisy-sine-linear.png)

+ They train models that are consistent, but inaccurate on average.
+ These include linear or parametric algorithms such as regression and naive Bayes.
+ The following sorts of difficulties could lead to high bias:
  - We did not include the correct predictors
  - We did not take interactions into account
  - We missed a non-linear (polynomial) relationship

      
High-bias models are generally **underfit**: The models have not picked up enough of the signal in the data. And so even though they may be consistent, they don't perform particularly well on the initial data, and so they will be consistently inaccurate.

### Variance

On the other hand, **high-variance** algorithms tend to be more complex, with flexible underlying structure.

<img src = "images/noisy-sine-decision-tree.png"  width = 600/>


+ They train models that are accurate on average, but inconsistent.
+ These include non-linear or non-parametric algorithms such as decision trees and nearest-neighbor models.
+ The following sorts of difficulties could lead to high variance:
  - We included an unreasonably large number of predictors;
  - We created new features by squaring and cubing each feature.

High variance models are **overfit**: The models have picked up on the noise as well as the signal in the data. And so even though they may perform well on the initial data, they will be inconsistently accurate on new data.

### Balancing Bias and Variance

While we build our models, we have to keep this relationship in mind.  If we build complex models, we risk overfitting our models.  Their predictions will vary greatly when introduced to new data.  If our models are too simple, the predictions as a whole will be inaccurate.   

![](images/noisy-sine-third-order-polynomial.png)

Bias: 
- when model not complex enough
- feature space not adequately rich enough to explain target

Variance: 

- model/weights: large fluctuations about true model given different train sets

- High $ \mathrm{Var}[\textbf{w}] $ over realization of training sets

- High fluctuation in MAE over test sets.

The bulls-eye diagrams of fitting model to different training set realizations:
<center><img src = "images/biasvar_bullseye.png" width = 400/></center>

Each dot is a model:
- Bulls-eye: the *true* model (generating mean of $y$ given $X$ in the population) 
- Each dot: models trained on different samples.

**Our goal**: lowering bias and variance in training predictive models

but the two often at odds.

<center><img src = "images/bias trade off.png" width = 800/></center>

#### Multicollinearity
Have to grapple with these issues when constructing linear models with multicollinear features

We talked about this way back. But how does it increase Var[$\textbf{w}$]?

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

WHO_data = pd.read_csv("data/WHO_life.csv")
X_WHO = WHO_data.drop(columns = ["Life expectancy "])
y = WHO_data["Life expectancy "]

In [None]:
X_WHO.head()

Many features from WHO dataset:

Regressing to find weights life expectancy

In [None]:
X_WHO.columns

But let's take a look at a few of these and their correlations:

In [None]:
col_selector = ['Income composition of resources', 'Schooling','Alcohol', ' thinness  1-19 years']
subsetX = X_WHO[col_selector]
sns.heatmap(subsetX.corr())
plt.show()

Let's focus on Schooling and income composite resources (ICR):

$$ Life= w_1*Alcohol + w_2*Polio + w_3*Schooling + w_4*Measles + w_5*ICR + ... $$

Correlation is very high!

In [None]:
col_selector = ['Income composition of resources', 'Schooling']
X_WHO[col_selector].corr()

Our regression: 
- Y = life expectancy

$$ Y - \sum_{i \neq 3,5} w_i x_i = w_3 Schooling + w_5 ICR $$

- Schooling and ICR highly related:

- Implies that $w_3$ and $w_5$ introduce too much flexibility.
- Maybe could fit almost as well with just $w_3$.

- $w_3$ and $w_5$ are floppy and can become big in either direction to fit data.
- Var[$\textbf{w}$] from $w_3$ and $w_5$ high.

Modeling data by linear model w/ multicollinear features:
- intoduces high weight variance
- unnecessary model complexity

These considerations are all nice and theoretical:
    
- how do we actually assess whether model suffers from bias / variance or both?

#### How to assess model variance: cross-validation

Could get many different training sets:
- Train weights $\textbf{w}$ for each.
- Get variance of $\textbf{w}$ 

Semi-equivalently:
- Test performance of each model on test set.
- Evaluate model performance/variance by looking at average/standard deviation of performance on test set.

Problem: 
- likely don't have this much data available to make enough independent training sets large enough to for each model to train on effectively.

#### Solution: Cross validation

So first we created our train / test split: 

- the **training set** can be used to develop models
- can assess variance of a model and average performance

<img src = "images/traintestsplit.png"  width = 800/>
<center> Splitting up training set </center>

<img src = "Images/crossval.png"  width = 800/>
<center> Splitting up training set </center>

Split up training set into folds:
- Training fold
- Validation fold

- For each iteration:
    - train a model.
    - Test on validation fold. 

Effectively sampling multiple training sets:
- testing each model performance on different **validation set**.
- Good for estimating model performance on average
- Good for estimating model variance as well.


So in the end:
- Performance metrics measured on validation
- We get average performance metric across all the models for each cross validation iteration.
- Get variance of performance metric.

Note: **validation set** is part of training set:
- Not part of true test/hold-out set.

We are often trying out different model types:
- OLS with raw features
- OLS with collinear features dropped
- OLS with polynomial features
- Ridge regressor (will see later)

Idea is that we try out different model types / tune models: 
- assess variance
- assess average performance

**Use train/validation for this**: 
- for each model type: estimate model average performance and variance *across different train/validation realizations*

True and final evaluation:
- Measure performance on tuned model on the test set that has never been seen before.

<img src = "Images/cvtuningflow.png"  width = 800/>
<center> Model comparison/selection using cross-validation </center>
<center> Best model from cross-validation in test phase</center>

Roughly:
- Training data is for building the model;
- Validation data is for *tweaking* the model;
- Testing data is for evaluating the model on unseen data.

- Think of **training** data as what you study for a test
- Think of **validation** data is using a practice test (note sometimes called **dev**)
- Think of **testing** data as what you use to judge the model
    - A **holdout** set is when your test dataset is never used for training (unlike in cross-validation)

Selected best model based on:
- what worked best on the given validation folds.

**Iterative optimization of models based on the train/validation data**

Ultimately: 

- want to evaluate our best model class (found by optimizing over the validation sets) 
- on data that has neither been trained or validated on

![](https://scikit-learn.org/stable/_images/grid_search_workflow.png)
> Image from Scikit-Learn https://scikit-learn.org/stable/modules/cross_validation.html


<img src = "Images/test_phase_afterCV.png"  width = 800/>
<center> Best model from cross-validation in test phase</center>

1. Split data into training data and a holdout test
2. Design a model
3. Evaluate how well it generalizes with **cross-validation** (only training data)
4. Determine if we should adjust model, use cross-validation to evaluate, and repeat
5. After iteratively adjusting your model, do a _final_ evaluation with the holdout test set
6. DON'T TOUCH THE MODEL!!!

Cross validation gives us a way to test statistical robustness of model performance:
- evaluate average performance
- evaluate model variance

But seeing a set of models have high variance:
- How to address this problem found in cross-validation trials?
- i.e., how do we lower the variance?

#### Ways to limit/deal with high variance.

- Get more data. With enough training data, even with floppy weights it'll get it right.

- Yeah, but often not possible/easy to get enough data for this.

- Get rid of columns that exhibit a high degree of collinearity with other columns.

- Yeah, but did we throw out some useful information for prediction? 
- ICR and schooling not the same thing.
- How many of the collinear columns should we throw away? Which ones?

Getting rid of columns like this:
- Can lower variance
- But can also increase bias in an arbitrary, non-optimal way

- Or we could come up with ways to directly limit the variance through the cost function itself.

The hope is that with this method:
- decrease variance
- without increasing bias too much

Doing this in an optimal and principled way.

## Leakage in cross validation

<img src = "Images/crossval.png" width = 500/>

When you pollute your training fold with your validation fold:

- Each cross-validation trial has data leakage.

Validation performance measurements not correct:
- Incorrect estimates of average model performance
- Incorrect estimates of model variance.

Messes up your hyperparameter tuning:
- "Best model": hyperparameter settings with best average model performance
- But it doesn't work well on my test/hold-out set...

## KFold

In [None]:
data = load_diabetes()

print(data.DESCR)

df = pd.concat([pd.DataFrame(data.data, columns=data.feature_names),
               pd.Series(data.target, name='target')], axis=1)

df.head()

In [None]:
X, y = load_diabetes(return_X_y=True)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, random_state=42)

In [None]:
for train_ind, val_ind in KFold().split(X_train2):
    
    train = X_train2[train_ind, :]
    val = X_train2[val_ind, :]
    
    target_train = y_train2[train_ind]
    target_val = y_train2[val_ind]
    
    ss = StandardScaler().fit(train)
    
    train_scld = ss.transform(train)
    
    val_scld = ss.transform(val)
    
    lr = LinearRegression().fit(train_scld, target_train)
    
    print(lr.coef_[0])