# The Cardinal sin of data leakage 

** Having Data in the training sample that you wouldn't have for real world predictions 

Examples
1. y is explicitly in X (yikes)
2. y is a 2018 variable, but there is a 2019 variable in X
3. subtle: y is loan default, but X contains employee ID and some employees are brought in to handle trouble-loans (if you include it, the firm can't use the model to deploy the trouble-loan specialists)
4. if out-of-sample predicted stock movements have R2 above 10%... unlikely! (or: you'll be richer than Bezos soon)
5. this code below

```python
import #a bunch of sklearn stuff
X, y = #load data
X = transform(X) # imputation, encode cat vars, standardize

# or this:
cross_validate(model,X,y)

```

**Q: What's the problem here?**

**A: `transform(X)` used the whole dataset, so the X_training data was altered using info from X_test**

In [None]:
x1 sample 
1 training 
nan training 
2 test
nan test 


In [None]:
x1 sample 
1 training 
1 training

2 test
1 test 

In [2]:
## Avoiding Data Leakage

- Preventing 1-4: Be very familiar with the data and how it was collected and built
- Preventing 5: Do your data prep _**within**_ CV folds and where the transformations are done using only info from the training

```python

# loop over folds
for train_index, test_index in StratifiedKFold(n_splits=5).split(X,y):

    # .split() yields the indices in train/test sets. use those to get
    # the x/y vars for each separated out:
   
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index] 
From Donald Bowen to Everyone: (14:04)
    ###################################################################
    # NEW: do the data prep inside this fold, only using training data
    ###################################################################

    # e.g. figure out means/std in Xtrain so we can impute/std
    prep_methods.fit(Xtrain)                 # "fit" the transform means "estimate (like in training a model) what to do"
    Xtrain = prep_methods.transform(Xtrain)  # apply those to Xtrain to impute and std
   
    # fit/estimate, predict OOS, evaluate and store
    model.fit(X_train,y_train)
   
    ###################################################################
    # NEW: transform the test data the same...
    ###################################################################
   
    X_test = prep_methods.transform(X_test)  # apply TEST data the FIT from the TRAIN data
   
    y_predict = model.predict(X_test)
    accuracy.append(   accuracy_score(y_test, y_predict)      )

```

SyntaxError: invalid syntax (<ipython-input-2-a4aaa4f24939>, line 3)

# Our first pipeline 

Pipe: a sequence of steps, as long as each step has a fit and transform 

In [3]:
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn import svm

iris = load_iris() # data

# set up the pipeline, which will, given a set of observations
# 1. fit and apply these steps to the training fold
# 2. in the testing fold, apply the transform and model to predict (no estimation)

classifier_pipeline = make_pipeline(
                                    preprocessing.StandardScaler(),  # clean the data
                                    svm.SVC(C=1)                     # model
                                    )

cross_validate(classifier_pipeline, iris.data, iris.target, cv=5)

{'fit_time': array([0.0039928, 0.0009973, 0.0009973, 0.0009973, 0.0009973]),
 'score_time': array([0.       , 0.       , 0.0009973, 0.       , 0.       ]),
 'test_score': array([0.96666667, 0.96666667, 0.96666667, 0.93333333, 1.        ])}