# Data Leakage
## Tutorial
* [Data leakage](https://www.kaggle.com/code/alexisbcook/data-leakage)

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [8]:
df = pd.read_csv('./data/aer_credit_card_data.csv', 
                   true_values = ['yes'], false_values = ['no'])
df.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [9]:
y = df.card
X = df.drop(['card'], axis=1)
X.shape

(1319, 11)

In [4]:
X.head()

Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,yes,no,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,no,no,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,yes,no,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,no,no,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,yes,no,2,64,1,5


Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)

In [10]:
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

Cross-validation accuracy: 0.980294


With experience, you'll find that it's very rare to find models that are accurate 98% of the time. It happens, but it's uncommon enough that we should inspect the data more closely for target leakage.
Here is a summary of the data, which you can also find under the data tab:
* card: 1 if credit card application accepted, 0 if not
* reports: Number of major derogatory reports
* age: Age n years plus twelfths of a year
* income: Yearly income (divided by 10,000)
* share: Ratio of monthly credit card expenditure to yearly income
* expenditure: Average monthly credit card expenditure
* owner: 1 if owns home, 0 if rents
* selfempl: 1 if self-employed, 0 if not
* dependents: 1 + number of dependents
* months: Months living at current address
* majorcards: Number of major credit cards held
* active: Number of active credit accounts
* A few variables look suspicious. For example, does expenditure mean expenditure on this card or on cards used before applying?↳

At this point, basic data comparisons can be very helpful:

In [11]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]
print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))

Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


In [12]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)
X2.head()

Unnamed: 0,reports,age,income,owner,selfemp,dependents,months
0,0,37.66667,4.52,True,False,3,54
1,0,33.25,2.42,False,False,3,34
2,0,33.66667,4.5,True,False,4,58
3,0,30.5,2.54,False,False,0,25
4,0,32.16667,9.7867,True,False,2,64


In [13]:

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.833198
