# Data Leakage with Credit Card Data

Kaggle Intermediate Machine Learning Course: https://www.kaggle.com/code/alexisbcook/data-leakage

There are two main types of leakage: **target leakage** and **train-test contamination**
According to chatGPT's help to understand,
- **target leakage** is a situation in machine learning where information from the future (data that would not be available at the time of prediction) is inadvertently used to train a model. This can lead to overly optimistic performance metrics during training but can result in poor generalization to new, unseen data.
- **train-test contamination** is a situation in machine learning where information from the test set (or validation set) unintentionally influences the training process. This can lead to overly optimistic performance evaluations during model development but result in poor generalization to new, unseen data.

In [34]:
import pandas as pd

# Read the data
data = pd.read_csv('AER_credit_card_data.csv')

# Select target
y = data.card

# Select predictors
X = data.drop(['card'], axis=1)

print("Number of rowsin the dataset:", X.shape[0])
X.head()

Number of rowsin the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,yes,no,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,no,no,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,yes,no,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,no,no,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,yes,no,2,64,1,5


## Cross-Validation

The code I implemented differs slightly from the examples provided in Kaggle courses. I encountered an error indicating the presence of categorical data in my dataset, leading me to incorporate one-hot encoding as a preprocessing step. Additionally, instead of using make_pipeline, which was introduced without a clear explanation in the course, I opted for the more familiar Pipeline class. This change resolved issues I encountered and aligned with my prior understanding of pipeline usage

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = Pipeline(steps=[
    ('preprocessor', OneHotEncoder(handle_unknown='ignore', sparse=False)),
    ('model', RandomForestClassifier(n_estimators=100))
])

cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())



Cross-validation accuracy: 0.975743


With experience, you'll find that it's very rare to find models that are accurate 98% of the time. It happens, but it's uncommon enough that we should inspect the data more closely for target leakage.

Here is a summary of the data, which you can also find under the data tab:

- card: 1 if credit card application accepted, 0 if not
- reports: Number of major derogatory reports
- age: Age n years plus twelfths of a year
- income: Yearly income (divided by 10,000)
- share: Ratio of monthly credit card expenditure to yearly income
- expenditure: Average monthly credit card expenditure
- owner: 1 if owns home, 0 if rents
- selfempl: 1 if self-employed, 0 if not
- dependents: 1 + number of dependents
- months: Months living at current address
- majorcards: Number of major credit cards held
- active: Number of active credit accounts

A few variables look suspicious. For example, does expenditure mean expenditure on this card or on cards used before applying?

At this point, basic data comparisons can be very helpful:

Again, I got errors by following the Kaggle courses, so I changed a little bit. 
I investigated about y values first and then used that information to solve the problem. 


In [38]:
print(y.unique()) # Confirm the uniqe values in 'y'
print(X.index) # Check the DataFrame index
y_boolean = (y == 'yes') # convert 'y' to boolean (boolean means either yes or no)

['yes' 'no']
RangeIndex(start=0, stop=1319, step=1)


Since y contains 'yes' and 'no' strings, I convert it to boolean values. 

In [39]:
expenditures_cardholders = X.expenditure[y_boolean]  # y = data.card defined above. card: 1 if credit card application accepted, 0 if not
                                            # contains values where the condition is True (cardholders),
expenditures_noncardholders = X.expenditure[~y_boolean]  # contains values where the condition is False (non-cardholders).

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))

Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


As shown above, everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures. It's not surprising that our model appeared to have a high accuracy. But this also seems to be a case of target leakage, where expenditures probably means expenditures on the card they applied for. 

Since `share` is partially determined by `expenditure`, it should be excluded too. The variables `active` and `majorcards` are a little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more. 

We would run a model without target leakage as follows:

In [40]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis =1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y,
                           cv =5, 
                           scoring = 'accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())



Cross-val accuracy: 0.833207


This accuracy is quite a bit lower, which might be disappoining. However, we can expect it to be right about 80% of the time when used on new application, whereas the leaky model would likely do much worse than that in spite of its higher apparent score in corss-validation.