In [25]:
import pandas as pd

# Read the data
data = pd.read_csv('./input/AER_credit_card_data/AER_credit_card_data.csv')

In [26]:
data

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,yes,0,37.66667,4.5200,0.033270,124.983300,yes,no,3,54,1,12
1,yes,0,33.25000,2.4200,0.005217,9.854167,no,no,3,34,1,13
2,yes,0,33.66667,4.5000,0.004156,15.000000,yes,no,4,58,1,5
3,yes,0,30.50000,2.5400,0.065214,137.869200,no,no,0,25,1,7
4,yes,0,32.16667,9.7867,0.067051,546.503300,yes,no,2,64,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1314,yes,0,33.58333,4.5660,0.002146,7.333333,yes,no,0,94,1,19
1315,no,5,23.91667,3.1920,0.000376,0.000000,no,no,3,12,1,5
1316,yes,0,40.58333,4.6000,0.026513,101.298300,yes,no,2,1,1,2
1317,yes,0,32.83333,3.7000,0.008999,26.996670,no,yes,0,60,1,7


In [27]:
# re-Read the data
data = pd.read_csv('./input/AER_credit_card_data/AER_credit_card_data.csv',
                       true_values = ['yes'],
                       false_values = ['no'])

In [17]:
data

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.5200,0.033270,124.983300,True,False,3,54,1,12
1,True,0,33.25000,2.4200,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5000,0.004156,15.000000,True,False,4,58,1,5
3,True,0,30.50000,2.5400,0.065214,137.869200,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.503300,True,False,2,64,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1314,True,0,33.58333,4.5660,0.002146,7.333333,True,False,0,94,1,19
1315,False,5,23.91667,3.1920,0.000376,0.000000,False,False,3,12,1,5
1316,True,0,40.58333,4.6000,0.026513,101.298300,True,False,2,1,1,2
1317,True,0,32.83333,3.7000,0.008999,26.996670,False,True,0,60,1,7


In [28]:
# Select target
y = data.card

# select predictors
X = data.drop(['card'], axis=1)

print("No. of rows in the data: {}".format(X.shape[0]))

No. of rows in the data: 1319


In [29]:
X.shape

(1319, 11)

In [30]:
# Model
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing we have no need for a pipeline,
# however it is best practice to have one

my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))

cv_scores = cross_val_score(my_pipeline, X, y,
                           cv = 5,
                           scoring = 'accuracy')

print("Cross-validation accuracy: %f" %cv_scores.mean())

Cross-validation accuracy: 0.980283


### Considering leakage

It's very rare to find models that are accurate 98% of the time. It happens, but it's uncommon enough that we should inspect the data more closely for target leakage.

Here is a summary of the data, which we can also find under the data tab:

- card: 1 if credit card application accepted, 0 if not
- reports: Number of major derogatory reports
- age: Age n years plus twelfths of a year
- income: Yearly income (divided by 10,000)
- share: Ratio of monthly credit card expenditure to yearly income
- expenditure: Average monthly credit card expenditure
- owner: 1 if owns home, 0 if rents
- selfempl: 1 if self-employed, 0 if not
- dependents: 1 + number of dependents
- months: Months living at current address
- majorcards: Number of major credit cards held
- active: Number of active credit accounts

A few variables look suspicious. For example, 
does `expenditure` mean expenditure on this card or on cards used before appying?

At this point, basic data comparisons can be very helpful:

In [31]:
# Comparision of expenditure with target

expenditure_cardholders = X.expenditure[y]
expenditure_noncardholders = X.expenditure[~y]

print("Fraction of clients who did not receive a card and had no expenditure: %.2f" \
     %((expenditure_noncardholders == 0).mean()))

print("Fraction of clients who received a card and had no expenditure: %.2f" \
     %((expenditure_cardholders == 0).mean()))

Fraction of clients who did not receive a card and had no expenditure: 1.00
Fraction of clients who received a card and had no expenditure: 0.02


As shown above, everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures. It's not surprising that our model appeared to have a high accuracy. But this also seems to be a case of **target leakage**, where expenditures probably means ***expenditures on the card they applied for***.

Since `share` is partially determined by `expenditure`, it should be excluded too. The variables `active` and `majorcards` are a little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.

We would run a model without target leakage as follows:

In [32]:
# Drop leaky predictors from the dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']

X2 = X.drop(potential_leaks, axis=1)

# model evaluation with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.829406


In [33]:
data.majorcards.nunique()

2

In [34]:
# Comparision of majorcards with target

majorcard_cardholders = X.majorcards[y]
majorcard_noncardholders = X.majorcards[~y]

print("Fraction of clients who received a card and had majorcards: %.2f" \
     %((majorcard_cardholders == 1).mean()))

print("Fraction of clients who received a card and had no majorcard: %.2f" \
     %((majorcard_cardholders == 0).mean()))

print("Fraction of clients who did not receive a card and had majorcards: %.2f" \
     %((majorcard_noncardholders == 1).mean()))

print("Fraction of clients who did not receive a card and had no majorcard: %.2f" \
     %((majorcard_noncardholders == 0).mean()))

Fraction of clients who received a card and had majorcards: 0.84
Fraction of clients who received a card and had no majorcard: 0.16
Fraction of clients who did not receive a card and had majorcards: 0.74
Fraction of clients who did not receive a card and had no majorcard: 0.26


The relation between majorcards and target is highly likely, therefore it must be considered as a case for target leakage.