# Credit Card Fraud Detection

- [How to setup this project](https://github.com/rurumimic/credit-card-fraud-detection)
- Kaggle Datasets: [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)
- Reference: [Dealing with Imbalanced Datasets](https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets) by Janio Martinez Bachmann

---

## About the project

### Background

- Credit card fraud continues to occur.
- Current audit systems require auditors to find out after fraud has already occurred.

### Goal

- Fraud Detection and Prevention
- Bank loss prevention and reputation management
- Increase audit efficiency and establish an integrated inspection management system
- Establishment of risk management system

### Differences from the actual project

![](process.png)

- In the actual project, we used real data from the bank's audit team when developing the machine learning model.
- In this example, we'll use Kaggle's credit card fraud detection data instead of real data.
- This example shows one of the many models used.

---

## Install Packages

- numpy
- pandas
- scikit
- scikit-optimize
- xgboost
- shap

```bash
conda install -c anaconda numpy pandas
conda install -c conda-forge scikit-learn scikit-optimize xgboost shap
```

## Datasets

Differences from the real datasets:

- Feature data type
   - All Kaggle data are normalized values.
   - Real data has a lot of categorical data.
- Number of features
   - Kaggle data contains 28 unnamed columns and time and amount information.
   - Real data has a lot of features.

Therefore, it is very different from the model that banks are actually using.


---
## Preprocess

### Load datasets

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('creditcard.csv', sep=',')

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


- Time: Elapsed time from previous transaction
- V1 to V28: Anonymized data
- Amount: amount
- Class
   - 0: normal
   - 1: scam

### Scaling

- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) is less prone to outliers.
- This Scaler removes the median and scales the data according to the quantile range.

In [4]:
from sklearn.preprocessing import RobustScaler

rob_scaler = RobustScaler()

Scale `Time` and `Amount`:

In [5]:
df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

Remove `Time` and `Amount`:

In [6]:
df.drop(['Amount', 'Time'], axis=1, inplace=True)

Change column order:

In [7]:
scaled_amount = df['scaled_amount']
scaled_time = df['scaled_time']

df.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)

df.insert(0, 'scaled_amount', scaled_amount)
df.insert(1, 'scaled_time', scaled_time)

Check changed data:

In [8]:
df.head()

Unnamed: 0,scaled_amount,scaled_time,V1,V2,V3,V4,V5,V6,V7,V8,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
0,1.783274,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0
1,-0.269825,-0.994983,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0
2,4.983721,-0.994972,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0
3,1.418291,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0
4,0.670579,-0.99496,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0


### Imbalanced datasets

In [9]:
print('Normal', round(df['Class'].value_counts()[0]/len(df) * 100, 5), '%')
print('Fraud', round(df['Class'].value_counts()[1]/len(df) * 100, 5), '%')

Normal 99.82725 %
Fraud 0.17275 %


- Fraud data is less than 1% of the total data, which may overfit the model.
- We have to deal with imbalanced data.

### Random Under-Sampling

- When using real data, under/oversampling is not used.
- Because there was no significant change when sampling.
- Use undersampling when using Kaggle data.
- You can test quickly with a small training dataset.

Normal and Fraud:

In [10]:
Normal = df[df['Class'] == 0]
Fraud = df[df['Class'] == 1]

Randomly select normal data:

In [11]:
temp = Normal.sample(frac=1)
NormalSample = temp.loc[temp['Class'] == 0][:Fraud.shape[0]]

Merge data:

In [12]:
UnderSample = pd.concat([Fraud, NormalSample]).sample(frac=1)

Check the undersampled data:

In [13]:
UnderSample.shape

(984, 31)

### Train and test data

Divide the data in a 7:3 ratio.

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X = UnderSample.drop('Class', axis=1)
y = UnderSample['Class']

In [16]:
global X_train, y_train, X_test, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

---
## Modeling

In [17]:
from xgboost import XGBClassifier
from sklearn.metrics import auc, precision_recall_curve, confusion_matrix

from skopt import gp_minimize
from skopt.space import Space, Real, Integer, Categorical
from skopt.utils import use_named_args

Find the optimal parameter with `gp_minimize`:

In [18]:
dimensions = [
    Real(low=1e-4, high=2e-1, prior='log-uniform', name='learning_rate'),
    Real(low=0, high=1, prior='uniform', name='reg_alpha'),
    Real(low=0, high=1, prior='uniform', name='reg_lambda'),
    Integer(low=20, high=1000, name='n_estimators'), 
    Real(low=1e-9, high=0.5, name='gamma'),
    Integer(low=2, high=10, name='max_depth')
]

In [19]:
@use_named_args(dimensions)
def fitness(**params):
    params['eval_metric'] = 'aucpr'
    
    model = XGBClassifier()
    model.set_params(**params)
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)
    
    precision, recall, _ = precision_recall_curve(y_test, y_proba[:, 1])
    auc_pr = auc(recall, precision)
    
    params.pop('eval_metric')
    params['aucpr'] = auc_pr
    
    return 1 - auc_pr

In [None]:
gp_result = gp_minimize(fitness, dimensions, n_jobs = 5, n_calls = 10)

In [21]:
dimension_names = [x.name for x in dimensions]
best_params = {}

for i in range(len(dimension_names)):
    best_params[dimension_names[i]] = gp_result['x'][i]

best_params['eval_metric'] = 'aucpr'
best_params['n_jobs'] = 5

In [22]:
best_params

{'learning_rate': 0.010502347160991049,
 'reg_alpha': 0.26304039594088163,
 'reg_lambda': 0.9689289216806913,
 'n_estimators': 412,
 'gamma': 0.21935353045195558,
 'max_depth': 5,
 'eval_metric': 'aucpr',
 'n_jobs': 5}

Use the optimal parameters and create a model:

In [None]:
model = XGBClassifier()
model.set_params(**best_params)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)

precision, recall, _ = precision_recall_curve(y_test, y_proba[:, 1])
auc_pr = auc(recall, precision)
cm = confusion_matrix(y_test, y_proba[:, 1]>0.5)

model.performance_indicator = {}
model.performance_indicator['aucpr'] = auc_pr
model.performance_indicator['confusion_matrix'] = cm
model.train_col = list(X_train.columns)

Check the model performance metrics:

In [24]:
pd.DataFrame({"predict false": [cm[0][0], cm[0][1]], \
              "predict true": [cm[1][0], cm[1][1]]})\
.rename(index = {0:'actual false', 1:'actual true'})

Unnamed: 0,predict false,predict true
actual false,155,10
actual true,2,129


| | predict false | predict true |
|---|---|---|
| actual false | True Negatives | False Positives <br> (Type I error) |
| actual true | False Negatives <br> (Type II error) | True Positives |

In [25]:
auc_pr # Area Under Precision Recall Curve

0.9877179994298837

---
## Prediction

### SHAP

In [26]:
import shap
import random

SHAP:

In [27]:
def _shap(model, X):
    explainer = shap.TreeExplainer(model, data=X)
    shap_values = explainer.shap_values(X)
    df_shap_values = pd.DataFrame(explainer.shap_values(X), columns = X.columns.values)
    
    df_importances = pd.DataFrame(columns=['fraud column 1', 'fraud value 1', 'fraud weight 1', 
                                           'fraud column 2', 'fraud value 2', 'fraud weight 2', 
                                           'fraud column 3', 'fraud value 3', 'fraud weight 3', 
                                           'fraud column 4', 'fraud value 4', 'fraud weight 4', 
                                           'fraud column 5', 'fraud value 5', 'fraud weight 5', 
                                           'normal column 1', 'normal value 1', 'normal weight 1', 
                                           'normal column 2', 'normal value 2', 'normal weight 2',
                                           'normal column 3', 'normal value 3', 'normal weight 3',
                                           'normal column 4', 'normal value 4', 'normal weight 4',
                                           'normal column 5', 'normal value 5', 'normal weight 5'])

    for i in range(len(X)):
        importance = sorted(zip(X.columns, X.values[i], df_shap_values.iloc[i, :]), key=lambda x: x[2], reverse=True)
        frd_imp = importance[:5]
        importance = sorted(zip(X.columns, X.values[i], -1 * df_shap_values.iloc[i, :]), key=lambda x: x[2], reverse=True)
        norm_imp = importance[:5]

        frd_imp.extend(norm_imp)
        df_importances.loc[i] = list(sum(frd_imp, ()))
        
    df_importances.insert(0, 'index', X.index.values)
    
    return explainer, shap_values, df_importances

Graph:

In [28]:
shap.initjs()

In [29]:
def AdditiveForceVisualizer(data, explainer, shap_values, X):   
    row_id = data.index.values[0]
    data_id = data['index'].values[0]
    data_class = data['isFraud'].values[0]
    data_fraud = "%.1f" % data['Fraud %'].values[0]

    print(f"{data_id} is Normal" if data_class == 0 else f"{data_id} is Fraud")
    print(f"Fraud: {data_fraud}%")
    plt = shap.force_plot(explainer.expected_value, shap_values[row_id,:], X.iloc[row_id,:], show=False)
    return plt

def Visualizer(importances, explainer, shap_values, X, Class_value = 1):
    data = importances.loc[importances['isFraud'] == Class_value]
    random_index = random.randint(0, len(data))
    data = data.iloc[[random_index]]
    return AdditiveForceVisualizer(data, explainer, shap_values, X), data

def FactorTable(data):
    return pd.DataFrame({
        'fraud column': [data['fraud column 1'].values[0], data['fraud column 2'].values[0], data['fraud column 3'].values[0], data['fraud column 4'].values[0], data['fraud column 5'].values[0]],
        'fraud value': [data['fraud value 1'].values[0], data['fraud value 2'].values[0], data['fraud value 3'].values[0], data['fraud value 4'].values[0], data['fraud value 5'].values[0]],
        'fraud weight': [data['fraud weight 1'].values[0], data['fraud weight 2'].values[0], data['fraud weight 3'].values[0], data['fraud weight 4'].values[0], data['fraud weight 5'].values[0]],
        'normal column': [data['normal column 1'].values[0], data['normal column 2'].values[0], data['normal column 3'].values[0], data['normal column 4'].values[0], data['normal column 5'].values[0]],
        'normal value': [data['normal value 1'].values[0], data['normal value 2'].values[0], data['normal value 3'].values[0], data['normal value 4'].values[0], data['normal value 5'].values[0]],
        'normal weight': [data['normal weight 1'].values[0], data['normal weight 2'].values[0], data['normal weight 3'].values[0], data['normal weight 4'].values[0], data['normal weight 5'].values[0]],
    }).rename(index = {0:'1', 1:'2', 2:'3', 3:'4', 4:'5'})

### Predict

In [30]:
y_pred = model.predict_proba(X_test)

In [31]:
explainer, shap_values, importances = _shap(model, X_test)

In [32]:
importances.insert(1, 'isFraud', y_test.values)
importances.insert(2, 'Fraud %', (y_pred[:,1] * 100).round(1))
importances.head()

Unnamed: 0,index,isFraud,Fraud %,fraud column 1,fraud value 1,fraud weight 1,fraud column 2,fraud value 2,fraud weight 2,fraud column 3,...,normal weight 2,normal column 3,normal value 3,normal weight 3,normal column 4,normal value 4,normal weight 4,normal column 5,normal value 5,normal weight 5
0,10897,1,99.0,V14,-14.666389,1.521694,V4,11.165526,0.830132,V10,...,0.04296,V28,-1.178063,0.019617,V13,0.345179,0.002714,V25,0.995271,0.001054
1,264487,0,2.2,V12,-1.331926,0.12492,V8,-0.079729,0.11885,V23,...,1.148552,V10,0.400928,0.780538,V17,0.172528,0.195458,V11,-1.485025,0.173763
2,223750,0,39.5,V4,4.432757,1.239237,V7,2.619744,0.268762,V18,...,0.750119,V10,0.769091,0.247326,V8,0.804989,0.201418,V16,0.496301,0.125464
3,115258,0,7.4,V4,2.385177,0.814506,V8,-0.123559,0.226464,V21,...,0.869566,V11,-0.776345,0.415647,V10,0.672238,0.357633,V17,-0.694183,0.260896
4,83417,1,99.0,V14,-6.233044,1.637585,V10,-3.252634,0.91946,V4,...,0.041765,V2,-0.364223,0.026712,scaled_amount,-0.224831,0.018689,V20,0.019626,0.0123


### Results

#### Normal Prediction 1

In [33]:
normal_graph, normal_data = Visualizer(importances, explainer, shap_values, X_test, 0)
normal_graph

265064 is Normal
Fraud: 2.5%


In [34]:
FactorTable(normal_data)

Unnamed: 0,fraud column,fraud value,fraud weight,normal column,normal value,normal weight
1,V8,0.133578,0.115034,V14,-0.074863,1.191437
2,V18,0.508021,0.081165,V12,0.952566,0.752598
3,V22,0.765636,0.041775,V4,0.308,0.571277
4,V24,-0.33741,0.023267,V10,0.168784,0.542509
5,V3,-0.313453,0.005547,V17,-0.793385,0.217275


#### Normal Prediction 2

In [35]:
normal_graph, normal_data = Visualizer(importances, explainer, shap_values, X_test, 0)
normal_graph

30113 is Normal
Fraud: 1.7%


In [36]:
FactorTable(normal_data)

Unnamed: 0,fraud column,fraud value,fraud weight,normal column,normal value,normal weight
1,V8,0.196714,0.106228,V14,0.346724,1.318384
2,V11,1.730901,0.029039,V4,-0.151329,0.776884
3,V9,-0.697704,0.023541,V12,0.878358,0.765717
4,V2,0.965899,0.008261,V10,-0.192583,0.586084
5,scaled_time,-0.574654,0.004383,V17,-0.300795,0.19224


#### Fraud Prediction 1

In [37]:
fraud_graph, fraud_data = Visualizer(importances, explainer, shap_values, X_test, 1)
fraud_graph

167305 is Fraud
Fraud: 98.2%


In [38]:
FactorTable(fraud_data)

Unnamed: 0,fraud column,fraud value,fraud weight,normal column,normal value,normal weight
1,V14,-8.490813,1.505348,V10,0.56668,0.355581
2,V4,6.081321,0.995539,scaled_amount,1.172221,0.054561
3,V12,-4.938284,0.443508,V26,0.466165,0.051041
4,V20,-1.118687,0.412202,V5,-1.636071,0.028655
5,V17,-3.093013,0.195621,V6,0.50061,0.001199


#### Fraud Prediction 2

In [39]:
fraud_graph, fraud_data = Visualizer(importances, explainer, shap_values, X_test, 1)
fraud_graph

8312 is Fraud
Fraud: 99.0%


In [40]:
FactorTable(fraud_data)

Unnamed: 0,fraud column,fraud value,fraud weight,normal column,normal value,normal weight
1,V14,-9.63469,1.524121,V26,0.521884,0.053495
2,V4,6.094141,0.930637,V20,0.440439,0.007345
3,V10,-5.153095,0.766357,V13,1.371819,0.007026
4,V12,-7.839539,0.439015,V21,0.149896,0.006568
5,scaled_amount,-0.29344,0.178986,V7,-0.591118,0.005215


#### All predictions

In [41]:
shap.force_plot(explainer.expected_value, shap_values, X_test)

---
## About a model

In [42]:
model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='aucpr',
              gamma=0.21935353045195558, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.010502347160991049,
              max_delta_step=0, max_depth=5, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=412, n_jobs=5,
              num_parallel_tree=1, random_state=0,
              reg_alpha=0.26304039594088163, reg_lambda=0.9689289216806913,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

### Model's performances

In [43]:
model.performance_indicator

{'aucpr': 0.9877179994298837,
 'confusion_matrix': array([[155,   2],
        [ 10, 129]])}

### Prediction Results

In [44]:
pd.DataFrame(y_pred).set_index(X_test.index).head()

Unnamed: 0,0,1
10897,0.010107,0.989893
264487,0.977587,0.022413
223750,0.605342,0.394658
115258,0.925676,0.074324
83417,0.009993,0.990007


---
## Save a model and predictions

In [45]:
from joblib import dump

In [46]:
dump(model, 'model')
pd.DataFrame(y_pred).set_index(X_test.index).to_csv('predictions.csv', mode='w')
importances.to_csv('importances.csv', mode='w')