This notebook contains two main parts: feature engineering and model training

**Feature engineering.** Using information from EDA, I conclude the feature engineering as follows:
- Use only transactions that its `transac_type` equals to "TRANSFER" or "CASH_OUT". Only these two types of transaction present fraudulent entries, according to the given data.
    - So, at this point, we are kind of building a classifier to predict only "TRANSFER" and "CASH_OUT" transactions. If fraud in other transaction types were discovered (e.g. by manual flag), the feature engineering part would need to be revised.
- remove: `src_acc, dst_acc`. These account numbers don't have quantitative value.
- remove: `is_flagged_fraud`. From EDA, this feature has no variance (only 16 rows having values out of ~6M)
- new features: `hour_of_day` and `day_of_month`. EDA suggests there are some useful signal from these two features.

**Model training.** The task is formulated as a binary classification task. This part is done as follows:
1. After the raw data is transformed according to the feature engineering strategy, the data is split into train, dev, and test sets.
    - The proportion is approx.: 70%, 10%, and 20% for train, dev, test respectively.
    - The train and dev sets are used for manual baseline experiments.
    - The train and dev sets will be combined during the hyperparameter tuning (becuase CV will be used here).
    - The test set will only be used for reporting the final performance.
2. Two baseline models, a logistic regression and a decision tree, are trained.
    - The logistic regression is chosen because purely of its simplicity as a baseline.
    - The decision tree classifier is selected under this hypothesis: The class imbalance might be handled by the decision tree. Unlike logistic regression, decision trees can partition the feature space to isolate rare patterns well, escpecially if fraudulent transactions follow some distinct rules.
4. The decision tree is observed to perform better, thus being proceeded to hyper-parameter tuning with randomized grid-search CV.
5. Lastly, the best model, standardization scaler, and the feature list are saved locally to `/models` directory.

# 1. Feature engineering

In [1]:
import os
import duckdb
import pandas as pd
import pickle
import json
import numpy as np


get_path = lambda :os.path.join('..','data','fraud_mock.csv')

df = duckdb.sql(f"""
SELECT 
    transac_type, 
    amount,
    src_bal,
    src_new_bal,
    dst_bal,
    dst_new_bal,
    is_fraud,
    CAST((time_ind / 24) + 1 AS INTEGER) AS day_of_month,
    MOD(time_ind, 24) AS hour_of_day,
FROM '{get_path()}'
WHERE transac_type IN ('TRANSFER', 'CASH_OUT')
"""
).fetchdf()

df

Unnamed: 0,transac_type,amount,src_bal,src_new_bal,dst_bal,dst_new_bal,is_fraud,day_of_month,hour_of_day
0,TRANSFER,181.00,181.00,0.0,0.00,0.00,1,1,1
1,CASH_OUT,181.00,181.00,0.0,21182.00,0.00,1,1,1
2,CASH_OUT,229133.94,15325.00,0.0,5083.00,51513.44,0,1,1
3,TRANSFER,215310.30,705.00,0.0,22425.00,0.00,0,1,1
4,TRANSFER,311685.89,10835.00,0.0,6267.00,2719172.89,0,1,1
...,...,...,...,...,...,...,...,...,...
2770404,CASH_OUT,339682.13,339682.13,0.0,0.00,339682.13,1,32,23
2770405,TRANSFER,6311409.28,6311409.28,0.0,0.00,0.00,1,32,23
2770406,CASH_OUT,6311409.28,6311409.28,0.0,68488.84,6379898.11,1,32,23
2770407,TRANSFER,850002.52,850002.52,0.0,0.00,0.00,1,32,23


In [2]:
# Since this column only contains two values, we can convert it to binary values.
df.loc[df['transac_type'] == 'TRANSFER', 'transac_type'] = 1
df.loc[df['transac_type'] == 'CASH_OUT', 'transac_type'] = 0
df

Unnamed: 0,transac_type,amount,src_bal,src_new_bal,dst_bal,dst_new_bal,is_fraud,day_of_month,hour_of_day
0,1,181.00,181.00,0.0,0.00,0.00,1,1,1
1,0,181.00,181.00,0.0,21182.00,0.00,1,1,1
2,0,229133.94,15325.00,0.0,5083.00,51513.44,0,1,1
3,1,215310.30,705.00,0.0,22425.00,0.00,0,1,1
4,1,311685.89,10835.00,0.0,6267.00,2719172.89,0,1,1
...,...,...,...,...,...,...,...,...,...
2770404,0,339682.13,339682.13,0.0,0.00,339682.13,1,32,23
2770405,1,6311409.28,6311409.28,0.0,0.00,0.00,1,32,23
2770406,0,6311409.28,6311409.28,0.0,68488.84,6379898.11,1,32,23
2770407,1,850002.52,850002.52,0.0,0.00,0.00,1,32,23


# 2. Model Training

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import tree

X = df.drop(columns=['is_fraud'])
features = X.columns.tolist()

X = X.to_numpy()
y = df['is_fraud'].to_numpy()

X_train, X_test_dev, y_train, y_test_dev = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

X_dev, X_test, y_dev, y_test = train_test_split(
    X_test_dev,
    y_test_dev,
    test_size=0.7,
    random_state=42,
    stratify=y_test_dev
)


scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_dev_scaled = scaler.transform(X_dev)
X_test_scaled = scaler.transform(X_test)

In [4]:

summary = pd.DataFrame({
    'dataset': ['train', 'dev', 'test'],
    'negatives': [
        (y_train == 0).sum(),
        (y_dev == 0).sum(),
        (y_test == 0).sum()
    ],
    'positives': [
        (y_train == 1).sum(),
        (y_dev == 1).sum(),
        (y_test == 1).sum()
    ]
})

summary

Unnamed: 0,dataset,negatives,positives
0,train,1933537,5749
1,dev,248597,739
2,test,580062,1725


In [5]:
def report(pred, gold):
    """
    This util function prints confusion matrix and the sklearn's classification report 
    """
    print("Confusion Matrix:")
    print(confusion_matrix(gold, pred))
    print("\nClassification Report:")
    print(classification_report(gold, pred))

## 2.1) Baseline: Logistic regression

In [6]:
# baseline model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

y_dev_pred = model.predict(X_dev_scaled)
report(pred=y_dev_pred, gold=y_dev)

Confusion Matrix:
[[248562     35]
 [   375    364]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    248597
           1       0.91      0.49      0.64       739

    accuracy                           1.00    249336
   macro avg       0.96      0.75      0.82    249336
weighted avg       1.00      1.00      1.00    249336



## 2.2) Baseline: Decision tree

In [7]:
model = tree.DecisionTreeClassifier(random_state=42, class_weight='balanced')
model.fit(X_train_scaled, y_train)

y_dev_pred = model.predict(X_dev_scaled)
report(pred=y_dev_pred, gold=y_dev)

Confusion Matrix:
[[248521     76]
 [   115    624]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    248597
           1       0.89      0.84      0.87       739

    accuracy                           1.00    249336
   macro avg       0.95      0.92      0.93    249336
weighted avg       1.00      1.00      1.00    249336



## 2.3) Decision tree feature importance

In [8]:
feature_importance = dict(zip(model.feature_importances_, features))

print("Feature Importance:")
for importance, feature in sorted(feature_importance.items(), key=lambda x: x[0], reverse=True):
    print(f"{importance:.3f}: {feature}")

Feature Importance:
0.558: src_bal
0.181: amount
0.118: dst_new_bal
0.102: src_new_bal
0.014: transac_type
0.014: hour_of_day
0.010: day_of_month
0.003: dst_bal


## 2.4) Hyper-parameter tuning

In [9]:
params = {
   'criterion':['entropy', 'log_loss', 'gini'],
    'max_depth': [None, 20],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 4],
    'class_weight': ['balanced']
}

search = RandomizedSearchCV(estimator=tree.DecisionTreeClassifier(), 
                            param_distributions=params, 
                            n_iter=10,
                            cv=5,
                            scoring='f1',
                            verbose=4)

# can combine train and dev set, since the validation is done in CV
X_train_cv = np.concatenate([X_train_scaled, X_dev_scaled])
y_train_cv = np.concatenate([y_train, y_dev])

search.fit(X_train_cv, y_train_cv)
print(search.best_estimator_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END class_weight=balanced, criterion=entropy, max_depth=20, min_samples_leaf=4, min_samples_split=10;, score=0.822 total time=  21.4s
[CV 2/5] END class_weight=balanced, criterion=entropy, max_depth=20, min_samples_leaf=4, min_samples_split=10;, score=0.841 total time=  19.7s
[CV 3/5] END class_weight=balanced, criterion=entropy, max_depth=20, min_samples_leaf=4, min_samples_split=10;, score=0.837 total time=  22.5s
[CV 4/5] END class_weight=balanced, criterion=entropy, max_depth=20, min_samples_leaf=4, min_samples_split=10;, score=0.831 total time=  20.2s
[CV 5/5] END class_weight=balanced, criterion=entropy, max_depth=20, min_samples_leaf=4, min_samples_split=10;, score=0.851 total time=  20.3s
[CV 1/5] END class_weight=balanced, criterion=log_loss, max_depth=None, min_samples_leaf=4, min_samples_split=10;, score=0.842 total time=  19.1s
[CV 2/5] END class_weight=balanced, criterion=log_loss, max_depth=None, min_sa

Performance of the best model over the test set

In [10]:
print(f"{search.best_params_=}")

search.best_params_={'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'criterion': 'log_loss', 'class_weight': 'balanced'}


In [11]:
best_model = tree.DecisionTreeClassifier(**search.best_params_)
best_model.fit(X_train_cv, y_train_cv)

y_test_pred = best_model.predict(X_test_scaled)
report(pred=y_test_pred, gold=y_test)

Confusion Matrix:
[[579885    177]
 [   237   1488]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    580062
           1       0.89      0.86      0.88      1725

    accuracy                           1.00    581787
   macro avg       0.95      0.93      0.94    581787
weighted avg       1.00      1.00      1.00    581787



In [12]:
with open(os.path.join('..','models','model.pkl'), 'wb') as f:
    pickle.dump(best_model, f)

with open(os.path.join('..','models','scaler.pkl'), 'wb') as f:
    pickle.dump(scaler, f)

with open(os.path.join('..','models','features.json'), 'w') as f:
    json.dump(features, f)