## This notebook is a work on synthetic transaction demo.

### It is created to warm up for AML/PhD preparation.¶¶

#### Loading the synthetic dataset from 'PaySim'

In [5]:
import pandas as pd
import numpy as np

df_synth = pd.read_csv("data/PS_20174392719_1491204439457_log.csv")
df_synth.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


#### Inspecting the dataset (General Inspection)

In [8]:
df_synth.shape

(6362620, 11)

In [10]:
df_synth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [12]:
df_synth.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [14]:
df_synth.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


#### Identifying columns and computing Fraud rates

In [17]:
df_synth.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [19]:
# Computing the Fraud Rate or Mean value:

df_synth["isFraud"].mean()

0.001290820448180152

### Overall fraud rate

- The overall fraud rate is about 0.129%, i.e. roughly 13 fraud cases per 10,000 transactions.
- This is highly imbalanced and realistic for financial crime data, where suspicious events are rare but critical to detect.
- Because fraud is so rare, models trained on this data will see overwhelmingly many non‑fraud examples, so recall and false‑positive rates will matter more than raw accuracy.

In [22]:
df_synth["type"].value_counts()

type
CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: count, dtype: int64

In [24]:
df_synth.groupby("type")["isFraud"].sum()

type
CASH_IN        0
CASH_OUT    4116
DEBIT          0
PAYMENT        0
TRANSFER    4097
Name: isFraud, dtype: int64

## Fraud distribution by transaction type

- Only CASH_OUT and TRANSFER transactions contain fraud labels in this dataset.
- This mirrors many AML scenarios, where outgoing transfers and cash‑outs are the main channels for moving illicit funds.
- In later modelling, we will pay special attention to distinguishing normal vs fraudulent CASH_OUT and TRANSFER transactions, while treating other types differently.

In [27]:
fraud_rate_by_type = df_synth.groupby("type")["isFraud"].mean()
fraud_rate_by_type

type
CASH_IN     0.000000
CASH_OUT    0.001840
DEBIT       0.000000
PAYMENT     0.000000
TRANSFER    0.007688
Name: isFraud, dtype: float64

### Fraud rate by type

- CASH_OUT and TRANSFER have non-zero fraud rates; other types have zero.
- This suggests scenario generators and training material should emphasise these transaction types to reflect where risk actually lies.

### Context:

- PaySim is a synthetic mobile money fraud dataset, in which we focus on CASH_OUT and TRANSFER, and fraud prevalence in this subset is ~0.30%.​

### Importing necessary libraries...

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

### Filtering to CASH_OUT and TRANSFER and building features¶

In [35]:
df_synth_ct = df_synth[df_synth["type"].isin(["CASH_OUT", "TRANSFER"])].copy()
df_synth_ct["type_code"] = (df_synth_ct["type"] == "TRANSFER").astype(int)

features = ["amount", "oldbalanceOrg", "newbalanceOrig", "type_code"]
X = df_synth_ct[features]
Y = df_synth_ct["isFraud"]

Y.value_counts(normalize=True)

isFraud
0    0.997035
1    0.002965
Name: proportion, dtype: float64

### Train/test split and logistic regression¶

In [40]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, Y_train)

Y_pred = clf.predict(X_test)

print(confusion_matrix(Y_test, Y_pred))
print(classification_report(Y_test, Y_pred, digits=4))

[[552381     58]
 [   958    685]]
              precision    recall  f1-score   support

           0     0.9983    0.9999    0.9991    552439
           1     0.9219    0.4169    0.5742      1643

    accuracy                         0.9982    554082
   macro avg     0.9601    0.7084    0.7866    554082
weighted avg     0.9980    0.9982    0.9978    554082



### Interpretation of the Confusion Matrix:

- True negatives (TN, legit correctly flagged legit): 552381.​

- False positives (FP, legit flagged as fraud): 58.​

- False negatives (FN, fraud missed): 958.​

- True positives (TP, fraud correctly flagged): 685.​

- So out of 1643 frauds, the model catches 685 and misses 958.

### The metrics further tell us:

- Fraud precision ≈ 0.92: when the model says “fraud”, it is right 92% of the time, which is excellent and means few false alarms relative to the number of alerts.​

- Fraud recall ≈ 0.42: the model only catches about 42% of frauds, so it still misses most fraudulent transactions.​

- Overall accuracy ≈ 99.8% is not very informative here because the dataset is extremely imbalanced and we could get similar accuracy by predicting “non-fraud” almost always.

#### So, for AML, this recall is usually too low, because missing frauds is very costly, so our next steps are typically to adjust the decision threshold, reweight classes, or use a different model to trade some precision for higher recall.

### Class-weighted logistic regression¶

In [46]:
clf_w = LogisticRegression(max_iter=1000, class_weight="balanced", n_jobs=-1)

clf_w.fit(X_train, Y_train)

Y_pred_w = clf_w.predict(X_test)

print("Confusion matrix(weighted logistic regression):")
print(confusion_matrix(Y_test, Y_pred_w))
print()
print("Classification report(weighted logistic regression):")
print(classification_report(Y_test, Y_pred_w, digits=4))

Confusion matrix(weighted logistic regression):
[[508593  43846]
 [   216   1427]]

Classification report(weighted logistic regression):
              precision    recall  f1-score   support

           0     0.9996    0.9206    0.9585    552439
           1     0.0315    0.8685    0.0608      1643

    accuracy                         0.9205    554082
   macro avg     0.5155    0.8946    0.5097    554082
weighted avg     0.9967    0.9205    0.9558    554082



### Interpretation of Class-weighted logistic regression:

- True Negatives: 508,593;

- False Positives: 43,846;

- False Negatives: 216;

- True Positives: 1,427.

- So we now catch about 87% of frauds (recall 0.8685) but wrongly flag around 44k legitimate transactions as fraud.

- Fraud precision collapses to about 3%: only 3 out of 100 alerts are actually fraud, which is usually unacceptable operationally in AML, despite the high recall.

- In other words, class_weight="balanced" on its own has swung us from a “very strict” model (high precision, low recall) to an “over‑trigger‑happy” one (high recall, extremely low precision).

### Threshold tuning¶

In [50]:
Y_prob_w = clf_w.predict_proba(X_test)[:, 1]

def eval_threshold(threshold):
    Y_pred_thr = (Y_prob_w >= threshold).astype(int)
    print(f"=== Threshold = {threshold: .2f} ===")
    print(confusion_matrix(Y_test, Y_pred_thr))
    print(classification_report(Y_test, Y_pred_thr, digits=4))
    print()

for thr in [0.50, 0.30, 0.20]:
    eval_threshold(thr)

=== Threshold =  0.50 ===
[[508593  43846]
 [   216   1427]]
              precision    recall  f1-score   support

           0     0.9996    0.9206    0.9585    552439
           1     0.0315    0.8685    0.0608      1643

    accuracy                         0.9205    554082
   macro avg     0.5155    0.8946    0.5097    554082
weighted avg     0.9967    0.9205    0.9558    554082


=== Threshold =  0.30 ===
[[475526  76913]
 [   116   1527]]
              precision    recall  f1-score   support

           0     0.9998    0.8608    0.9251    552439
           1     0.0195    0.9294    0.0381      1643

    accuracy                         0.8610    554082
   macro avg     0.5096    0.8951    0.4816    554082
weighted avg     0.9968    0.8610    0.9224    554082


=== Threshold =  0.20 ===
[[439337 113102]
 [    20   1623]]
              precision    recall  f1-score   support

           0     1.0000    0.7953    0.8859    552439
           1     0.0141    0.9878    0.0279      164

### Interpretation after Threshold tuning:

- At 0.50: recall ≈ 0.87, precision ≈ 0.03, ~43.8k false positives.

- We catch most frauds but almost every alert is a false alarm, which would be operationally unusable in production.

- At 0.30: recall rises slightly to ≈ 0.93, precision drops further to ≈ 0.02, and false positives jump to ~76.9k, so alert volume becomes even less manageable.

- At 0.20: recall is ≈ 0.99 (we miss only 20 frauds), but precision falls to ≈ 0.014 and false positives exceed 113k, which would swamp any AML team.

- This nicely shows that “pushing recall towards 1” without any constraint on precision or alert volume is not realistic for fraud/AML, even though it looks good from a purely statistical viewpoint.

#### Lowering the classification threshold from 0.50 to 0.30 and 0.20 with the class‑weighted model increases fraud recall from ~0.87 to ~0.99, but precision collapses from ~3% to ~1–2%, generating tens of thousands of false alerts. This mirrors industry discussions that an effective fraud or AML system must balance recall against precision and operational alert capacity, rather than optimising a single metric in isolation.¶

### Let's further try on "Two candidate operating points":

In [55]:
Y_prob_w_tcop = clf_w.predict_proba(X_test)[:, 1]

def eval_threshold(threshold):
    Y_pred_thr = (Y_prob_w_tcop >= threshold).astype(int)
    print(f"=== Threshold = {threshold: .2f} ===")
    print(confusion_matrix(Y_test, Y_pred_thr))
    print(classification_report(Y_test, Y_pred_thr, digits=4))
    print()

for thr in [0.60, 0.70, 0.80]:
    eval_threshold(thr)

=== Threshold =  0.60 ===
[[520271  32168]
 [   264   1379]]
              precision    recall  f1-score   support

           0     0.9995    0.9418    0.9698    552439
           1     0.0411    0.8393    0.0784      1643

    accuracy                         0.9415    554082
   macro avg     0.5203    0.8905    0.5241    554082
weighted avg     0.9967    0.9415    0.9671    554082


=== Threshold =  0.70 ===
[[529648  22791]
 [   333   1310]]
              precision    recall  f1-score   support

           0     0.9994    0.9587    0.9786    552439
           1     0.0544    0.7973    0.1018      1643

    accuracy                         0.9583    554082
   macro avg     0.5269    0.8780    0.5402    554082
weighted avg     0.9966    0.9583    0.9760    554082


=== Threshold =  0.80 ===
[[536554  15885]
 [   449   1194]]
              precision    recall  f1-score   support

           0     0.9992    0.9712    0.9850    552439
           1     0.0699    0.7267    0.1276      164

### Operating points and AML interpretation:

- Model A – conservative alerts (unweighted, threshold 0.50): This model achieves very high precision on frauds (~0.92) with almost no false positives, but recall is only ~0.42, so the majority of frauds are missed. This is suitable for low‑risk portfolios or early pilots where investigators will only trust the system if almost every alert is genuine.

- Model B – cost‑sensitive alerts (weighted, threshold 0.80): With class_weight="balanced" and a higher threshold of 0.80, fraud recall increases to ~0.73, while precision drops to ~0.07 and false positives rise to about 15.9k. This configuration would be preferred when the institution is more concerned about undetected fraud/AML cases and is willing to handle a higher but still bounded alert volume.

- Overall, these two operating points illustrate that fraud and AML models must be tuned against business and regulatory constraints (cost of missed cases, investigation capacity, SLAs) rather than accuracy alone, aligning with industry guidance on precision–recall trade‑offs and threshold tuning in transaction monitoring.

### Limitations & next steps:

- It is to note that only 4 features have been used and no time/customer history.
- Future work could be to try tree‑based models, PRAUC/ROC‑AUC, or more realistic cost matrices, referencing how threshold optimisation and PRAUC are recommended in fraud/AML literature.

### Importing necessary libraries...¶

In [60]:
from sklearn.ensemble import RandomForestClassifier

In [64]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    n_jobs=-1,
    random_state=42,
    class_weight=None
)
rf.fit(X_train, Y_train)
Y_pred_rf = rf.predict(X_test)

print("Confusion Matrix (RandomForest, default threshold):")
print(confusion_matrix(Y_test, Y_pred_rf))
print()
print("Classification Report (RandonForest, default threshold):")
print(classification_report(Y_test, Y_pred_rf, digits=4))

Confusion Matrix (RandomForest, default threshold):
[[552301    138]
 [   178   1465]]

Classification Report (RandonForest, default threshold):
              precision    recall  f1-score   support

           0     0.9997    0.9998    0.9997    552439
           1     0.9139    0.8917    0.9026      1643

    accuracy                         0.9994    554082
   macro avg     0.9568    0.9457    0.9512    554082
weighted avg     0.9994    0.9994    0.9994    554082



#### The baseline RandomForest reaches fraud precision of ~ 0.91 and recall of ~ 0.89 at the default 0.5 threshold. Compared to logistic regression, this shows whether a non-linear ensemble improves sensitivity to fraud given the same simple feature set.

- This is a huge improvement over our logistic baseline (recall ~ 0.42) and over the class-weighted logistic model, which needed tens of thousands of alerts to get high recall.

- Confusion matrix: TN = 552,301;

- FP = 138;

- FN = 178;

- TP = 1,465.

- So it only misses 178 of 1,643 frauds and raises 138 false alarms.

#### RandomForest vs logistic regression: The RandomForest model achieves fraud precision of ~ 0.91 and recall of ~ 0.89 with only 138 false positives, compared to the unweighted logistic regression (~ 0.92 precision but only ~ 0.42 recall) and the class‑weighted logistic model (high recall but > 40k false positives). This demonstrates that a non-linear ensemble can deliver both high recall and a manageable alert volume on the same PaySim feature set, which is much closer to what AML teams require in practice.m

### Model comparison overview:

- Model Threshold Fraud Precision Fraud recall False positives False negatives Notes

- Logistic regression (unweighted) 0.50 ~0.92 ~0.42 58 958 Very few alerts, but misses most frauds. ​

- Logistic regression (class-weighted) 0.50 ~0.03 ~0.87 43,846 216 High recall, but operationally unusable alert volume. ​

- Logistic regression (weighted) 0.80 ~0.07 ~0.73 15,885 449 More balanced, but still low precision and many alerts. ​

- RandomForest 0.50 ~0.91 ~0.89 138 178 High recall with very low false positives; best AML fit. ​

- Overall, the RandomForest model dominates the logistic baselines on both fraud precision and recall, while keeping the number of false positives very low. This aligns with fraud‑detection practice, where tree ensembles are widely used to capture complex, non‑linear risk patterns in transactional data.

#### Limitations & future work:

- The models in this notebook are trained on the PaySim synthetic mobile money dataset, which is designed to resemble real transactional data but cannot fully capture the diversity of real-world fraud and AML behaviour.
- Only a small set of simple, transaction-level features (amount and balances plus type) is used, without any explicit temporal, customer-history, or network/graph-based information, even though such features are known to be important in production fraud systems.
- Future work could therefore extend this baseline by engineering richer behavioural features, evaluating models with PRAUC/ROC‑AUC and cost-sensitive metrics, and eventually testing on more realistic synthetic or anonymised AML datasets that include customer and network context.

### Computing ROC-AUC score and Average Precision score:¶

In [73]:
from sklearn.metrics import roc_auc_score, average_precision_score

def eval_model(name, Y_true, Y_prob, Y_pred, c_fp=10, c_fn=1000):
    tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred).ravel()
    roc = roc_auc_score(Y_true, Y_prob)
    pr = average_precision_score(Y_true, Y_prob)
    total_cost = c_fp * fp + c_fn * fn
    return {
        "model": name,
        "ROC_AUC": roc,
        "PR_AUC": pr,
        "precision": tp / (tp + fp) if (tp + fp) > 0 else 0,
        "recall": tp / (tp + fn) if (tp + fn) > 0 else 0,
        "FP": fp,
        "FN": fn,
        "total_cost": total_cost
    }

results = []

# 1) Logistic Unweighted:

Y_prob_log = clf.predict_proba(X_test)[:, 1]
Y_pred_log = (Y_prob_log >= 0.5).astype(int)
results.append(eval_model("Logistic (unweighted, 0.5)", Y_test, Y_prob_log, Y_pred_log))

# 2) Logistic Weighted, threshold 0.5 & 0.8:

Y_prob_w = clf_w.predict_proba(X_test)[:, 1]
for thr in [0.5, 0.8]:
    Y_pred_w_thr = (Y_prob_w >= thr).astype(int)
    results.append(eval_model(f"Logistic (weighted, {thr})", Y_test, Y_prob_w, Y_pred_w_thr))

# 3) RandomForest

Y_prob_rf = rf.predict_proba(X_test)[:, 1]
Y_pred_rf = (Y_prob_rf >= 0.5).astype(int)
results.append(eval_model("RandomForest (0.5)", Y_test, Y_prob_rf, Y_pred_rf))

df_synth_results = pd.DataFrame(results)
df_synth_results

Unnamed: 0,model,ROC_AUC,PR_AUC,precision,recall,FP,FN,total_cost
0,"Logistic (unweighted, 0.5)",0.975772,0.549324,0.921938,0.41692,58,958,958580
1,"Logistic (weighted, 0.5)",0.969046,0.449542,0.03152,0.868533,43846,216,654460
2,"Logistic (weighted, 0.8)",0.969046,0.449542,0.06991,0.726719,15885,449,607850
3,RandomForest (0.5),0.99714,0.963074,0.912773,0.891662,140,178,179400


In [77]:
#### Exporting the results to CSV:

import os

os.makedirs("results", exist_ok=True)

df_synth_results.to_csv("results/paysim_model_comparison.csv", index=False)

In [79]:
df_synth_results.sort_values("total_cost")

Unnamed: 0,model,ROC_AUC,PR_AUC,precision,recall,FP,FN,total_cost
3,RandomForest (0.5),0.99714,0.963074,0.912773,0.891662,140,178,179400
2,"Logistic (weighted, 0.8)",0.969046,0.449542,0.06991,0.726719,15885,449,607850
1,"Logistic (weighted, 0.5)",0.969046,0.449542,0.03152,0.868533,43846,216,654460
0,"Logistic (unweighted, 0.5)",0.975772,0.549324,0.921938,0.41692,58,958,958580


#### Interpretation: Using a toy cost of 10 per false positive and 1,000 per false negative, the total cost ranking is:

- RandomForest (0.5): total_cost ≈ 179,400

- Logistic (unweighted, 0.5): total_cost ≈ 589,580

- Logistic (weighted, 0.8): total_cost ≈ 496,078,500

- Logistic (weighted, 0.5): total_cost ≈ 216,654,460

- So RandomForest at the default 0.5 threshold both:

- Achieves the best PR‑AUC (0.963) and very high ROC‑AUC (0.997), and Minimises expected cost by a large margin under plausible AML cost assumptions.

#### On the PaySim synthetic mobile‑money dataset, a RandomForest model achieved ROC‑AUC of 0.997 and PR‑AUC of 0.963, with fraud precision of 0.91 and recall of 0.89 at the default 0.5 threshold, yielding only 138 false positives and 178 false negatives. Under a simple cost model that assigns a cost of 1,000 to missed frauds and 10 to false alerts, this configuration produced the lowest expected cost (~ 179k) among the tested models, substantially outperforming both unweighted logistic regression and class‑weighted logistic variants, which either missed most frauds or generated tens of thousands of alerts. This demonstrates the importance of combining PR‑AUC with cost‑sensitive analysis and operating‑point selection when designing AML systems, a perspective the proposed PhD will extend to investigator training and generative scenario evaluation.