# Context:

- PaySim is a synthetic mobile money fraud dataset, in which we focus on CASH_OUT and TRANSFER, and fraud prevalence in this subset is ~0.30%.​

### Importing necessary libraries...

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

df_synth = pd.read_csv("data/PS_20174392719_1491204439457_log.csv")

In [30]:
df_synth.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [31]:
df_synth.type

0           PAYMENT
1           PAYMENT
2          TRANSFER
3          CASH_OUT
4           PAYMENT
             ...   
6362615    CASH_OUT
6362616    TRANSFER
6362617    CASH_OUT
6362618    TRANSFER
6362619    CASH_OUT
Name: type, Length: 6362620, dtype: object

#### Filtering to CASH_OUT and TRANSFER and building features

In [33]:
df_synth_ct = df_synth[df_synth["type"].isin(["CASH_OUT", "TRANSFER"])].copy()
df_synth_ct["type_code"] = (df_synth_ct["type"] == "TRANSFER").astype(int)

features = ["amount", "oldbalanceOrg", "newbalanceOrig", "type_code"]
X = df_synth_ct[features]
Y = df_synth_ct["isFraud"]

Y.value_counts(normalize=True)

isFraud
0    0.997035
1    0.002965
Name: proportion, dtype: float64

## Train/test split and logistic regression

In [35]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, Y_train)

Y_pred = clf.predict(X_test)

print(confusion_matrix(Y_test, Y_pred))
print(classification_report(Y_test, Y_pred, digits=4))

[[552381     58]
 [   958    685]]
              precision    recall  f1-score   support

           0     0.9983    0.9999    0.9991    552439
           1     0.9219    0.4169    0.5742      1643

    accuracy                         0.9982    554082
   macro avg     0.9601    0.7084    0.7866    554082
weighted avg     0.9980    0.9982    0.9978    554082



#### Interpretation of the Confusion Matrix:

- True negatives (TN, legit correctly flagged legit): 552381.​

- False positives (FP, legit flagged as fraud): 58.​

- False negatives (FN, fraud missed): 958.​

- True positives (TP, fraud correctly flagged): 685.​

- So out of 1643 frauds, the model catches 685 and misses 958.

#### The metrics further tell us:

- Fraud precision ≈ 0.92: when the model says “fraud”, it is right 92% of the time, which is excellent and means few false alarms relative to the number of alerts.​

- Fraud recall ≈ 0.42: the model only catches about 42% of frauds, so it still misses most fraudulent transactions.​

- Overall accuracy ≈ 99.8% is not very informative here because the dataset is extremely imbalanced and we could get similar accuracy by predicting “non-fraud” almost always.

#### So, for AML, this recall is usually too low, because missing frauds is very costly, so our next steps are typically to adjust the decision threshold, reweight classes, or use a different model to trade some precision for higher recall.

#### Class-weighted logistic regression

In [65]:
clf_w = LogisticRegression(max_iter=1000, class_weight="balanced", n_jobs=-1)

clf_w.fit(X_train, Y_train)

Y_pred_w = clf_w.predict(X_test)

print("Confusion matrix(weighted logistic regression):")
print(confusion_matrix(Y_test, Y_pred_w))
print()
print("Classification report(weighted logistic regression):")
print(classification_report(Y_test, Y_pred_w, digits=4))

Confusion matrix(weighted logistic regression):
[[508593  43846]
 [   216   1427]]

Classification report(weighted logistic regression):
              precision    recall  f1-score   support

           0     0.9996    0.9206    0.9585    552439
           1     0.0315    0.8685    0.0608      1643

    accuracy                         0.9205    554082
   macro avg     0.5155    0.8946    0.5097    554082
weighted avg     0.9967    0.9205    0.9558    554082



## Interpretation of Class-weighted logistic regression:

- True Negatives: 508,593; 
- False Positives: 43,846; 
- False Negatives: 216; 
- True Positives: 1,427.

- So we now catch about 87% of frauds (recall 0.8685) but wrongly flag around 44k legitimate transactions as fraud.
- Fraud precision collapses to about 3%: only 3 out of 100 alerts are actually fraud, which is usually unacceptable operationally in AML, despite the high recall.
- In other words, class_weight="balanced" on its own has swung us from a “very strict” model (high precision, low recall) to an “over‑trigger‑happy” one (high recall, extremely low precision).

#### Threshold tuning 

In [70]:
Y_prob_w = clf_w.predict_proba(X_test)[:, 1]

def eval_threshold(threshold):
    Y_pred_thr = (Y_prob_w >= threshold).astype(int)
    print(f"=== Threshold = {threshold: .2f} ===")
    print(confusion_matrix(Y_test, Y_pred_thr))
    print(classification_report(Y_test, Y_pred_thr, digits=4))
    print()

for thr in [0.50, 0.30, 0.20]:
    eval_threshold(thr)

=== Threshold =  0.50 ===
[[508593  43846]
 [   216   1427]]
              precision    recall  f1-score   support

           0     0.9996    0.9206    0.9585    552439
           1     0.0315    0.8685    0.0608      1643

    accuracy                         0.9205    554082
   macro avg     0.5155    0.8946    0.5097    554082
weighted avg     0.9967    0.9205    0.9558    554082


=== Threshold =  0.30 ===
[[475526  76913]
 [   116   1527]]
              precision    recall  f1-score   support

           0     0.9998    0.8608    0.9251    552439
           1     0.0195    0.9294    0.0381      1643

    accuracy                         0.8610    554082
   macro avg     0.5096    0.8951    0.4816    554082
weighted avg     0.9968    0.8610    0.9224    554082


=== Threshold =  0.20 ===
[[439337 113102]
 [    20   1623]]
              precision    recall  f1-score   support

           0     1.0000    0.7953    0.8859    552439
           1     0.0141    0.9878    0.0279      164

## Interpretation after Threshold tuning:

 - At 0.50: recall ≈ 0.87, precision ≈ 0.03, ~43.8k false positives. 
 - We catch most frauds but almost every alert is a false alarm, which would be operationally unusable in production.

 - At 0.30: recall rises slightly to ≈ 0.93, precision drops further to ≈ 0.02, and false positives jump to ~76.9k, so alert volume becomes even less manageable.

 - At 0.20: recall is ≈ 0.99 (we miss only 20 frauds), but precision falls to ≈ 0.014 and false positives exceed 113k, which would swamp any AML team.

 - This nicely shows that “pushing recall towards 1” without any constraint on precision or alert volume is not realistic for fraud/AML, even though it looks good from a purely statistical viewpoint.

#### Lowering the classification threshold from 0.50 to 0.30 and 0.20 with the class‑weighted model increases fraud recall from ~0.87 to ~0.99, but precision collapses from ~3% to ~1–2%, generating tens of thousands of false alerts. This mirrors industry discussions that an effective fraud or AML system must balance recall against precision and operational alert capacity, rather than optimising a single metric in isolation.

## Let's further try on "Two candidate operating points":

In [82]:
Y_prob_w_tcop = clf_w.predict_proba(X_test)[:, 1]

def eval_threshold(threshold):
    Y_pred_thr = (Y_prob_w_tcop >= threshold).astype(int)
    print(f"=== Threshold = {threshold: .2f} ===")
    print(confusion_matrix(Y_test, Y_pred_thr))
    print(classification_report(Y_test, Y_pred_thr, digits=4))
    print()

for thr in [0.60, 0.70, 0.80]:
    eval_threshold(thr)

=== Threshold =  0.60 ===
[[520271  32168]
 [   264   1379]]
              precision    recall  f1-score   support

           0     0.9995    0.9418    0.9698    552439
           1     0.0411    0.8393    0.0784      1643

    accuracy                         0.9415    554082
   macro avg     0.5203    0.8905    0.5241    554082
weighted avg     0.9967    0.9415    0.9671    554082


=== Threshold =  0.70 ===
[[529648  22791]
 [   333   1310]]
              precision    recall  f1-score   support

           0     0.9994    0.9587    0.9786    552439
           1     0.0544    0.7973    0.1018      1643

    accuracy                         0.9583    554082
   macro avg     0.5269    0.8780    0.5402    554082
weighted avg     0.9966    0.9583    0.9760    554082


=== Threshold =  0.80 ===
[[536554  15885]
 [   449   1194]]
              precision    recall  f1-score   support

           0     0.9992    0.9712    0.9850    552439
           1     0.0699    0.7267    0.1276      164

## Operating points and AML interpretation:

- Model A – conservative alerts (unweighted, threshold 0.50): This model achieves very high precision on frauds (~0.92) with almost no false positives, but recall is only ~0.42, so the majority of frauds are missed. This is suitable for low‑risk portfolios or early pilots where investigators will only trust the system if almost every alert is genuine.

- Model B – cost‑sensitive alerts (weighted, threshold 0.80): With class_weight="balanced" and a higher threshold of 0.80, fraud recall increases to ~0.73, while precision drops to ~0.07 and false positives rise to about 15.9k. This configuration would be preferred when the institution is more concerned about undetected fraud/AML cases and is willing to handle a higher but still bounded alert volume.

- Overall, these two operating points illustrate that fraud and AML models must be tuned against business and regulatory constraints (cost of missed cases, investigation capacity, SLAs) rather than accuracy alone, aligning with industry guidance on precision–recall trade‑offs and threshold tuning in transaction monitoring.

## Limitations & next steps:

- It is to note that only 4 features have been used and no time/customer history.
- Future work could be to try tree‑based models, PRAUC/ROC‑AUC, or more realistic cost matrices, referencing how threshold optimisation and PRAUC are recommended in fraud/AML literature.