# Machine Learning Midterm - Online Transaction

## Rayhan Diff-1103220039

## Import Library

import the necessary library

In [5]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix
import gc 

pd.set_option('display.max_columns', None)

## Data Loading & Preprocessing

In [6]:
# 1. Load Data
print("Loading datasets...")
train_df = pd.read_csv('train_transaction.csv')
test_df = pd.read_csv('test_transaction.csv')

print(f"Shape Train: {train_df.shape}")
print(f"Shape Test: {test_df.shape}")

# 2. temporary combine for consistent preprocessing
test_df['isFraud'] = -1
all_data = pd.concat([train_df, test_df], axis=0, sort=False).reset_index(drop=True)

# 3. Handling Missing Values & Categorical Encoding

# Identified column
cat_cols = all_data.select_dtypes(include=['object']).columns

print("Encoding categorical columns...")
for col in cat_cols:
    all_data[col] = all_data[col].fillna('Unknown')
    
    le = LabelEncoder()
    all_data[col] = le.fit_transform(all_data[col].astype(str))

train_df = all_data[all_data['isFraud'] != -1].copy()
test_df = all_data[all_data['isFraud'] == -1].copy()

test_df = test_df.drop('isFraud', axis=1)

del all_data
gc.collect()

Loading datasets...
Shape Train: (590540, 394)
Shape Test: (506691, 393)
Encoding categorical columns...


20

## Feature Engineering & Class Imbalance
In this section, the data imbalance problem is addressed. When the number of fraud cases is extremely small, the model tends to predict “Not Fraud” repeatedly. This issue is handled using specific parameters.


In [7]:
# Defining the Features (X) and the Target (y)
X = train_df.drop(['isFraud', 'TransactionID'], axis=1)
y = train_df['isFraud']

X_test_final = test_df.drop(['TransactionID'], axis=1) # For final submission

# Splitting Data for Local Validation (80% Train, 20% Validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Calculating Scale Pos Weight for Class Imbalance
# Formula: number_of_negatives / number_of_positives
ratio = float(np.sum(y == 0)) / np.sum(y == 1)
print(f"Class Imbalance Ratio: {ratio:.2f}")

Class Imbalance Ratio: 27.58


## Training model XGBoost
The AUC (Area Under the Curve) metric is used because predicting probabilities rather than merely assigning 0 or 1 labels.

In [11]:
# XGBoost configuration
model = xgb.XGBClassifier(
    n_estimators=500,           # Number of trees
    max_depth=9,                # Tree depth
    learning_rate=0.05,         # Learning rate
    subsample=0.9,              # Prevent overfitting
    colsample_bytree=0.9,       # Prevent overfitting
    missing=np.nan,             # Automatically handle NaN values
    eval_metric='auc',          # Main evaluation metric
    n_jobs=-1,                  # Use all CPU cores
    scale_pos_weight=ratio,     # KEY FOR IMBALANCED DATA
    tree_method='hist',     # Use GPU for training
    device='cuda',    
    random_state=42,
    early_stopping_rounds=50,   # Stop if score does not improve for 50 rounds
)

print("Starting Model Training...")
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=50
)

Starting Model Training...
[0]	validation_0-auc:0.86305
[50]	validation_0-auc:0.92255
[100]	validation_0-auc:0.93718
[150]	validation_0-auc:0.94829
[200]	validation_0-auc:0.95410
[250]	validation_0-auc:0.95799
[300]	validation_0-auc:0.96062
[350]	validation_0-auc:0.96256
[400]	validation_0-auc:0.96430
[450]	validation_0-auc:0.96547
[499]	validation_0-auc:0.96645


## Model Evaluation

In [12]:
# Probability Prediction (for AUC calculation)
y_pred_prob = model.predict_proba(X_val)[:, 1]

# Label Prediction (0 or 1 with default threshold 0.5)
y_pred_label = model.predict(X_val)

# 1. ROC-AUC Score (Main Metric)
auc_score = roc_auc_score(y_val, y_pred_prob)
print(f"\nValidation ROC-AUC Score: {auc_score:.4f}")

# 2. Classification Report (To see Precision & Recall)
print("\nClassification Report:")
print(classification_report(y_val, y_pred_label))

# 3. Feature Importance (Which features are most influential?)
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(importance.head(10))

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)



Validation ROC-AUC Score: 0.9664

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.97      0.98    113975
           1       0.48      0.82      0.61      4133

    accuracy                           0.96    118108
   macro avg       0.74      0.89      0.79    118108
weighted avg       0.98      0.96      0.97    118108


Top 10 Most Important Features:
    feature  importance
310    V258    0.156976
122     V70    0.078288
270    V218    0.065714
143     V91    0.051541
253    V201    0.036998
346    V294    0.033836
28      C14    0.010792
215    V163    0.010411
22       C8    0.009577
194    V142    0.006850


## Test Set Prediction

In [None]:
# Final Prediction on Test Set
test_probs = model.predict_proba(X_test_final)[:, 1]


print("\n" + "="*40)
print("SAMPLE SUBMISSION PREVIEW")
print("="*40)
print(submission.head(10))

print("\n" + "="*40)
print("PREDICTION STATISTICS")
print("="*40)
# Displaying statistics (mean, min, max) to ensure the predictions are valid
print(submission['isFraud'].describe())

# Check how many are predicted to have a high risk of Fraud (> 50%)
high_risk_count = (submission['isFraud'] > 0.5).sum()
print(f"\nNumber of transactions predicted as High Risk Fraud (>50%): {high_risk_count} out of {len(submission)} data.")


SAMPLE SUBMISSION PREVIEW
        TransactionID   isFraud
590540        3663549  0.003068
590541        3663550  0.010426
590542        3663551  0.004724
590543        3663552  0.013365
590544        3663553  0.002487
590545        3663554  0.038541
590546        3663555  0.025501
590547        3663556  0.087754
590548        3663557  0.000404
590549        3663558  0.082110

PREDICTION STATISTICS
count    506691.000000
mean          0.095619
std           0.171669
min           0.000007
25%           0.008409
50%           0.030644
75%           0.097773
max           0.999897
Name: isFraud, dtype: float64

Number of transactions predicted as High Risk Fraud (>50%): 20779 out of 506691 data.


: 

# Conclusion & Analysis
Based on the development of the *end-to-end* Machine Learning pipeline using XGBoost for Fraud Detection, here are the key takeaways and analysis derived from the model performance:

## A. Model Performance
- **Handling Class Imbalance**  
  The model successfully addressed the massive class imbalance (Ratio ~1:27) by using the `scale_pos_weight` parameter and a histogram-based optimization (`tree_method='hist'`) on GPU.

- **Excellent Discriminative Ability**  
  The model achieved a ROC-AUC Score of **0.9664** on the validation set. This indicates that the model has a *96.6%* probability of correctly distinguishing between fraudulent and normal transactions.

- **High Recall (Sensitivity)**  
  The Classification Report shows a Recall of **0.82** for the Fraud class (1). This is the most important metric in fraud detection, meaning the model successfully captures *82%* of all actual fraud cases.

- **Precision Trade-off**  
  The Precision for the Fraud class is **0.48**, indicating a trade-off with false positives (around 52% of predicted frauds might actually be legitimate). In real-world scenarios, this is acceptable since it is safer to flag transactions for review than to let fraud slip through.

## B. Feature Importance Analysis
The XGBoost model automatically identified the most significant signals in the dataset. The top features contributing to fraud predictions are:

- **V258** (Importance: ~0.157) — The dominant feature  
- **V70** (Importance: ~0.078)  
- **V218** (Importance: ~0.065)  

These findings suggest that variables related to transaction frequency or historical behavior (anonymized as *V-features*) play a more influential role than direct identity attributes in this model.

## C. Test Set Prediction Profile
On the unlabeled test set (`test_transaction.csv`), the model demonstrates a realistic prediction distribution:

- **Mean Fraud Probability:** 9.5%  
- **High-Risk Flags (> 50% probability):** 20,779 transactions out of 506,691  

This volume is manageable for prioritized manual review or the implementation of secondary authentication layers in a production environment.

## D. Final Verdict
The XGBoost model proves to be *highly effective* and *efficient* for this high-dimensional, imbalanced tabular dataset. It achieves a near *state-of-the-art* AUC score with minimal feature engineering, reinforcing the effectiveness of gradient boosting trees for financial fraud detection.
