<a href="https://colab.research.google.com/github/sammyhasan17/fraud-detection-using-ML/blob/main/fraud_detection_xgboost_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit‑Card Fraud Detection with XGBoost

This notebook provides an end‑to‑end example of building a fraud detection model using **XGBoost**. A real‑world dataset of anonymized credit‑card transactions is pulled directly from Figshare via its public API. The data contain transactions made by European cardholders in September 2013 and include 284,807 transactions where only 492 are fraudulent (≈0.17 %)【861937260890769†L20-L33】. To protect cardholder privacy, all input features except `Time` and `Amount` have been transformed using principal‑component analysis【861937260890769†L21-L24】.

Key tasks performed in this notebook:

* Use the Figshare REST API to fetch metadata for the dataset and obtain the direct download URL.
* Download the CSV file and load it into a Pandas DataFrame.
* Clean the data (remove duplicates, standardize numeric features) and explore the class distribution.
* Split the data into training and test sets while preserving the fraud/non‑fraud ratio.
* Train an XGBoost model with class imbalance handling (`scale_pos_weight`).
* Evaluate the model using accuracy, recall, precision, F1 score and ROC‑AUC.
* (Optional) Create an ensemble model using stacking to combine XGBoost with other classifiers.

The code is written to be transparent and reproducible; comments explain each step so that you can adapt it for your own experimentation. Running this notebook requires an internet connection to download the dataset and the necessary Python libraries (e.g. `xgboost`, `pandas`, `numpy`, `scikit‑learn`, and `requests`).

In [None]:
# Install missing packages (run once).
# Uncomment the following lines if you don't already have the required packages.
# !pip install pandas numpy scikit-learn xgboost requests tqdm --quiet

import json
import os
import requests
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from xgboost import XGBClassifier
from tqdm.auto import tqdm

# Display versions for reproducibility
print('pandas:', pd.__version__)
print('numpy:', np.__version__)

pandas: 2.2.2
numpy: 2.0.2


## 1. Use the Figshare REST API to get our data

The data reside in a public Figshare record (article ID `29270873`). Figshare exposes a REST API at `https://api.figshare.com/v2`. We first query the article metadata to find the file ID and direct download URL, then we stream the file to disk. If you already have the file locally, you can skip this step.

In [None]:
# Replace these constants only if the Figshare article or file IDs change
FIGSHARE_ARTICLE_ID = 29270873
LOCAL_DATA_PATH = 'creditcard.csv'

def get_figshare_article(article_id: int) -> dict:
    # Retrieve metadata for a Figshare article and return a dict
    url = f'https://api.figshare.com/v2/articles/{article_id}'
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

def download_figshare_file(download_url: str, dest_path: str) -> None:
    # Download a file from Figshare with streaming to avoid memory issues
    with requests.get(download_url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('Content-Length', 0))
        chunk_size = 1024 * 1024  # 1 MB
        with open(dest_path, 'wb') as f:
            with tqdm(total=total_size, unit='B', unit_scale=True, desc='Downloading') as pbar:
                for chunk in r.iter_content(chunk_size=chunk_size):
                    if chunk:
                        f.write(chunk)
                        pbar.update(len(chunk))

# If the CSV is not present locally, fetch metadata and download it
if not os.path.exists(LOCAL_DATA_PATH):
    article_meta = get_figshare_article(FIGSHARE_ARTICLE_ID)
    file_info = article_meta['files'][0]
    print(f"Found file on Figshare: {file_info['name']} ({file_info['size'] / 1e6:.2f} MB)")
    download_url = file_info['download_url']
    print(f'Downloading from {download_url}')
    download_figshare_file(download_url, LOCAL_DATA_PATH)
else:
    print(f'{LOCAL_DATA_PATH} already exists – skipping download.')

creditcard.csv already exists – skipping download.


## 2. Load data into a DataFrame

Once downloaded, the dataset is a CSV file with 31 columns: 28 PCA components `V1`–`V28`, the raw `Time` and `Amount` features, and the binary `Class` label. A value of 1 indicates a fraudulent transaction and 0 represents a legitimate transaction【861937260890769†L21-L33】.

In [None]:
# Load the CSV into a DataFrame
data = pd.read_csv(LOCAL_DATA_PATH)
print('Dataset shape:', data.shape)
data.head()

Dataset shape: (284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Inspect class distribution

The dataset is highly imbalanced: only ~0.17 % of the transactions are fraudulent【861937260890769†L21-L33】. We compute the number and proportion of each class below.

In [None]:
# Compute class distribution
class_counts = data['Class'].value_counts().sort_index()
print('Class distribution:')
for cls, count in class_counts.items():
    print(f'  Class {cls}: {count} samples ({count / len(data) * 100:.3f}%)')

Class distribution:
  Class 0: 284315 samples (99.827%)
  Class 1: 492 samples (0.173%)


## 3. Data cleaning and preprocessing

We perform standard cleaning steps: remove duplicates, verify no missing values, scale `Time` and `Amount` and split the data into stratified training and test sets.

In [None]:
# 1. Remove duplicate rows
initial_rows = len(data)
data = data.drop_duplicates()
print(f'Removed {initial_rows - len(data)} duplicate rows')

# 2. Check for missing values
missing_counts = data.isnull().sum()
if missing_counts.any():
    print('Missing values detected:', missing_counts[missing_counts > 0])
else:
    print('No missing values detected.')

# 3. Separate features and target
X = data.drop('Class', axis=1).copy()
y = data['Class'].copy()

# Standardize 'Time' and 'Amount'
scaler = StandardScaler()
X[['Time', 'Amount']] = scaler.fit_transform(X[['Time', 'Amount']])

# 4. Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
print('Training set size:', X_train.shape[0])
print('Test set size:', X_test.shape[0])

Removed 1081 duplicate rows
No missing values detected.
Training set size: 198608
Test set size: 85118


## 4. Train an XGBoost fraud detector

Because fraud cases are extremely rare, we set the `scale_pos_weight` hyperparameter to the ratio of negative to positive examples in the training data. We also specify a moderate number of trees and constrain their depth to avoid overfitting.

In [None]:
# Compute scale_pos_weight as ratio of negative to positive samples
neg_pos_ratio = (y_train == 0).sum() / (y_train == 1).sum()
print(f'Negative/positive ratio in training set: {neg_pos_ratio:.0f}:1')

xgb_clf = XGBClassifier(
    objective='binary:logistic',
    eval_metric='aucpr',
    n_estimators=200,
    max_depth=4,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=neg_pos_ratio,
    random_state=42,
    n_jobs=-1
)

xgb_clf.fit(X_train, y_train)

y_pred = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1]

print('Classification report:')
print(classification_report(y_test, y_pred, digits=4))

cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:', cm)

roc_auc = roc_auc_score(y_test, y_proba)
print(f'ROC-AUC: {roc_auc:.4f}')

Negative/positive ratio in training set: 599:1
Classification report:
              precision    recall  f1-score   support

           0     0.9997    0.9997    0.9997     84976
           1     0.8085    0.8028    0.8057       142

    accuracy                         0.9994     85118
   macro avg     0.9041    0.9012    0.9027     85118
weighted avg     0.9994    0.9994    0.9994     85118

Confusion matrix: [[84949    27]
 [   28   114]]
ROC-AUC: 0.9738


## 5. Cross-validation and hyperparameter tuning (optional)

Use stratified cross-validation to obtain a more robust estimate of model performance. Hyperparameter tuning can further improve results.

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb_clf, X_train, y_train, cv=cv, scoring='roc_auc', n_jobs=-1)
print('Cross-validated ROC-AUC scores:', cv_scores)
print('Mean ROC-AUC:', cv_scores.mean())

Cross-validated ROC-AUC scores: [0.99054979 0.98701484 0.97796555 0.99557356 0.97373712]
Mean ROC-AUC: 0.9849681729808781


## 6. Creating an ensemble (stacking) model

Combine logistic regression and random forest as base learners with a final XGBoost classifier using `StackingClassifier`.

In [None]:
from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier

base_learners = [
    ('lr', LogisticRegression(max_iter=1000, class_weight='balanced')),
    ('rf', RandomForestClassifier(n_estimators=200, max_depth=8, class_weight='balanced_subsample', random_state=42))
]

final_estimator = XGBClassifier(
    objective='binary:logistic',
    eval_metric='aucpr',
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=neg_pos_ratio,
    random_state=42,
    n_jobs=-1
)

stacking_clf = StackingClassifier(estimators=base_learners, final_estimator=final_estimator, n_jobs=-1)
stacking_clf.fit(X_train, y_train)

ensemble_pred = stacking_clf.predict(X_test)
ensemble_proba = stacking_clf.predict_proba(X_test)[:, 1]

print('Stacking model classification report:')
print(classification_report(y_test, ensemble_pred, digits=4))
ensemble_auc = roc_auc_score(y_test, ensemble_proba)
print('Ensemble ROC-AUC:', ensemble_auc)

Stacking model classification report:
              precision    recall  f1-score   support

           0     0.9998    0.9607    0.9799     84976
           1     0.0366    0.8944    0.0703       142

    accuracy                         0.9606     85118
   macro avg     0.5182    0.9275    0.5251     85118
weighted avg     0.9982    0.9606    0.9783     85118

Ensemble ROC-AUC: 0.9699187227014884


## 7. Conclusion

This notebook demonstrated how to download a real-world credit-card fraud dataset from Figshare, perform basic preprocessing, and build a fraud detector using XGBoost. We accounted for severe class imbalance using the `scale_pos_weight` parameter and evaluated the model with appropriate metrics. An optional stacking ensemble illustrated how combining classifiers can further improve results.