## 1.  Project Overview
Credit card fraud is a critical issue for banks and financial institutions, leading to **billions of dollars in losses annually**.  
This project aims to develop a **machine learning pipeline** for fraud detection using the **Kaggle Credit Card Fraud Dataset**.

The workflow includes:
- **Data Preprocessing & Exploratory Data Analysis (EDA)**  
- **Class Imbalance Handling** using SMOTE & SMOTEENN  
- **Model Training & Hyperparameter Tuning** (Logistic Regression, Random Forest, XGBoost)  
- **Model Evaluation & Comparison** using ROC-AUC, PR-AUC, Precision, Recall, and F1-score  
- **Model Selection** for deployment

## 2.  Import Required Libraries

We begin by importing all the **necessary libraries** for data manipulation, visualization, modeling, and evaluation:

- **Pandas, Numpy** → Data handling and numerical operations  
- **Matplotlib, Seaborn** → Visualization  
- **Scikit-learn** → Preprocessing, model training, evaluation  
- **XGBoost** → Gradient boosting model for structured data  
- **Imbalanced-learn** → Resampling techniques (SMOTE, SMOTEENN)  
- **Joblib** → Model saving and loading  
- **SHAP** → Model interpretability (optional)  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, roc_curve, precision_recall_curve,
    average_precision_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
import joblib

# Optional SHAP
try:
    import shap
    shap_available = True
except ImportError:
    shap_available = False
    print(" SHAP not installed. Skipping explainability.")

## 3.  Dataset Overview
- **Total Transactions**: 284,807  
- **Fraudulent Transactions**: 492 (~0.17%) → **severe imbalance**  
- **Features**:  
  - 28 anonymized PCA-transformed features (`V1`–`V28`)  
  - `Time` → seconds elapsed since first transaction  
  - `Amount` → transaction value  
- **Target Variable**: `Class`  
  - `0` → Non-Fraud  
  - `1` → Fraud

## 4.  Data Splitting
We perform a **stratified split** to preserve fraud/non-fraud ratio:

- **Train Set (80%)**: 227,845 transactions  
- **Test Set (20%)**: 56,962 transactions  

Fraud cases:  
- Train → 394 frauds  
- Test → 98 frauds  

The train set is further split into:
- **Training subset (80%)**  
- **Validation subset (20%)**  



In [None]:
df = pd.read_csv("C:/Users/tiwar/OneDrive/Desktop/credit_card_fraud/data/creditcard.csv")

# Split into train and test
train_df, test_df = train_test_split(
    df, test_size=0.2, stratify=df["Class"], random_state=42
)

train_df.to_csv("creditcard_train.csv", index=False)
test_df.to_csv("creditcard_test.csv", index=False)

print(" Data split completed!")
print("Train shape:", train_df.shape, "Test shape:", test_df.shape)
print("Fraud cases in Train:", train_df["Class"].sum())
print("Fraud cases in Test:", test_df["Class"].sum())

df = train_df.copy()

## 4.  Exploratory Data Analysis (EDA)

### 4.1 Class Imbalance
- Fraud = **0.17%**  
- Non-fraud = **99.83%**

 **Insight**: Models trained naively will predict almost everything as *non-fraud*.  
Thus, we must use **resampling techniques** and imbalance-aware metrics.

In [None]:
# Class distribution
plt.figure(figsize=(6,4))
sns.countplot(x="Class", data=df)
plt.title("Fraud (1) vs Non-Fraud (0) Distribution (Train)")
plt.show()

# Pie chart
plt.figure(figsize=(6,6))
df["Class"].value_counts().plot.pie(
    autopct='%1.2f%%',
    labels=["Non-Fraud","Fraud"],
    colors=["skyblue","red"]
)
plt.title("Fraud vs Non-Fraud (Pie Chart) (Train)")
plt.show()

### 4.2 Transaction Amount Distribution
- **Linear Scale**:  
  - Highly right-skewed.  
  - Most transactions are **small (< $100)** ..
  - Some extreme values (~$25,000).  

- **Log Scale**:  
  - Distribution reveals multiple peaks.  
  - Indicates popular transaction ranges (e.g., small daily purchases, medium recurring payments).

 **Insight**: Fraudulent transactions often cluster at **small values**, possibly to avoid detection.


In [None]:
# Amount distribution
plt.figure(figsize=(8,5))
sns.histplot(df["Amount"], bins=100, kde=True)
plt.title("Transaction Amount Distribution (Train)")
plt.show()

# Amount distribution log scale
plt.figure(figsize=(8,5))
sns.histplot(df["Amount"], bins=100, log_scale=True, kde=True)
plt.title("Transaction Amount Distribution (Log Scale) (Train)")
plt.show()

### 4.3 Amount Distribution by Class
- Non-fraud transactions span entire value range.  
- Fraud transactions concentrated below **$500**.  

 **Insight**: Fraudsters seem to prefer smaller charges, likely to bypass suspicion.

In [None]:
# Amount by class
plt.figure(figsize=(8,5))
sns.histplot(df[df["Class"]==0]["Amount"], bins=50, color="blue", label="Non-Fraud", alpha=0.6)
sns.histplot(df[df["Class"]==1]["Amount"], bins=50, color="red", label="Fraud", alpha=0.6)
plt.legend()
plt.title("Amount Distribution by Class (Train)")
plt.show()

### 4.4 Transaction Time Analysis
- Non-fraud transactions:  
  - Follow a clear **daily cycle**.  
  - Peaks during **business hours**.  

- Fraud transactions:  
  - More **randomly distributed** across time.  
  - Appear during unusual hours.  

 **Insight**: Fraud may be linked to **off-peak times** where monitoring is weaker.

In [None]:
# Transaction time distribution
plt.figure(figsize=(10,5))
sns.histplot(df[df["Class"]==0]["Time"], bins=50, color="blue", label="Non-Fraud", alpha=0.6)
sns.histplot(df[df["Class"]==1]["Time"], bins=50, color="red", label="Fraud", alpha=0.6)
plt.legend()
plt.title("Transaction Time Distribution by Class (Train)")
plt.xlabel("Time (seconds since first transaction)")
plt.show()

### 4.5 Correlation Heatmap
- Features (`V1–V28`) are **decorrelated** due to PCA transformation.  
- No strong correlation with `Class`.  

 **Insight**: Confirms dataset is anonymized safely (no data leakage).

In [None]:
# Correlation heatmap
plt.figure(figsize=(12,8))
corr = df.corr()
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap (Train)")
plt.show()

### 4.6 PCA Feature Distributions
- Example: **V1–V3 distributions** differ slightly for fraud vs non-fraud.  
- Fraudulent cases tend to deviate more strongly in **V2 & V3**.  

**Insight**: Even anonymized PCA features capture **fraud-specific anomalies**.

In [None]:
# PCA feature distributions
fraud = df[df["Class"]==1]
nonfraud = df[df["Class"]==0]

plt.figure(figsize=(12,6))
for col in ["V1","V2","V3"]:
    sns.kdeplot(nonfraud[col], label=f"{col} Non-Fraud", fill=True, alpha=0.3)
    sns.kdeplot(fraud[col], label=f"{col} Fraud", fill=True, alpha=0.3)
plt.title("Distribution of PCA Features (V1–V3) by Class (Train)")
plt.legend()
plt.show()

##  5: Feature Engineering

- Split features (`X`) and target (`y` = Class).  
- Standardized **Time** and **Amount** using `StandardScaler` to match PCA feature scales.  
- Applied scaling to both training and test sets.  
- Further split training data into **train (64%)** and **validation (16%)** for model tuning.  
- Final test set (20%) is saved as `creditcard_test.csv` for unbiased evaluation.


In [None]:
X = df.drop("Class", axis=1)
y = df["Class"]

scaler = StandardScaler()
X[["Time", "Amount"]] = scaler.fit_transform(X[["Time", "Amount"]])

X_test_full = test_df.drop("Class", axis=1).copy()
y_test_full = test_df["Class"]
X_test_full[["Time", "Amount"]] = scaler.transform(X_test_full[["Time", "Amount"]])

# Train-test split for model training
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train shape:", X_train.shape, "Validation shape:", X_val.shape)

## 6.  Handling Class Imbalance
We test two approaches:  

1. **SMOTE (Synthetic Minority Oversampling Technique)**  
   - Creates synthetic fraud samples.  
   - Ensures **balanced 50:50 dataset**.  

2. **SMOTEENN (SMOTE + Edited Nearest Neighbors)**  
   - Adds oversampling + cleaning of noisy samples.  
   - Results in slightly fewer samples but reduces overlap.  

 **Insight**:  
- **SMOTE** → preserves more samples.  
- **SMOTEENN** → creates cleaner training sets.

In [None]:
strategies = {
    "SMOTE": SMOTE(random_state=42),
    "SMOTEENN": SMOTEENN(random_state=42)
}

## 7.  Models & Hyperparameter Tuning
We test 3 classifiers:  

1. **Logistic Regression**  
   - Linear, interpretable.  
   - Class-weight balancing applied.  

2. **Random Forest**  
   - Bagging ensemble, robust to imbalance.  
   - Tuned: `n_estimators`, `max_depth`, `max_features`.  

3. **XGBoost**  
   - Gradient boosting, handles complex patterns.  
   - Tuned: `n_estimators`, `max_depth`, `learning_rate`.  

 **Search Method**: RandomizedSearchCV with **ROC-AUC scoring**.

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight="balanced"),
    "Random Forest": RandomForestClassifier(random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
}

param_grids = {
    "Random Forest": {
        "n_estimators": [100, 200],
        "max_depth": [10, None],
        "max_features": ["sqrt", "log2"]
    },
    "XGBoost": {
        "n_estimators": [100, 200],
        "max_depth": [3, 6],
        "learning_rate": [0.05, 0.1]
    }
}

results = {}
all_fpr, all_tpr, all_auc, all_prec_rec = {}, {}, {}, {}

## 8. Evaluation Metrics
Why standard accuracy is misleading:  
- Always predicting *non-fraud* yields ~99.8% accuracy but **zero fraud detection**.  

Thus, we focus on:  
- **ROC-AUC** → overall separability.  
- **PR-AUC** → better for imbalanced datasets.  
- **Precision** → of predicted frauds, how many are correct.  
- **Recall** → of true frauds, how many are detected.  
- **F1-score** → harmonic mean of Precision & Recall.

## 9.  Results & Insights

### 9.1 Logistic Regression
- **Recall ~0.87** → detects most frauds.  
- **Precision ~0.05** → many false positives.  
- F1 ~0.10.  

 **Insight**: Good at catching fraud, but too many false alarms → not practical.

### 9.2 Random Forest
- With **SMOTE**:  
  - Precision = **0.64**  
  - Recall = 0.88  
  - F1 = **0.74**  
  - ROC-AUC = **0.977**  
  - PR-AUC = 0.786  

- With **SMOTEENN**:  
  - Precision = 0.59  
  - Recall = 0.88  
  - F1 = 0.70  
  - ROC-AUC = **0.984 (best overall)**  

 **Insight**: Random Forest achieves the **best trade-off between recall and precision**.

### 9.3 XGBoost
- Recall ~0.89 (very high).  
- Precision ~0.19–0.23 (moderate).  
- F1 ~0.31.  
- ROC-AUC ~0.969–0.970.  

 **Insight**: Very strong recall but tends to **over-flag frauds**.

In [None]:
for strat_name, sampler in strategies.items():
    print(f"\n========== Using {strat_name} ==========")
    X_train_res, y_train_res = sampler.fit_resample(X_train, y_train)
    print("Resampled shape:", X_train_res.shape, y_train_res.value_counts().to_dict())

    for name, model in models.items():
        print(f"\n🔹 Training {name} with {strat_name}...")

        if name in param_grids:
            sample_size = min(50000, len(X_train_res))
            X_sample = X_train_res.sample(sample_size, random_state=42)
            y_sample = y_train_res.loc[X_sample.index]

            search = RandomizedSearchCV(
                model,
                param_distributions=param_grids[name],
                n_iter=3,
                scoring="roc_auc",
                cv=3,
                n_jobs=1,
                random_state=42
            )
            search.fit(X_sample, y_sample)
            model = search.best_estimator_
            print("Best Params:", search.best_params_)
        else:
            model.fit(X_train_res, y_train_res)

        # Predictions
        y_pred = model.predict(X_val)
        y_prob = model.predict_proba(X_val)[:, 1]

        # Metrics
        auc = roc_auc_score(y_val, y_prob)
        pr_auc = average_precision_score(y_val, y_prob)
        print(classification_report(y_val, y_pred))
        print("ROC-AUC:", auc)
        print("PR-AUC:", pr_auc)

        # Save model
        filename = f"{name.replace(' ', '_').lower()}_{strat_name.lower()}_fraud_model.pkl"
        joblib.dump(model, filename)

        # Store results
        results[(name, strat_name)] = {"auc": auc, "pr_auc": pr_auc}

        # ROC + PR curves
        fpr, tpr, _ = roc_curve(y_val, y_prob)
        prec, rec, _ = precision_recall_curve(y_val, y_prob)
        all_fpr[(name, strat_name)] = fpr
        all_tpr[(name, strat_name)] = tpr
        all_auc[(name, strat_name)] = auc
        all_prec_rec[(name, strat_name)] = (prec, rec)

## 10.  ROC & PR Curves

- **ROC Curve**: All models >0.96 AUC.  
  - Random Forest (SMOTE) shows best curve.  

- **PR Curve**: More realistic under imbalance.  
  - Random Forest maintains **highest precision at high recall**.  
  - Logistic Regression collapses quickly in precision.  

 **Insight**: PR-AUC is a better reflection of fraud detection performance.

In [None]:
# ROC curves
plt.figure(figsize=(8,6))
for (name, strat), fpr in all_fpr.items():
    plt.plot(fpr, all_tpr[(name, strat)], label=f"{name}-{strat} (AUC={all_auc[(name, strat)]:.3f})")
plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves (All Models)")
plt.legend()
plt.show()

# Precision–Recall curves
plt.figure(figsize=(8,6))
for (name, strat), (prec, rec) in all_prec_rec.items():
    plt.plot(rec, prec, label=f"{name}-{strat}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curves (All Models)")
plt.legend()
plt.show()

##  Feature Importance and Explainability  

To better understand our fraud detection models, we analyze **which features drive predictions** using two complementary methods:  

1. **Feature Importance** (from Random Forest & XGBoost)  
2. **SHAP (SHapley Additive exPlanations)**  

---

###  Feature Importance (Random Forest / XGBoost)  

Feature importance measures the contribution of each feature to the model’s predictions.  

- **XGBoost (SMOTE & SMOTEENN)** shows that **V14** is by far the most influential feature.  
- Other important features include **V10, V12, V4, V8, and V13**.  
- The importance decreases quickly after the top 3–4 features, meaning a few features carry most of the predictive power.  

**Insights:**  
- **V14** consistently stands out as the strongest fraud indicator.  
- Features like **V10, V12, and V4** also play important roles.  
- Both Random Forest and XGBoost agree on the dominance of **V14**, making it critical for fraud detection.  


In [None]:
for (name, strat), _ in results.items():
    if "random_forest" in name.lower() or "xgboost" in name.lower():
        model = joblib.load(f"{name.replace(' ', '_').lower()}_{strat.lower()}_fraud_model.pkl")
        if hasattr(model, "feature_importances_"):
            feat_imp = pd.DataFrame({
                "Feature": X.columns,
                "Importance": model.feature_importances_
            }).sort_values(by="Importance", ascending=False).head(10)

            plt.figure(figsize=(8,6))
            sns.barplot(x="Importance", y="Feature", data=feat_imp)
            plt.title(f"Top 10 Feature Importances - {name} ({strat})")
            plt.show()

###  SHAP Explainability  

While feature importance gives a global view, **SHAP** explains predictions at both the global and local level.  

- The **beeswarm plot** shows how each feature impacts fraud probability.  
- **V4, V14, V8, and V13** have the largest SHAP contributions.  
- The plot also shows **directionality**:  
  - Positive SHAP values → push prediction toward *fraud*.  
  - Negative SHAP values → push prediction toward *non-fraud*.  

**Insights:**  
- SHAP confirms **V14 and V4** as critical fraud predictors.  
- High values of **V14** strongly increase fraud likelihood.  
- SHAP helps ensure **transparency and trust**, which is essential in financial fraud detection.  

In [None]:
if shap_available:
    best_model_name, best_resample = max(results, key=lambda x: results[x]["auc"])
    print(f"\n Best Model: {best_model_name} ({best_resample}) with AUC={results[(best_model_name, best_resample)]['auc']:.4f}")

    best_model = joblib.load(f"{best_model_name.replace(' ', '_').lower()}_{best_resample.lower()}_fraud_model.pkl")
    explainer = shap.Explainer(best_model, X_val)
    shap_values = explainer(X_val[:200])

    # Fix for multi-output explanation shape
    if len(shap_values.values.shape) == 3:
        shap_values = shap_values[:, :, 1]

    shap.plots.beeswarm(shap_values)
    plt.show()

In [None]:


# Save Scaler + Metadata
joblib.dump(scaler, "scaler.pkl")
joblib.dump(list(X.columns), "feature_names.pkl")
print("\n Models, scaler, metadata saved successfully!")
print("Test set saved as 'creditcard_test.csv' for Streamlit Compare Models.")