# Fraud Detection with Machine Learning

This project aims to build a **fraud detection system** using machine learning on financial transaction data.

The goal is to:
- Understand and visualize transaction patterns.
- Build a predictive model to detect potential frauds.
- Optimize for both **accuracy** and **recall** to minimize false negatives (missed frauds).
- Deploy the model as an interactive dashboard later.

**Dataset Info:**
- File: `Fraud.csv`
- Rows: 6,362,620
- Columns: 11  
- Source: Synthetic dataset simulating financial transactions.


##  1. Import Required Libraries

We start by importing the essential Python libraries for:
- **Data manipulation:** `pandas`, `numpy`
- **Visualization:** `matplotlib`, `seaborn`
- **Memory management:** `gc`
- **Warning control:** `warnings`


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings

# To suppress unnecessary warnings
warnings.filterwarnings('ignore')

# Show all columns when displaying dataframes
pd.set_option('display.max_columns', None)


## 2: Load the Dataset

The dataset `Fraud.csv` contains transaction details such as:
- Step (time step of the transaction)
- Type (transaction type: PAYMENT, TRANSFER, etc.)
- Amount
- Source and destination account balances
- Fraud indicators (`isFraud` and `isFlaggedFraud`)

We’ll specify `dtype` for each column to reduce memory usage during loading.


In [None]:
DATA_PATH = '/content/Fraud.csv'   # Update if stored elsewhere

print("Loading dataset...")

dtype_map = {
    'step': 'int32',
    'type': 'category',
    'amount': 'float32',
    'oldbalanceOrg': 'float32',
    'newbalanceOrig': 'float32',
    'oldbalanceDest': 'float32',
    'newbalanceDest': 'float32',
    'isFraud': 'int8',
    'isFlaggedFraud': 'int8'
}

df = pd.read_csv(DATA_PATH, dtype=dtype_map, low_memory=False)

print(" Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Memory usage (MB): {round(df.memory_usage().sum() / 1024**2, 2)}")

# Display first few rows
df.head()


## 3.Dataset Overview

Now, let's quickly inspect:
- The data types of each column  
- Null or missing values  
- A few descriptive statistics  
- Class balance between fraudulent and non-fraudulent transactions


In [None]:
# Data types and non-null counts
print("Data Info:")
df.info()

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Basic descriptive statistics
print("\nDescriptive Statistics:")
display(df.describe())

# Fraud ratio
fraud_ratio = df['isFraud'].mean() * 100
print(f"\nFraudulent transactions ratio: {fraud_ratio:.4f}%")


## 4.Exploratory Data Analysis (EDA)

Before building a model, it's important to explore the dataset to understand:
- Which transaction types are most common?
- How much money do fraudulent transactions involve?
- How is the balance changing before and after a transaction?
- What’s the imbalance between fraud and non-fraud classes?

Let's start by analyzing transaction types and fraud distribution.


### 4.1 Distribution of Transaction Types

We’ll check the frequency of each transaction type and how often fraud occurs in each.


In [None]:
import plotly.express as px
import plotly.graph_objects as go


### 4.1 Distribution of Transaction Types (Interactive)

This chart shows how frequently each transaction type occurs,  
and allows you to hover over bars to view exact counts.


In [None]:
fig = px.histogram(
    df,
    x='type',
    color='isFraud',
    barmode='group',
    title='Distribution of Transaction Types (Fraud vs Non-Fraud)',
    color_discrete_map={0: 'blue', 1: 'red'}
)

fig.update_layout(
    xaxis_title='Transaction Type',
    yaxis_title='Count',
    template='plotly_white'
)
fig.show()


### 4.2 Transaction Amount Distribution Over Time

Histogram showing how transaction amounts evolve across time steps.
This helps us see if certain time periods have higher fraud concentration.


In [None]:
# Limit for visibility
sample = df[df['amount'] < 50000].sample(50000, random_state=42)

fig = px.histogram(
    sample,
    x='amount',
    color='isFraud',
    nbins=50,
    animation_frame='step',
    title='Transaction Amount Distribution Over Time (Animated)',
    color_discrete_map={0: 'lightblue', 1: 'red'}
)
fig.update_layout(
    xaxis_title='Transaction Amount',
    yaxis_title='Count',
    template='plotly_white'
)
fig.show()


### 4.3 Origin Account Balance Changes

 scatter plot that shows how account balances change during transactions.  
Fraudulent ones often display abrupt or illogical balance jumps.


In [None]:
sample = df.sample(10000, random_state=42).copy()
sample['balance_diff'] = sample['oldbalanceOrg'] - sample['newbalanceOrig']

fig = px.scatter(
    sample,
    x='oldbalanceOrg',
    y='newbalanceOrig',
    color='isFraud',
    animation_frame='step',
    size='amount',
    title='Origin Account Balances Before vs After (Animated)',
    color_discrete_map={0: 'green', 1: 'red'},
    hover_data=['amount', 'type']
)
fig.update_layout(
    xaxis_title='Old Balance (Origin)',
    yaxis_title='New Balance (Origin)',
    template='plotly_white'
)
fig.show()


### 4.4 Fraud Rate by Transaction Type

 Bar chart showing the percentage of fraudulent transactions per type.


In [None]:
fraud_rate = df.groupby('type')['isFraud'].mean().reset_index()
fraud_rate['isFraud'] *= 100

fig = px.bar(
    fraud_rate,
    x='type',
    y='isFraud',
    text='isFraud',
    title='Fraud Percentage by Transaction Type',
    color='isFraud',
    color_continuous_scale='Reds'
)
fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig.update_layout(
    xaxis_title='Transaction Type',
    yaxis_title='Fraud Rate (%)',
    template='plotly_white'
)
fig.show()


## 5. Feature Engineering

To improve model performance, we'll create additional features that help capture hidden relationships in the data.

### Features to Add:
1. **Type Encoding:** Convert categorical `type` to numeric codes.
2. **Balance Changes:** Calculate how account balances change before and after each transaction.
3. **Log Amount:** Apply logarithmic transformation to large transaction amounts.
4. **Transaction Ratios:** Compare amount to available balance.

These transformations make the model more interpretable and powerful.


In [None]:
# 1. Encode transaction type
df['type_code'] = df['type'].map({
    'PAYMENT': 0,
    'TRANSFER': 1,
    'CASH_OUT': 2,
    'DEBIT': 3
}).fillna(4).astype('int8')

# 2. Compute balance change features
df['delta_orig'] = df['oldbalanceOrg'] - df['newbalanceOrig']
df['delta_dest'] = df['newbalanceDest'] - df['oldbalanceDest']

# 3. Log-transform amount to handle skew
df['amt_log'] = np.log1p(df['amount'])

# 4. Transaction ratio: amount relative to sender's original balance
df['trans_ratio'] = np.where(df['oldbalanceOrg'] > 0,
                             df['amount'] / (df['oldbalanceOrg'] + 1e-5),
                             0).astype('float32')

# Check sample of new features
df[['amount', 'delta_orig', 'delta_dest', 'amt_log', 'trans_ratio', 'type_code']].head()


### 5.1 Handling Missing and Infinite Values

Some columns may contain invalid entries such as `NaN`, `inf`, or `-inf`.
We’ll replace those with safe defaults to avoid model training errors.

We’ll **skip categorical columns** while filling numeric columns with `0`.


In [None]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
# Exclude the 'type' column when filling NaN values
for col in df.columns:
    if df[col].isnull().any():
        if df[col].dtype.name != 'category':
            df[col].fillna(0, inplace=True)

print(" Missing values handled successfully.")

## 6. Prepare Training and Validation Sets

We'll now split the dataset into **train** and **validation** sets for model training.
Given the large dataset size, we’ll sample for efficient training.


In [None]:
from sklearn.model_selection import train_test_split

# Choose relevant features
feature_cols = [
    'amount', 'oldbalanceOrg', 'newbalanceOrig',
    'oldbalanceDest', 'newbalanceDest',
    'delta_orig', 'delta_dest', 'amt_log',
    'trans_ratio', 'type_code'
]

X = df[feature_cols]
y = df['isFraud']

# Downsample for training efficiency (optional)
df_sample = df.sample(500000, random_state=42)
X = df_sample[feature_cols]
y = df_sample['isFraud']

# Split into train-validation
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Train shape: {X_train.shape}, Validation shape: {X_val.shape}")
print(f"Fraud ratio in train: {y_train.mean():.4f}")



## 7.Model Training (LightGBM)

We use **LightGBM**, a fast and efficient gradient boosting model well-suited for large datasets and imbalanced classes.


In [None]:
from lightgbm import LGBMClassifier, early_stopping, log_evaluation

# Calculate class weight for imbalance
scale_pos_weight = (len(y_train) - y_train.sum()) / y_train.sum()

# Initialize model
model = LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

# Train with callbacks for logging and early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='auc',
    callbacks=[early_stopping(stopping_rounds=50), log_evaluation(period=50)]
)


##8. Model Evaluation

We'll evaluate the model using Precision, Recall, F1-score, and AUC.  
High **recall** is crucial for fraud detection — we prefer to catch more frauds, even if it means some false positives.


In [None]:
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc

# Predict probabilities
y_proba = model.predict_proba(X_val)[:, 1]

# AUC Score
roc_auc = roc_auc_score(y_val, y_proba)
print(f"ROC-AUC: {roc_auc:.4f}")

# Precision-Recall curve
prec, rec, th = precision_recall_curve(y_val, y_proba)
pr_auc = auc(rec, prec)
print(f"PR-AUC: {pr_auc:.4f}")

# Choose threshold with best F1
f1_scores = 2 * (prec * rec) / (prec + rec)
best_idx = np.argmax(f1_scores)
best_threshold = th[best_idx]

print(f"Best Threshold: {best_threshold:.4f}")
print(f"Precision: {prec[best_idx]:.4f}, Recall: {rec[best_idx]:.4f}, F1: {f1_scores[best_idx]:.4f}")

# Final classification report
y_pred = (y_proba >= best_threshold).astype(int)
print("\nClassification Report:")
print(classification_report(y_val, y_pred))


##9. Save the Trained Model

We'll save the trained LightGBM model so we can load it later in a Streamlit dashboard or API for live predictions.
We’ll use the `joblib` library, which efficiently stores large model objects.


In [None]:
import joblib
import json
import os

# Create folder if not exists
os.makedirs("model", exist_ok=True)

# Save model
model_path = "model/fraud_model_slim.pkl"
joblib.dump(model, model_path)

# Save metadata (features + threshold)
metadata = {
    "feature_cols": feature_cols,
    "best_threshold": float(best_threshold)
}
with open("model/metadata.json", "w") as f:
    json.dump(metadata, f, indent=4)

print(f" Model saved to: {model_path}")
print(f" Metadata saved to: model/metadata.json")


## 10. Feature Importance

LightGBM provides feature importance scores showing how much each feature contributes to the model’s predictions.
We’ll visualize this as a bar chart.


In [None]:
import plotly.express as px
import pandas as pd

# Feature importance
importance_df = pd.DataFrame({
    "Feature": feature_cols,
    "Importance": model.feature_importances_
}).sort_values(by="Importance", ascending=True)

fig = px.bar(
    importance_df,
    x="Importance",
    y="Feature",
    orientation="h",
    title="Feature Importance (LightGBM)",
    color="Importance",
    color_continuous_scale="Blues"
)
fig.update_layout(template="plotly_white")
fig.show()


##  11. Model Explainability using SHAP

To understand why the model predicts a transaction as fraud or not, we use **SHAP (SHapley Additive exPlanations)**.
SHAP values explain the contribution of each feature to individual predictions.


In [None]:
import shap

# Initialize SHAP explainer (TreeExplainer works best with LightGBM)
explainer = shap.TreeExplainer(model)

# Use a subset of validation data for speed
sample_X = X_val.sample(1000, random_state=42)

# Compute SHAP values
shap_values = explainer.shap_values(sample_X)


### 11.1 SHAP Summary Plot

The summary plot shows which features have the strongest influence on fraud prediction across many samples.


In [None]:
# Visual summary
shap.summary_plot(shap_values, sample_X, plot_type="dot", show=True)


### 11.2 Single Transaction Explanation

Let's take one random transaction and visualize how each feature contributes to its fraud score.


In [None]:

single = X_val.sample(1, random_state=42)
shap_single = explainer.shap_values(single)


if isinstance(shap_single, list):
    shap_single = shap_single[0]


shap.plots.bar(shap.Explanation(values=shap_single, base_values=explainer.expected_value, data=single))
