**Business Need for Fraud Detection in Financial Transactions**

Problem Statement

Financial institutions, e-commerce platforms, and online payment systems face significant losses due to fraudulent transactions. Fraudsters use stolen credit card details, fake identities, and other malicious tactics to perform unauthorized transactions.

Why is Fraud Detection Important?

Financial Loss Prevention: Banks and businesses lose billions annually due to fraud.

Customer Trust & Retention: Ensuring transaction security enhances customer confidence.

Regulatory Compliance: Companies must comply with financial regulations to prevent fraud.

Operational Efficiency: Detecting fraud early saves investigation costs and reduces chargebacks.

Business Impact

Reduced Chargebacks & Losses 🏦

Early fraud detection minimizes financial losses.

Enhanced Security & Compliance 🔐

Detecting anomalies ensures compliance with anti-fraud regulations.

Improved Customer Experience 💳

Preventing fraud protects customer accounts and builds trust.

Objective of the Project

Analyze transaction data to identify patterns of fraudulent behavior.

Develop a Machine Learning model to detect fraud in real-time.

Improve accuracy using feature engineering and various ML algorithms.

In [2]:
import os
print("Current working directory:", os.getcwd())



Current working directory: /Users/surajitdas/Downloads/downloaded-file (6)


In [3]:
cd /path/to/directory

[Errno 2] No such file or directory: '/path/to/directory'
/Users/surajitdas/Downloads/downloaded-file (6)


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE


In [5]:
#Load dataset
data = pd.read_csv("fraud_detection_sample.csv")
df = pd.read_csv("fraud_detection_sample.csv")
print("Columns in dataset:", list(df.columns))
# Display first few rows
print("Dataset shape:", data.shape)
print("\nFirst 5 rows:")
print(data.head())
data.head()

print("\nMissing values:")
print(data.isnull().sum())


Columns in dataset: ['Time', 'Amount', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'Class']
Dataset shape: (500, 13)

First 5 rows:
    Time   Amount        V1        V2        V3        V4        V5        V6  \
0  15795  4292.21  0.593101 -1.592994  0.126380  0.189706  0.333860 -1.535040   
1    860  2145.54 -0.309546  0.440475  1.938929 -0.661982  1.431367 -1.880010   
2  76820  3754.60  0.326133 -0.019638 -1.000331  0.425887  1.081767  0.712712   
3  54886  3772.96 -1.251114  0.552490 -0.677745  0.019148 -1.312219 -1.883150   
4   6265   516.52  0.924027  0.223914  0.513908 -0.641487  0.622070 -0.372319   

         V7        V8        V9       V10  Class  
0  0.872197 -2.386930 -0.190872 -1.846573      0  
1 -0.315087 -0.495878 -0.198196 -0.428655      0  
2 -0.571746  1.097300  0.510157  1.029441      0  
3  0.332608 -1.565648  1.272570 -0.336895      0  
4  0.933128 -3.007632  0.126314 -0.846434      0  

Missing values:
Time      0
Amount    0
V1        0
V2   

In [6]:
print(df.columns)
print(df['Class'].value_counts()) 
target_column = 'Class'
print(df[target_column].value_counts())

Index(['Time', 'Amount', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9',
       'V10', 'Class'],
      dtype='object')
Class
0    480
1     20
Name: count, dtype: int64
Class
0    480
1     20
Name: count, dtype: int64


In [7]:
target_column = 'Class'
print(df[target_column].value_counts())

Class
0    480
1     20
Name: count, dtype: int64


## EDA: Check class distribution

In [8]:
target_column = 'Class'  # previously was 'isFraud'
X = df.drop(target_column, axis=1)
y = df[target_column]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Features shape: (500, 12)
Target shape: (500,)


In [9]:

# Split features and target
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Handle class imbalance using SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("After SMOTE class distribution:")
print(y_resampled.value_counts())

# Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled,
    test_size=0.2,
    random_state=42,
    stratify=y_resampled
)

print(f"Train set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

# Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Feature scaling applied")

# Model training with Random Forest
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2
)

model.fit(X_train_scaled, y_train)
print("Random Forest model trained")

# Predictions
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Model evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC AUC Score: {roc_auc:.4f}")
# Feature importance
import pandas as pd

import pandas as pd

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values(by='importance', ascending=False)

print("Top 10 Important Features:")
print(feature_importance.head(10))





Features shape: (500, 12)
Target shape: (500,)
After SMOTE class distribution:
Class
0    480
1    480
Name: count, dtype: int64
Train set size: (768, 12)
Test set size: (192, 12)
Feature scaling applied
Random Forest model trained
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.92      0.96        96
           1       0.92      1.00      0.96        96

    accuracy                           0.96       192
   macro avg       0.96      0.96      0.96       192
weighted avg       0.96      0.96      0.96       192

Confusion Matrix:
[[88  8]
 [ 0 96]]
ROC AUC Score: 0.9921
Top 10 Important Features:
   feature  importance
0     Time    0.175172
11     V10    0.168110
1   Amount    0.116719
3       V2    0.098973
4       V3    0.078862
8       V7    0.058873
6       V5    0.057703
9       V8    0.055534
7       V6    0.052845
10      V9    0.052540
