<a href="https://colab.research.google.com/github/thvarsha00/credit-card-fraud-detection/blob/main/credit_card_fraud_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Project Summary –  Credit Card Fraud Detection

This project focuses on detecting fraudulent credit card transactions using machine learning techniques. The dataset used is highly imbalanced, with fraud cases making up only a very small fraction of the total transactions.

# Steps Performed:

# Data Preprocessing

# Handled missing values and explored the dataset.

Standardized features like transaction amount and time to ensure better model performance.

Applied PCA (Principal Component Analysis) for dimensionality reduction, since most features are anonymized (V1–V28).

# Model Training

Implemented Logistic Regression (as a baseline model).

Implemented Random Forest Classifier, which is better at handling class imbalance and nonlinear patterns.

# Evaluation Metrics

Since the dataset is imbalanced, accuracy is not the right measure.

Focused on Recall (True Positive Rate), because detecting frauds is more critical than avoiding false alarms.

Also used ROC-AUC score to evaluate the overall discriminatory power of models.

# Results:

Logistic Regression: Provided reasonable performance but missed some fraud cases.

Random Forest: Achieved higher recall and ROC-AUC, making it better suited for fraud detection.

# Key takeaway:
 In fraud detection, recall is more important than accuracy, since failing to detect a fraud (false negative) is more harmful than wrongly flagging a normal transaction (false positive).

# Outcome:

The project demonstrates how machine learning models can detect fraudulent transactions effectively, with Random Forest being the most reliable model in this case. This forms the foundation for building real-world fraud detection systems where minimizing fraud losses is the top priority.

In [None]:
from google.colab import drive
drive.mount('/content/drive')






# **Importing Libraries**

This cell loads all the essential libraries used throughout the project:

- **Data Handling & Analysis**:
  - `pandas`, `numpy`: For loading, cleaning, and manipulating the dataset.

- **Visualization**:
  - `matplotlib.pyplot`, `seaborn`: For static plots like histograms, heatmaps, and line charts.
  - `plotly.express`, `plotly.graph_objects`: For interactive visualizations including dashboards, scatter plots, and violin plots.

- **Machine Learning & Dimensionality Reduction**:
  - `StandardScaler`: For feature scaling before modeling or visualization.
  - `PCA`, `TSNE`: For reducing high-dimensional data to 2D for visualization.
  - `precision_recall_curve`, `PrecisionRecallDisplay`: For evaluating model performance on imbalanced data using precision-recall metrics.

These tools form the backbone of the project, enabling both exploratory data analysis and model evaluation.

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.graph_objects as goS


# ML Tools
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay


In [None]:

# Load the dataset from Drive folder
file_path = '/content/drive/MyDrive/creditcard_data/creditcard.csv'
df = pd.read_csv(file_path)

print(df.head())
print(df.columns)



X = df.drop("Class", axis=1)
y = df["Class"]


# Fraud vs Non-Fraud Transactions Histogram
This interactive histogram shows the distribution of fraudulent (Class = 1) and non-fraudulent (Class = 0) transactions in the dataset.





In [None]:

fig = px.histogram(df, x="Class", color="Class",
                   title="Fraud vs Non-Fraud Transactions",
                   labels={"Class": "Transaction Type"})
fig.show()


# Transaction Amount Distribution (Fraud vs Non-Fraud)

The boxplot compares the transaction amounts between fraudulent (1) and non-fraudulent (0) transactions.

Most non-fraud transactions are spread across a wide range of amounts.

Fraudulent transactions tend to involve smaller amounts on average, but a few outliers show very high values.

This suggests fraudsters often attempt small transactions to avoid detection, but sometimes larger amounts are also targeted.

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x="Class", y="Amount", data=df, palette="coolwarm")
plt.title("Transaction Amount Distribution")
plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])
plt.show()

# Fraud vs Non-Fraud by Transaction Amount (Log Scale)

This histogram compares transaction amounts for fraudulent and non-fraudulent cases on a logarithmic scale for clearer visibility.

Non-fraud transactions (green) dominate across all ranges, especially in mid and high transaction amounts.

Fraudulent transactions (red) are more concentrated in the lower amount ranges, showing fraudsters often attempt small transactions.

The log scale helps highlight fraud patterns that would otherwise be hidden due to class imbalance.

In [None]:
# Fraud vs Non-Fraud by Amount (log scale for better view)
plt.figure(figsize=(10,6))
sns.histplot(df[df['Class']==0]['Amount'], bins=50, color='green', label="Non-Fraud", alpha=0.6)
sns.histplot(df[df['Class']==1]['Amount'], bins=50, color='red', label="Fraud", alpha=0.6)
plt.yscale("log")
plt.legend()
plt.title("Fraud vs Non-Fraud by Transaction Amount (Log Scale)")
plt.show()

# Fraud vs Non-Fraud Transactions Over Time

This plot shows how transactions are distributed across time.

Non-fraud transactions (green) occur consistently throughout the time period.

Fraud transactions (red) appear in clusters, indicating fraudsters often strike in short bursts.

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df[df['Class']==0]['Time'], bins=100, color='green', label="Non-Fraud", alpha=0.6)
sns.histplot(df[df['Class']==1]['Time'], bins=100, color='red', label="Fraud", alpha=0.6)
plt.legend()
plt.title("Fraud vs Non-Fraud Transactions Over Time")
plt.xlabel("Time (seconds)")
plt.ylabel("Count")
plt.show()


# Fraud Rate by Transaction Amount Range

This bar chart shows the probability of fraud across different transaction amount bins.

Fraudulent transactions are more frequent in lower ranges, showing that fraudsters often attempt smaller amounts to avoid detection.

Some higher ranges also have spikes, meaning large-value fraud attempts exist but are less common.

In [None]:
df['AmountBin'] = pd.qcut(df['Amount'], 10)  # Divide into 10 bins
fraud_rate = df.groupby('AmountBin')['Class'].mean()

plt.figure(figsize=(10,6))
fraud_rate.plot(kind='bar', color="orange")
plt.title("Fraud Rate by Transaction Amount Range")
plt.ylabel("Fraud Rate")
plt.xlabel("Transaction Amount Range")
plt.show()


# Fraud vs Non-Fraud Distribution (Pie Chart)

This pie chart highlights the imbalance in the dataset.

Non-fraud (green) makes up almost all transactions.

Fraud (red) is less than 1%, showing the dataset is highly imbalanced.

In [None]:
plt.figure(figsize=(6,6))
df['Class'].value_counts().plot.pie(autopct="%1.2f%%", labels=["Non-Fraud","Fraud"], colors=["green","red"])
plt.title("Fraud vs Non-Fraud Distribution")
plt.show()


# Transaction Amount by Class

# Insights:

Fraudulent transactions often have different amount distributions compared to normal ones.

Outliers and spread can reveal whether fraud tends to occur in high- or low-value transactions.

Useful for feature engineering and understanding model behavior.

In [None]:

fig = px.box(df, x="Class", y="Amount", color="Class",
             title="Transaction Amounts by Class")
fig.show()




---

###  **Transaction Time vs Amount (Scatter Plot)**

This scatter plot visualizes the relationship between transaction time and amount, with points color-coded by class (`Class = 0` for non-fraud, `Class = 1` for fraud).

- **X-axis**: Time (in seconds since the first transaction)
- **Y-axis**: Transaction amount
- **Color**: Indicates whether the transaction was fraudulent

**Insights**:
- Helps identify if frauds cluster around specific time periods or transaction values.
- Reveals temporal patterns or anomalies that could be useful for time-based fraud detection.
- Can guide feature engineering (e.g., creating time-of-day features).

In [None]:
fig = px.scatter(df, x="Time", y="Amount", color="Class",
                 title="Transaction Time vs Amount")
fig.show()


# Correlation Heatmap

This heatmap displays the pairwise correlation coefficients between all numerical features in the dataset.

To identify relationships between features, including potential multicollinearity

This is especially valuable in datasets with anonymized features like V1–V28, where direct interpretation isn’t possible.


In [None]:
df_encoded = df.copy()
if 'AmountBin' in df_encoded.columns:
    df_encoded['AmountBin'] = df_encoded['AmountBin'].cat.codes  # Convert intervals to integers

plt.figure(figsize=(12,8))
corr = df_encoded.corr()
sns.heatmap(corr, cmap="coolwarm", cbar=True)
plt.title("Correlation Heatmap (Including Encoded Categories)")
plt.show()




###  **Interpretation of the Violin Plot: Transaction Amount by Class**

- **Class 0 (Non-Fraudulent)**:
  - The majority of transactions are tightly clustered around lower amounts.
  - There's a long tail of higher-value transactions, but they occur infrequently.
  - The dense concentration near the bottom suggests most legitimate transactions are small.

- **Class 1 (Fraudulent)**:
  - Fewer data points overall, confirming the class imbalance.
  - Fraudulent transactions appear more spread out, with some occurring at significantly higher amounts.
  - The sparse distribution and presence of outliers suggest fraud can happen across a wide range of transaction values.

**Why the Log Scale Matters**:
- Without it, the dense cluster of small transactions would dominate the plot.
- Log scaling reveals the subtle spread and outliers in both classes, especially for fraud.



In [None]:


fig = px.violin(df, x='Class', y='Amount', color='Class',
                box=True, # adds a box plot inside the violin
                points="all", # shows all data points
                labels={
                    "0": "Non-Fraudulent",
                    "1": "Fraudulent"
                },
                title="<b>Distribution of Transaction Amount by Class</b>",
                template="plotly_white")

# Using a log scale makes the distributions easier to compare
fig.update_layout(yaxis_type="log")
fig.show()

# Distribution of Fraudulent vs. Non-Fraudulent Transactions (Donut Chart)

In [None]:


class_counts = df['Class'].value_counts()
labels = ['Non-Fraudulent', 'Fraudulent']
values = class_counts.values

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.4,
                             marker_colors=['#636EFA', '#EF553B'],
                             pull=[0, 0.2])]) # Pulls out the fraud slice
fig.update_layout(
    title_text="<b>Distribution of Fraudulent vs. Non-Fraudulent Transactions</b>",
    annotations=[dict(text=f'{class_counts[1]/len(df):.2%}<br>Fraud', x=0.5, y=0.5, font_size=20, showarrow=False)]
)
fig.show()

# PCA Visualization (2D)

This scatter plot shows the result of applying Principal Component Analysis (PCA) to reduce the high-dimensional transaction data to two principal components:

PC1 and PC2: Represent the directions of maximum variance in the data

Color-coded: Fraudulent (Class = 1) vs Non-Fraudulent (Class = 0) transactions

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px

# Keep only numeric features
X_numeric = df.drop("Class", axis=1).select_dtypes(include=["int64","float64"])

# Scale data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(X_numeric)

# PCA
pca = PCA(n_components=2, random_state=42)
pca_result = pca.fit_transform(scaled_data)

# Create DataFrame for visualization
df_pca = pd.DataFrame(pca_result, columns=["PC1", "PC2"])
df_pca["Class"] = df["Class"].values

# Plot
fig = px.scatter(
    df_pca,
    x="PC1",
    y="PC2",
    color="Class",
    title="PCA Visualization (2 Components)",
    color_discrete_map={0: "blue", 1: "red"}
)
fig.show()


# Fraud Transactions Over Time (Histogram)
This histogram displays the distribution of fraudulent transactions (Class = 1) across time.

X-axis: Time (measured in seconds since the first transaction)

Y-axis: Number of fraud cases

Color: Red, to highlight fraud activity

In [None]:
fraud = df[df['Class']==1]

fig = px.histogram(fraud, x="Time", nbins=50,
                   title="Fraud Transactions Over Time",
                   color_discrete_sequence=["red"])
fig.show()

# Combined Visualization Dashboard

In [None]:
plt.figure(figsize=(15,10))

# Fraud vs Non-Fraud Count

In [None]:
plt.subplot(2,2,1)
sns.countplot(x="Class", data=df, palette="Set2")
plt.title("Fraud vs Non-Fraud Count")

# Transaction Amount Distribution

In [None]:
plt.subplot(2,2,2)
sns.histplot(df['Amount'], bins=100, kde=True, color="blue")
plt.title("Transaction Amount Distribution")

# Fraud Amount Distribution

In [None]:
plt.subplot(2,2,3)
sns.histplot(fraud['Amount'], bins=50, kde=True, color="red")
plt.title("Fraudulent Transaction Amounts")

# Fraud Transactions Over Time

In [None]:
plt.subplot(2,2,4)
sns.histplot(fraud['Time'], bins=50, kde=False, color="purple")
plt.title("Fraud Transactions Over Time")

plt.tight_layout()
plt.show()


# Interactive Dashboard with Plotly Subplots

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=2, cols=2, subplot_titles=(
    "Fraud vs Non-Fraud Count",
    "Transaction Amount Distribution",
    "Fraudulent Transaction Amounts",
    "Fraud Transactions Over Time"
))

# Transaction Amounts (All)

In [None]:

fig.add_trace(go.Histogram(x=df['Amount'], nbinsx=50, marker_color="blue"),
              row=1, col=2)

# Fraudulent Transaction Amounts

In [None]:

fig.add_trace(go.Histogram(x=fraud['Amount'], nbinsx=50, marker_color="red"),
              row=2, col=1)


# Fraud Over Time

In [None]:


fig.add_trace(go.Histogram(x=fraud['Time'], nbinsx=50, marker_color="purple"),
              row=2, col=2)

fig.update_layout(title_text="Interactive Dashboard of Fraud Analysis", height=800, showlegend=False)
fig.show()

# Fraud Rate by Transaction Amount Bucket
This bar chart analyzes how the likelihood of fraud varies across different transaction amount ranges.

Amount Buckets: Transactions are grouped into ranges (e.g., ₹0–10, ₹10–100, etc.)

Fraud Rate: Calculated as the proportion of fraudulent transactions within each bucket

Color Gradient: Darker red indicates higher fraud risk

In [None]:
# Create amount buckets
df['AmountBucket'] = pd.cut(df['Amount'], bins=[0, 10, 100, 500, 1000, 5000, 10000, df['Amount'].max()],
                            labels=['0–10', '10–100', '100–500', '500–1K', '1K–5K', '5K–10K', '10K+'])

# Fraud rate per bucket
bucket_stats = df.groupby('AmountBucket')['Class'].agg(['count', 'sum'])
bucket_stats['FraudRate'] = bucket_stats['sum'] / bucket_stats['count']

# Plot
fig = px.bar(bucket_stats, x=bucket_stats.index, y='FraudRate',
             title='Fraud Rate by Transaction Amount Bucket',
             labels={'FraudRate': 'Fraud Rate'},
             color='FraudRate', color_continuous_scale='Reds')
fig.show()


# Fraudulent Transactions by Hour of Day (Line Plot)
This line chart shows how fraudulent transactions (Class = 1) are distributed across different hours of the day.

Time Conversion: Raw time (in seconds) is converted to hourly bins

X-axis: Hour of the day (0 to 23+)

Y-axis: Number of fraud cases

Line Plot: Highlights peaks and troughs in fraud activity

In [None]:
# Convert time to hours (assuming time is in seconds)
df['Hour'] = (df['Time'] / 3600).astype(int)

# Fraud frequency by hour
hourly_fraud = df[df['Class'] == 1]['Hour'].value_counts().sort_index()

# Plot
fig = px.line(x=hourly_fraud.index, y=hourly_fraud.values,
              labels={'x': 'Hour of Day', 'y': 'Number of Fraud Cases'},
              title='Fraudulent Transactions by Hour of Day')
fig.show()
