### **Introduction to the Project**

In today's digital world, credit card fraud has become a major concern, costing businesses and consumers billions of dollars every year. However, one of the biggest challenges in fraud detection is the extreme imbalance in data—fraudulent transactions are rare compared to legitimate ones. This project takes a deep dive into Credit Card Fraud Detection, a crucial yet often overlooked problem, and explores effective ways to handle imbalanced datasets.

To build a reliable fraud detection system, we cannot rely on standard machine learning models. Without proper preprocessing, the model would be biased towards predicting all transactions as non-fraudulent, simply because fraudulent cases are so few. To address this, we employ techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset, ensuring that our model can accurately distinguish between fraudulent and legitimate transactions.

This project is not just about detecting fraud; it is about understanding data imbalance and mastering techniques to overcome it before training a classification model. By the end of this notebook, you will have a solid grasp of handling skewed data distributions and building more robust fraud detection systems.

#### **Dataset Overview**
The dataset used in this project consists of credit card transactions recorded in September 2013 by European cardholders. The transactions span over two days and contain a total of 284,807 records, out of which only 492 transactions are fraudulent. This means that the fraudulent cases make up a mere 0.172% of the entire dataset, making it highly imbalanced.

Due to confidentiality reasons, the dataset only includes numerical features, which are the result of Principal Component Analysis (PCA) transformations. Here’s a breakdown of the key features:

V1 to V28: Principal components extracted from PCA transformation.
Time: The elapsed time (in seconds) between each transaction and the first transaction in the dataset.
Amount: The transaction amount, which can be used for cost-sensitive learning.
Class: The target variable, where:
0 indicates a legitimate transaction.
1 indicates a fraudulent transaction.
Since the original feature names and additional background information about the dataset are confidential, we rely solely on these transformed variables for fraud detection. 

#### **Dataset Information**
We are using the Credit Card Fraud Detection dataset, available on Kaggle:
🔗 Credit Card Fraud Dataset: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

With our evaluation strategy in place, let’s move forward and build a fraud detection model that does more than just predict "no fraud"—but actually detects fraud!

#### **Evaluation Strategy**
When dealing with highly imbalanced datasets, traditional accuracy metrics can be misleading. In fraud detection, where fraudulent transactions make up a tiny fraction of the data, a model predicting "no fraud" 99.8% of the time might still appear highly accurate—yet it completely fails at identifying actual fraud cases.

To ensure a meaningful evaluation, we will use the Area Under the Precision-Recall Curve (AUPRC) as our primary metric. Unlike standard accuracy or even AUC-ROC, AUPRC focuses on how well the model identifies positive cases (fraudulent transactions) while minimizing false positives, making it better suited for unbalanced classification problems.

#### **Evaluation Approach**
Given that this is a real-world dataset (not a Kaggle competition), we will allocate 0.2% of the data as our test set.

This ensures that our model is evaluated on a small but representative subset of transactions, allowing us to measure its true ability to detect fraudulent activities.


In [None]:
# Importing Libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Pre-Processing Libs
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE 
from sklearn.model_selection import train_test_split

# Modelling Libs
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

# Validating/Testing libs
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score


print("Libraries Imported")

#### **Loading the Data: Preparing for Fraud Detection**
Since the dataset exceeds 100MB, it has been compressed into a ZIP file for efficient storage and transfer. Unfortunately, GitHub restricts files larger than 100MB, so we need to unzip the folder before we can start working with the data. This process can be easily handled using Pandas and Python’s built-in utilities.

**Steps for Data Loading & Exploration**
* Extract the dataset: Unzip the file to access the raw data. Load the dataset using Pandas for further processing.
* Explore the dataset: Display the first few rows to understand the structure. Check for missing values and data consistency.
* Understand the dataset: Analyze basic information such as column names, data types, and null values. Examine dataset size and the distribution of fraud vs. non-fraud cases.
* Summarize key statistics: Generate descriptive statistics for numerical features. Understand patterns in transaction amounts and time-based trends.


In [None]:
# Read the dataset using the compression zip
credit_df = pd.read_csv("../input/creditcardfraud/creditcard.csv")
 
# Display dataset
credit_df

In [None]:
# Some Information
credit_df.info()

**Observation**

All columns except for `Time` and `Amount` are transformed by **PCA** and have been **Scaled**.

Now we have to **PCA** transform and **Scale** these Columns. But before we do anything like this, we'll do some Exploratory Data Analysis (EDA).

In [None]:
# Some Description
credit_df.describe().T

**Observation**

There isn't much to say. The features `V1-V28` are anonymous and we have no information whatsoever. This is so because these columns have some confidential information that cannot be disclosed to the general public. But these columns are well-processed (PCA Transformation, Dimensiality Reduction and Scaling), so no worries.

We just need to deal with the **Time** and the **Amount** column.

### **EDA**
* Look at the distribution of the `Time`, `Amount` and `Class` column
* Experience the horrible imbalancy

In [None]:
# Distribution of Time
px.histogram(x=credit_df["Time"])

In [None]:
# Distribution of Time
px.histogram(x=credit_df["Amount"])

**Observation**

Most values are around 0-100, and there are rare cases with more than 5k. But we can't consider them as outliers as it is very much possible to transfer over 5k to anyone.

In [None]:
# Frauds and Non-Frauds
plt.figure(figsize=(8, 5), dpi=120)
credit_df.Class.value_counts().plot(kind="pie", explode=[0, 0.1], shadow=True, startangle=140, autopct='%1.1f%%')
plt.legend(labels=['Normal','Fraud'])
plt.title('"Fraud" Distribution')
plt.axis('off')
plt.show()

**Observation**

So as you can see, there is only 0.2% fraud (570 Samples from 284,807 entries), which is a severe imbalance. If we train our model just like this, there is no chance we'll ever predict a **FRAUD** case. So we'll have to deal with this and this project is mainly about this topic - Dealing with Imbalanced Classification!

In [None]:
# Relation of Non-Frauds and Frauds with Transaction Time
values = credit_df["Class"].value_counts().index
figure, (non_fraud, fraud) = plt.subplots(2,1, sharex=True, figsize=(15, 10))

non_fraud.hist((credit_df["Time"]/60/60)[credit_df["Class"] == 0], bins=50, color="lightgreen")
non_fraud.set_title("Class = NON-FRAUD")

fraud.hist((credit_df["Time"]/60/60)[credit_df["Class"] ==1 ], bins=50, color="salmon")
fraud.set_title("Class = FRAUD")

plt.xticks(np.arange(0,54,6))
plt.xlim([0,48])
plt.xlabel("Time after first transaction (HOURS)")
plt.ylabel('Number of Transactions')

plt.show()

**Observation**
As you can see, the number of transactions for genuine users take a hit during late night and early morning hours. It also makes sense since most people sleep during this. On the contrary, for fraudulent transactions, the number sees sharp spikes during late hours, and during the daytime, the count is significantly less.


#### **Cleaning Data**

In [None]:
# Let's create a copy and do all the wrangling stuff on there so we have our orignal dataset preserved
credit_df_copy = credit_df.copy()

In [None]:
credit_df_copy.isna().sum()

No **NULL** values

In [None]:
# Duplicating Data (Number of Columns)
print(f"Non-Frauds: {credit_df_copy[credit_df_copy.Class == 0].duplicated().sum()}")
print(f"Frauds: {credit_df_copy[credit_df_copy.Class == 1].duplicated().sum()}")
print("*" * 100)

# Drop
credit_df_copy.drop_duplicates(inplace=True)
print("Dropped Succesfully")
print("*" * 100)

# Check
print(f"Non-Frauds: {credit_df_copy[credit_df_copy.Class == 0].duplicated().sum()}")
print(f"Frauds: {credit_df_copy[credit_df_copy.Class == 1].duplicated().sum()}")

Regarding outliers, we'll not deal with them. Becuase all is possible. The amount could be easily over 5k and the time could be more becuase of internet or any technical issue. So I don't think there will be any outliers as this dataset seems to be constructed by something automated, and not manual.

### **Data Pre-Processing**

* PCA Transforming the `Time` & `Amount` columns
* Using the `RobustScaler()` to scale the `Time` & `Amount` columns
* Using `SMOTE` technique to solve the imbalancy

In [None]:
# PCA transformations
pca = PCA(n_components = 2)
columns = credit_df_copy[["Time", "Amount"]]
pca.fit(columns)
credit_df_copy[["Time", "Amount"]] = pca.transform(columns)

In [None]:
# Scaling with the Robust Scaler
transformer = RobustScaler().fit(columns)
credit_df_copy[["Time", "Amount"]] = transformer.transform(columns)

In [None]:
# Using SMOTE to balance the data
X = credit_df_copy.drop('Class', axis = 1)
y = credit_df_copy['Class']

smote = SMOTE(random_state=42)
X, y = smote.fit_resample(X, y)

# Plot the results
fig = px.pie(values=y.value_counts(), 
             width=800, height=400, 
             title="Data Balance",
             color_discrete_sequence=["skyblue","black"])
fig.show()

#### **Model: Training & Evaluation**
With our dataset prepared, the next step is to train and evaluate multiple machine learning models to determine the most effective approach for credit card fraud detection. Given the severe class imbalance, choosing the right model and handling data distribution carefully is crucial.

* **Splitting the Data**: To ensure a fair evaluation, we will split the dataset into training and testing sets;

Training Set: Used to train the model.
Testing Set: Used to evaluate the model’s performance on unseen data.
Since fraud cases are rare, we must carefully balance the dataset to prevent the model from predicting only non-fraudulent transactions.

* **Trying Out Different Models** : We will experiment with multiple supervised learning algorithms to identify the best-performing classifier. The models we’ll test include:

    1. Logistic Regression: A simple yet effective baseline model for classification tasks. Works well with imbalanced data when combined with techniques like class weighting.
    2. Naive Bayes (GaussianNB): A probabilistic model that assumes feature independence. Can be useful for high-dimensional datasets like ours.
    3. Random Forest Classifier: An ensemble learning technique that creates multiple decision trees. Known for handling imbalanced datasets better than single-tree models.  
    4. K-Neighbors Classifier: A distance-based algorithm that classifies a point based on its nearest neighbors. May struggle with large datasets due to computational complexity.
    5. XGBoost Classifier: A powerful gradient boosting algorithm known for handling imbalanced data effectively. Regularization features prevent overfitting.

* **Model Comparison & Selection** : Each model will be evaluated using the AUPRC (Area Under Precision-Recall Curve) and other relevant metrics to ensure accurate fraud detection. Based on performance, we will select the best-performing model for deployment.
By testing multiple models and comparing their results, we aim to develop a robust fraud detection system that effectively identifies fraudulent transactions while minimizing false positives.

In [None]:
# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"""Data Splitted. Here are the Stats:

Rows in X_train: {X_train.shape[0]}
Rows in y_train: {y_train.shape[0]}

Rows in X_test: {X_test.shape[0]}
Rows in y_test: {y_test.shape[0]} 

Columns in X_train & X_test are 3
Columns in y_train & y_test is only 1 - the TARGET column (i.e Class)""")

In [None]:
# Naive Bayes
classifier = GaussianNB()
classifier.fit(X_train , y_train)
classifier_score = classifier.score(X_test , y_test).round(5)

In [None]:
# Decision Tree
dt =DecisionTreeClassifier(max_features=8 , max_depth=6)
dt.fit(X_train , y_train)
dt_score = dt.score(X_test , y_test).round(5)

In [None]:
# Random Forest Classifier
Rclf = RandomForestClassifier(max_features=8 , max_depth=6)
Rclf.fit(X_train, y_train)
Rclf_score = Rclf.score(X_test, y_test).round(5)

In [None]:
# Logistic Regression
lr = LogisticRegression(C = 100, max_iter=1000)
lr.fit(X_train , y_train)
lr_score = lr.score(X_test , y_test).round(5)

In [None]:
# K-Nearest
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
knn_score = knn.score(X_test, y_test).round(5)

In [None]:
# XGBoost
xgb = XGBClassifier()
xgb.fit(X_train , y_train)
xgb_score = xgb.score(X_test, y_test).round(5)

#### **Metrics**

* Accuracy
* F-1 Score
* Precision Score
* Recall Score

In [None]:
model_comparison = {}
names = ["Decision Tree", "Naive", "Random Forest", "KNN", "Logistic Regression", "XGboost"]
models = [classifier, dt, Rclf, lr, knn, xgb]
results = {}

# Make Predictions
for model in models:
    results[str(model).split("(")[0]] = [model.predict(X_test)]        

In [None]:
# Find the scores of the metrics
for model, preds in results.items():
    model_comparison[model] = [
                            round(accuracy_score(y_test, pd.DataFrame(preds).T), 2),
                            round(f1_score(y_test, pd.DataFrame(preds).T,average='weighted'), 2),
                            round(precision_score(y_test, pd.DataFrame(preds).T), 2),
                            round(recall_score(y_test, pd.DataFrame(preds).T), 2),
    ]

In [None]:
results_df = pd.DataFrame(model_comparison, index=["Accuracy", "F-1 Score", "Precision Score", "Recall Score"])
results_df.style.format("{:.2%}").background_gradient(cmap='Blues')

In [None]:
# Cross Validation Scores
classifier_cr = cross_val_score(classifier, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
dt_cr = cross_val_score(dt, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
Rclf_cr = cross_val_score(Rclf, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
lr_cr = cross_val_score(lr, cre`dit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
knn_cr = cross_val_score(knn, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
xgb_cr = cross_val_score(xgb, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()

In [None]:
# Cross Validation Scores in a Plot
cross_validated_scores = [classifier_cr, dt_cr, Rclf_cr, lr_cr, knn_cr, xgb_cr]
cross_validated_scores = pd.DataFrame(cross_validated_scores, index=["GaussianNB", 
                                            "DecisionTreeClassifier", 
                                            "RandomForestClassifier",
                                            "LogisticRegression",
                                            "KNeighborsClassifier", 
                                            "XGBClassifier"])
cross_validated_scores.rename(columns={0 : "Score"}, inplace=True)
cross_validated_scores.plot(kind="bar", figsize=(10, 5), color=["salmon"])
plt.xticks(rotation=45)
plt.show()

### **Conclusion**

This project successfully explored the challenge of credit card fraud detection, a real-world problem where fraudulent transactions are extremely rare compared to legitimate ones. The key takeaway from this analysis is that handling class imbalance is crucial to building an effective fraud detection system.

**Key Learnings & Takeaways**
* Data Preprocessing & Feature Engineering
 1. The dataset consisted of numerical features obtained from PCA transformation, with ‘Amount’ and ‘Time’ being the only raw variables.
 2. Proper data cleaning and exploration helped in understanding patterns and distributions.

* **Handling Class Imbalance**
1. Fraudulent transactions accounted for only 0.172% of the total dataset, making it highly skewed.
2. We applied SMOTE (Synthetic Minority Over-sampling Technique) and other resampling techniques to balance the data and improve model performance.

* **Model Selection & Performance Evaluation**
Multiple machine learning models, including Logistic Regression, Naïve Bayes, Random Forest, K-Nearest Neighbors, and XGBoost, were tested.
AUPRC (Area Under the Precision-Recall Curve) was used as the primary evaluation metric, ensuring meaningful results for fraud detection.
The best-performing model achieved an optimal balance between precision and recall, reducing false positives while effectively identifying fraudulent transactions.
