# Introduction to the Project

Hey there, this notebook shall go through the a very famous, yet overlooked problem - Credit Card Fraud Detection. This project is special, in the sense that this project is highly imbalanced. To solve this problem we'll use techniques like **SMOTE**, and we'll surely learn something new.

Especially, in these volatile cases, where there is a very less amount of frauds, you need to make a classification machine learning model that'd predict if a transaction is fraud or not - and without some pre-processing, our model would always predict as **NOT** fraud, even when it is!

So this project mainly revovles around the idea of imbalance and the techniques used to balance our data before modelling

# Data
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

# Evaluation
Given the class imbalance ratio, we'll measure the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification

As this is just a dataset (And not a Kaggle competition), we'll take 0.2% of our data as our test set, and then we'll begin modelling.

The Dataset/Kaggle Link: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

In [None]:
# Importing Libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Pre-Processing Libs
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler
from imblearn.over_sampling import SMOTE 
from sklearn.model_selection import train_test_split

# Modelling Libs
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

# Validating/Testing libs
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score


print("Libraries Imported")

# Loading the Data
* We'll have to unzip a folder (I had to zip it becuase the file was too big, and GitHub doesn't allow to push 100+ MB) - This can be done with Pandas...

* Look at our data

* Some Information about our data

* A bit of description

In [None]:
# Read the dataset using the compression zip
credit_df = pd.read_csv('creditcard.csv.zip',compression='zip')
 
# Display dataset
credit_df

In [None]:
# Some Information
credit_df.info()

**Observation**

Well this is not bad. No categorical/string columns are included. All columns except for `Time` and `Amount` are transformed by **PCA** and have been **Scaled**.

So we'll have to **PCA** transform and **Scale** these Columns. But before we do anything like this, we'll do some Exploratory Data Analysis (EDA).

In [None]:
# Some Description
credit_df.describe().T

**Observation**

There isn't much to say. The features `V1-V28` are anonymous and we have no information whatsoever. This is so because these columns have some confidential information that cannot be disclosed to the general public. But these columns are well-processed (PCA Transformation, Dimensiality Reduction and Scaling), so no worries.

We just need to deal with the **Time** and the **Amount** column.

# Some EDA
* Look at the distribution of the `Time`, `Amount` and `Class` column
* Experience the horrible imbalancy

In [None]:
# Distribution of Time
px.histogram(x=credit_df["Time"])

**Distribution**

I see no pattern except that a wave like pattern (Starting low, moving up, going low and then moving up again)

In [None]:
# Distribution of Time
px.histogram(x=credit_df["Amount"])

**Observation**

Most values are around 0-100, and there are rare cases with more than 5k. But we can't consider them as outliers as it is very much possible to transfer over 5k to anyone.

In [None]:
# Frauds and Non-Frauds
plt.figure(figsize=(8, 5), dpi=120)
credit_df.Class.value_counts().plot(kind="pie", explode=[0, 0.1], shadow=True, startangle=140, autopct='%1.1f%%')
plt.legend(labels=['Normal','Fraud'])
plt.title('"Fraud" Distribution')
plt.axis('off')
plt.show()

**Observation**

So as you can see, there is only 0.2% fraud (570 Samples from 284,807 entries), which is a severe imbalance. If we train our model just like this, there is no chance we'll ever predict a **FRAUD** case. So we'll have to deal with this and this project is mainly about this topic - Dealing with Imbalanced Classification!

In [None]:
# Relation of Non-Frauds and Frauds with Transaction Time
values = credit_df["Class"].value_counts().index
figure, (non_fraud, fraud) = plt.subplots(2,1, sharex=True, figsize=(15, 10))

non_fraud.hist((credit_df["Time"]/60/60)[credit_df["Class"] == 0], bins=50, color="lightgreen")
non_fraud.set_title("Class = NON-FRAUD")

fraud.hist((credit_df["Time"]/60/60)[credit_df["Class"] ==1 ], bins=50, color="salmon")
fraud.set_title("Class = FRAUD")

plt.xticks(np.arange(0,54,6))
plt.xlim([0,48])
plt.xlabel("Time after first transaction (HOURS)")
plt.ylabel('Number of Transactions')

plt.show()

**Observation**
As you can see, the number of transactions for genuine users take a hit during late night and early morning hours. It also makes sense since most people sleep during this. On the contrary, for fraudulent transactions, the number sees sharp spikes during late hours, and during the daytime, the count is significantly less.


# Cleaning Data
* Missing Data?
* Duplicating Data?
* Outliers?

Let's clean the data a bit.

In [None]:
# Let's create a copy and do all the wrangling stuff on there so we have our orignal dataset preserved
credit_df_copy = credit_df.copy()

In [None]:
credit_df_copy.isna().sum()

**Observation**

No **NULL** values

In [None]:
# Duplicating Data (Number of Columns)
print(f"Non-Frauds: {credit_df_copy[credit_df_copy.Class == 0].duplicated().sum()}")
print(f"Frauds: {credit_df_copy[credit_df_copy.Class == 1].duplicated().sum()}")
print("*" * 100)

# Drop
credit_df_copy.drop_duplicates(inplace=True)
print("Dropped Succesfully")
print("*" * 100)

# Check
print(f"Non-Frauds: {credit_df_copy[credit_df_copy.Class == 0].duplicated().sum()}")
print(f"Frauds: {credit_df_copy[credit_df_copy.Class == 1].duplicated().sum()}")

Regarding outliers, we'll not deal with them. Becuase all is possible. The amount could be easily over 5k and the time could be more becuase of internet or any technical issue. So I don't think there will be any outliers as this dataset seems to be constructed by something automated, and not manual.

# Data Pre-Processing
* PCA Transforming the `Time` & `Amount` columns
* Using the `RobustScaler()` to scale the `Time` & `Amount` columns
* Using `SMOTE` technique to solve the imbalancy


**Some Recourses**

1. To learn more about PCA transformations, you can read this: https://builtin.com/data-science/step-step-explanation-principal-component-analysis

2. To learn about Scaling, read this: https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/

3. To read on RobustScaler, have a look here: https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/

In [None]:
# PCA transformations
pca = PCA(n_components = 2)
columns = credit_df_copy[["Time", "Amount"]]
pca.fit(columns)
credit_df_copy[["Time", "Amount"]] = pca.transform(columns)

In [None]:
# Scaling with the Robust Scaler
transformer = RobustScaler().fit(columns)
credit_df_copy[["Time", "Amount"]] = transformer.transform(columns)

In [None]:
# Using SMOTE to balance the data
X = credit_df_copy.drop('Class', axis = 1)
y = credit_df_copy['Class']

smote = SMOTE(random_state=42)
X, y = smote.fit_resample(X, y)

# Plot the results
fig = px.pie(values=y.value_counts(), 
             width=800, height=400, 
             title="Data Balance",
             color_discrete_sequence=["skyblue","black"])
fig.show()

Beautiful, isn't it? We did some PCA transformations, did some Scaling using the RobustScaler method and also balanced our dataset!

Now....we move to the fun part - the modelling part. Let's go!

# Modelling
* Split the Data into Training & Testing sets
* Try out different models like
    1. Logistic Regression
    2. Naive Bayes (GaussianNB)
    3. Random Forest Classifier
    4. K-Neighbors Classifier
    5. XGBoost Classifier

* Then we'll do some testing on our test set, push our project to [Github](https://github.com/muhammadanas0716/Machine-Learning-Projects-101) and [Kaggle](https://www.kaggle.com/muhammadanas0716)

In [None]:
# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"""Data Splitted. Here are the Stats:

Rows in X_train: {X_train.shape[0]}
Rows in y_train: {y_train.shape[0]}

Rows in X_test: {X_test.shape[0]}
Rows in y_test: {y_test.shape[0]} 

Columns in X_train & X_test are 3
Columns in y_train & y_test is only 1 - the TARGET column (i.e Class)""")

In [None]:
# Naive Bayes
classifier = GaussianNB()
classifier.fit(X_train , y_train)
classifier_score = classifier.score(X_test , y_test).round(5)

In [None]:
# Decision Tree
dt =DecisionTreeClassifier(max_features=8 , max_depth=6)
dt.fit(X_train , y_train)
dt_score = dt.score(X_test , y_test).round(5)

In [None]:
# Random Forest Classifier
Rclf = RandomForestClassifier(max_features=8 , max_depth=6)
Rclf.fit(X_train, y_train)
Rclf_score = Rclf.score(X_test, y_test).round(5)

In [None]:
# Logistic Regression
lr = LogisticRegression(C = 100, max_iter=1000)
lr.fit(X_train , y_train)
lr_score = lr.score(X_test , y_test).round(5)

In [None]:
# K-Nearest
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
knn_score = knn.score(X_test, y_test).round(5)

In [None]:
# XGBoost
xgb = XGBClassifier()
xgb.fit(X_train , y_train)
xgb_score = xgb.score(X_test, y_test).round(5)

# Testing Metrics

We'll use the following metrics:
* Accuracy
* F-1 Score
* Precision Score
* Recall Score

In [None]:
model_comparison = {}
names = ["Decision Tree", "Naive", "Random Forest", "KNN", "Logistic Regression", "XGboost"]
models = [classifier, dt, Rclf, lr, knn, xgb]
results = {}

# Make Predictions
for model in models:
    results[str(model).split("(")[0]] = [model.predict(X_test)]        

In [None]:
# Find the scores of the metrics
for model, preds in results.items():
    model_comparison[model] = [
                            round(accuracy_score(y_test, pd.DataFrame(preds).T), 2),
                            round(f1_score(y_test, pd.DataFrame(preds).T,average='weighted'), 2),
                            round(precision_score(y_test, pd.DataFrame(preds).T), 2),
                            round(recall_score(y_test, pd.DataFrame(preds).T), 2),
    ]

In [None]:
results_df = pd.DataFrame(model_comparison, index=["Accuracy", "F-1 Score", "Precision Score", "Recall Score"])
results_df.style.format("{:.2%}").background_gradient(cmap='Blues')

In [None]:
# Cross Validation Scores
classifier_cr = cross_val_score(classifier, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
dt_cr = cross_val_score(dt, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
Rclf_cr = cross_val_score(Rclf, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
lr_cr = cross_val_score(lr, cre`dit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
knn_cr = cross_val_score(knn, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()
xgb_cr = cross_val_score(xgb, credit_df_copy.drop("Class", axis=1), credit_df_copy["Class"], cv=5).mean()

In [None]:
# Cross Validation Scores in a Plot
cross_validated_scores = [classifier_cr, dt_cr, Rclf_cr, lr_cr, knn_cr, xgb_cr]
cross_validated_scores = pd.DataFrame(cross_validated_scores, index=["GaussianNB", 
                                            "DecisionTreeClassifier", 
                                            "RandomForestClassifier",
                                            "LogisticRegression",
                                            "KNeighborsClassifier", 
                                            "XGBClassifier"])
cross_validated_scores.rename(columns={0 : "Score"}, inplace=True)
cross_validated_scores.plot(kind="bar", figsize=(10, 5), color=["salmon"])
plt.xticks(rotation=45)
plt.show()

# End Notes
And there...our model works amazingly fine! The scores are dope. KNN and XGBClassifier did really well, giving us a 100% accuracy, but I would be skeptical of this. But as we did cross-validation, maybe this is not bad after all!

KNN seems to be the BEST algorithm for this problem!
For now my freinds that's it! And be sure to give this repo a star and follow me [@MuhammadAnas707](https://twitter.com/MuhammadAnas707)