The project I'm working on focuses on developing a credit card fraud detection system. I've utilized logistic regression, XGBoost, and random forest algorithms for this purpose. The kernel is still a work in progress, and I aim to refine it further based on ongoing research and analysis.

Although the dataset features are scaled and anonymized for privacy reasons, there are still valuable insights to be extracted from analyzing the data. Let's dive into the exploration!

# 1. Importing necessary Libraries

I will be using some of the visualization libraries like matplotlib, plotly for data exploration. I am planning to use Random Forest, Logistic Regression and XGBoost to find the model which gives higher accuracy for fraud detection.

In [None]:
#Importing Libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn 
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# 2. Importing Data and performing initial analysis


In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv("../input/creditcard_2023.csv")

In [None]:
data.head()

In [None]:
data.shape


We can observe that the data is very large and hence it is possible to get good accuracy by training some models like RF, LR, XGB.

Now let's check for the null values.

In [None]:
data.isnull().sum()

# 3. Data Visualization and Exploration

Let's check for the number of fraud vs not fraud transaction in the column Class by using Histogram.

In [None]:
#Checking Credit card fraud Class
px.histogram(data_frame = data, x='Class', color='Class')

In [None]:
plt.figure(figsize = (20,10))
sns.heatmap(data.corr(), cmap = 'crest', annot = True)
plt.show

Some of features are highly correlated among themselves. For example, V18 and V17 are highly correlated and so are V16 and V17.

Let's check the skewness of our features:

In [None]:
data.skew()

**Positively Skewed Data:**

Features like V7, V5, V20, V27, and V28 exhibit positive skewness. This is indicated by their positive mean values, with V7 being particularly notable with a mean of approximately 19.03.
Positive skewness suggests that the majority of data points in these features are concentrated on the lower end, with a long tail extending towards higher values.

**Negatively Skewed Data:**

None of the features in the provided data exhibit significant negative skewness. Negative skewness would be indicated by negative mean values, but all features have means closer to zero or slightly positive.

In [None]:
#visualize the distribution of the 'Amount' variable in the dataset using a KDE plot. 
sns.kdeplot(data = data['Amount'], fill=True)
plt.show()

In [None]:
paper, axes = plt.subplots(2, 2, figsize=(10,6))
data['V1'].plot(kind='hist', ax = axes[0,0], title = 'Distribution of V10')
data['V5'].plot(kind='hist', ax = axes[0,1], title = 'Distribution of V5')
data['V7'].plot(kind='hist', ax = axes[1,0], title = 'Distribution of V7')
data['V20'].plot(kind='hist', ax = axes[1,1], title = 'Distribution of V20')
plt.suptitle('Distribution of V10, V5, V7, V20')
plt.tight_layout()

In summary, the data exhibits positively skewed features, with some features displaying more pronounced skewness compared to others. There's no evidence of negatively skewed data, and certain features appear to have symmetric distributions.

# 4. Model Implementation

We will define our dependent and independent features by 'y' and 'X'. 


In [None]:
y = data.Class
X = data.drop(['id', 'Class'], axis=1)

In [None]:
X.head()

In [None]:
y.head()

In [None]:
print(X.shape)
print(y.shape)

In [None]:
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

In [None]:
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled_df.head()

Now, let's import all the necessary models and metrics from scikit-learn.

In [None]:
import gc
from datetime import datetime 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from catboost import CatBoostClassifier
from sklearn import svm
import lightgbm as lgb
from lightgbm import LGBMClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
#Splitting the dataset into training and testing.
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size = 0.2, random_state = 5, stratify = y)

## 1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()

In [None]:
LR.fit(X_train, y_train)

In [None]:
lr_predictions_train = LR.predict(X_train)
lr_predictions_test = LR.predict(X_test)

In [None]:
asc = accuracy_score(y_train, lr_predictions_train)
cr = classification_report(y_train, lr_predictions_train)
print("Accuracy Score is:", asc)
print(cr)
cm = pd.crosstab(y_train, lr_predictions_train, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Training Confusion Matrix', fontsize=14)
plt.show()

In [None]:
asc = accuracy_score(y_test, lr_predictions_test)
cm = confusion_matrix(y_test, lr_predictions_test)
cr = classification_report(y_test, lr_predictions_test)
print("Accuracy Score is:", asc)
print(cm)
print(cr)

cm = pd.crosstab(y_test, lr_predictions_test, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Testing Confusion Matrix', fontsize=14)
plt.show()

So, after running our logistic regression model, it yielded a high accuracy of 97%.

Reviewing the confusion matrix, we notice generally accurate classifications with a few errors scattered in. Precision and recall metrics for both normal and fraudulent transactions are commendably high, resulting in an overall F1-score of 97%.

## 2. Random Forest


In [None]:
RFC = RandomForestClassifier(n_jobs = 4,
                             random_state = 5,
                             n_estimators = 100,
                             max_depth = 5,
                             verbose = False)

In [None]:
#Training the Random Forest Classifier model using the training data.
RFC.fit(X_train, y_train)

In [None]:
rf_predictions_train = RFC.predict(X_train)
rf_predictions_test = RFC.predict(X_test)

In [None]:
asc = accuracy_score(y_train, rf_predictions_train)
cr = classification_report(y_train, lr_predictions_train)
print("Accuracy Score is:", asc)
print(cr)

# Plot training confusion matrix 
cm = pd.crosstab(y_train, rf_predictions_train, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Training Confusion Matrix', fontsize=14)
plt.show()

In [None]:
asc = accuracy_score(y_test, rf_predictions_test)
cr = classification_report(y_test, rf_predictions_test)
print("Accuracy Score is:", asc)
print(cr)

# Plot testing confusion matrix 
cm = pd.crosstab(y_test, rf_predictions_test, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Testing Confusion Matrix', fontsize=14)
plt.show()


The accuracy score obtained from **Random Forest Classfier** is **0.96**.

## 3. XGBoost

In [None]:
# Prepare the train and test datasets
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

#Here the focus is on monitoring training and testing.
watchlist = [(dtrain, 'train'), (dtest, 'test')]

# Set xgboost parameters
params = {}
params['objective'] = 'binary:logistic'
params['eta'] = 0.039
params['silent'] = True
params['max_depth'] = 2
params['subsample'] = 0.8
params['colsample_bytree'] = 0.9
params['eval_metric'] = 'auc'
params['random_state'] = 5

In [None]:
model = xgb.train(params, 
                dtrain, 
                1000, 
                watchlist, 
                early_stopping_rounds=50, 
                maximize=True, 
                verbose_eval=50)

In [None]:
xgb_preds = model.predict(dtest)

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Convert probabilities to binary predictions
threshold = 0.6  # The threshold can be adjusted if needed
binary_preds = np.where(xgb_preds > threshold, 1, 0)

# Calculate accuracy
asc = accuracy_score(y_test, binary_preds)
print("Accuracy Score is:", asc)

# Generate classification report
cr = classification_report(y_test, binary_preds)
print(cr)

# Compute confusion matrix
cm = confusion_matrix(y_test, binary_preds)
print("Confusion Matrix:")
print(cm)

# Plot confusion matrix
plt.figure(figsize=(5,5))
sns.heatmap(cm, 
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,
            linewidths=.2,
            linecolor="Darkblue",
            cmap="Blues")
plt.title('Test Confusion Matrix', fontsize=14)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.show()


We have obtained the highest accuracy score for XGBoost which is 0.99!

# Conclusion

Out of all three models which we have used for this particular dataset, we observe that XGBoost give the highest accuracy.

Logistic Regression achieved an impressive accuracy of 97%, with generally accurate classifications observed in the confusion matrix.
Random Forest Classifier yielded an accuracy score of 96%, demonstrating robust performance in fraud detection.
XGBoost emerged as the most effective model with the highest accuracy score of 99%.


XGBoost proved to be the most effective model for fraud detection in this dataset, outperforming logistic regression and Random Forest.


The project demonstrates the potential of machine learning algorithms in detecting credit card fraud.
Further enhancements could involve refining the models, exploring additional algorithms, and optimizing hyperparameters to improve accuracy and efficiency.

Your thoughts and feedback on this project would be greatly appreciated!


