PROJECT TITLE

Credit Card Fraud Detection using Machine Learning

PROBLEM STATEMENT

The objective of this project is to build a machine learning model that can detect fraudulent credit card transactions. Since fraud cases are rare compared to normal transactions, the dataset is highly imbalanced. The goal is to develop a model that accurately identifies fraud cases while minimizing false negatives.

In [1]:
from google.colab import files
uploaded = files.upload()


Saving creditcard.csv.zip to creditcard.csv.zip


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score


In [3]:
import pandas as pd
df=pd.read_csv('creditcard.csv.zip')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
df['Class'].value_counts()


Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,284315
1,492


In [6]:
X = df.drop('Class', axis=1)
y = df['Class']


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)



In [9]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

y_pred_lr = lr.predict(X_test)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [10]:
print(confusion_matrix(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, lr.predict_proba(X_test)[:,1]))


[[56851    13]
 [   30    68]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.84      0.69      0.76        98

    accuracy                           1.00     56962
   macro avg       0.92      0.85      0.88     56962
weighted avg       1.00      1.00      1.00     56962

ROC-AUC: 0.9513502678786764


In [11]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))


[[56859     5]
 [   18    80]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.94      0.82      0.87        98

    accuracy                           1.00     56962
   macro avg       0.97      0.91      0.94     56962
weighted avg       1.00      1.00      1.00     56962

ROC-AUC: 0.9630272515590367


RESULTS

• Logistic Regression ROC-AUC ≈ 0.95
• Random Forest ROC-AUC ≈ 0.96

Random Forest slightly outperformed Logistic Regression and was selected as the final model.

Since the dataset is imbalanced, evaluation was done using:

Confusion Matrix

Precision

Recall

F1 Score

ROC-AUC Score

Accuracy was not used as the primary metric due to class imbalance.

CONCLUSION

The Random Forest model achieved a ROC-AUC score of approximately 0.96, demonstrating strong capability in distinguishing between fraudulent and legitimate transactions.

The model effectively handles imbalanced data and improves fraud detection performance. This project highlights the importance of using appropriate evaluation metrics such as Recall and ROC-AUC in fraud detection systems.

The complete machine learning pipeline was implemented, including data exploration, handling imbalanced data, model training, evaluation, and comparison.