# Credit Card Fraud Detection Model
Credit Card database in: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

## Context
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

## Content
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a (Principal Component Analysis - https://en.wikipedia.org/wiki/Principal_component_analysis) PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

In [109]:
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score

In [111]:
df = pd.read_csv('kaggle/creditcard.csv')

In [112]:
# As we can see, columns after PCA dimensionality reduction technique are not understandable without more context; 
# which is ideal for dealing with real life pii data

# *Time* - Number of seconds elapsed between this transaction and the first transaction in the dataset
# *V1 - V28* - PCA output variables
# *Amount* - Value of the transaction
# *Class* - 0 if normal transaction; 1 if Fraudulent transaction

df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [113]:
#labeling of dependent variable Class
df = df.assign(class_nm = df.apply(lambda x: 'legit' if x.Class==0 else 'fraud',axis=1))

In [116]:
#This shows that the dataset is highly unbalanced, with 0.17% of frauds (Class = 1)
display(df['class_nm'].value_counts().reset_index())
display(df['class_nm'].value_counts(normalize=True).reset_index())

Unnamed: 0,class_nm,count
0,legit,284315
1,fraud,492


Unnamed: 0,class_nm,proportion
0,legit,0.998273
1,fraud,0.001727


In [119]:
legit = df[df.Class==0]
fraud = df[df.Class==1]

In [121]:
#Here we see most of amount values are very small
counts, bin_edges = np.histogram(fraud['Amount'],bins=300)
hist_df = pd.DataFrame({
    'Bin Start':bin_edges[:-1], 
    'Bin End':bin_edges[1:],
    'Count':counts
})

display(hist_df)

Unnamed: 0,Bin Start,Bin End,Count
0,0.000000,7.086233,230
1,7.086233,14.172467,27
2,14.172467,21.258700,14
3,21.258700,28.344933,5
4,28.344933,35.431167,12
...,...,...,...
295,2090.438833,2097.525067,0
296,2097.525067,2104.611300,0
297,2104.611300,2111.697533,0
298,2111.697533,2118.783767,0


In [None]:
#Sampling 5000 cases out of the > 290K
new_df = pd.concat([legit.sample(50000,random_state=2),fraud])
#new_df=df

#dropping the target (predicted) variable from ou inputs X, and keeping only target to y
X = new_df.drop(columns=['class_nm','Class'])
y = new_df['Class']

#Run the split, using stratify on y because of the highly imbalanced dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

In [432]:
from sklearn.preprocessing import StandardScaler

#scale the metrics before training
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [434]:
#train the model
model = LogisticRegression(max_iter=300)
model.fit(X_train_scaled,y_train)

In [442]:
#test, in this case the input of the test - X_test must be the scaled one as well, as the model was trained on scaled input
y_pred = model.predict(X_test_scaled)

In [444]:
print("Logistic Regression Results:")
print(classification_report(y_test, y_pred))

Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10001
           1       0.96      0.71      0.82        98

    accuracy                           1.00     10099
   macro avg       0.98      0.86      0.91     10099
weighted avg       1.00      1.00      1.00     10099



## Random Forest

In [446]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [468]:
#Sampling 15000 cases out of the > 290K
new_df = pd.concat([legit.sample(150000,random_state=2),fraud])
#new_df=df

#dropping the target (predicted) variable from ou inputs X, and keeping only target to y
X = new_df.drop(columns=['class_nm','Class'])
y = new_df['Class']

#Run the split, using stratify on y because of the highly imbalanced dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

# Apply SMOTE to the training data
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [472]:
# Step: Train the Random Forest classifier (no need to scale)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)


# Step: Predict on the test set
y_pred_rf = rf_model.predict(X_test)

# Step: Evaluate the Random Forest model
print("Random Forest Results:")
print(classification_report(y_test, y_pred_rf))

Random Forest Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     30001
           1       0.95      0.81      0.87        98

    accuracy                           1.00     30099
   macro avg       0.98      0.90      0.94     30099
weighted avg       1.00      1.00      1.00     30099



## XGBoost

In [476]:
#Sample =full population
new_df=df

#dropping the target (predicted) variable from ou inputs X, and keeping only target to y
X = new_df.drop(columns=['class_nm','Class'])
y = new_df['Class']

#Run the split, using stratify on y because of the highly imbalanced dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

In [480]:
import xgboost as xgb
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Train the XGBoost model with scale_pos_weight for class imbalance
xgb_model = xgb.XGBClassifier(n_estimators=400, 
                              scale_pos_weight=len(y_train_resampled[y_train_resampled == 0]) / len(y_train_resampled[y_train_resampled == 1]), 
                              learning_rate = 0.3,
                              random_state=42,
                             max_depth = 4)

xgb_model.fit(X_train_resampled, y_train_resampled)

# Predict and evaluate the model on the original test set
y_pred = xgb_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.79      0.97      0.87        98

    accuracy                           1.00     56962
   macro avg       0.90      0.98      0.94     56962
weighted avg       1.00      1.00      1.00     56962



Logistic Regression (sample of 50,000):
Highest precision (0.96), meaning fewer false positives (legit transactions incorrectly flagged as fraud).
Lower recall (0.71), so it missed more fraud cases compared to the other models.
F1-Score of 0.84, indicating a lower balance between precision and recall.

Random Forest (sample of 150,000):
Precision (0.95) and recall (0.81) are well-balanced.
F1-Score of 0.87, reflecting a good balance between fraud detection and avoiding false positives.

XGBoost (full sample of 287,000 , max_depth=4, n_estimators=4000, learning_rate = 0.2):
Precision dropped to 0.77 compared to Random Forest and Logistic Regression, meaning more false positives.
Highest recall (0.97), meaning it caught more fraud cases compared to the other models.
F1-Score of 0.87, indicating the best balance between precision and recall among the models.

Conclusion: Although same f1-score between Random Forest and XGBoost, the latter had much higher recall (TP/(TP+FN) --> detected most of the real frauds), so for this problem, it decides that the model of choice is XGBoost
Notice that although I only used full sample for xgboost, this is because for the other models increasing the sample to the full population proved to be worse for the algorithms being tried (Logistic Regression and Random Forest)