# Data 695 Project Work

## Introduction:
Credit card fraud remains a major concern in the financial sector, causing substantial losses for both card issuers and consumers. 
Over $34 billion is lost annually to credit card fraud, a number that is expected to rise as digital transactions continue to grow.

## Part 2 - Machine Learning Models

In [4]:
# importing necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    mean_squared_error,
    f1_score,
    confusion_matrix,
    precision_score
)

In [5]:
# Loading the Cleaned Credit card transaction data
df_cleaned = pd.read_csv('cleaned_fraud_dataset.csv')
df_cleaned.head()

Unnamed: 0,timestamp,amount,transaction_type,merchant_category,location,device_used,is_fraud,spending_deviation_score,velocity_score,geo_anomaly_score,payment_channel
0,2023-12-18,28.44,transfer,retail,Singapore,web,False,0.23,11,0.93,wire_transfer
1,2023-02-06,64.88,payment,retail,Toronto,pos,False,0.44,4,0.4,wire_transfer
2,2023-07-26,5.68,transfer,online,Dubai,web,True,0.28,18,0.09,wire_transfer
3,2023-04-27,11.97,transfer,utilities,Toronto,atm,True,-1.31,1,0.63,wire_transfer
4,2023-03-14,191.39,withdrawal,retail,Tokyo,pos,False,1.1,12,0.16,UPI


In [6]:
# Splitting the data in 80% train and 20% test 


# Defining the x predictor variables
X= df_cleaned.drop(columns=('is_fraud'),axis=1)


# Defining y, target variable
y= df_cleaned["is_fraud"]


# Splitting into train and test data 
X_train,X_test,y_train,y_test=train_test_split(X,y, random_state=42, test_size=0.2)


# Calculating proportion of fraudulent and genuine in training set
count_fraud =(y_train==1).sum() 
count_geniune =(y_train==0).sum() 


prop_fraud= count_fraud/len(y_train) 
prop_genuine= count_geniune/len(y_train) 

print('Proportion of Fraudlent Transactions:', round(prop_fraud,2))
print('Proportion of Genuine Transactions:',round(prop_genuine,2))


X_train.shape


Proportion of Fraudlent Transactions: 0.33
Proportion of Genuine Transactions: 0.67


(430927, 10)

In [7]:
# Standardizing the X Features

# We'll only standardize continous variabes 

X = df_cleaned.drop(columns=['is_fraud', 'timestamp', 'location'])  
X = pd.get_dummies(X, drop_first=True)  

# Defining target variable
y = df_cleaned['is_fraud'].astype(int)  

# Splitting into training and testing sets (80% train, 20% test), maintaining class proportions
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Counting fraud and genuine transactions in training data
count_fraud = (y_train == 1).sum()
count_genuine = (y_train == 0).sum()

# Computing proportions
prop_fraud = count_fraud / len(y_train)
prop_genuine = count_genuine / len(y_train)

# Displaying results
print('Proportion of Fraudulent Transactions:', round(prop_fraud, 4))
print('Proportion of Genuine Transactions:', round(prop_genuine, 4))
print("Shape of X_train:", X_train.shape)

Proportion of Fraudulent Transactions: 0.3333
Proportion of Genuine Transactions: 0.6667
Shape of X_train: (430927, 20)


In [8]:
X_train_continous = X_train.select_dtypes(include=['int64', 'float64'])

# Print  
print("Row indices of continuous features:", X_train_continous.index)

# Print
print("Continuous feature columns:", X_train_continous.columns.tolist())


Row indices of continuous features: Index([345412, 480928, 366564, 214938, 100648, 500675, 218194, 230279,  40532,
       178084,
       ...
       348169, 222722, 128105, 493959, 327261, 200189,  65637, 334343,  91688,
       342457],
      dtype='int64', length=430927)
Continuous feature columns: ['amount', 'spending_deviation_score', 'velocity_score', 'geo_anomaly_score']



# Logistics Regression Model
Logistic Regression is a statistical model used for binary classification problems. It estimates the probability that a given input belongs to a particular category by applying the logistic (sigmoid) function to a linear combination of the input features. Logistic regression performs well when the relationship between the features and the target variable is approximately linear. Its strengths are its interpretability, speed, and effectiveness on linearly separable data.

In [10]:
# Defining the x predictor variables
X= df_cleaned.drop(columns=('is_fraud'),axis=1)

# Defining y, target variable
y= df_cleaned["is_fraud"]

In [11]:
from sklearn.model_selection import cross_val_score

# Features and target variable
X = df_cleaned.drop(columns=['is_fraud', 'timestamp', 'location'])  
X = pd.get_dummies(X, drop_first=True)  
y = df_cleaned['is_fraud'].astype(int)  

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initializing and train the logistic regression model
model = LogisticRegression(max_iter=1000, class_weight='balanced', solver='liblinear')

# Cross-validation before model fitting
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=3, scoring='f1')
print(f"Logistic Regression - Cross-validated F1 scores: {cv_scores}")
print(f"Logistic Regression - Mean F1 score: {cv_scores.mean():.4f}")
model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Evaluating the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Logistic Regression - Cross-validated F1 scores: [0.40069799 0.39058916 0.40081908]
Logistic Regression - Mean F1 score: 0.3974
Confusion Matrix:
[[35751 36070]
 [17836 18075]]

Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.50      0.57     71821
           1       0.33      0.50      0.40     35911

    accuracy                           0.50    107732
   macro avg       0.50      0.50      0.49    107732
weighted avg       0.56      0.50      0.51    107732



In [12]:
from sklearn.metrics import confusion_matrix

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Ensuring the shape is valid 
if cm.shape == (2, 2):
    TN, FP, FN, TP = cm.ravel()

    # Computing precision and specificity 
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0

    print(f"Precision: {precision:.4f}")
    print(f"Specificity: {specificity:.4f}")
else:
    print("Confusion matrix shape is not (2,2) — check if it's a binary classification problem.")


Precision: 0.3338
Specificity: 0.4978


In [13]:
from sklearn.metrics import accuracy_score, f1_score

# Making sure predictions exist
if 'y_pred' in locals():
    # Computing Accuracy and F1 Score
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Print metrics
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")
else:
    print("Error: 'y_pred' is not defined. Please run model prediction first.")


Accuracy: 0.4996
F1 Score: 0.4014


In [15]:
from sklearn.metrics import mean_squared_error

# Calculating Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.4f}")


Mean Squared Error (MSE): 0.5004


In [25]:
from sklearn.metrics import recall_score, roc_auc_score
# Recall 
logreg_recall = recall_score(y_test, y_pred)
print(f"Logistic Regression Recall: {logreg_recall:.4f}")

# AUC-ROC 
logreg_auc = roc_auc_score(y_test, y_pred)
print(f"Logistic Regression AUC-ROC: {logreg_auc:.4f}")


Logistic Regression Recall: 0.5033
Logistic Regression AUC-ROC: 0.5006


In [27]:
print(f"Logistics Regression Precision: {precision:.4f}")
print(f"Logistics Regression Specificity: {specificity:.4f}")
print(f"Logistics Regression Accuracy: {accuracy:.4f}")
print(f"Logistics Regression F1 Score: {f1:.4f}")
print(f"Logistics Regression Mean Squared Error (MSE): {mse:.4f}")
print(f"Logistic Regression Recall: {logreg_recall:.4f}")
print(f"Logistic Regression AUC-ROC: {logreg_auc:.4f}")

Logistics Regression Precision: 0.3338
Logistics Regression Specificity: 0.4978
Logistics Regression Accuracy: 0.4996
Logistics Regression F1 Score: 0.4014
Logistics Regression Mean Squared Error (MSE): 0.5004
Logistic Regression Recall: 0.5033
Logistic Regression AUC-ROC: 0.5006


## Conclusion:

The Logistics Regression model had an overall accuracy of **0.4996,** and an F1-score of **0.4014** on the cleaned credit-card–fraud dataset, with an MSE of **0.5004.** 

The model was applied to a downsampled version of our original 5-million-row dataset, consisting of approximately 500k+ transaction records. The dataset was intentionally balanced to a 2:1 ratio of genuine to fraudulent class distribution to make the model more sensitive to detecting fraud.

Taken together, these metrics indicate that the model may not adequately capture the complex patterns often found in fraud scenarios and is only marginally better than random guessing. It correctly flags roughly one-third of the transactions it labels as fraudulent, and it misses about half of the genuine transactions. 

While it does provides a useful baseline, achieving a more reliable fraud detection will likely require advanced models such as Random Forests, CatBoost, or Neural Networks, along with techniques like SMOTE. Therefore, this outcome is just a foundational benchmark not a production-ready solution.

# Support Vector Machine (SVM) Model

Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. It works by finding the optimal hyperplane that maximally separates data points of different classes, thereby improving generalization. SVM is especially effective in high-dimensional spaces and is robust against overfitting, particularly when there are clear class boundaries.

In [104]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
import pandas as pd

# Preparing X and y
X = df_cleaned.drop(columns=['is_fraud', 'timestamp', 'location'])  
X = pd.get_dummies(X, drop_first=True)  
y = df_cleaned['is_fraud'].astype(int)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Initialize and train SVM with class_weight for imbalance
svm_model = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    class_weight='balanced',  
    random_state=42
)
svm_model.fit(X_train_s, y_train)

# Predict and evaluate
y_pred_svm = svm_model.predict(X_test_s)

print("SVM Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))

print("\nSVM Classification Report:")
print(classification_report(y_test, y_pred_svm))

print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_svm):.4f}")


SVM Confusion Matrix:
[[35265 36556]
 [18030 17881]]

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.66      0.49      0.56     71821
           1       0.33      0.50      0.40     35911

    accuracy                           0.49    107732
   macro avg       0.50      0.49      0.48    107732
weighted avg       0.55      0.49      0.51    107732

Accuracy: 0.4933
F1 Score: 0.3958


In [None]:
from sklearn.metrics import mean_squared_error

# Calculating MSE
mse = mean_squared_error(y_test, y_pred_svm)
print(f"Mean Squared Error (MSE): {mse:.4f}")


In [None]:
# Recall
svm_recall = recall_score(y_test, y_pred_svm)
print(f"SVM Recall: {svm_recall:.4f}")

# AUC-ROC
svm_auc = roc_auc_score(y_test, y_pred_svm)
print(f"SVM AUC-ROC: {svm_auc:.4f}")


In [None]:
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_svm):.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"SVM Recall: {svm_recall:.4f}")
print(f"SVM AUC-ROC: {svm_auc:.4f}")

## Conclusion:

The Support Vector Machine (SVM) model with an RBF kernel was applied to the cleaned and downsampled dataset containing approximately 500k+ transactions, with a class ratio of 2:1 of genuine to fraudulent. The SVM model performed very similarly to Logistic Regression, even though it’s a non-linear model, struggling with precision (high false positives), and only catching half of the fraud.

Our SVM results suggest that despite SVM's theoretical strength for nonlinear classificatiois, it's not good enough for production, it does however help confirm that a simple model with an imbalanced dataset, even using non-linear kernels, can't detect fraud reliably.

In conclusion, while SVM provides a useful benchmark, it is not sufficient on its own. Future efforts should prioritize more scalable and fraud-aware models such as Random Forests, or CatBoost, potentially combined with SMOTE might improve both precision and recall in fraud detection.