# Predicting Credit Card Fraud using a Scikit-Learn Logisitic Regression Model, with Synthetic Training & Test Data

Completed for my Codecademy Data Scientist (Machine Learning Specialist) certification course.

Data is synthetically generated and comes from Kaggle Synthetic Financial Dataset (https://www.kaggle.com/datasets/ealaxi/paysim1)

Robert Hall

10/10/2024.


## Table of Contents

* Basic Exploratory Data Analysis (EDA)
* Feature Generation for Logistic Regression Model
* Generate Training and Test Data
* Evaluating Accuracy Scores for Model on Training and Test Data
* Examine Features with the Highest Impact on Predicted Outcome

## Basic Exploratory Data Analysis (EDA)

In [42]:
import pandas as pd
transactions = pd.read_csv('transactions.csv')

#### Features
source: (https://www.kaggle.com/datasets/ealaxi/paysim1)

**step**: Maps a unit of time in the real world. In this case 1 step is 1 hour of time.

**type**: categorical [CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER]

**amount**: amount of the transaction in local currency

**nameOrig**: customer who started the transaction

**oldbalanceOrg**: initial balance before the transaction

**newbalanceOrig**: customer's balance after the transaction.

**nameDest**: recipient ID of the transaction.

**oldbalanceDest**: initial recipient balance before the transaction.

**newbalanceDest**: recipient's balance after the transaction.

**isFraud**: identifies a fraudulent transaction (1) and non fraudulent (0)

In [43]:
transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isPayment,isMovement,accountDiff
0,206,CASH_OUT,62927.08,C473782114,0.0,0.0,C2096898696,649420.67,712347.75,0,0,1,649420.67
1,380,PAYMENT,32851.57,C1915112886,0.0,0.0,M916879292,0.0,0.0,0,1,0,0.0
2,570,CASH_OUT,1131750.38,C1396198422,1131750.38,0.0,C1612235515,313070.53,1444820.92,1,0,1,818679.85
3,184,CASH_OUT,60519.74,C982551468,60519.74,0.0,C1378644910,54295.32,182654.5,1,0,1,6224.42
4,162,CASH_IN,46716.01,C1759889425,7668050.6,7714766.61,C2059152908,2125468.75,2078752.75,0,0,0,5542581.85


In [44]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            1000 non-null   int64  
 1   type            1000 non-null   object 
 2   amount          1000 non-null   float64
 3   nameOrig        1000 non-null   object 
 4   oldbalanceOrg   1000 non-null   float64
 5   newbalanceOrig  1000 non-null   float64
 6   nameDest        1000 non-null   object 
 7   oldbalanceDest  1000 non-null   float64
 8   newbalanceDest  1000 non-null   float64
 9   isFraud         1000 non-null   int64  
 10  isPayment       1000 non-null   int64  
 11  isMovement      1000 non-null   int64  
 12  accountDiff     1000 non-null   float64
dtypes: float64(6), int64(4), object(3)
memory usage: 101.7+ KB


In [45]:
# how many fraudulent and non-fraudulent transactions?
transactions['isFraud'].value_counts()

isFraud
0    718
1    282
Name: count, dtype: int64

There are 282 true fraudulent transactions.

## Feature Generation for Logistic Regression Model

#### isPayment: 

1 if transaction type is 'payment' or 'debit'

0 if not

In [46]:
transactions['isPayment'] = 0
transactions.loc[transactions['type'].isin(['PAYMENT', 'DEBIT']), 'isPayment'] = 1

#### isMovement: 

1 if transaction type is 'cash_out' or 'transfer'

0 if not

In [47]:
transactions['isMovement'] = 0
transactions.loc[transactions['type'].isin(['CASH_OUT', 'TRANSFER']), 'isMovement'] = 1

#### accountDiff: 

The absolute difference between oldbalanceOrg and oldbalanceDest

In [48]:
transactions['accountDiff'] = abs(transactions['oldbalanceDest'] - transactions['oldbalanceOrg'])

## Generating Training and Test Data

#### Features (inputs):

- **amount**: amount of the transaction in local currency

- **isPayment**: 1 if transaction type is 'payment' or 'debit'; 0 if not

- **isMovement**: 1 if transaction type is 'cash_out' or 'transfer'; 0 if not

- **accountDiff**: The absolute difference between oldbalanceOrg and oldbalanceDest

#### Labels (result):

- **isFraud**: 1 if transaction is fraudulent; 0 if not

In [49]:
# truncate dataframe 'transactions' into features and labels data 

features = transactions[['amount','isPayment','isMovement','accountDiff']]
labels = transactions['isFraud']

In [50]:
# split data into test and training data
# 30% test data
# 70% training data

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, 
                                                    labels, 
                                                    test_size=0.3, 
                                                    train_size=0.7)

In [51]:
# convert training and test datasets to numpy arrays
# (to resolve warnings about training model on data without feature names)
import numpy as np

x_train = np.array(x_train)
x_test = np.array(x_test)

y_train = np.array(y_train)
y_test = np.array(y_test)

In [52]:
# scale data using sklearn's StandardScaler()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_sc = scaler.fit_transform(x_train)
x_test_sc = scaler.transform(x_test)

In [53]:
# fit the model to the scaled training data
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train_sc, y_train)

## Evaluating Accuracy Scores for Model on Training and Test Data

In [54]:
train_score = model.score(x_train, y_train)
print(f"Model Accuracy Score on Training Data: {round(train_score, 2)}")

# Score the model on the test data
test_score = model.score(x_test, y_test)
print(f"Model Accuracy Score on Test Data: {round(test_score, 2)}")

Model Accuracy Score on Training Data: 0.64
Model Accuracy Score on Test Data: 0.64


## Examine Features with the Highest Impact on Predicted Outcome

In [55]:
print(features.columns)
print(model.coef_)

Index(['amount', 'isPayment', 'isMovement', 'accountDiff'], dtype='object')
[[ 2.37795867 -0.61285316  2.12196728 -0.94341755]]


The coefficient of "amount" is the highest among all of the predicting features. The amount of money in the transaction is positively correlated with the likelihood of a transaction being fraudulent. 

## Self-Generated New Test Data to Re-Test Accuracy of Model

In [56]:
# new test data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])
transaction4 = np.array([1567820.0, 0.0, 1.0, 567820.0])

In [57]:
# combine new test transaction data into a single numpy array
sample_transactions = np.stack((transaction1, 
                                transaction2, 
                                transaction3, 
                                transaction4), axis=0)

# transform sample transaction array using scikit-learn StandardScaler()
sample_transactions = scaler.transform(sample_transactions)

In [58]:
# predict whether each transaction is fraudulent or not
print(model.predict(sample_transactions))

[0 0 0 1]


In [59]:
# predict the complementary probabilities of each sample transaction being fraudulent
print(model.predict_proba(sample_transactions))

[[0.5986548  0.4013452 ]
 [0.99805152 0.00194848]
 [0.99638812 0.00361188]
 [0.13039948 0.86960052]]


The model predicts that transactions 1-3 are not fraudulent, while transaction 4 is so. These outcomes are consistent with the model's coefficients (weights), which weigh the amount of money in the transaction, and the status of movement in each transaction, strongly positively. 