# Fraud Prevention Using CC Transaction Data

The goal in this notebook will be to use standard classification algorithms to show how they can be used for fraud prevention on credit card data. Hopefully I will be able to show that, by using the data that Terex already has available to them, the fraud monitoring could be done in house and a major business expense could be removed.

I'm using a synthetic dataset that mimics the data that I think Terex has available. The real data will likely require quite a bit more cleaning than this data does, but I haven't included that portion as that is more of a data engineering problem than a machine learning (ML) problem and probably doesn't need as much proof of concept.

Finally, I will show a few different levels of algorithmic complexity, so that it can be shown whether or not it would be beneficial to run a highly complex system, or if simple logistic regression would cover the basis.

## Standard Statistical Solution

For this solution I will use straight ahead logistic regression and see how it performs for fraud detection/prevention. While logistic regression is really just classic statistics, it is normally lumped into the conversion of machine learning, and therefore also AI, as the simplist classification based algorithm.

I'm using classification based algorithms since our desired output will be either fraud or not fraud, that is, we are looking for a binary prediction. Classification and regression based tasks (such as pricing models or maintainence models) differ greatly and therefore it shouldn't be assumed that the results from this proof-of-concept would transfer to a regression based problem

### Standard Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer



In [2]:
data = pd.read_csv('fraud.csv')

### EDA

**Data Dictionary:**

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [3]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


A few of these columns probably wouldn't make sense for Terex's use case (such as old and new balance org since all cards are probably connected to the same account. However, I'm going to keep all of the columns so as to maintain enough features for a valid solution. This solution could be tailored to the data that Terex has available.

In [4]:
#however I will drop the rows that have been flagged fraud since those don't hold much value for our model
flagged_fraud = data[data['isFlaggedFraud'] == 1].index

data = data.drop(index=flagged_fraud)
data = data.drop(['isFlaggedFraud'], axis=1)
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0
...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1


In [5]:
data.duplicated(['nameOrig']).value_counts()

False    6353291
True        9313
dtype: int64

In [6]:
data = data.drop(['nameOrig', 'nameDest'], axis = 1)
data

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,170136.00,160296.36,0.00,0.00,0
1,1,PAYMENT,1864.28,21249.00,19384.72,0.00,0.00,0
2,1,TRANSFER,181.00,181.00,0.00,0.00,0.00,1
3,1,CASH_OUT,181.00,181.00,0.00,21182.00,0.00,1
4,1,PAYMENT,11668.14,41554.00,29885.86,0.00,0.00,0
...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,339682.13,0.00,0.00,339682.13,1
6362616,743,TRANSFER,6311409.28,6311409.28,0.00,0.00,0.00,1
6362617,743,CASH_OUT,6311409.28,6311409.28,0.00,68488.84,6379898.11,1
6362618,743,TRANSFER,850002.52,850002.52,0.00,0.00,0.00,1


In [7]:
data[data['isFraud'] == 1]

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
2,1,TRANSFER,181.00,181.00,0.0,0.00,0.00,1
3,1,CASH_OUT,181.00,181.00,0.0,21182.00,0.00,1
251,1,TRANSFER,2806.00,2806.00,0.0,0.00,0.00,1
252,1,CASH_OUT,2806.00,2806.00,0.0,26202.00,0.00,1
680,1,TRANSFER,20128.00,20128.00,0.0,0.00,0.00,1
...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,339682.13,0.0,0.00,339682.13,1
6362616,743,TRANSFER,6311409.28,6311409.28,0.0,0.00,0.00,1
6362617,743,CASH_OUT,6311409.28,6311409.28,0.0,68488.84,6379898.11,1
6362618,743,TRANSFER,850002.52,850002.52,0.0,0.00,0.00,1


### Preprocessing

In [9]:
Y = data['isFraud']
X = data.drop(['isFraud'], axis = 1)

In [10]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

In [11]:
num_columns = list(x_train.select_dtypes(include = ['int64', 'float64']))
cat_columns = list(x_train.select_dtypes(include = ['object']))

In [12]:
#add pipelining for columns
num_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ord', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer(transformers = [
    ('numerical', num_preprocessor, num_columns),
    ('categorical', cat_preprocessor, cat_columns)
])

x_train_transformed = preprocessor.fit_transform(x_train)
x_test_transformed = preprocessor.fit_transform(x_test)

### Logistic Regression

add what this is

In [11]:
log_reg = LogisticRegression()

log_reg.fit(x_train_transformed, y_train)
predictions = log_reg.predict(x_test_transformed)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

In [13]:
accuracy = accuracy_score(predictions, y_test)
recall = recall_score(predictions, y_test)
precision = precision_score(predictions, y_test)
f1 = f1_score(predictions, y_test)

In [14]:
print('accuracy is:', accuracy)
print('recall is:', recall)
print('precision is:', precision)
print('f1 score is:', f1)

accuracy is: 0.999213372510159
recall is: 0.9316939890710383
precision is: 0.417636252296387
f1 score is: 0.5767441860465116


In this particular case I am assuming that these signals would be sent to a manager who would then decide what to do with them. Because of this it is much better to have false positives then false negatives and, therefore, we should mainly focus on optimizing recall and f1 score.

It can also be helpful to see a confusion matrix of how the algorithm performed

In [15]:
from sklearn.metrics import confusion_matrix

confusion_matrix(predictions, y_test)

array([[1270838,     951],
       [     50,     682]])

This tells us that 1.2 million cases were not fraud and were correctly classified. However we have 951 cases that were not fraud, but were classified as fraud (false positives). 50 cases were classified as not fraud, but they were actually fraud. This is the number we want to minimize. 682 cases of fraud were caught.

## Machine Learning Algorithms

Now I'll move away from classical statistical methods and more into the realm of machine learning (what some folks label as AI). There are a lot of options here but I will use 2, support vector machines and random forest classifiers.

### Support Vector Machines

add what this is

In [16]:
from sklearn.svm import LinearSVC

svm = LinearSVC(C=1, loss='hinge')

svm.fit(x_train_transformed, y_train)



In [17]:
svm_preds = svm.predict(x_test_transformed)
confusion_matrix(svm_preds, y_test)

array([[1270871,    1034],
       [     17,     599]])

In [18]:
svm_recall = recall_score(svm_preds, y_test)
svm_f1 = f1_score(svm_preds, y_test)

print('recall score is:', svm_recall)
print('f1 score is:', svm_f1)

recall score is: 0.9724025974025974
f1 score is: 0.5326811916407291


### Random Forests

add what this is

In [19]:
from sklearn.ensemble import RandomForestClassifier

rnd_for = RandomForestClassifier(n_jobs = -1)
rnd_for.fit(x_train_transformed, y_train)
rnd_for_preds = rnd_for.predict(x_test_transformed)

In [20]:
confusion_matrix(rnd_for_preds, y_test)

array([[1263008,     959],
       [   7880,     674]])

In [21]:
rnd_for_recall = recall_score(rnd_for_preds, y_test)
rnd_for_f1 = f1_score(rnd_for_preds, y_test)

print("recall is:", rnd_for_recall)
print('f1 score is:', rnd_for_f1)

recall is: 0.07879354687865327
f1 score is: 0.13232551290860903


## Conclusions

In [22]:
#final metrics

d1 = {'logistic regression': [recall], 'support vector machine': [svm_recall], 'random forest': [rnd_for_recall]}
conc = pd.DataFrame(data = d1)
conc

Unnamed: 0,logistic regression,support vector machine,random forest
0,0.931694,0.972403,0.078794


add some other algorithms

From this very brief exploration we can see that the support vector machine performs the best. It could be interpreted that it is only missing 2.8% of fraud cases. While this is already quite good, it could be improved by performing hyperparamter tuning, and it is reasonable to think that that number could get down to somewhere in the 0.05% range.

**How could this be used?**

- Once in production, we would pipeline the data in of all the previous company credit card transactions that we know of. 
- We would then label them as fraud or not fraud (something that can be done easily with iterable loops). 
- We would then train the ML model on that data and use it to predict wether the most recent x period of transactions (x = however often you want to run this) are fraudulent or not. 
- The machine learning, or AI system, will then output a list of the transactions that it thinks are fraudulent which can be furthered on to the appropriate managerial staff for review. 

Ultimately, without much heavy lifting, this would provide a system that could oversee corporate credit cards for fraud and misuse with very little maintainence and, I think, provide a very viable alternative to outsourcing this task to a seperate company.

**What are some other use cases?**

- Taking data in to predict when maintainence might be needed, allowing salespeople to get in touch with vendors before maintainence is needed. Ideally this would result in a higher likelyhood of vendors using Terex for maintainence and increasing revenue in that manner (regression task)
- For keeping track of vehicle inventory and identifying where potential storage lots are (clustering task)
- Using extensive data collection around leads to help salespeople prioritize which leads have the highest potential of converting into a sale