# Predicting fraudulent credit card transactions



According to [creditcards.com][1], there was over £300m in fraudulent credit card transactions in the UK in the first half of 2016, with banks preventing over £470m of fraud in the same period. The data shows that credit card fraud is rising, so there is an urgent need to continue to develop new, and improve current, fraud detection methods.

Using this dataset, we will use machine learning to develop a model that attempts to predict whether or not a transaction is fraudlent. To preserve anonymity, these data have been transformed using principal components analysis.

To begin this analysis, we will first train a random forest model to establish a benchmark, before looping back to EDA, looking at the most important predictive variables and testing other models.
[1]: http://uk.creditcards.com/credit-card-news/uk-britain-credit-debit-card-statistics-international.php

In [None]:
# load packages
library(readr)
library(dplyr)
library(randomForest)
library(ggplot2)
library(Hmisc)
library(party)

In [None]:
# set random seed for model reproducibility
set.seed(1234)

In [None]:
# import data
creditData <- read_csv("../input/creditcard.csv")

In [None]:
# look at the data
glimpse(creditData)

In [None]:
# make Class a factor
creditData$Class <- factor(creditData$Class)

In [None]:
train <- creditData[1:150000, ]
test <- creditData[150001:284807, ]

In [None]:
train %>%
  select(Class) %>%
  group_by(Class) %>%
  summarise(count = n()) %>%
  glimpse

In [None]:
test %>%
  select(Class) %>%
  group_by(Class) %>%
  summarise(count = n()) %>%
  glimpse

As we can see, fraudulent transactions are a very small proportion of our dataset, we could build what would appear to be a highly accurate model just by always saying that every transaction was not fraudulent. While we would be right over 99% of the time, that would cost consumers and the industry over £500m per year, so wouldn't be a useful model.

In [None]:
# build random forest model using every variable
rfModel <- randomForest(Class ~ . , data = train)

In [None]:
test$predicted <- predict(rfModel, test)

So we've built a random forest model using all the available variables and used it to predict whether or not a transaction is fraudulent on our test test. As the data is very imbalanced, we would expect our accuracy to be very high, even if our model just always guessed 'not fraudulent'.

[Jason Brownlee has some useful options to work through when you have an imbalanced dataset][1] but, for now, while it will give us the accuracy, the `confusion matrix` function in the `caret` package does give us some other useful metrics:
[1]: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

In [None]:
library(caret)
confusionMatrix(test$Class, test$predicted)

Looking at the output above, we can indeed see that we have very high accuracy (99.94%) as we expected.  Going back to our test set class counts, we can see that we had 134,608 legitimate transactions, and 199 fraudulent transactions. Had we said that every transaction was fraudulent, we would have got 199 wrong.

If we look at our confusion matrix, we can see that, using our model, we only got 82 predictions wrong; this figure of 82 is made up of 47 false positive and 35 false negatives. Going from 199 wrong to 82 wrong is quite an improvement on performance, but we have to consider the imporance of sensitivity and specificity when it comes to the real-world application. Are the implications of a false negative more or less sigificant than the implications of a false positive?

Ultimately, there is no way lenders could function if they classified every transaction as fraud and investigated it thoroughly before deciding whether or not to approve it, the costs of doing that would be so high that it wouldn't be feasible. If the lenders let every transaction through, the costs associated with the fraud would escalate.

In the absence of 100% accuracy, when we are building our models, it is important to consider the purpose of the model and how it will be used. We could optimise our model for area under the ROC curve, but if the real-world use of the model places more importance on reducing false negatives rather than false positives, we may be training against the wrong metric.

For this model, let's use the F1 score from the `MLmetrics` package.

In [None]:
library(MLmetrics)

In [None]:
F1_all <- F1_Score(test$Class, test$predicted)
F1_all

Now we have our benchmark figure, obtained very quickly using all the variables to train a random forest model with no tuning. Now to see if we can either simplify, without losing accuracy, or improve that score.

To start off with, let's look at the importance of the predictors:

In [None]:
options(repr.plot.width=5, repr.plot.height=4)
varImpPlot(rfModel,
          sort = T,
           n.var=10,
           main="Top 10 Most Important Variables")

Let's see what sort of performance we get with just out top predictive variable:

In [None]:
rfModelTrim1 <- randomForest(Class ~  V17, 
                            data = train)

test$predictedTrim1 <- predict(rfModelTrim1, test)

F1_1 <- F1_Score(test$Class, test$predictedTrim1)
F1_1

Not a bad score at all, and our run time for the train was considerably shorter. What about if we go with the top 2?

In [None]:
rfModelTrim2 <- randomForest(Class ~  V17 + V12, 
                            data = train)

test$predictedTrim2 <- predict(rfModelTrim2, test)

F1_2 <- F1_Score(test$Class, test$predictedTrim2)
F1_2

That takes us up to the 0.9996 level, so we're already getting close to our performance using all the variables, but with much faster run times. What about the top 3?

In [None]:
rfModelTrim3 <- randomForest(Class ~  V17 + V12 + V14, 
                            data = train)

test$predictedTrim3 <- predict(rfModelTrim3, test)

F1_3 <- F1_Score(test$Class, test$predictedTrim3)
F1_3

A bit of a dip there, but that could just be due to chance. Let's try a few more models of increasing complexity to see what sort of trend emerges. We could do this in a for loop so that we could set it running and go away and have a nice coffee, but I want to get a feel for how long each model takes to run and keep the code simple to follow, so we'll keep things seperate.

I might return to that in a future kernel though, it might be interesting to play with time-stamping the start and end-points of processes and calculating the run time for each iteration through a loop...

In [None]:
# four variables
rfModelTrim4 <- randomForest(Class ~  V17 + V12 + V14 + V10, 
                            data = train)

test$predictedTrim4 <- predict(rfModelTrim4, test)

F1_4 <- F1_Score(test$Class, test$predictedTrim4)
F1_4

In [None]:
# five variables
rfModelTrim5 <- randomForest(Class ~  V17 + V12 + V14 + V10 + V16, 
                            data = train)

test$predictedTrim5 <- predict(rfModelTrim5, test)

F1_5 <- F1_Score(test$Class, test$predictedTrim5)
F1_5

In [None]:
# ten variables
rfModelTrim10 <- randomForest(Class ~  V17 + V12 + V14 + V10 + V16 
                              + V11 + V9 + V4 + V18 + V26, 
                            data = train)

test$predictedTrim10 <- predict(rfModelTrim10, test)

F1_10 <- F1_Score(test$Class, test$predictedTrim10)
F1_10

With those scores calculated, let's go ahead and plot those out:

In [None]:
# build dataframe of number of variables and scores
numVariables <- c(1,2,3,4,5,10,17)
F1_Score <- c(F1_1, F1_2, F1_3, F1_4, F1_5, F1_10, F1_all)
variablePerf <- data.frame(numVariables, F1_Score)

In [None]:
# plot score performance against number of variables
options(repr.plot.width=4, repr.plot.height=3)
ggplot(variablePerf, aes(numVariables, F1_Score)) + geom_point() + labs(x = "Number of Variables", y = "F1 Score", title = "F1 Score Performance")

In [None]:
rf10 = randomForest(Class ~  V17 + V12 + V14 + V10 + V16 
                              + V11 + V9 + V4 + V18 + V26,  
                   ntree = 1000,
                   data = train)

In [None]:
options(repr.plot.width=6, repr.plot.height=4)
plot(rf10)

Plotting our 10-variable model shows that there is not much additional performance gained after what looks like about 50 trees, but let's zoom in on that region to make sure:

In [None]:
options(repr.plot.width=6, repr.plot.height=4)
plot(rf10, xlim=c(0,100))

## Summing up
We have used a random forest method to predict whether or not a credit card transaction is fraudulent or not, and built a model that offers a useful uplift over the no information rate. By testing models using an increasing number of variables, we have begun to explore the balance between model performance and run-time. 

This basic model provides a starting point for contintinuing to tune the model to seek additional improvements. Although we are dealing with changes in accuracy at the fourth decimal place, these very slight changes in accuracy need to be considered with respect to the volume of credit card transactions that take place every year. Even with only a slight fraction of these being fraudulent, the sheer volume of transactions mean that even very slight improvements in model performance may result in significant reductions in credit card fraud.