This dataset is presently only one of four on Kaggle with information on the rising risk of digital financial fraud, emphasizing the difficulty in obtaining such data. The main technical challenge it posses to predicting fraud is the highly imbalanced distribution between positive and negative classes in 6 million rows of data. Another strumbling block to the utility of this data stems from the possible discrepancies in its description. The goal of this analysis is to solve both these issues by a detailed data exploration and cleaning followed by choosing a suitable machine-learning algorithm to deal with the skew. I show that an optimal solution based on feature-engineering and extreme gradient-boosted decision trees yields an enhanced predictive power of 0.997, as measured by the area under the precision-recal curve. Crucially, these results were obtianed without artificial balancing of the data making this approach suitable to real-world applications. 

# 1. Importing Libraries and Load data

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import average_precision_score
# from xgboost.sklearn import XGBClassifier
# from xgboost import plot_importance, to_graphviz

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

### Loading data

In [4]:
pdf = pd.read_csv("data/PS_20174392719_1491204439457_log.csv")
pdf = pdf.rename(columns={"oldbalanceOrg":"oldBananceOrig","newbalanceOrig":"newBalanceOrig",
                         "oldbalanceDest":"oldBalanceDest","newbalanceDest":"newBalanceDest"})
pdf.head()

Unnamed: 0,step,type,amount,nameOrig,oldBananceOrig,newBalanceOrig,nameDest,oldBalanceDest,newBalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


Test if there any missing values. It turns out there are no obvious missing values but, as we will see below, this does not rule out proxies by a numerical value like 0.

In [5]:
pdf.isnull().values.any()

False

# 2. Exploratory Data Analysis
In this section and ultil section 4, we wrangle with the data exclusively using Dataframe methods. This is the most succinct way to gain insights into the dataset. More elaborate visualizations follow in subsequent sections.

### 2.1 Which types of transactions are fraudulent?
We find that of the five types of transactions, fraud occurs only in two of them (see also kernels by [Net](https://www.kaggle.com/netzone/eda-and-fraud-detection), [Phillip Schmidt](https://www.kaggle.com/philschmidt/where-s-the-money-lebowski) and [Ibe_Noriaki](https://www.kaggle.com/ibenoriaki/three-features-with-kneighbors-auc-score-is-0-998)): 'TRANSFER' where money is sent to a customer / fraudster and 'CASH_OUT' where money is sent to a merchang who pays the customer / fraudster in cash. Remarkable, the number of fraudulent TRANSFERs almost equals the number of fraudulent CASH_OUTs (see the right half of the plot in section 5.1). These observations appear, at frst, to bear out the description provided on Kaggle for the modus operandi of fraudulent transactions in this dataset, namely, fraud is committed by first transferring out funds to another account which subsequently cashes it out. We will return to this issue later in section 2.4.

In [8]:
print("\n The types of fraudulent transactions are {}".format(list(pdf.loc[pdf.isFraud == 1].type.drop_duplicates().values)))

pdfFraudTransfer = pdf.loc[(pdf.isFraud == 1) & (pdf.type == 'TRANSFER')]
pdfFraudCashout = pdf.loc[(pdf.isFraud == 1) & (pdf.type == 'CASH_OUT')]

print("\n The number of fraudulent TRANSFERs = {}".format(len(pdfFraudTransfer)))  # 4097
print("\n The number of fraudulent CASH_OUTs = {}".format(len(pdfFraudCashout)))  # 4116



 The types of fraudulent transactions are ['TRANSFER', 'CASH_OUT']

 The number of fraudulent TRANSFERs = 4097

 The number of fraudulent CASH_OUTs = 4116


### 2.1 What determines weather the feature isFlaggedFraud gets set or not?

it turns out that the origin of _isFlaggedFraud_ is unclear, contrasting with the description provided. The 16 entries (out of 6 million) where the _isFlaggedFraud_ feature is set do not seem to corrleate with any explanatory variable. The data is described as _isFlaggedFraud_ being set whtn an attemp is made to 'TRANSFER' an 'amount' greater than 200,000. In fact, as shown below, _isFlaggedFraud_ can remain not set despite this condition being met.


In [11]:
print("\nThe type of transactions in which isFlaggedFraud is set:{}".format(list(pdf.loc[pdf.isFlaggedFraud == 1].type.drop_duplicates())))

pdfTransfer = pdf.loc[pdf.type == 'TRANSFER']
pdfFlagged = pdf.loc[pdf.isFlaggedFraud == 1]
pdfNotFlagged = pdf.loc[pdf.isFlaggedFraud == 0]

print("\nMin amount transacted when isFlaggedFraud is set = {}".format(pdfFlagged.amount.min()))
print("\nMax amount transacted when isFlaggedFraud is not set = {}".format(pdfTransfer.loc[pdfTransfer.isFlaggedFraud == 0].amount.max()))



The type of transactions in which isFlaggedFraud is set:['TRANSFER']

Min amount transacted when isFlaggedFraud is set = 353874.22

Max amount transacted when isFlaggedFraud is not set = 92445516.64


Can _oldBalanceDest_ and _newBalanceDest_ determine _isFlaggedFraud_ being set? The old is identical to the new balance in the origin and destination accounts, for every TRANSFER where _isFlaggedFraud_ is set. This is presumably because the transaction is halted. Interestingly, _oldBalanceDest_ = 0 in every such transaction. However, as shown below, since _isFlaggedFraud_ can remain not set in TRANSFERS where _oldBalanceDest_ and _newBalanceDest_ can both be -, these conditions do not determine the state of _isFlaggedFraud_.

In [13]:
print("\nThe number of TRANSFERs where isFlaggedFraud = 0, yet oldBalanceDest = 0 and newBalanceDest = 0: {}".\
     format(len(pdfTransfer.loc[(pdfTransfer.isFlaggedFraud == 0) & \
                                (pdfTransfer.oldBalanceDest == 0) & \
                                (pdfTransfer.newBalanceDest == 0)]))) # 4158


The number of TRANSFERs where isFlaggedFraud = 0, yet oldBalanceDest = 0 and newBalanceDest = 0: 4158


_isFlaggedFraud_ being set cannot be thresholded on _oldBalanceOrig_ since the corresponding range of values overlaps with that for TRANSFERs where _isFlaggedFraud_ is not set (see below). Note that we do not need to consider _newBalanceOrig_ since it is updated only after the transaction, whereas _isFlaggedFraud_ would be set before the transaction takes place.