# Detecting Credit card fraud using Anomaly Detection algorithms

### The key idea is that instead of using the supervised learning models, we will build a probability of features for the normal transactions. Use the fradulent cases to set a threshold on what we call anomaly. 

Author: Sushant N. More

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Importing data

In [4]:
df = pd.read_csv('./creditcard.csv')

In [5]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [6]:
df.shape

(284807, 31)

We already did a detailed data exploration in the CreditCardFraudDetection notebook. 

Here are some of the points we remind ourselves.

1. The data is highly skewed. Out of 284,807 examples only 0.17% are fradulent. 

2. We noticed and reasoned that the time column doesn't have any effect on the transaction being fradulent or not. We drop it. 

3. We do the mean normalization of the amount column to make it amenable to ML techniques. 

4. We had seen that the probability density for features V1 through V28 roughly seems Gaussian

In the present case, the workflow is as follows: 

a) We repeat the steps 2 and 3 above.

b) We separate the normal transactions and do a probability density estimation. 

c) Use cross-validation to figure out the threshold. 

d) Finally test on the test set!

In [8]:
df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1))

In [9]:
df.columns

Index([u'Time', u'V1', u'V2', u'V3', u'V4', u'V5', u'V6', u'V7', u'V8', u'V9',
       u'V10', u'V11', u'V12', u'V13', u'V14', u'V15', u'V16', u'V17', u'V18',
       u'V19', u'V20', u'V21', u'V22', u'V23', u'V24', u'V25', u'V26', u'V27',
       u'V28', u'Amount', u'Class', u'normAmount'],
      dtype='object')

In [10]:
dfMod = df.drop(['Time', 'Amount'], axis = 1)

In [11]:
dfMod.columns

Index([u'V1', u'V2', u'V3', u'V4', u'V5', u'V6', u'V7', u'V8', u'V9', u'V10',
       u'V11', u'V12', u'V13', u'V14', u'V15', u'V16', u'V17', u'V18', u'V19',
       u'V20', u'V21', u'V22', u'V23', u'V24', u'V25', u'V26', u'V27', u'V28',
       u'Class', u'normAmount'],
      dtype='object')

In [12]:
dfMod.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normAmount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0,-0.073403


Let's separate the normal transactions from the fradulent

In [13]:
FraudTransaction = dfMod[dfMod['Class'] == 1]
NormalTransaction = dfMod[dfMod['Class'] == 0]

Note in this case, our training set consists of all normal examples. 
Let's split as follows:

Training set: 60% of the normal data

Cross validation set: 20% of normal data, 50% of fradulent data. 

Training set: 20% of normal data, 50% of fradulent data.

Before making this split, we randomly shuffle the data using the sample method of pandas and then apply np.split

In [15]:
len(NormalTransaction)

284315

In [16]:
NormalTransaction.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normAmount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0,-0.073403


In [17]:
NormalTransaction.index

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            284797, 284798, 284799, 284800, 284801, 284802, 284803, 284804,
            284805, 284806],
           dtype='int64', length=284315)

In [18]:
NormalTransactionShuffled = NormalTransaction.sample(frac = 1)

In [19]:
NormalTransactionShuffled.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normAmount
227143,2.008021,-0.210928,-3.134833,-0.343703,2.597961,3.286339,-0.555582,0.832574,0.677253,-0.655852,...,-0.031826,0.094989,0.072766,0.63186,0.123913,0.65912,-0.022794,-0.035894,0,-0.313249
111151,1.178239,-0.112266,-0.528712,0.05566,-0.067912,-1.110653,0.532842,-0.377442,-0.088516,-0.116822,...,-0.027329,-0.272626,-0.226306,-0.034026,0.57489,1.092502,-0.133263,0.001482,0,0.046379
274188,-0.55931,0.252248,0.718171,-0.061253,1.167424,-0.58395,0.425633,0.035439,-0.653711,-0.226866,...,-0.137535,-0.517707,-0.022232,-0.462955,-0.407223,0.374007,0.067637,0.136836,0,-0.349671
36596,-0.707835,1.119971,1.288918,0.942082,-0.084526,-1.000469,0.756299,-0.238079,-0.29902,0.230965,...,0.138632,0.536785,-0.186578,0.713265,-0.390907,-0.434221,-0.539133,-0.359817,0,-0.311529
123354,-1.386725,1.067763,1.822499,-0.682421,-0.377616,-0.680118,0.282821,0.430088,0.150079,-0.372859,...,0.060073,0.011975,-0.091141,0.443813,-0.236341,-0.925158,-0.202221,-0.115952,0,-0.325283


In [20]:
NormalTransactionShuffled.index

Int64Index([227143, 111151, 274188,  36596, 123354, 188769, 143654, 271800,
            205058, 148641,
            ...
            270192, 169599, 223292, 146892, 284442,  62393, 101240, 116059,
             21779, 193586],
           dtype='int64', length=284315)

In [23]:
NormalTrain, NormalValidate, NormalFraud = np.split(NormalTransactionShuffled, \
                                                    [int(0.6 * len(NormalTransactionShuffled)), \
                                                    int(0.8 * len(NormalTransactionShuffled)) 
                                                    ])

In [24]:
NormalTrain.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normAmount
227143,2.008021,-0.210928,-3.134833,-0.343703,2.597961,3.286339,-0.555582,0.832574,0.677253,-0.655852,...,-0.031826,0.094989,0.072766,0.63186,0.123913,0.65912,-0.022794,-0.035894,0,-0.313249
111151,1.178239,-0.112266,-0.528712,0.05566,-0.067912,-1.110653,0.532842,-0.377442,-0.088516,-0.116822,...,-0.027329,-0.272626,-0.226306,-0.034026,0.57489,1.092502,-0.133263,0.001482,0,0.046379
274188,-0.55931,0.252248,0.718171,-0.061253,1.167424,-0.58395,0.425633,0.035439,-0.653711,-0.226866,...,-0.137535,-0.517707,-0.022232,-0.462955,-0.407223,0.374007,0.067637,0.136836,0,-0.349671
36596,-0.707835,1.119971,1.288918,0.942082,-0.084526,-1.000469,0.756299,-0.238079,-0.29902,0.230965,...,0.138632,0.536785,-0.186578,0.713265,-0.390907,-0.434221,-0.539133,-0.359817,0,-0.311529
123354,-1.386725,1.067763,1.822499,-0.682421,-0.377616,-0.680118,0.282821,0.430088,0.150079,-0.372859,...,0.060073,0.011975,-0.091141,0.443813,-0.236341,-0.925158,-0.202221,-0.115952,0,-0.325283


In [25]:
len(NormalTrain) + len(NormalValidate) + len(NormalFraud) 

284315

In [28]:
len(NormalTransaction)

284315