# Credit Card Fraud Detection - Data Analysis Layer

*By Satyam Sharma, B.Tech CSE(ML&AI), 5th Sem, Section ML, Graphic Era Deemed University*


### Context

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

## Content of the Data Set

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Let us explore the dataset in a bit more detail.

---

First, lets import the libraries required to work on the data.

In [18]:
import numpy as np
import pandas as pd

Then import the dataset itself into an object

In [3]:
df = pd.read_csv('creditcard.csv')

Now lets start looking into the data - reading the dimensions of the dataset using its shape var, then getting a general idea about its composition using head() & describe().

In [6]:
print(df.shape)

(284807, 31)


In [8]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [9]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,...,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


We are dealing with problems which lead to a resultant of either true(fraudulent transaction) or false(legit transaction). In the given dataset, these results are stored in the 'Class' row, but when we pay attention to the describe() resultant, we can see that the composition of the results in 'Class' is quite skewed, given its mean of just 0.0017~. 

Let us take a closer look at this. 

In [19]:
fraud = df[df['Class'] == 1] 
legit = df[df['Class'] == 0] 

print('Number of fraudulent cases = {}'.format(len(df[df['Class'] == 1]))) 
print('Number of legit cases = {}'.format(len(df[df['Class'] == 0])))

outlierRatio = len(fraud)/float(len(legit)) 
print("Percentage of fraudulent cases = ",outlierRatio*100,"%") 

Number of fraudulent cases = 492
Number of legit cases = 284315
Percentage of fraudulent cases =  0.17304750013189596 %


---
Given this result, ie just a 0.17304750013189596 % chance of the transaction begin fraudulent(1), we can hence say that we are dealing with an imbalance in the data.

Also, from the details of the Amount row of describe(), we can also see that the average money transaction for the fraudulent ones is more. This makes this problem even more crucial to deal with.

## Result of the Data Analysis

An imbalance in data has been found, and hence we can classify this problem as an imbalanced classification problem. 

There are several ways to deal with this problem.

**Change the performance metric** - Using other means to calcuate/understand the accuracy of the model. Eg :-

Confusion Matrix: showing correct predictions and types of incorrect predictions.

Precision: The number of true positives divided by all positive predictions. Low precision indicates a high number of false positives.

Recall: The number of true positives divided by the number of positive values in the test data. Low recall indicates a high number of false negatives.

F1 Score: the weighted average of precision and recall.

AUC: is a graph showing the performance of a classification model at all classification thresholds.

**Random Oversampling** - By 'oversampling', ie balancing the data set by copying minority class samples. Like Random Oversampling. Its the addition of randomly selected samples from the minority class. This technique can be used if the data set is small. It may cause overfitting.

**SMOTE Oversampling** - Creating synthetic samples from minority class to prevent overfitting.

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

**Random Undersampling** - Undersampling is the technique of balancing the data set by removing the samples belonging to the majority class. You can use this technique if you have a large data set. In Random Undersampling, information may be lost due to random selection.

**NearMiss Undersampling** - It prevents the loss of information. It is based on the KNN algorithm. The distance between the samples belonging to the majority class and the samples belonging to the minority class is calculated. Samples whose distance is shorter than the specified value of k are preserved.

Three Version:

NearMiss-1: Majority class examples with minimum average distance to three closest minority class examples.
NearMiss-2: Majority class examples with minimum average distance to three furthest minority class examples.
NearMiss-3: Majority class examples with minimum distance to each minority class example.
The NearMiss-3 seems desirable, given that it will only keep those majority class examples that are on the decision boundary

---

It is suggested that the Random Forest classifer be used to get a better results, as from Random Forest we can get which features are more important.