## Credit Card Fraud Detection


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands the transaction is fraud or correct teransaction


### 2) Data Collection
- Dataset Source - https://www.kaggle.com/mlg-ulb/creditcardfraud
- The data consists of 31 column and 284807 rows.

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [44]:
df = pd.read_csv('../kaggle_data/creditcard.csv')

#### Show Top 5 Records

In [45]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


#### Shape of the dataset

In [46]:
df.shape

(284807, 31)

### 2.2 Dataset information

- time - The elasped after each transcations 
- v1 to v28 it is encoded code of pirticular transactions that is very sensitive that is why it given in this coded type
- amount - The amount that withdrawn in each transaction.
- class - The column is the label of the transaction is fraud or not 0 means the transaction is genuine and 1 means the transaction is fraud

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [47]:
df.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

#### There are no missing values in the data set

### 3.2 Check Duplicates

In [48]:
df.duplicated().sum()

1081

#### There are no duplicates  values in the data set

In [49]:
# remove the duplicates
df = df.drop_duplicates(keep=False)

### 3.3 Check data types

In [50]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 282953 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    282953 non-null  float64
 1   V1      282953 non-null  float64
 2   V2      282953 non-null  float64
 3   V3      282953 non-null  float64
 4   V4      282953 non-null  float64
 5   V5      282953 non-null  float64
 6   V6      282953 non-null  float64
 7   V7      282953 non-null  float64
 8   V8      282953 non-null  float64
 9   V9      282953 non-null  float64
 10  V10     282953 non-null  float64
 11  V11     282953 non-null  float64
 12  V12     282953 non-null  float64
 13  V13     282953 non-null  float64
 14  V14     282953 non-null  float64
 15  V15     282953 non-null  float64
 16  V16     282953 non-null  float64
 17  V17     282953 non-null  float64
 18  V18     282953 non-null  float64
 19  V19     282953 non-null  float64
 20  V20     282953 non-null  float64
 21  V21     282953 

### 3.4 Checking the number of unique values of each column

In [51]:
df.nunique()

Time      124527
V1        274897
V2        274897
V3        274897
V4        274897
V5        274897
V6        274897
V7        274897
V8        274897
V9        274897
V10       274897
V11       274897
V12       274897
V13       274897
V14       274897
V15       274897
V16       274897
V17       274897
V18       274897
V19       274897
V20       274897
V21       274897
V22       274897
V23       274897
V24       274897
V25       274897
V26       274897
V27       274897
V28       274897
Amount     32737
Class          2
dtype: int64

### 3.5 Check statistics of data set

In [52]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,...,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0,282953.0
mean,94816.256714,0.010161,-0.006837,0.002906,-0.004665,0.003311,-0.001734,0.002985,-0.002038,-0.002651,...,-0.000316,0.000184,0.000332,0.000372,-0.000347,0.000317,0.00283,0.00074,88.534756,0.001626
std,47479.631543,1.94099,1.643708,1.504189,1.413356,1.374938,1.331984,1.223249,1.173378,1.094047,...,0.721104,0.724223,0.623093,0.605599,0.521199,0.481876,0.391139,0.327223,250.56757,0.040287
min,0.0,-56.40751,-72.715728,-48.325589,-5.683171,-113.743307,-26.160506,-43.557242,-73.216718,-13.434066,...,-34.830382,-10.933144,-44.807735,-2.836627,-10.295397,-2.604551,-22.565679,-15.430084,0.0,0.0
25%,54213.0,-0.912989,-0.601721,-0.888987,-0.851101,-0.688407,-0.769506,-0.55147,-0.209036,-0.645213,...,-0.228236,-0.542743,-0.161658,-0.354423,-0.317659,-0.326567,-0.070453,-0.052736,5.59,0.0
50%,84704.0,0.022459,0.062929,0.180273,-0.023625,-0.052817,-0.275914,0.041333,0.021522,-0.052847,...,-0.02937,0.007041,-0.011184,0.041074,0.016162,-0.052152,0.001564,0.011312,22.0,0.0
75%,139294.0,1.316582,0.797751,1.02719,0.737319,0.612704,0.39522,0.570666,0.324281,0.594912,...,0.186184,0.528316,0.147729,0.43988,0.350621,0.239885,0.09131,0.07827,77.71,0.0
max,172792.0,2.45493,22.057729,9.382558,16.875344,34.801666,73.301626,120.589494,20.007208,15.594995,...,22.614889,10.50309,22.528412,4.584549,7.519589,3.517346,31.612198,33.847808,25691.16,1.0


#### Insight
- From above description of numerical data, all means are very close to each other
- All standard deviations are also close - between 

### 3.7 Exploring Data

In [53]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### 3.6 Distributio of Data

In [54]:
# distribution of class 
df['Class'].value_counts()

Class
0    282493
1       460
Name: count, dtype: int64

1 - The data is unbalanced and need to use regularization to avoid overfitting or underfitting
2 - The fraud transaction is 460 and the genuine transaction is 282493

In [55]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 31 numerical features : ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']

We have 0 categorical features : []


In [56]:
df.head(2)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0


### 4. Sepprating The Data for Analysis

In [57]:
legit = df[df.Class == 0]
fraud = df[df.Class == 1]

In [58]:
print(legit.shape)
print(fraud.shape)  

(282493, 31)
(460, 31)


In [59]:
legit.Amount.describe()

count    282493.000000
mean         88.476932
std         250.543853
min           0.000000
25%           5.640000
50%          22.000000
75%          77.600000
max       25691.160000
Name: Amount, dtype: float64

In [60]:
fraud.Amount.describe()   

count     460.000000
mean      124.045239
std       262.620752
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [61]:
df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94840.240208,0.017304,-0.012241,0.013551,-0.01185,0.007926,0.000553,0.01111,-0.00363,0.001367,...,-0.000208,-0.000959,2e-05,0.000503,0.000554,-0.00041,0.000247,0.002449,0.000607,88.476932
1,80087.628261,-4.376653,3.311484,-6.534364,4.407571,-2.830642,-1.406285,-4.986946,0.97551,-2.47067,...,0.416612,0.39472,0.100917,-0.105252,-0.111322,0.038273,0.043374,0.236649,0.082442,124.045239


### Here data is imbalanced so we are doing here under sampling 

### 4.1 Under Sampling 

In [62]:
# Doing undersamling
df_legit = legit.sample(len(fraud))

In [63]:
# comcatenate the fraud and df_legit 
df = pd.concat([fraud, df_legit], axis=0)

In [66]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
541,406.0,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1
623,472.0,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1
4920,4462.0,-2.30335,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.56232,-0.399147,-0.238253,...,-0.294166,-0.932391,0.172726,-0.08733,-0.156114,-0.542628,0.039566,-0.153029,239.93,1
6108,6986.0,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,...,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,59.0,1
6329,7519.0,1.234235,3.01974,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,...,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,1.0,1


In [67]:
df['Class'].value_counts()

Class
1    460
0    460
Name: count, dtype: int64

In [68]:
df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,93538.791304,-0.09314,-0.221958,0.003601,0.044951,-0.010756,0.000128,-0.045588,0.122912,0.008247,...,-0.051628,-0.02636,-0.002283,0.018515,-0.016437,0.041842,0.032153,-0.006081,-0.016496,95.163739
1,80087.628261,-4.376653,3.311484,-6.534364,4.407571,-2.830642,-1.406285,-4.986946,0.97551,-2.47067,...,0.416612,0.39472,0.100917,-0.105252,-0.111322,0.038273,0.043374,0.236649,0.082442,124.045239
