# Data Collection

## Pull data

Command line code:
- `pip install kaggle`  
- `kaggle datasets download -d mlg-ulb/creditcardfraud`

After (1) installing Kaggle via CLI and (2) authenticating with an API token in order to download the dataset, code is needed to access and read the csv file using its absolute path.

## Import libraries

In [1]:
import csv
import pandas as pd
import matplotlib
%matplotlib inline
import plotly.graph_objects as go # create graph objects for customized plots
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected = True)

# defining absolute path
is_local = False
import os
if (is_local): # in current working directory
    path = "../credit_card_fraud_detection/notebooks"
else:
    path = "../data"
print("File:", os.listdir(path))

File: ['01_raw.csv']


## Read data

In [2]:
df_01_raw = pd.read_csv(path + "/01_raw.csv")

## Check data

In [3]:
rows = df_01_raw.shape[0]
columns = df_01_raw.shape[1]
size = df_01_raw.size
print("Number of rows: ", rows)
print("Number of columns: ", columns)
print("Number of elements: ", size)

Number of rows:  284807
Number of columns:  31
Number of elements:  8829017


### Glimpse data

In [4]:
first_five = df_01_raw.head()
print("First five rows:\n\n", first_five)

First five rows:

    Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26  

In [5]:
central_tendencies = df_01_raw.describe()
print("Descriptive statistics for each attribute:\n\n", central_tendencies)

Descriptive statistics for each attribute:

                 Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  1.758743e-12 -8.252298e-13 -9.636929e-13  8.316157e-13   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean   1.591952e-13  4.247354e-13 -3.050180e-13  8.693344e-14 -1.179712e-12 

### Check missing data

In [6]:
# sum of elements with null values (per column in descending order)
total_number_missing_elements = df_01_raw.isnull().sum().sort_values(ascending = False)

# sum of elements with null values / number of elements with null values (per column in descending order)
percent_missing_elements = (df_01_raw.isnull().sum() / df_01_raw.isnull().count() * 100).sort_values(ascending = False)

# concatenate dataframes above along columns and transpose dataframe
pd.concat([total_number_missing_elements,percent_missing_elements], axis = 1, keys = ['Total', 'Percent']).transpose()

Unnamed: 0,Class,V14,V1,V2,V3,V4,V5,V6,V7,V8,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Time
Total,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Percent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Check for imbalanced class distribution

In [7]:
class_value_count = df_01_raw["Class"].value_counts() # class value count method resulting in count of each value in class column
df_class_value_count = pd.DataFrame({'Class' : class_value_count.index, 'values' : class_value_count.values})
df_class_value_count

Unnamed: 0,Class,values
0,0,284315
1,1,492


In [8]:
# graph object specifications of data to be plotted
trace = go.Bar(
    x = df_class_value_count['Class'], y = df_class_value_count['values'],
    name = 'Imbalanced Credit Card Fraud Class Distribution (0 = Not fraudulent, 1 = Fraudulent)',
    marker = dict(color = 'Red'),
    text = df_class_value_count['values']
)

In [9]:
data = [trace]

In [10]:
# dictionary representation of layout
layout = dict(title = '''Imbalanced Credit Card Fraud Class Distribution
            (0 = Not fraudulent, 1 = Fraudulent)''',
             xaxis = dict(title = 'Class', showticklabels = True),
             yaxis = dict(title = 'Number of Transactions'),
             hovermode = 'closest', width = 500
             )

In [11]:
# pass trace and layout specifications to plotly 
fig = dict(data = data, layout = layout) # figure represented as dictionary
iplot(fig, filename = 'class') # display (offline) plot in notebook saved locally

The target variable, 'Class' is imbalanced as 492 / 284807, or 0.173% of the transactions are fraudulent.  This could cause conventional machine learning algorithms in predictive models, which do not consider class distribution, to be biased and inaccurate in detecting anomalies.  Thus, exploratory data analysis (EDA) is necessary to produce as much information and insights about the data as possible.