# Credit Card Fraud Detection
by **Sai Vamsy Dhulipala**

## Part II - Data Preprocessing

## Importing necessary packages

In [1]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 400)

In [2]:
from os import chdir, getcwd
from collections import Counter
import pickle

In [3]:
from sklearn.decomposition import PCA
from imblearn.combine import SMOTEENN

## Importing the data

### Defining a function to read the data

In [4]:
def read_data(name):
    file = pickle.load(open(getcwd() + "/data/cleaned_data/" + name + ".pkl","rb"))
    return file

### Reading the data

In [5]:
folder_path = input("Please provide the path of the folder, by replacing \\ as /:")
chdir(folder_path)

In [6]:
train = read_data("train")
valid = read_data("valid")
test = read_data("test")

## Preprocessing data for model building

#### 1. Splitting the data into X and y

In [7]:
X_train = train.drop("is_fraud", axis=1)
y_train = train["is_fraud"]

X_valid = valid.drop("is_fraud", axis=1)
y_valid = valid["is_fraud"]

X_test = test.drop("is_fraud", axis=1)
y_test = test["is_fraud"]

#### 2. Applying PCA to preserve 98% of the variance in the data

Kindly note that the above %age of variance to preserve has been selected after performing Grid Search CV on logistic regression, which has not been included in the notebook due to computational constraints.

In [8]:
pca = PCA(n_components=0.98)
pca = pca.fit(X_train)

In [9]:
X_train_decomposed = pca.transform(X_train)
X_valid_decomposed = pca.transform(X_valid)
X_test_decomposed = pca.transform(X_test)

In [10]:
print(f"Shape of train set: {X_train_decomposed.shape}")
print(f"Shape of valid set: {X_valid_decomposed.shape}")
print(f"Shape of test set: {X_test_decomposed.shape}")

Shape of train set: (907672, 68)
Shape of valid set: (389003, 68)
Shape of test set: (555719, 68)


#### 3. Resampling the data to handle class imbalance

SMOTEENN technique (Over-sampling using SMOTE and Under-sampling using ENN (Edited Nearest Neighbours)) has been used here since there is heavy class imbalance and the size of the data is quite huge. SMOTEENN has provided with an icrease of 7% in recall compared to SMOTE and ADASYN techniques. Building and evaluating the models using SMOTE and ADASYN has not been included in the notebook due to computational constraints.

Also note that resampling has been done on the valid set but not on the train set, since valid set itself took approximately 12 hours to be resampled. Hence, the pickle files of the resampled data have also been attached in the submission files in order to save time. The code to load the files has been provided below in markdown. Please use the same code while running the notebook.

In [11]:
resampler = SMOTEENN(random_state=42)
X_resampled, y_resampled = resampler.fit_resample(X_valid_decomposed, y_valid)

def read_resampled_data(name):
    file = pickle.load(open(getcwd() + "/data/resampled_data/" + name + "_resampled.pkl","rb"))
    return file

X_resampled = read_resampled_data("X")
y_resampled = read_resampled_data("y")

In [12]:
print(Counter(y_resampled))

Counter({1: 386751, 0: 380700})


## Saving the data into pickle files

In [13]:
for df, file_name in (X_train_decomposed, "X_train_decomposed"), (X_valid_decomposed, "X_valid_decomposed"), (X_test_decomposed, "X_test_decomposed"):
    pickle.dump(df, open(getcwd() + "/data/decomposed_data/" + file_name + ".pkl", "wb"))
    del df

In [15]:
for df, file_name in (y_train, "y_train"), (y_valid, "y_valid"), (y_test, "y_test"):
    pickle.dump(df, open(getcwd() + "/data/decomposed_data/" + file_name + ".pkl", "wb"))
    del df

In [16]:
for df, file_name in (X_resampled, "X_resampled"), (y_resampled, "y_resampled"):
    pickle.dump(df, open(getcwd() + "/data/resampled_data/" + file_name + ".pkl", "wb"))
    del df

### Deleting all the variables present in the memory

In [17]:
%reset -f