In this project we will use various predictive models to see how accurate they are in detecting whether a transaction is a normal payment or a fraud. As described in the dataset, the features are scaled and the names of the features are not shown due to privacy reasons.

The main challenge when it comes to modeling fraud detection as a classification problem comes from the fact that in real world data, the majority of transactions is not fraudulent. Investment in technology for fraud detection has increased over the years so this shouldn’t be a surprise, but this brings us a problem: imbalanced data.
There are many ways of dealing with imbalanced data. We will focus in the following approaches:<br/>
1.Oversampling — RandomOverSampler<br/>
2.Oversampling — SMOTE<br/>
2.Undersampling — RandomUnderSampler<br/>
3.Combined Class Methods — SMOTE+ENN

<b>Understand the data</b>

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
    
The first thing we must do is gather a basic sense of our data. Remember, except for the transaction and amount we dont know what the other columns are (due to privacy reasons). The only thing we know, is that those columns that are unknown have been scaled already.<br/><br/>
The description of the data says that all the features went through a PCA transformation (Dimensionality Reduction technique) (Except for time and amount).Keep in mind that in order to implement a PCA transformation features need to be previously scaled, i.e all the V features have been scaled.<br/><br/>
Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib.patches as mpatches
import time
# Classifier Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
import collections
# Other Libraries
from imblearn.datasets import fetch_datasets
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import auc,precision_score, recall_score, f1_score, roc_auc_score, accuracy_score,precision_recall_curve, classification_report,confusion_matrix
from collections import Counter
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings("ignore")
import os

In [None]:
import os
os.chdir("C:/Users/StephyJosin/Desktop/Projects/Credit_card")

df = pd.read_csv('creditcard.csv')
df.head()

In [None]:
df.describe()

In [None]:
#checking null values
df.isnull().sum().max()

In [None]:
# The classes are heavily skewed and we will fix this later.
# print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset')
# print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset')
print('No Frauds', round(df['Class'].value_counts(normalize=True)[0] * 100,2), '% of the dataset')
print('Frauds', round(df['Class'].value_counts(normalize=True)[1]*100,2), '% of the dataset')

Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis we might get a lot of errors and our algorithms will probably overfit since it will "assume" that most transactions are not fraud. But we don't want our model to assume, we want our model to detect patterns that give signs of fraud!

In [None]:
sns.countplot('Class', data=df, palette=["#0101DF", "#DF0101"])
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)

In [None]:
#Amount and Time feature distribution
fig, ax = plt.subplots(1, 2, figsize=(18,4))

amount_val = df['Amount'].values
time_val = df['Time'].values

sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])

sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])


plt.show()

We are going to scale the amount and time features as all other V features has already been scaled

In [None]:

from sklearn.preprocessing import RobustScaler

# RobustScaler is less prone to outliers.

rob_scaler = RobustScaler()

#create new scaled columns
df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

#drop original columns
df.drop(['Time','Amount'], axis=1, inplace=True)

#plot amount and time after scaling
fig, ax = plt.subplots(1, 2, figsize=(18,4))
ax[0].set_title('Amount After Robust Scaling')
sns.kdeplot(df['scaled_amount'], ax=ax[0])
ax[1].set_title('Time After Robust Scaling')
sns.kdeplot(df['scaled_time'], ax=ax[1])
plt.show()


In [None]:
#Move new scaled columns to the front
cols_at_end = ['scaled_amount', 'scaled_time']
df = df[[c for c in cols_at_end if c in df]+[c for c in df if c not in cols_at_end]]
df.head()

<b>Splitting the data</b><br/><br/>
Before proceeding with resampling techniques ,we are going to split the dataset into training and test set.Then will apply resampling techniques on the training data and build our models on the resampled training data.But we will be testing our models on the original test data.<br/>
If the number of values belonging to each class are unbalanced, using <b>stratified split</b> is a good thing. You are basically asking the model to take the training and test set such that the class proportion is same as of the whole dataset.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

print('No Frauds', round(df['Class'].value_counts(normalize=True)[0] * 100,2), '% of the dataset')
print('Frauds', round(df['Class'].value_counts(normalize=True)[1]*100,2), '% of the dataset')

#drop the target column from training data
X_features = df.drop('Class', axis=1)#training features
y_target= df['Class']#target column

#PCA transformed amount and time features
#X = PCA(n_components=2, random_state=42).fit_transform(X.values)
print(X_features.head())
print(y_target.head())
#Stratification is done based on the y labels.

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_index, test_index in sss.split(X_features, y_target):
    print("Train:", train_index, "Test:", test_index)
    original_trainX, original_testX = X_features.iloc[train_index], X_features.iloc[test_index]
    original_train_y, original_test_y = y_target.iloc[train_index], y_target.iloc[test_index]

<h1>Resampling Techniques</h1>

<h2>1.Random Under-Sampling</h2><br/>
Random Undersampling aims to balance class distribution by randomly eliminating majority class examples.  This is done until the majority and minority class instances are balanced out.<br/>

<b>Advantages</b><br/>
It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.<br/>
<b>Disadvantages</b><br/>
It can discard potentially useful information which could be important for building rule classifiers.<br/>
The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the<br/> population. Thereby, resulting in inaccurate results with the actual test data set.<br/>

<h2>2.Random Over-Sampling</h2><br/>
Over-Sampling increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample.<br/>
<b>Advantages</b><br/>
Unlike under sampling this method leads to no information loss.
Outperforms under sampling<br/>
<b>Disadvantages</b><br/>
It increases the likelihood of overfitting since it replicates the minority class events.<br/>

<h2>3.SMOTE — Synthetic Minority Over-sampling Technique</h2><br/>
SMOTE creates synthetic observations of the minority class (in this case, fraudulent transactions). At a lower level, SMOTE performs the following steps:<br/><br/>

Finding the k-nearest-neighbors for minority class observations (finding similar observations)<br/>
Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.<br/>

<b>Advantages</b><br/>
Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than replication of instances<br/>
No loss of useful information<br/>
<b>Disadvantages</b><br/>
While generating synthetic examples SMOTE does not take into consideration neighboring examples from other classes. This can result in increase in overlapping of classes and can introduce additional noise<br/>
SMOTE is not very effective for high dimensional data<br/>

<h2>4.Combined class methods-SMOTEENN</h2><br/>
SMOTE can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the resulted space obtained after over-sampling.<br/>

In this regard, we will use SMOTE together with edited nearest-neighbours (ENN). Here, ENN is used as the cleaning method after SMOTE over-sampling to obtain a cleaner space. This is something that is easily achievable by using imblearn’s SMOTEENN class.SMOTEENN is a combination of over- and under-sampling methods<br/>





In [None]:
def train_test(model,X_train,Y_train,X_Test,Y_Test,sampler): 
    '''
    This function resamples the training data,trains the model with resampled training data,
    predict the output of test data with the trained model
    - model=machinelearning model
    - X_train=training features
    - Y_train=training target
    - X_test=testdata features
    - Y_test=testdata target
    - sampler=sampler object
    '''
    x_train_sampled, y_train_sampled = sampler.fit_sample(X_train,Y_train)
    model.fit(x_train_sampled, y_train_sampled)
    #predicting test data outputs
    predictions = model.predict(X_Test)
    eval_metrics(model,Y_Test, predictions)
    
def plot_pr_curve(model_auc_pr,y_pred,y_test):
    '''
    This function plot the area under precision recall curve
    -model_auc_pr = precision_recall area under curve
    '''
    # calculate precision-recall curve
    precision, recall, thresholds = precision_recall_curve(testy, probs)
    # calculate F1 score
    f1 = f1_score(testy, yhat)
    # calculate precision-recall AUC
    auc = auc(recall, precision)
    # calculate average precision score
    ap = average_precision_score(testy, probs)
    print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap))
    # plot no skill
    plt.plot([0, 1], [0.5, 0.5], linestyle='--')
    # plot the roc curve for the model
    plt.plot(recall, precision, marker='.')
    # show the plot
    plt.show()
    
    recall, precision, thresholds = precision_recall_curve(y_test,y_pred, pos_label=1)
    plt.figure()
    plt.plot(recall, precision, label=' (area = %0.2f)' % model_auc_pr)
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision Recall Curve')
    plt.legend(loc="lower right")
    plt.savefig('PRC')
    plt.show()

def eval_metrics(model,Y_Test, predictions):  
    '''
    This function calculates the evaluation metrics
    model-model object
    Y_Test-actual target values
    predictions-predcicted target values
    
    '''
    #Accuracy score
    print('Accuracy_score')
    print('-'*40)
    print(accuracy_score(Y_Test, predictions))
    #Confusion matrix 
    print('\n Confusion_matrix')
    print('-'*40)
    print(confusion_matrix(Y_Test, predictions))
    #Classification Report
    print('\n Classification Report')
    print('-'*40)
    print(classification_report(Y_Test, predictions))
    #pr_auc_score
    model_auc_pr = pr_auc_score(Y_Test,predictions) 
    print ('\n Area under curve')
    print('-'*40)
    print(model_auc_pr)
   # plot_pr_curve(model_auc_pr,predictions,Y_Test)


In [None]:
def pr_auc_score(y_test,y_pred):
    '''
        This function computes area under the precision-recall curve. 
    '''
      
    precisions, recalls,_ = precision_recall_curve(y_test,y_pred, pos_label=1)
    
    return auc(recalls, precisions)

In [None]:

cv=5

GBmodel=GradientBoostingClassifier()
rfmodel=RandomForestClassifier()

#  Random Under Sampling
from imblearn.under_sampling import RandomUnderSampler
print('Random under-sampling")
train_test(rfmodel,original_trainX,original_train_y, original_testX, original_test_y, RandomUnderSampler())

# Random over Sampling
from imblearn.over_sampling import RandomOverSampler
print('-'*40)
print("Random over-sampling")
train_test(rfmodel, original_trainX,original_train_y, original_testX, original_test_y, RandomOverSampler())

# SMOTE
from imblearn.over_sampling import SMOTE
print('-'*40)
print("SMOTE over-sampling")
train_test(rfmodel, original_trainX,original_train_y, original_testX, original_test_y, SMOTE())

# SMOTEENN
from imblearn.combine import SMOTEENN 
print('-'*40)
print("SMOTEENN over-sampling")
train_test(rfmodel, original_trainX,original_train_y, original_testX, original_test_y, SMOTEENN())


<h3>Conclusion</h3>
Results shows that this dataset generates better precision recall curve area when applied with random over sampling technique.