## WQD7004 Alternative Assessment

In [1]:
# importing library

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
# reading csv into dataframe

df = pd.read_csv("C:/Users/oscar/Desktop/UM/WQD7004 Programming/creditcard.csv")

In [3]:
# first 5 row of the dataframe

df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
# check for null values

df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [5]:
nofrauds = df['Class'].value_counts()[0]
frauds = df['Class'].value_counts()[1]

print('No Frauds', round(nofrauds/len(df)*100,2), '%of the dataset')
print('Frauds', round(frauds/len(df)*100,2), '%of the dataset')

No Frauds 99.83 %of the dataset
Frauds 0.17 %of the dataset


Notice how imbalanced is our dataset wwith 99.83% of No Frauds and only 0.17% of Frauds. If we use this dataframe as a base for our predictive modelling, our algorithms will probably overfit since it will 'assume' that most of the transaction is going to be No Frauds.

To continue with this we will have to create a subsample with a ratio of 50/50 of Frauds and No Frauds transaction.

### Random Under-sampling
In this phase of the project we will implement "Random Under Sampling" which basically consists of removing data in order to have a more balanced dataset and thus avoiding our models to overfitting.

Since there is a 492 rows of Frauds transaction data, assuming we want to achieve a ratio of 50/50 subsample, we will randomly choose another 492 rows of No Frauds transaction data from the dataframe.

In [6]:
# Lets shuffle the data before creating the subsamples

df = df.sample(frac=1)

# amount of fraud classes 492 rows.
fraud_df = df.loc[df['Class'] == 1]
non_fraud_df = df.loc[df['Class'] == 0][:492]

normal_distributed_df = pd.concat([fraud_df, non_fraud_df])

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42)

In [9]:
# now we have a new dataframe with a combination of 50:50 Frauds and No Frauds data

new_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
221153,142455.0,0.198757,-2.937021,-1.435928,1.569646,-0.406247,1.837675,0.407623,0.299951,0.436065,...,0.479582,-0.352901,-0.388156,-1.652056,-0.848618,-0.912362,-0.055699,0.097117,850.0,0
276864,167338.0,-1.374424,2.793185,-4.346572,2.400731,-1.688433,0.111136,-0.922038,-2.14993,-2.027474,...,-0.870779,0.504849,0.137994,0.368275,0.103137,-0.414209,0.454982,0.096711,349.08,1
249992,154671.0,-0.336486,1.024339,-0.548793,-0.292096,1.155033,-0.68419,1.043101,-0.341845,-0.163569,...,-0.167662,-0.327067,0.392497,0.539568,-1.350847,0.016465,-0.111222,0.252853,8.99,0
150662,93853.0,-5.839192,7.151532,-12.81676,7.031115,-9.651272,-2.938427,-11.543207,4.843627,-3.494276,...,2.462056,1.054865,0.530481,0.47267,-0.275998,0.282435,0.104886,0.254417,316.06,1
23308,32686.0,0.287953,1.728735,-1.652173,3.813544,-1.090927,-0.984745,-2.202318,0.555088,-2.033892,...,0.262202,-0.633528,0.092891,0.187613,0.368708,-0.132474,0.576561,0.309843,0.0,1


### Oversampling (SMOTE)

SMOTE stands for Synthetic Minority Over-sampling Technique. Unlike Random UnderSampling, SMOTE creates new synthetic points in order to have an equal balance of the classes. This is another alternative for solving the "class imbalance problems".

SMOTE is a technique used to balance out the class distribution in a dataset by generating synthetic samples from the minority class. It does this by picking the distance between the minority class samples and generating new synthetic points in between them. This method results in a higher retention of information compared to random under-sampling, but also requires more computational resources and time to train.

Since the technique for creating Oversampling with SMOTE requires a certain level of machine learning algorithms, to keep things simple we are not going to demonstrate that in this notebook. Please refer the following attached link for more information.

https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets#Test-Data-with-Logistic-Regression

### Neural Network Testing Random Undersampling and Oversampling(SMOTE)

After creating the dataset using SMOTE, we will utilize a basic Neural Network with one hidden layer to compare the accuracy of the logistic regression models that were previously developed using the under-sampled and over-sampled (SMOTE) datasets, in identifying fraud and non-fraud transactions by using the confusion matrix.

The purpose of our study is to investigate the performance of a basic neural network when applied to datasets that have undergone random under-sampling and over-sampling. The goal is to determine the network's ability to accurately identify both fraud and non-fraud cases. It is important to consider non-fraud cases as well, as a false positive (mistakenly identifying a legitimate transaction as fraudulent) can result in inconvenience for the customer, such as having their card blocked. Therefore, our focus is on both correctly identifying fraud and accurately classifying non-fraud transactions.