#Description
#Prediction of Credit Card fraud:

A credit card is one of the most used financial products to make online purchases and payments. Though the Credit cards can be a convenient way to manage your finances, they can also be risky. Credit card fraud is the unauthorized use of someone else's credit card or credit card information to make purchases or withdraw cash.

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. 

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

We have to build a classification model to predict whether a transaction is fraudulent or not.


#Reading the data

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df= pd.read_csv("/content/creditcard.csv")
df

ParserError: ignored

#EDA

In [None]:
#Data is preprocessed. Most columns have been processed using PCA, except for Time, Amount and Class.
df.describe()

NameError: ignored

- Data ranges differ greatly amongst all features. Normalization should be done.
- Mean and median are similar in most of the features which shows symmetrical distribution.
- Info shows no null data and same data type across all features except Class.

In [None]:
df.info()

In [None]:
df.Class.unique()

In [None]:
plt.figure(figsize=(15,10))
corr_matrix= df.corr().round(1)
sns.heatmap(data= corr_matrix, annot=True, linewidths=0.5, square=True)
plt.show()

In [None]:
#Empty values
df.isna().sum()

- Boxplot for all the variable to identify ranges and outliers

In [None]:
col_dic= {}
for k,v in enumerate(df.columns):
  col_dic[v]= k+1

In [None]:
plt.figure(figsize=(22,28))

for variable, i in col_dic.items():
  plt.subplot(16,2,i, axisbelow= True)
  sns.boxplot(x= df[variable])
# set the spacing between subplots
plt.subplots_adjust(hspace=0.6)
plt.show()


Takeaways:
- All variables contain high amount of outliers
- All variables have a median close to 0
- Distribution on all the variables appear to be normal with small variations in V1(Positive skewed) and V3 (negative skewed)
- All columns have different ranges, comparison is better after normalization to determine dispersion

#Checking Distribution

In [None]:
def centers(data):
  mean,median, modes= data.mean(), data.median(), data.mode()
  mn= plt.axvline(mean, color="red")
  md= plt.axvline(median, color="green")   
  for mode in modes:
    mo= plt.axvline(mean, color="yellow")
  plt.legend((mn,md,mo), "Mean Median Mode".split())

In [None]:
sns.displot(df.V1)
centers(df.V1)

In [None]:
sns.displot(df.Amount)
centers(df.Amount)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a subplot with multiple plots
fig, axs = plt.subplots(8, 4, figsize=(20, 20))
axs = axs.ravel()

# Plot histograms for all columns
for i, column in enumerate(df.columns):
    sns.histplot(df[column], ax=axs[i])
    centers(df[column])
    axs[i].set_title(column)

plt.tight_layout()
plt.show()

#Check class balance:

In [None]:
df['Class'].value_counts()

In [None]:
#Imbalanced Dataset: 0 non.fraudulent and 1 fraudulent
#The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

plt.figure(figsize=(10,10))
sns.countplot(x= "Class", data= df)
plt.title ("Fraudulent Transactions Vs. Non Fraudulent Transactions")
plt.xlabel ("Fraud")
plt.ylabel ("Non Fraud")
plt.show()

#Check and process Outliers
-Outliers influence the best fit line. Check the amount of outliers and depending on percentage either transform them, drop them or leave them.


In [None]:
import numpy as np
#Tukey Method:
for variable in col_dic.keys():
  #q75,q25=np.percentile(df[variable], [75,25])
  q25 = df[variable].quantile(0.25)
  q75 = df[variable].quantile(0.75)
  iqr=q75-q25
  min_threshold= q25-(iqr*1.5)
  max_threshold= q75+(iqr*1.5)
  outliers = df[(df[variable] < min_threshold) | (df[variable] > max_threshold)]
  num_outliers = outliers.shape[0]
  percentage= num_outliers/284807*100
  print("Number of outliers and percentage of it in {}: {} and {:0.2f}% \n".format(variable, num_outliers, percentage))
  

In [None]:
import pandas as pd
import numpy as np

def replace_outliers(df, col_name, k=1.5):
    q1 = df[col_name].quantile(0.25)
    q3 = df[col_name].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - k*iqr
    upper_bound = q3 + k*iqr
    
    df[col_name] = np.where(df[col_name] < lower_bound, lower_bound, df[col_name])
    df[col_name] = np.where(df[col_name] > upper_bound, upper_bound, df[col_name])
      
    return df

In [None]:
for i in col_dic.keys():
  if i != "Class":
    replace_outliers(df, i, k=1.5)

In [None]:
df

In [None]:
#Before Replacing outliers
sns.boxplot(df["V1"],orient="horizontal")

In [None]:
#After replacing outliers
sns.boxplot(x=df["V1"])

In [None]:
df.info() #Check null values again

#Scaler
- Dataset (V1...) were previously processed using PCA, therefore I apply StandardScaler to the remaining features (Time and Amount)



In [None]:
from sklearn.preprocessing import StandardScaler


# Scale 'Time' and 'Amount'
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
df['Time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1))

In [None]:
df

In [None]:
df.isnull().sum().sum()

Dealing with NaNs

In [None]:
df = df.fillna(df.mean())
df.isnull().sum().sum()

Up to this point, the dataset has been analysed, processed by dealing with outliers and standarizing the data. Next, I will create datasets using undersampling and oversampling to test with different ML or DL models.

#Dealing with imbalanced dataset:
- I test three datasets:
  1. Undersampling the majority class
  2. Oversampling the minority class
  3. SMOTE
I use the imblearn library

In [None]:
!pip install imblearn

In [None]:
X= df.drop(["Class"], axis=1)
y= df.Class

In [None]:
fraud= len(df[df["Class"]==1])
no_fraud= len(df[df["Class"]==0])

In [None]:
print(f"Fraudulent transactions: {fraud}")
print(f"Non Fraudulent transactions: {no_fraud}")

##Undersampling
Testing two techniques:
- Edited NearestNeighbours and RandomOverSampler

In [None]:
from imblearn.under_sampling import CondensedNearestNeighbour,EditedNearestNeighbours,NearMiss,NeighbourhoodCleaningRule,OneSidedSelection,RandomUnderSampler,TomekLinks

RandomUnderSampler

In [None]:
rus = RandomUnderSampler(random_state=0, replacement=True)
X_random_undersampled, y_random_undersampled = rus.fit_resample(X, y)
X_random_undersampled.head()

In [None]:
print(len(X_random_undersampled))
print(len(y_random_undersampled))

In [None]:
sns.countplot(y_random_undersampled)
plt.show()

EditedNearestNeighbours
- kind_sel='all' will be less conservative than kind_sel='mode', and more samples will be excluded in the former strategy than the latest.

In [None]:
enn = EditedNearestNeighbours(kind_sel="all")
X_edited_undersampled, y_edited_undersampled = rus.fit_resample(X, y)
X_edited_undersampled.head()

In [None]:
print(len(X_edited_undersampled))
print(len(y_edited_undersampled))

In [None]:
sns.countplot(y_edited_undersampled)
plt.show()

##OverSampling
 I use two techniques
  - Adasyn and RandomOverSampler

RandomOverSampler is over-sampling by duplicating some of the original samples of the minority class, SMOTE and ADASYN generate new samples in by interpolation.

In [None]:
 from imblearn.over_sampling import BorderlineSMOTE,ADASYN,KMeansSMOTE,SMOTE,RandomOverSampler,SVMSMOTE

RandomOverSampler

In [None]:
ros = RandomOverSampler(random_state=0)
X__random_oversampled, y_random_oversampled = ros.fit_resample(X, y)
X__random_oversampled.head() 

In [None]:
print(len(X__random_oversampled))
print(len(y_random_oversampled))

In [None]:
sns.countplot(y_random_oversampled)
plt.show()

Adasyn Oversampler


In [None]:
X_adasyn_oversampled, y_adasyn_oversampled = ADASYN().fit_resample(X, y)
X_adasyn_oversampled.head() 

In [None]:
print(len(X_adasyn_oversampled))
print(len(y_adasyn_oversampled)) 

In [None]:
sns.countplot(y_adasyn_oversampled)
plt.show()

SMOTE
- the Synthetic Minority Oversampling Technique (SMOTE)

In [None]:
X_smote_oversampled, y_smote_oversampled = SMOTE().fit_resample(X, y)
X_smote_oversampled.head() 

In [None]:
print(len(X_smote_oversampled))
print(len(y_smote_oversampled))

Split all the datasets into train and test:

In [None]:
from sklearn.model_selection import train_test_split
#Original
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state= 42)

#Undersampling
X_train_ru, X_test_ru, y_train_ru, y_test_ru = train_test_split(X_random_undersampled, y_random_undersampled, random_state= 42)
X_train_eu, X_test_eu, y_train_eu, y_test_eu = train_test_split(X_edited_undersampled, y_edited_undersampled, random_state= 42)

#Oversampling
X_train_ro, X_test_ro, y_train_ro, y_test_ro = train_test_split(X__random_oversampled, y_random_oversampled, random_state= 42)
X_train_ao, X_test_ao, y_train_ao, y_test_ao = train_test_split(X_adasyn_oversampled, y_adasyn_oversampled, random_state= 42)

X_train_so, X_test_so, y_train_so, y_test_so = train_test_split(X_smote_oversampled, y_smote_oversampled, random_state= 42)

Following dictionary is used for automating training:


In [None]:
datasets_aug= {"original": [X_train, X_test, y_train, y_test],
           "Random Undersampling": [X_train_ru, X_test_ru, y_train_ru, y_test_ru],
           "Edited Undersampling":[X_train_eu, X_test_eu, y_train_eu, y_test_eu],
           "Random Oversampling": [X_train_ro, X_test_ro, y_train_ro, y_test_ro],
           "Adasyn Oversampling": [X_train_ao, X_test_ao, y_train_ao, y_test_ao],
           "Smote Oversampling": [X_train_so, X_test_so, y_train_so, y_test_so]}

#Model training
- I will test with RandomForest and a simple NN
- Comparison of the Original, Undersampled, and Oversampled datasets
- Later hyperparameter tuning will be performed in both for the best scoring dataset.


In [None]:
def conf_matrix(confusion_matrix):
    plt.figure(figsize=(10,10))
    f, ax= plt.subplots()
    labels= y_test.unique()
    sns.heatmap(confusion_matrix, cmap="Blues", annot = True, xticklabels=labels, yticklabels=labels);
    plt.xlabel("Predicted", fontsize=20)
    plt.ylabel("Actual", fontsize=20)
    plt.show()

In [None]:
def plot_training_history(history, name):
  hist=history.history
  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Accuracy")
  plt.title(name)
  plt.plot(history.epoch, hist["accuracy"], label="Train Acuraccy")  
  plt.plot(history.epoch, hist["val_accuracy"], label="Val Acuraccy")
  plt.legend(["Training", "Validation"], loc="best")


In [None]:
def training_ml(model, name, dataset):
  model.fit(dataset[0], dataset[2])
  y_pred=model.predict(dataset[1])
  accuracy= accuracy_score(dataset[3],y_pred)
  matrix= confusion_matrix(dataset[3],y_pred)
  print(f"The model name is: {name}. It's accuracy is: {accuracy} \n")
  conf_matrix(matrix)
  return accuracy
  

Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
rfc= RandomForestClassifier()

In [None]:
dic_results= {}
for k,v in datasets_aug.items():
  acc= training_ml(rfc, k,v)
  dic_results[k] = acc
  

In [None]:
dic_results

In [None]:
df_ml= pd.DataFrame(list(dic_results.items()), columns=["Model", "Accuracy"])

In [None]:
df_ml

Neural Network


In [None]:
import tensorflow as tf
from tensorflow.keras import datasets, callbacks
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, InputLayer, Dense, Conv2D, Flatten, MaxPooling2D, BatchNormalization, Dropout, RandomFlip, RandomRotation, RandomZoom, InputLayer, Rescaling, Resizing, GlobalAveragePooling2D 
import tensorflow_datasets as tfds
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import Callback, EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from keras.utils import plot_model


In [None]:
import tensorflow as tf

# define the input shape
input_shape = (X_train.shape[1],)

# define the sequential model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=64, activation='relu', input_shape=input_shape),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=16, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

# compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()



In [None]:
# define early stopping callback
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

def train_nn(model, name, dataset):
  train_history= model.fit(
    dataset[0],
    dataset[2],
    epochs=20,
    verbose=2,
    validation_split=0.1,
    callbacks=[early_stop]
  )
  score= model.evaluate(dataset[1], dataset[3], verbose=2)
  print(f"\n Model: {name} \n")
  print("Test Accuracy: {:0.2f} \n".format(score[1]*100))
  plot_training_history(train_history, name)
  return score



In [None]:
dic_results_nn= {}
for k,v in datasets_aug.items():
  acc = train_nn(model, k,v)
  dic_results_nn[k] = acc[1]


In [None]:
dic_results_nn

In [None]:
df_nn= pd.DataFrame(list(dic_results.items()), columns=["Model", "Accuracy"])
df_nn

#Best performing models:

Random Forest: Random Undersampling.
Oversampling seems to create near 100% accuracy but also overfitting. Same as the original dataset.

Neural Network: Random Undersampling
Same as with the Random forest classifier, the use of oversampling obtained 99.8 and above accuracy. It might be a clear case of overfitting. Therefore, choosing the best next option.

For both the best option is the dataset processed with Random Undersampling. This will be the dataset used for futher hyperparameter tuning. 

#Hyperparameter Tuning

In [None]:
params_rf={
    "criterion": ("gini", "entropy"),
    "min_samples_leaf": list(range(1,10)),
    "max_depth": list(range(1,10))
    }


In [None]:
from sklearn.model_selection import GridSearchCV

gs= GridSearchCV(rfc, params_rf, scoring= "accuracy", n_jobs=-1, cv=3, verbose=1)
gs.fit(X_train_ru, y_train_ru)

In [None]:
gs.best_estimator_

In [None]:
gs.best_params_

In [None]:
rfc_new= RandomForestClassifier(criterion = 'gini', max_depth= 6, min_samples_leaf = 2)

model= training_ml(rfc_new, "Random UnderSampling",datasets_aug["Random Undersampling"])

In [None]:
import pickle

with open("rf_creditcard_fraud.pkl", "wb") as f:
    pickle.dump(model, f)
#Load:

#with open("model.pkl", "rb") as f:
#    model = pickle.load(f)

Neural Network:

In [None]:
params_nn={
    "epochs": list(range(10,30)),
    'activation': ['relu', 'tanh'],
    "optimizer": ["adam", "SDG"],
}

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# Define the neural network architecture
def create_model(optimizer='sgd', activation='relu'):
    # define the input shape
    input_shape = (X_train.shape[1],)

    # define the sequential model
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(units=64, activation= activation, input_shape=input_shape),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=32, activation=activation),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=16, activation=activation),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])

    # compile the model
    model.compile(optimizer= optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    return model



# Define the hyperparameter grid
params_nn={
    "epochs": list(range(10,30)),
    'activation': ['relu', 'tanh'],
    "optimizer": ["adam", "sgd", "AdamW"] }

# Create a Keras classifier
keras_clf = KerasClassifier(build_fn=create_model, epochs=20, batch_size=32, verbose=0)

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=keras_clf, param_grid=params_nn, cv=3)

# Fit the GridSearchCV object to the training data
grid_search.fit(datasets_aug["Random Undersampling"][0], datasets_aug["Random Undersampling"][2])

# Print the best parameters and score
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)


In [None]:
input_shape = (X_train.shape[1],)

    # define the sequential model
model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(units=64, activation= "tanh", input_shape=input_shape),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=32, activation="tanh"),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=16, activation="tanh"),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=1, activation='sigmoid')
    ])

    # compile the model
model.compile(optimizer= "adam",
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

In [None]:
train_history= model.fit(
datasets_aug["Random Undersampling"][0],
datasets_aug["Random Undersampling"][2],
epochs=26,
verbose=2,
validation_split=0.1,
callbacks=[early_stop]
)
score= model.evaluate(datasets_aug["Random Undersampling"][1], datasets_aug["Random Undersampling"][3], verbose=2)
print(f"\n Model: Random Undersampling \n")
print("Test Accuracy: {:0.2f} \n".format(score[1]*100))
plot_training_history(train_history, "Random Undersampling")

In [None]:
model.save_weights('nn_credit_card_fraud_weights')
model.save('nn_credit_card_fraud.h5')