In this notebook, I have trained total of 6 different models and evaluated their performance on the fraud detection problem. As the data was not balanced, at first I trained three different models: Logistic Regression, Random Forest Classifier and XGBoost Classifier on the imbalanced data then later those models on oversampled data. The hyperparameters were tuned separately so in total there are 6 different models. 

The functions used in this notebook are declared in utils.py. 

Steps included in this notebook:

* Import library
* Read and load data
* Exploratory Data Analysis (EDA)
* Visualization of data
* Feature Selection
* Split of data
* Hyperparameter Tuning 
* Model Fitting
* Model Evaluation
* Oversampling data
* Hyperparameter Tuning 
* Model Fitting
* Model Evaluation
* Conclusion


##### Impoting necessary libraries

In [None]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, precision_score, recall_score
import utils as u

##### Reading and loading data

In [None]:
# Path to dataset
data_path = 'creditcard_data.csv'

In [None]:
# Load the data
df = u.load_data(data_path)

##### Exploratory Data Analysis

In [None]:
# Printing column names
print("\nColumns:\n",df.columns) 

In [None]:
# Check type of data in the dataframe
df.dtypes

In [None]:
# Printing null values count
print("\nNumber of null values in each column:\n",df.isnull().sum()) 

In [None]:
# Printing statistical info
print("\nStatistics of all the columns of the dataframe: \n",df.describe()) 

In [None]:
# Print unique values count for each column
print("\nUnique values count in each column: \n", df.nunique())

In [None]:
# Print skewness of each column
print("\nSkewness for each column: \n", df.skew())

In [None]:
# Print kurtosis of each column
print("\nKurtosis for each column: \n", df.kurtosis())

In [None]:
# Check and print if there are any duplicate rows
print("\nNumber of duplicate rows: \n", df.duplicated().sum())

In [None]:
# For numeric columns, create box plots to check for outliers
u.visualize_boxplots(df)

In [None]:
# Main variables
target_var = 'Class'
amount_column = 'Amount'

In [None]:
# Check Class Difference
u.class_difference(df, target_var, amount_column)

##### Visualization of the data

In [None]:
# Visualize Class Difference
u.visualize_class_difference(df, target_var)

In [None]:
# Visualize Heatmap
u.visualize_heatmap(df, target_var)

In [None]:
# Visualize Scatterplot
u.visualize_scatterplot(df, target_var)

##### Feature Selection

In [None]:
# Feature Selection
features_selected = u.feature_selection(df, target_var)

# Concatenate selected features with 'Class' and 'Amount' columns
df_final = pd.concat([df[features_selected], df[['Class', 'Amount']]], axis=1)
print("\nSelected features: ", df_final.columns)

##### Splitting the data

In [None]:
# Split Dataset
X_train, X_test, y_train, y_test = u.split_dataset(df_final, target_var)

At first, three different models: Logistic Regression, Random Forest Classifier and XGB CLassifier are trained of the data without balancing it. The hyperparameters are tuned before training the models.

##### Logistic Regression

In [None]:
# Get the best parameters
best_params_lr = u.hyperparameter_tuning_logistic(X_train, y_train)

# Create a new LogisticRegression with the best hyperparameters
best_lr_model = LogisticRegression(**best_params_lr)

# Fit the model with the training data
best_lr_model.fit(X_train, y_train)

##### Evaluating Performance of Logistic Regression

In [None]:
# Predict on the X_test using the trained LogisticRegression model
lr_pred = best_lr_model.predict(X_test)

# Compute and print the evaluation metrics
u.print_evaluation_metrics(y_test, lr_pred)

# Grpah the confusion matrix of the predicted result
u.plot_confusion_matrix(y_test, lr_pred)


##### Random Forest Classifier

In [None]:
# Get the best parameters
best_params_rf = u.hyperparameter_tuning_rf(X_train, y_train)

# Create a new RandomForestClassifier with the best hyperparameters
best_rf_model = RandomForestClassifier(**best_params_rf)

# Fit the model with the training data
best_rf_model.fit(X_train, y_train)

##### Evaluating Performance of Random Forest Classifier

In [None]:
# Predict on the X_test using the trained RandomForestClassifier model
rf_pred = best_rf_model.predict(X_test)

# Compute and print the evaluation metrics
u.print_evaluation_metrics(y_test, rf_pred)

# Grpah the confusion matrix of the predicted result
u.plot_confusion_matrix(y_test, rf_pred)

##### XGB Classifier

In [None]:
# Get the best parameters
best_params_xgb = u.hyperparameter_tuning_xgb(X_train, y_train)

# Create a new XGBClassifier with the best hyperparameters
best_xgb_model = XGBClassifier(**best_params_xgb)

# Fit the model with the training data
best_xgb_model.fit(X_train, y_train)

##### Evaluating Performance of XGB Classifier

In [None]:
# Predict on the X_test using the trained XGBClassifier model
xgb_pred = best_xgb_model.predict(X_test)

# Compute and print the evaluation metrics
u.print_evaluation_metrics(y_test, xgb_pred)

# Grpah the confusion matrix of the predicted result
u.plot_confusion_matrix(y_test, xgb_pred)

Now, the data is oversampled using SMOTE and again the same three models are trained by tuning the hyperparameters.

In [None]:
# Oversample
X_train_oversampled, y_train_oversampled = u.oversample(X_train, y_train)

##### Logistic Regression on Oversampled Data

In [None]:
# Get the best parameters
best_params_lr_os = u.hyperparameter_tuning_logistic(X_train_oversampled, y_train_oversampled)

# Create a new LogisticRegression with the best hyperparameters
best_lr_model_os = LogisticRegression(**best_params_lr_os)

# Fit the model with the training data
best_lr_model_os.fit(X_train_oversampled, y_train_oversampled)

##### Evaluating Performance of Logistic Regression

In [None]:
# Predict on the X_test using the trained LogisticRegression model
lr_os_pred = best_lr_model_os.predict(X_test)

# Compute and print the evaluation metrics
u.print_evaluation_metrics(y_test, lr_os_pred)

# Grpah the confusion matrix of the predicted result
u.plot_confusion_matrix(y_test, lr_os_pred)

##### Random Forest Classifier on Oversampled Data

In [None]:
# Get the best parameters
best_params_rf_os = u.hyperparameter_tuning_rf(X_train_oversampled, y_train_oversampled)

# Create a new RandomForestClassifier with the best hyperparameters
best_rf_model_os = RandomForestClassifier(**best_params_rf_os)

# Fit the model with the training data
best_rf_model_os.fit(X_train_oversampled, y_train_oversampled)

##### Evaluating Performance of Random Forest Classifier

In [None]:
# Predict on the X_test using the trained RandomForestClassifier model
rf_os_pred = best_rf_model_os.predict(X_test)

# Compute and print the evaluation metrics
u.print_evaluation_metrics(y_test, rf_os_pred)

# Grpah the confusion matrix of the predicted result
u.plot_confusion_matrix(y_test, rf_os_pred)

##### XGB Classifier on Oversampled Data

In [None]:
# Get the best parameters
best_params_xgb_os = u.hyperparameter_tuning_xgb(X_train_oversampled, y_train_oversampled)

# Create a new XGBClassifier with the best hyperparameters
best_xgb_model_os = XGBClassifier(**best_params_xgb_os)

# Fit the model with the training data
best_xgb_model_os.fit(X_train_oversampled, y_train_oversampled)

##### Evaluating Performance of XGB Classifier

In [None]:
# Predict on the X_test using the trained XGBClassifier model
xgb_os_pred = best_xgb_model_os.predict(X_test)

# Compute and print the evaluation metrics
u.print_evaluation_metrics(y_test, xgb_os_pred)

# Grpah the confusion matrix of the predicted result
u.plot_confusion_matrix(y_test, xgb_os_pred)

##### Conclusion

The results of all the models with and without oversampling seemed to be similar for this data.

Possible improvemets in the process:

* Remove outliers and compare the result. As it is a fraud detection problem I didnot remove the outliers because data that seems to be the outliers could be the actual positive cases in these kind of problems. And removing them will increase class imbalance.

* Try different feature selection approach and compare the results.

* Different approaches to balance the data. Instead of just using SMOTE to oversample the minority, undersampling or class weight or sample weight can be used as they might perform better in this case. 
