# Introduction

Hello there! In this project, we will try to predict a very imbalanced dataset - Credit Cards Frauds. As it is said in the description of the dataset, the features are transformed by a PCA (Principal Component Analysis, a Dimensionality Reduction technique) and have their names hidden for privacy reasons. We will try to explore some data and apply different classification models to come to the best solution. Let's start!

### Acknowledges
Special regards to the authors of these notebooks, which helped me a lot to write this script!

- https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets 

### Dependencies

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np

# Reading Data

In [None]:
data = pd.read_csv('../input/creditcardfraud/creditcard.csv')
data.head()

In [None]:
data.shape

Given that the dataset is extremely imbalanced, we will use StatifiedShuffleSplit to split the train and test sets with similar proportions of our target class.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data['Class']):
    df_train = data.loc[train_index]
    df_test = data.loc[test_index]

print(df_train.shape)
print(df_test.shape)

In [None]:
print("Train Set Class column:")
print(df_train.Class.value_counts(normalize=True))

print("\nTest Set Class column:")
print(df_test.Class.value_counts(normalize=True))

Now that we have split our dataset, let's forget about Test Set and focus on our Training Set. Let's explore its data!

# Exploratory Data Analysis (EDA)

In [None]:
df_train.info()

In [None]:
df_train.columns[df_train.isnull().any()]

As we can see, we don't have any columns with null values, and also every one of them is numerical. As I said earlier, the columns were scaled and transformed by a PCA. But I was lying... not all of them! The columns `Time` and `Amount` aren't scaled. Let's explore them to scale in the best possible way.

In [None]:
df_train[["Time", "Amount", "Class"]].describe()

In [None]:
df_train.Time.nunique()

In [None]:
# Setting some parameters to plot better graphs
custom_params = {"axes.spines.right": False, 
                 "axes.spines.top": False,  
                 "font.family": "arial", 
                 "figure.figsize": (18, 6)}
sns.set_theme(style="ticks", rc=custom_params)

In [None]:
g = sns.histplot(df_train.Time, bins=100, label="Time", )
g.set_xlabel("Time")
g.set_ylabel("Frequency")
g.set_title("Time Distribution")
plt.show()

In [None]:
g = sns.boxplot(x="Class", y="Time", data=df_train)
g.set_title("Time Distribution by Class")
plt.show()

In [None]:
g = sns.histplot(df_train.Amount, bins=100, label="Amount")
g.set_ylabel("Frequency")
g.set_title("Amount Distribution")
plt.show()

In [None]:
g = sns.boxplot(x="Class", y="Amount", data=df_train)
g.set_title("Amount Distribution by Class")
plt.show()

Wow, how many outliers! We can remove them, but we could remove rows with Class==1 (which we saw earlier that they are pretty rare and important). Another way is to use a scaling technique not very sensible to outliers, A.K.A RobustScaler! Check this amazing article by Jeff Hale about different scaling, normalizing, and standardizing techniques: https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02

In [None]:
from sklearn.preprocessing import RobustScaler

rob_amount = RobustScaler()
rob_time = RobustScaler()
df_train['scaled_amount'] = rob_amount.fit_transform(df_train['Amount'].values.reshape(-1,1))
df_train['scaled_time'] = rob_time.fit_transform(df_train['Time'].values.reshape(-1,1))

df_train.drop(['Time','Amount'], axis=1, inplace=True)

In [None]:
corr = df_train.corr()
g = sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            cmap='coolwarm_r',
            
            )
g.set_title("Linear Correlation Heatmap")
plt.show()

As we can see, `V2` and `V5` are very negatively correlated with `scaled_amount`, and `V3` with `scaled_time`.

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(18,10))
g0 = sns.scatterplot(x="scaled_amount", y="V2", data=df_train, ax=ax[0], hue="Class", alpha=0.5)
g1 = sns.scatterplot(x="scaled_amount", y="V5", data=df_train, ax=ax[1], hue="Class", alpha=0.5)
g2 = sns.scatterplot(x="scaled_time", y="V3", data=df_train, ax=ax[2], hue="Class", alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18,6))

sns.boxplot(x="Class", y="V2", data=df_train, ax=ax[0])
sns.boxplot(x="Class", y="V5", data=df_train, ax=ax[1])
plt.show()

# Training ML Models

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df_train, df_train['Class']):
    df_train_ml = data.loc[train_index]
    df_valid = data.loc[test_index]

X_train = df_train_ml.drop(['Class'], axis=1)
X_valid = df_valid.drop(['Class'], axis=1)
y_train = df_train_ml['Class']
y_valid = df_valid['Class']

Now we will use RandomizedSearchCV to find the best params for our model. Given that our dataset is imbalanced, we will use a technique called Oversampling, using the SMOTE algorithm to create a dataset with synthetic positive instances.

In [None]:
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.base import clone
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix
import xgboost as xgb

In [None]:
params_grid = {
                "n_estimators": [100, 500, 1000],
                "max_depth": [2, 5, 10],
                "learning_rate": [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1.0],
                "gamma": stats.reciprocal(0.001, 0.1),
                "subsample": np.arange(0.1, 1.0, 0.1),
                "colsample_bytree": np.arange(0.1, 1.0, 0.1),
                "scale_pos_weight": [5, 10, 20, 50, 100],
                "n_jobs": [-1],
                "use_label_encoder": [False],
                "random_state": [42]
        }

fit_params = {
                "early_stopping_rounds": 5,
                "eval_metric":["auc"],
                "eval_set": [(X_valid, y_valid)],
                "verbose":0
        }

In [None]:
xgb_clf = xgb.XGBClassifier()
rand_grid = RandomizedSearchCV(xgb_clf, 
                               params_grid, 
                               n_iter=10, 
                               cv=5, 
                               scoring="recall", 
                               random_state=42,
                               verbose=1,
                               n_jobs=-1)
pipe = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_grid)
rand_grid.fit(X_train, y_train, **fit_params)
preds = rand_grid.predict(X_valid)

f1 = f1_score(y_valid, preds)
precision = precision_score(y_valid, preds)
recall = recall_score(y_valid, preds)
print("F1 Score: %.3f" %f1)
print("Precision: %.3f" %precision)
print("Recall: %.3f" %recall)

In [None]:
cm = confusion_matrix(y_valid, preds, normalize='true', labels=[0,1])
plt.figure(figsize=(6,6))
g = sns.heatmap(cm, annot=True, fmt=".2%", cmap="Blues")
g.set_title("Validation Confusion Matrix")
g.set_xlabel("Predicted Class")
g.set_ylabel("True Class")
plt.show()

## Training On Full Data

In [None]:
best_xgb = rand_grid.best_estimator_

In [None]:
best_xgb

In [None]:
X = df_train.drop(['Class'], axis=1)
y = df_train['Class']

smote = SMOTE(sampling_strategy='minority')
X_res, y_res = smote.fit_resample(X, y)
best_xgb.fit(X_res, y_res)

# Model Explainability

In [None]:
!pip install shap

In [None]:
import shap 

explainer = shap.Explainer(best_xgb)

shap_values = explainer(X, check_additivity=False)

In [None]:
shap.summary_plot(shap_values, X)

# Predicting Test Labels

In [None]:
df_test['scaled_amount'] = rob_amount.transform(df_test['Amount'].values.reshape(-1,1))
df_test['scaled_time'] = rob_time.transform(df_test['Time'].values.reshape(-1,1))

df_test.drop(['Time','Amount'], axis=1, inplace=True)

In [None]:
X_test = df_test.drop(["Class"], axis=1)
y_test = df_test["Class"]

In [None]:
test_preds = best_xgb.predict(X_test)
y_test.index = range(len(y_test))

y_test_1 = y_test[y_test == 1]
test_preds_1 = test_preds[y_test_1.index]

y_test_0 = y_test[y_test == 0]
test_preds_0 = test_preds[y_test_0.index]

In [None]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, test_preds)
acc_1 = accuracy_score(y_test_1, test_preds_1)
acc_0 = accuracy_score(y_test_0, test_preds_0)

print("Total Accuracy: %.1f%%" %(acc*100))
print("Fraud Accuracy: %.1f%%" %(acc_1*100))
print("Non-Fraud Accuracy: %.1f%%" %(acc_0*100))

In [None]:
cm = confusion_matrix(y_test, test_preds, normalize='true', labels=[0,1])
plt.figure(figsize=(6,6))
g = sns.heatmap(cm, annot=True, fmt=".2%", cmap="Blues")
g.set_title("Test Confusion Matrix")
g.set_xlabel("Predicted Class")
g.set_ylabel("True Class")
plt.show()

In [None]:
output = pd.DataFrame({"Id": y_test.index, "Class": test_preds})
output.to_csv("output.csv", index=False)

If you come this far, thank you! I hope I could help you in some way with my solution. Please let me know how can I improve it in the comments :)