<a id='back_to_top'></a>

# Applying SMOTE to Fraud Detection

---


**Created By**: Wuttipat S. <br>
**Created Date**: 2023-09-24 <br>
**Status**: <span style="color:green">Completed</span>

 <h3 style='background:green; color:#F0FFFF; text-align:center'><left>If you found my notebook helpful or informative, please consider upvoting it to show your support 👍</left></h3>

<div style="text-align: center;">
    <img src="https://img.freepik.com/free-vector/hacker-activity-concept_23-2148534946.jpg?w=826&t=st=1695568933~exp=1695569533~hmac=2559706cefbc543821e7e6f48268a1e85672117841ae52b78937acee20149e72" alt="Fraud Transaction" width="500"/>
    <p style="text-align: center;"><a href="https://www.freepik.com/free-vector/hacker-activity-concept_7970717.htm#query=fraud%20transaction&position=3&from_view=search&track=ais">Image by freepik</a><p>
</div>


## Introduction
In this notebook, we explore the domain of **fraud detection**, a critical application in the banking and finance sectors. Especially, we experiment with **SMOTE (Synthetic Minority Over-sampling Technique)** to better understand its effects on measurement metrics in the fraud detection model.. 

<br>

**This notebook outlines the process of fraud detection using a given dataset. The steps include:**
1. data inspection
1. exploratory data analysis
1. preprocessing
1. model training and evaluating
     1. Base-line Model (LogisticRegression)
     1. Applying SMOTE
     1. Randomforst Classifier
1. summary

<br>

**The dataset consists of the following columns:**
- **step**: Integer value, possibly representing a time step or sequence.
- **type**: Type of transaction (e.g., PAYMENT, TRANSFER, CASH_OUT).
- **amount**: Amount involved in the transaction.
- **nameOrig**: Originator of the transaction.
- **oldbalanceOrg**: Initial balance before the transaction.
- **newbalanceOrig**: New balance after the transaction.
- **nameDest**: Recipient of the transaction.
- **oldbalanceDest**: Initial balance of the recipient before the transaction.
- **newbalanceDest**: New balance of the recipient after the transaction.
-  **isFraud**: Binary label indicating if the transaction is fraudulent (1) or not (0).
-  **isFlaggedFraud**: Binary label possibly indicating if the transaction was flagged as fraud by some system (1) or not (0).

In [None]:
import warnings
warnings.filterwarnings('ignore') # Hide all warnings

In [None]:
'''
Vertify what environment are running
'''
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

if iskaggle:
    path='/kaggle/input/paysim1'
else:
    path="{}".format(os.getcwd())

### Import Python Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import time

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_fscore_support


# 1. Data Inspection
Let's look into the dataset.

In [None]:
# import the dataset
data = pd.read_csv(f"{path}/PS_20174392719_1491204439457_log.csv")
data.head()

In [None]:
#  summary of the dataset.
data.info()

In [None]:
# descriptive statistics.
data.describe()

In [None]:
# Check for missing values in the dataset
missing_values = data.isnull().sum()
missing_values

In [None]:
# Duplicated row
duplicated = data.duplicated().sum()
duplicated

- The dataset has a large number of entries (Over 6 millions rows).
- Since our has not missing or duplicated values, the data cleaning is not needed.

---

# 2. Exploratory Data Analysis

Now, let's proceed with some Exploratory Data Analysis (EDA) to better understand the dataset. We'll:
- Understand the distribution of the 'isFraud' column
- Visualize the distribution of the 'isFraud' across transaction 'type'
- Overview the 'FlaggedFraud' variable
- Compare the correlation across all variables between 'isFruad' = 0 and 1
- Vislualize the distribution of the 'amount' column

In [None]:
# Set the color palette
colors = sns.color_palette("pastel")

data['isFraud'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=0, colors=colors)
plt.ylabel('')  # This removes the 'isFraud' label on the y-axis
plt.title('Distribution of Fraudulent Transactions')
plt.show()

- The target variable, 'isFraud', is imbalanced, with the vast majority of transactions being non-fraudulent.


In [None]:
# Create the countplot
plt.figure(figsize=(6, 6))
ax = sns.countplot(x='type', data=data, hue='isFraud', palette='pastel')

# Add a title and labels
plt.title('Transaction Types vs. Fraud Count', fontsize=12)
plt.xlabel('Transaction Type', fontsize=13)
plt.ylabel('Count', fontsize=12)

# Add count values above each bar
for p in ax.patches:
    height = p.get_height()
    if not np.isnan(height):  # Check if the height is not NaN and greater than 0
        ax.annotate(f'{int(height)}', (p.get_x() + p.get_width() / 2., height),
                    ha='center', va='center', fontsize=10, color='black',
                    xytext=(0, 7),  # adjust the vertical offset
                    textcoords='offset points')

# Display the plot
plt.tight_layout()
plt.show()

- `TRANSFER` and `CASH_OUT` are the only types of transactions that have a chance to be fraudulent.

In [None]:
flagged_data = data.loc[data['isFlaggedFraud']==1]

# Create the countplot
plt.figure(figsize=(6, 6))
ax = sns.countplot(x='isFlaggedFraud', data=data, hue='isFraud', palette='pastel')

# Add a title and labels
plt.title('Flagged Transaction Types vs. Fraud Count', fontsize=13)
plt.ylabel('Count', fontsize=12)

# Add count values above each bar
for p in ax.patches:
    height = p.get_height()
    if not np.isnan(height):  # Check if the height is not NaN and greater than 0
        ax.annotate(f'{int(height)}', (p.get_x() + p.get_width() / 2., height),
                    ha='center', va='center', fontsize=10, color='black',
                    xytext=(0, 7),  # adjust the vertical offset
                    textcoords='offset points')

# Display the plot
plt.tight_layout()
plt.show()

- Out of all the transactions, only 16 were flagged as fraud. This small number indicates that the flagged system rarely detects all fraudulent transactions. However, if a transaction is flagged, it is definitively fraudulent.

In [None]:
temp = data.select_dtypes(include=['float64', 'int64'])

# Compute the correlation matrix for 'data' dataframe
corr_data = temp.corr()

# Compute the correlation matrix for 'fraudulent_transaction' dataframe
fraudulent_transaction = temp.loc[data['isFraud']==1]
corr_fraudulent = fraudulent_transaction.corr()

# Create subplots
fig, ax = plt.subplots(1, 2, figsize=(15, 6))

# Heatmap for 'data' dataframe
sns.heatmap(corr_data, annot=False, fmt=".2f", cmap='coolwarm', ax=ax[0])
ax[0].set_title("All Transaction", fontsize=12)

# Heatmap for 'fraudulent_transaction' dataframe
sns.heatmap(corr_fraudulent, annot=False, fmt=".2f", cmap='coolwarm', ax=ax[1])
ax[1].set_title("Fraudulent Transaction", fontsize=12)

fig.suptitle('Compare Heatmap Between All Transaction and Fraudulent Transaction', fontsize=15)

plt.tight_layout()
plt.show()


- There's a notable correlation between 'amount' and 'oldbalanceOrg' that distinguishes between all transactions and fraudulent transactions. This is typical for fraud because criminals are likely intending to empty the target's account regardless of the savings balance.

In [None]:
colors = sns.color_palette("pastel")

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(5, 8), sharex=True)

# Distribution of Transaction Amounts for all transactions
axes[0].hist(data['amount'], bins=50, color='skyblue', edgecolor='black')
axes[0].set_title('All')
axes[0].set_ylabel('Number of Transactions')
axes[0].set_yscale('log')  # Set y-axis to log scale
axes[0].grid(axis='y', linestyle='--', alpha=0.7)

# Distribution of Transaction Amounts for fraudulent transactions
axes[1].hist(data.loc[data['isFraud']==1]['amount'], bins=10, color='coral', edgecolor='black')
axes[1].set_title('Fraudulent')
axes[1].set_xlabel('Amount')
axes[1].set_ylabel('Number of Transactions')
axes[1].set_yscale('log')  # Set y-axis to log scale
axes[1].grid(axis='y', linestyle='--', alpha=0.7)

fig.suptitle('Distribution of Transaction Amounts', y=1)
plt.tight_layout()
plt.show()


- The distribution of the overall transaction amount is right-skewed, ranging between 0-70 million.
- The distribution of fraudulent transaction amounts is also right-skewed, ranging between 0-10 million, with a spike at 10 million. This suggests that all fraudulent transactions do not exceed 10 million per transfer. This might be due to reaching the maximum transfer limit of the bank.

---
# 3. Data Preprocessing
Before diving into model training, it's essential to preprocess the data to ensure it's in the right format for our machine learning algorithms. In this section:

- We drop the unused columns.
- Convert data types for *isFraud* and *isFlaggedFraud* to boolean type.
- Encode the categorical columns using **OneHotEncoder**.
- Further divide the data into training and testing sets. Stratified sampling is used to ensure that both sets have a similar distribution of the target variable.


In [None]:
# remove unuse columns
data.drop(['step', 'nameOrig', 'nameDest'], axis=1, inplace=True)

In [None]:
# covert data type of some columns
data['isFraud'] = data['isFraud'].astype(bool)
data['isFlaggedFraud'] = data['isFlaggedFraud'].astype(bool)
data.info()

---
### OneHotEncoder
If we Label Encoding categorical variable (in this case 'type') as integer e.g. (0,1,2,..) algorithms might assume an ordinal relationhsip between the categories. For example
- PAYMENT = 0
- TRANSFER = 1
- CASH_OUT = 2
- DEBIT = 3
- CASH_IN = 4
    
This might lead the model to assume that DEBIT(3) somehow greater than TRANSFER(2), which doesn't make sense. OneHotEncoding avoids this problem.

<div style="text-align: center;">
<img src="https://miro.medium.com/v2/resize:fit:720/0*T5jaa2othYfXZX9W." alt="Example of OneHotEncoder" width="700"/>
<p style="text-align: center;"><a href="https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179">by Michael DelSole's Medium</a></p>
</div>

In [None]:
# encodeing categorical vaiables

encoder = OneHotEncoder(sparse_output=False)

# Reshape the 'type' column to 2D array for encoder
type_col = data['type'].values.reshape(-1, 1)

encoded_cols = encoder.fit_transform(type_col)

# Reset the index of the data dataframe
data = data.reset_index(drop=True)

# Create a DataFrame from the encoded columns
df_encoded = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out(['type']))

# Drop the original 'type' column and concatenate the encoded DataFrame
encoded_data = data.drop(columns=['type'])
encoded_data = pd.concat([encoded_data, df_encoded], axis=1)


encoded_data.head()

---
### Stratified Split
After preprocessing the dataset, we'll divide it into a training set and a test set. Given the extreme imbalance in the dataset, the Stratified Split is a suitable choice for this approach.

**Stratified Split** is a method used to partition a dataset into training and test sets, ensuring that the distribution of classes in the test set closely mirrors that of the training set.


<div style="text-align: center;">
    <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*EDS6v3bDOW1VQHdEVXdNQA.png" alt="Example of OneHotEncoder" width="500"/>
    <p style="text-align: center;"><a href="https://medium.com/@analyttica/what-is-meant-by-stratified-split-289a8a986a90">by Analyttica Datalab's Medium</a></p>
</div>

In [None]:
# Split data into features and target
X = encoded_data.drop(columns=['isFraud'])
y = encoded_data['isFraud']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

---
# 4. Training & Evaluating

In this section, we'll develop a function to train, test, and evaluate our model on various models and input data. Given the need to conduct multiple experiments to identify the most effective model for fraud detection, creating a custom function will streamline and simplify this process.

In [None]:
def run_model_and_evaluate(model, X_train, y_train, X_test, y_test):
    
    start_time = time.time()  # Start the timer

    # Train the model
    model.fit(X_train, y_train)

    # Predict on test set
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]  # Get the probability of the positive class
    
    # Stop the timer
    end_time = time.time()
    elapsed_time = end_time - start_time  # Calculate elapsed time in seconds

    # Evaluate classifier's performance
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)  # Use probabilities to compute ROC AUC
    
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')
    
    metrics_dict = {
        'running_time': elapsed_time,
        'accuracy': accuracy,
        'roc_auc': roc_auc,
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }

    print(f"Running Time: {elapsed_time:.2f} seconds")
    print(f"Accuracy: {accuracy}")
    print(f"ROC AUC: {roc_auc}")
    print(classification_report(y_test, y_pred))

    # Compute the confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)    

    fig, axes = plt.subplots(1, 2, figsize=(8, 4))

    # Confusion matrix
    sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', cbar=False,
                xticklabels=["Not Fraud", "Fraud"], yticklabels=["Not Fraud", "Fraud"], ax=axes[0])
    axes[0].set_xlabel('Predicted labels')
    axes[0].set_ylabel('True labels')
    axes[0].set_title('Confusion Matrix')

    # ROC curve
    axes[1].plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
    axes[1].plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
    axes[1].set_xlabel('False Positive Rate')
    axes[1].set_ylabel('True Positive Rate')
    axes[1].set_title('Receiver Operating Characteristic (ROC) Curve')
    axes[1].legend(loc='lower right')
    axes[1].grid(alpha=0.2)

    plt.tight_layout()
    plt.show()
    
    return metrics_dict


---
### 4.1 Baseline Model
Before diving deep into complex models, it's a good practice to start with a simple model to set a baseline. Here, we train a **Logistic Regression Classifier** on our data. The results highlight the importance of considering metrics beyond accuracy:

In [None]:
lr = run_model_and_evaluate(LogisticRegression(), X_train, y_train, X_test, y_test)

Here are the results from the logistic regression classifier trained on the balanced dataset:

- **Running Time**: 21.18 sec
- **Accuracy**: 99.79%
- **ROC AUC**: 0.96
- **Precision (for Fraudulent transactions)**: 0.36
- **Recall (for Fraudulent transactions)**: 0.79
- **F1-score (for Fraudulent transactions)**: 0.5

Logistic Regression achieves high accuracy and ROC AUC, which means the model's overall reliability in classifying transactions is good. However, in fraud detection, where fraudulent transactions are intermingled with non-fraudulent ones, the balance between precision and recall often carries more weight than mere accuracy.

- The recall is 0.79, indicating that the model can detect 79% of fraudulent activities. This is not ideal. In fraud detection, a high recall is critical to ensure that fewer fraudulent transactions go undetected.

- The precision is 0.36, which means that for every 100 transactions the model predicts as fraudulent, only 36 of them are genuinely fraudulent. A lower precision implies that a higher number of legitimate transactions are incorrectly flagged.

- The F1-score is reflecting the trade-off in fraud detection between capturing real fraudulent activities and minimizing false alarms. Since this is baseline model, a score of 0.5 indicates that there is room for improvement.


---

### 4.2. Applying SMOTE


**SMOTE - Synthetic Minority Over-sampling Technique**. It's a method used to handle class imbalance in datasets by specifically over-sampling the minority class.

#### Why SMOTE?
To deal with the class imbalance problem, our choices usually revolve around Undersampling or Oversampling as a starting point. While Undersampling and Oversampling are popular choices:
- Undersampling (decreasing the number of the majority class)
- Oversampling (increasing the number of the minority class)

They both have drawbacks:
- Undersampling can lead to a loss of valuable data.
- Oversampling can lead to overfitting.

#### How does SMOTE work?
To overcome these issues, SMOTE creates synthetic (not duplicate) samples in the dataset. This is essential because our data becomes balanced, akin to using Oversampling, but with a reduction in bias.

<br><br>

<div style="text-align: center;">
    <img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*CeOd_Wbn7O6kpjSTKTIUog.png" alt="Example of OneHotEncoder" width="700"/>
    <p style="text-align: center;"><a href="https://medium.com/@parthdholakiya180/smote-synthetic-minority-over-sampling-technique-4d5a5d69d720">by parth dholakiya's Medium</a></p>
</div>

In [None]:
# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smoted, y_train_smoted = smote.fit_resample(X_train, y_train)

smote_df = pd.concat([X_train_smoted, y_train_smoted], axis=1).reset_index(drop=True)

# Checking the distribution of the target variable after SMOTE
smoted_distribution = y_train_smoted.value_counts()

smoted_distribution.plot(kind='bar', color=colors)
plt.title('Count of SMOTE Training Set')
plt.xticks(rotation=0)  # Rotate x-ticks by 90 degrees
plt.show()

- We have already generated synthetic fraudulent transactions. Plots show that our training set is now balanced.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

frac = 0.05 # Using a smaller fraction will reduce the execution time.

# First subplot
sns.scatterplot(x="newbalanceDest", y="amount", hue="isFraud", data=data.sample(frac=frac, random_state=42), palette="pastel", ax=axes[0])
axes[0].set_title("Original Data")

# Second subplot
sns.scatterplot(x="newbalanceDest", y="amount", hue="isFraud", data=smote_df.sample(frac=frac, random_state=42), palette="pastel", ax=axes[1])
axes[1].set_title("SMOTE Data")

# Main title
fig.suptitle('Scatter plot Comparison of Original and SMOTE Data (Sample with frac=0.05)', fontsize=15, y=1)
plt.tight_layout()
plt.show()

- From the plot, we observe that the fraudulent transactions are not merely duplicated. Instead, SMOTE creates new ones that are similar but not identical to the originals.

---

Next let's try out **LogisticRegressio**n model with **SMOTE's** training set

In [None]:
model = LogisticRegression(random_state=42, max_iter=100)
lr_smote = run_model_and_evaluate(model, X_train_smoted, y_train_smoted, X_test, y_test)

- Using SMOTE has made the Logistic Regression model more sensitive to the minority class (in this case 'Fraud' = 1), leading to a significant increase in recall for fraudulent transactions. This model is now better at catching most of the fraudulent activities. However, this comes at the cost of reduced precision. The trade-off between Precision and Recall has now become clearer.

<br><br>

Next, we'll introduce an ensemble model (Random Forest) to improve our fraud detection.

---

### 4.3 RandomForest
We will conduct experiments with two training sets:

1. **RandomForest** with **imbalanced** training set.
1. **RandomForest** with the **SMOTE** training set.

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10)
rf = run_model_and_evaluate(model, X_train, y_train, X_test, y_test)

In [None]:
model = RandomForestClassifier(random_state=42, n_estimators=10)
rf_smote = run_model_and_evaluate(model, X_train_smoted, y_train_smoted, X_test, y_test)

- We have conducted several experiments with two models across two training sets.
- Next, we'll compile the metrics of all the models we've trained into a single DataFrame for easier comparison.

In [None]:
result_df = pd.DataFrame([lr, lr_smote, rf, rf_smote], index=['LogisticRegression', 'LogisticRegression with SMOTE',
                                                              'RandomForest', 'RandomForest with SMOTE'])
result_df

---
## Visualize metrics of all model

In [None]:
colors = sns.color_palette("pastel")

max_value = result_df['running_time'].max()
temp = result_df.copy()
temp['normalize_running_time'] = temp['running_time'] / max_value
temp.drop('running_time', inplace=True, axis=1)
temp = temp.T

In [None]:
ax = temp.plot(kind='bar', figsize=(10, 6), color=colors)
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))  # Place the legend outside of the plot
plt.tight_layout()
plt.show()

In [None]:
# create color dictionary for the plot
c = {
    'lr':'#a1c9f4',
    'lr_smote':'#ffb482',
    'rf':'#8de5a1',
    'rf_smote':'#ff9f9b',
}

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Comparing LogisticRegression model between normal and smote
temp[['LogisticRegression', 'LogisticRegression with SMOTE']].plot(kind='bar', color=[c['lr'], c['lr_smote']], ax=axes[0, 0])
axes[0, 0].set_title("LogisticRegression: Original vs. SMOTE")
axes[0, 0].set_ylabel("Metrics Value")
axes[0, 0].set_xticklabels(temp.index, rotation=45)

# 2. Compare RandomForest model between normal and smote 
temp[['RandomForest', 'RandomForest with SMOTE']].plot(kind='bar', color=[c['rf'], c['rf_smote']], ax=axes[0, 1])
axes[0, 1].set_title("RandomForest: Original vs. SMOTE")
axes[0, 1].set_xticklabels(temp.index, rotation=45)

# 3. Compare LogisticRegression model and RandomForest model normal
temp[['LogisticRegression', 'RandomForest']].plot(kind='bar', color=[c['lr'], c['rf']], ax=axes[1, 0])
axes[1, 0].set_title("LogisticRegression vs. RandomForest (Original)")
axes[1, 0].set_ylabel("Metrics Value")
axes[1, 0].set_xticklabels(temp.index, rotation=45)

# 4. compare LogisticRegression model and RandomForst model with smote
temp[['LogisticRegression with SMOTE', 'RandomForest with SMOTE']].plot(kind='bar', color=[c['lr_smote'], c['rf_smote']], ax=axes[1, 1])
axes[1, 1].set_title("LogisticRegression vs. RandomForest (SMOTE)")
axes[1, 1].set_xticklabels(temp.index, rotation=45)

plt.tight_layout()
plt.show()


# 5. Summary

- The RandomForest demonstrates robust performance, especially in the harmony of Precision and Recall, as reflected by the F1-score. However, it takes a longer runtime than Logistic Regression.
- With SMOTE, while the precision for frauds decreases, the recall experiences a significant boost, indicating fewer missed frauds.
- The trade-off between Precision and Recall is evident, and the choice of model would depend on the specific requirements of detection.

 <h3 style='background:green; color:#F0FFFF; text-align:center'><left>If you found my notebook helpful or informative, please consider upvoting it to show your support 👍</left></h3>

## [Back to top](#back_to_top)