# Credit Card Fraud Detection With Machine Learning in Python

This example shows use of classification to help credit card company to detect potential fraud cases. 
Original example can be found [here](https://medium.com/codex/credit-card-fraud-detection-with-machine-learning-in-python-ac7281991d87).

### Notes on running this example:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

To run the code:
1. Make sure you [add your AWS account credentials to Saturn Cloud](https://saturncloud.io/docs/examples/python/load-data/qs-load-data-s3/#create-aws-credentials) to access the data.
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.

### Start an IPyParallel cluster
Run the following code in a cell to start an IPyParallel cluster. 4 cores are used in this example. 

In [None]:
import ipyparallel as ipp
import psutil

n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines="mpi", n=n).start_and_connect_sync(activate=True)

### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [None]:
%%px
import bodo

print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Numpy to work with arrays
 - scikit-learn to build and evaluate classification models
 - xgboost for xgboost classifier model algorithm

In [None]:
%%px
import warnings

warnings.filterwarnings("ignore")

import time

import bodo
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier  # Random forest tree algorithm
from sklearn.linear_model import LogisticRegression  # Logistic regression algorithm
from sklearn.metrics import accuracy_score  # evaluation metric
from sklearn.model_selection import train_test_split  # data split
from sklearn.preprocessing import StandardScaler  # data normalization
from sklearn.svm import LinearSVC  # SVM classification algorithm
from xgboost import XGBClassifier  # XGBoost algorithm

## Data Processing and EDA
1. Load dataset
2. Compute the percentage of fraud cases in the overall recorded transcations.
3. Get a statistical view of both fraud and non-fraud transaction amount data

In [None]:
%%px
@bodo.jit(distributed=["df"], cache=True)
def load_data():
    start = time.time()
    df = pd.read_csv("s3://bodo-example-data/creditcard/creditcard.csv")
    df.drop("Time", axis=1, inplace=True)
    end = time.time()
    print("Read Time: ", (end - start))
    return df


df = load_data()

In [None]:
%%px
df.shape

In [None]:
%%px
def data_processing(df):
    cases = len(df)
    nonfraud_cases = df[df.Class == 0]
    fraud_cases = df[df.Class == 1]
    nonfraud_count = len(nonfraud_cases)
    fraud_count = len(fraud_cases)
    fraud_percentage = round(fraud_count / nonfraud_count * 100, 2)
    print("--------------------------------------------")
    print("Total number of cases are ", cases)
    print("Number of Non-fraud cases are ", nonfraud_count)
    print("Number of fraud cases are", fraud_count)
    print("Percentage of fraud cases is ", fraud_percentage)
    print("--------------------------------------------")
    print("--------------------------------------------")
    print("NON-FRAUD CASE AMOUNT STATS")
    print(nonfraud_cases.Amount.describe())
    print("FRAUD CASE AMOUNT STATS")
    print(fraud_cases.Amount.describe())
    print("--------------------------------------------")


data_processing(df)

## Feature Selection & Data Split

### 1. Normalize `Amount` variable
`Amount` variable varies when compared to the rest of the variables. To reduce its range of values, we normalize it using the `StandardScaler` 

In [None]:
%%px
@bodo.jit(distributed=["df"], cache=True)
def sc(df):
    start = time.time()
    sc = StandardScaler()
    amount = df["Amount"].values
    amount = amount.reshape(-1, 1)
    sc.fit(amount)
    df["Amount"] = (sc.transform(amount)).ravel()
    print("StandardScaler time: ", time.time() - start)
    print(df["Amount"].head(10))


sc(df)

### 2. Split the data into a training set and testing set 

In [None]:
%%px
@bodo.jit(distributed=["df", "X_train", "X_test", "y_train", "y_test"], cache=True)
def data_split(df):
    X = df.drop("Class", axis=1).values
    y = df["Class"].values.astype(np.int64)
    start = time.time()
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, train_size=0.8, random_state=0
    )
    print("train_test_split time: ", time.time() - start)
    print("X_train samples :", X_train[:1])
    print("X_test samples :", X_test[0:1])
    print("y_train samples :", y_train[0:20])
    print("y_test samples :", y_test[0:20])
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = data_split(df)

## Modeling
Here we have built four different types of classification models and evaluate these models using accuracy score metrics provided by scikit-learn package.

#### 1. Logistic Regression

In [None]:
%%px
@bodo.jit(distributed=["X_train", "y_train", "X_test", "y_test"], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    lr_yhat = lr.predict(X_test)
    print("LogisticRegression fit and predict time: ", time.time() - start)
    print(
        "Accuracy score of the Logistic Regression model is {}".format(
            accuracy_score(y_test, lr_yhat)
        )
    )


lr_model(X_train, X_test, y_train, y_test)

#### 2. Random Forest Tree

In [None]:
%%px
@bodo.jit(distributed=["X_train", "y_train", "X_test", "y_test"], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier(max_depth=4)
    rf.fit(X_train, y_train)
    rf_yhat = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time() - start)
    print(
        "Accuracy score of the Random Forest Tree model is {}".format(
            accuracy_score(y_test, rf_yhat)
        )
    )


rf_model(X_train, X_test, y_train, y_test)

#### 3. XGBoost Model

In [None]:
%%px
@bodo.jit(distributed=["X_train", "y_train", "X_test", "y_test"], cache=True)
def xgb_model(X_train, X_test, y_train, y_test):
    start = time.time()
    xgb = XGBClassifier(max_depth=4)
    xgb.fit(X_train, y_train)
    xgb_yhat = xgb.predict(X_test)
    print("XGBClassifier fit and predict time: ", time.time() - start)
    print(
        "Accuracy score of the XGBoost model is {}".format(
            accuracy_score(y_test, xgb_yhat)
        )
    )


xgb_model(X_train, X_test, y_train, y_test)

#### 4. SVM

In [None]:
%%px
@bodo.jit(distributed=["X_train", "y_train", "X_test", "y_test"], cache=True)
def lsvc_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lsvc = LinearSVC(random_state=42)
    lsvc.fit(X_train, y_train)
    lsvc_yhat = lsvc.predict(X_test)
    print("LinearSVC fit and predict time: ", time.time() - start)
    print(
        "Accuracy score of the Linear Support Vector Classification model is {}".format(
            accuracy_score(y_test, lsvc_yhat)
        )
    )


lsvc_model(X_train, X_test, y_train, y_test)

In [None]:
# To stop the cluster run the following command.
rc.cluster.stop_cluster_sync()