```
Copyright 2022 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Random Forest on Credit Card Fraud Dataset

## Background 

The goal of this competition is to predict if a credit card transaction is fraudulent or genuine based on a set of anonymized features.

## Source

The raw dataset was obtained from [Kaggle: Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)

## Goal

The goals of this notebook are to illustrate how to use Snap ML to: 1) import a scikit-learn random forest trained on this dataset into Snap ML, and 2) run inference on the Z AI accelerator using the Snap ML prediction engine.

## Code

In [3]:
# This is the directory where the dataset is stored
# For this meetup, dataset is already present in the cache directory
CACHE_DIR='cache-dir'

In [9]:
import warnings
warnings.filterwarnings("ignore")
warnings.resetwarnings()

In [10]:
# The numpy library helps us to have highly efficient arrays for manipulating the dataset
# Please refer https://numpy.org/doc/stable/user/absolute_beginners.html
import numpy as np

# For calculating performance of sklearn and Snap ML frameworks
import time

# datasets module loads and pre-processes the dataset
# It can also download the dataset from Kaggle if necessary (but we won't be using that feature for now)
from datasets import CreditCardFraud

# The metrics module of sklearn allows us to measure the accuracy of our models
from sklearn.metrics import balanced_accuracy_score as score

# sklearn2pmml library Allows us to export an sklearn model in PMML format
# Please refer https://github.com/jpmml/sklearn2pmml
from sklearn2pmml import sklearn2pmml

# Allows us to build an sklearn pipeline which can be used for exporting the model
# A pipeline allows us to automate a number of steps required for creating a model
# Please refer https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn2pmml import PMMLPipeline

In [7]:
# We are going to use the Random Foreset Classifier for training the model, this is the sklearn's version
from sklearn.ensemble import RandomForestClassifier

# Import the Snap ML's version of Random Forest Classifier
from snapml import RandomForestClassifier as SnapRandomForestClassifier

In [None]:
dataset = CreditCardFraud(cache_dir=CACHE_DIR)
X_train, X_test, y_train, y_test = dataset.get_train_test_split()

In [None]:
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))
print("Classes:  ", (np.unique(y_train)))

In [None]:
# Create a scikit-learn Random Forest Classifier model
model = RandomForestClassifier(n_estimators = 200, max_depth=6, n_jobs=4, random_state=42)

# Train a PMML pipeline that uses the scikit-learn model defined above
pipeline = PMMLPipeline([("model", model)]).fit(X_train, y_train)

# Save the trained PMML pipeline to a file, e.g., "model.pmml"
sklearn2pmml(pipeline, "model.pmml", with_repr=True)

np.random.seed(1000)

# Create and score batches of rows using the PMML pipeline
test_data_size = 128

times = []
scores = []

for batch_index in range(100):
    test_data_indices = np.random.choice(X_test.shape[0], test_data_size)
    
    t0 = time.time()
    preds = pipeline.predict(X_test[test_data_indices])
    t_predict_sklearn = time.time() - t0
    
    times.append(t_predict_sklearn)
    scores.append(score(y_test[test_data_indices], preds))

t_predict_sklearn = np.mean(np.array(times))
score_sklearn = np.mean(np.array(scores))
print("Inference time (sklearn): %6.2f milliseconds" % (1000*t_predict_sklearn))
print("Accuracy score (sklearn): %.4f" % (score_sklearn))

In [None]:
# Create a Snap ML Random Forest Classifier model
snapml_model = SnapRandomForestClassifier()

# Import the scikit-learn model into Snap ML
# To indicate that the Snap ML predict engine should run on the Z AI accelerator use the "zdnn_tensors" tree format
snapml_model.import_model("model.pmml", "pmml")

# Set the number of CPU threads used at inference time
snapml_model.set_params(n_jobs=4)

np.random.seed(1000)

# Create and score batches of rows using the Snap ML predict engine
# The current implementation can run inference on test data sets with less than 32768 rows
test_data_size = 128

times = []
scores = []

for batch_index in range(100):
    test_data_indices = np.random.choice(X_test.shape[0], test_data_size)

    t0 = time.time()
    preds = snapml_model.predict(X_test[test_data_indices])
    t_predict_snapml = time.time() - t0
    
    times.append(t_predict_snapml)
    scores.append(score(y_test[test_data_indices], preds))

t_predict_snapml = np.mean(np.array(times))
score_snapml = np.mean(np.array(scores))
print("Inference time (snapml): %6.2f milliseconds" % (1000*t_predict_snapml))
print("Accuracy score (snapml): %.4f" % (score_snapml))

In [None]:
speed_up = t_predict_sklearn/t_predict_snapml
score_diff = (score_snapml - score_sklearn)/score_sklearn
print("Snap ML vs Scikit-Learn Inference Speed-up: %.1f x" % (speed_up))
print("Relative diff. in score: %.4f" % (score_diff))

## Disclaimer

Performance results always depend on the hardware and software environment. 

Information regarding the environment that was used to run this notebook are provided below:

In [None]:
import utils
environment = utils.get_environment()
for k,v in environment.items():
    print("%15s: %s" % (k, v))