```
Copyright 2022 IBM Corporation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

# Random Forest on Credit Card Fraud Dataset

## Background 

The goal of this competition is to predict if a credit card transaction is fraudulent or genuine based on a set of anonymized features.

## Source

The raw dataset was obtained from [Kaggle: Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud)

## Goal

The goals of this notebook are to illustrate how to use Snap ML to: 1) import a scikit-learn random forest trained on this dataset into Snap ML, and 2) run inference on the Z AI accelerator using the Snap ML prediction engine.

## Code

In [None]:
# This is the directory where the dataset is stored
# For this meetup, dataset is already present in the cache directory
CACHE_DIR='cache-dir'

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# The numpy library helps us to have highly efficient arrays for manipulating the dataset
# Please refer https://numpy.org/doc/stable/user/absolute_beginners.html
import numpy as np

# For calculating performance of sklearn and Snap ML frameworks
import time

# datasets module loads and pre-processes the dataset
# It can also download the dataset from Kaggle if necessary (but we won't be using that feature for now)
from datasets import CreditCardFraud

# The metrics module of sklearn allows us to measure the accuracy of our models
from sklearn.metrics import balanced_accuracy_score as score

# sklearn2pmml library Allows us to export an sklearn model in PMML format
# Please refer https://github.com/jpmml/sklearn2pmml
from sklearn2pmml import sklearn2pmml

# Allows us to build an sklearn pipeline which can be used for exporting the model
# A pipeline allows us to automate a number of steps required for creating a model
# Please refer https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn2pmml import PMMLPipeline

In [None]:
# We are going to use the Random Foreset Classifier for training the model, this is the sklearn's version
from sklearn.ensemble import RandomForestClassifier

# Import the Snap ML's version of Random Forest Classifier
from snapml import RandomForestClassifier as SnapRandomForestClassifier

In [None]:
# Load the dataset from cache directory
dataset = CreditCardFraud(cache_dir=CACHE_DIR)

# Here, we use the helper functions from dataset module to split the dataset for training and testing
# The X_train and X_test contains features for training and testing
# The y_train and y_test contains the target values, in this case fraud/non-fraudulent
# Please see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_train, X_test, y_train, y_test = dataset.get_train_test_split()

In [None]:
# Show the number of examples, features and the classes the dataset have
print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))
print("Classes:  ", (np.unique(y_train)))

In [None]:
# Create a scikit-learn Random Forest Classifier model
# It uses a number of decision trees then uses the average of their outputs for prediction
# n_estimators is the number of trees in the forest
# max_depth is the maximum depth of individual trees
# n_jobs is the number of jobs to be run in parallel
# random_state argument allows us to pass a number to sklearn to seed the random number generator
# Please refer https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
model = RandomForestClassifier(n_estimators = 200, max_depth=6, n_jobs=4, random_state=42)

# Train a PMML pipeline that uses the scikit-learn model defined above
pipeline = PMMLPipeline([("model", model)]).fit(X_train, y_train)

# Export the trained PMML pipeline, which includes the model to the file "model.pmml"
sklearn2pmml(pipeline, "model.pmml", with_repr=True)

In [None]:
# Set the size of dataset to be used for testing
test_data_size = 128

# List of time taken for each inference
inference_time_list = []

# List of inference scores
score_list = []

# Seed the random number generator of numpy
# This allows numpy to randomly select data from the dataset
np.random.seed(1000)

# inferences are done 100 times, each on a batch of `test_data_size` number of transactions
for batch_index in range(100):
    # Retrieve indices of `test_data_size` number of transactions randomly from the test dataset
    test_data_indices = np.random.choice(X_test.shape[0], test_data_size)
    
    # Get the current time
    t0 = time.time()
    
    # Do inferences on the `test_data_size` number of transactions
    preds = pipeline.predict(X_test[test_data_indices])
    
    # Calculate the time taken to perform the inference
    t_predict_sklearn = time.time() - t0
    
    # Store the time taken for the inference
    inference_time_list.append(t_predict_sklearn)
    
    # Calculate the scores
    score_list.append(score(y_test[test_data_indices], preds))

# Find the average time taken for inference by sklearn
t_predict_sklearn = np.mean(np.array(inference_time_list))

# Find the average inference score of sklearn
score_sklearn = np.mean(np.array(score_list))

# Show the scores
print("Inference time (sklearn): %6.2f milliseconds" % (1000*t_predict_sklearn))
print("Accuracy score (sklearn): %.4f" % (score_sklearn))

In [None]:
# Create a Snap ML Random Forest Classifier model
snapml_model = SnapRandomForestClassifier()

# Import the scikit-learn model into Snap ML
# Inferences would now be performed using the Snap ML framework
# We can take advantage of the IBM z16's AI acceleration capabilities by passing `tree_format="zdnn_tensors"`
snapml_model.import_model("model.pmml", "pmml")

# Set the number of CPU threads used at inference time
snapml_model.set_params(n_jobs=4)

# As before, we seed the numpy's random number generator
np.random.seed(1000)

# Set the size of dataset to be used for testing
test_data_size = 128

inference_time_list = []
score_list = []

# Do inferences on 100 batches of transactions, now using Snap ML instead of sklearn
# Please see previous cell for detailed info of each step
for batch_index in range(100):
    test_data_indices = np.random.choice(X_test.shape[0], test_data_size)

    t0 = time.time()
    preds = snapml_model.predict(X_test[test_data_indices])
    t_predict_snapml = time.time() - t0
    
    inference_time_list.append(t_predict_snapml)
    score_list.append(score(y_test[test_data_indices], preds))

# Find the average time taken and inference scores
t_predict_snapml = np.mean(np.array(inference_time_list))
score_snapml = np.mean(np.array(score_list))
print("Inference time (snapml): %6.2f milliseconds" % (1000*t_predict_snapml))
print("Accuracy score (snapml): %.4f" % (score_snapml))

In [None]:
# Get the speed difference between sklearn and Snap ML
speed_up = t_predict_sklearn/t_predict_snapml

# Get the score difference between sklearn and Snap ML
score_diff = (score_snapml - score_sklearn)/score_sklearn

# Show the results
print("Snap ML vs Scikit-Learn Inference Speed-up: %.1f x" % (speed_up))
print("Relative diff. in score: %.4f" % (score_diff))

## Disclaimer

Performance results always depend on the hardware and software environment. 

Information regarding the environment that was used to run this notebook are provided below:

In [None]:
import utils
environment = utils.get_environment()
for k,v in environment.items():
    print("%15s: %s" % (k, v))