# Dealing with massively imbalanced datasets using YData and UbiOps
This notebook will show you how to create a model from an imbalanced dataset and serve it on UbiOps.

## Credit Fraud - a highy imbalanced dataset
The dataset in this example use case is from Kaggle - ["Credit Card Fraud detection"](https://kaggle.com/mlg-ulb/creditcardfraud), as for demonstration purposes we are only able to use datasets from the public domain.
This dataset includes labeled transactions from European credit card holders, and the data provided is a result from a dimensionality reduction, containing 27 continous features and a time column indicating the number of seconds elapsed between the first and the last transaction of the dataset.

### Installing and importing requirements

In [None]:
!pip install --upgrade scikit-learn==0.22.0 scipy==1.5.4 \
    xgboost==1.3.3 ubiops==3.15.0 ydata-synthetic==0.1.1

In [None]:
# Import everything
import xgboost as xgb
import shutil
import ubiops
import json
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as m

from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix, accuracy_score, classification_report
from ydata_synthetic.synthesizers.regular import WGAN_GP
from xgboost import XGBClassifier
from datetime import datetime


### Load the dataset

**Download** the dataset [creditcard.csv](https://storage.googleapis.com/ubiops/data/creditcard.csv) and put it in your current directory.

In [None]:
credit = pd.read_csv('creditcard.csv')

In [None]:
credit.head()

### Data split
Split the dataset in train and test sets. The test set will be used again at the end of our iteration.

In [None]:
X = credit.drop('Class', axis=1)
cols = X.columns
X = X.values
y = credit['Class']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

In [None]:
count_original = np.unique(y, return_counts=True)
count_train = np.unique(y_train, return_counts=True)
count_test = np.unique(y_test, return_counts=True)

print("Ratio between fraud and normal events for the \033[1mfull\033[0m  dataset:"+ " {:.2}%".format(count_train[1][1]/count_train[1][0]))
print("Ratio between fraud and normal events for the \033[1mtrain\033[0m dataset:"+ " {:.2}%".format(count_train[1][1]/count_train[1][0]))
print("Ratio between fraud and normal events for the \033[1mtest\033[0m dataset:"+" {:.2}%".format(count_test[1][1]/count_test[1][0]))

### The first model
Let's try to develop a model based on the assumption that everything is ok with our dataset, and understand how good our classifier is in indentifying fraudulent events.
Here we've decided to develop a classifier using [RandomForest from the scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) package.

In [None]:
X_train = pd.DataFrame(X_train, columns=cols)
X_test = pd.DataFrame(X_test, columns=cols)

In [None]:
# Data scaling and preprocessing before training the model

def preprocess_df(df, std_scaler, rob_scaler):
    df['Amount'] = std_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
    df['Time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))
    return df

In [None]:
stdscaler = StandardScaler()
robscaler = RobustScaler()

X_train = preprocess_df(X_train, stdscaler, robscaler)

In [None]:
# Apply the same transformation to the test dataset
X_test = preprocess_df(X_test, stdscaler, robscaler)

In [None]:
# Using XGBOOST model to train the model

def XGBoost_Classifier(X, y, Xtest):
    """XGBoost training code"""
    classifier = XGBClassifier()
    print('Start fitting XGBoost classifier')
    classifier.fit(X, y)
    y_pred = classifier.predict(Xtest)
    print('Classifier trained.')
    return classifier, y_pred

classifier_model, y_pred = XGBoost_Classifier(X_train, y_train, X_test)

In [None]:
# Now let's check the real metrics for this classifier

def print_confusion_matrix(model, X_test, y_test):
    """ Plot normalized and non-normalized confusion matrices """
    titles_options = [("Confusion matrix, without normalization", None),
                      ("Normalized confusion matrix", 'true')]

    fig, axes = plt.subplots(1,2,figsize=(20,8))
    for (title, normalize), ax in zip(titles_options, axes):

        disp = plot_confusion_matrix(model, X_test, y_test,
                                     display_labels=["Normal", "Fraud"],
                                     cmap=plt.cm.OrRd,
                                     normalize=normalize,
                                     ax=ax)

        ax.set_title(title, fontsize=20, pad=10)

print_confusion_matrix(classifier_model, X_test, y_test)

## Synthetic data to improve the detection of fraud

### Synthetic data with YData synthesizer package

In this case the objective is to synthesize only the fraudulent events. Through the augmentation of fraudulent events we are able to improve the results of our classifier.

In [None]:
# Let's filter by fraudulent events only
aux = X_train.copy()
aux['y'] = y_train.reset_index()['Class']

non_fraud = aux[aux['y'] == 0]
fraud = aux[aux['y']==1]

del aux

In [None]:
noise_dim = 32
dim = 128
batch_size = 164

log_step = 100
epochs = 1000+1
learning_rate = 5e-4
beta_1 = 0.5
beta_2 = 0.9
models_dir = './cache'

gan_args = [batch_size, learning_rate, beta_1, beta_2, noise_dim, fraud.shape[1]-1, dim]
train_args = ['', epochs, log_step]

fraud_synth = WGAN_GP(gan_args, n_critic=2)
fraud_synth.train(fraud.drop('y', axis=1), train_args)

synthetic_fraud = fraud_synth.sample(400)
synthetic_fraud.columns = fraud.drop('y', axis=1).columns

In [None]:
synthetic_fraud.head(5)

### Testing the classifier capacity after adding more fraudulent events

In [None]:
synth_df = synthetic_fraud.copy()
org_df = X_train.copy()

org_df['Class'] = y_train.reset_index()['Class']
org_df['color'] = np.where(org_df['Class']==1, 2, 1)

synth_df['Class'] = 1
synth_df['color'] = 3

full_data = pd.concat([org_df, synth_df])

np.unique(y_test, return_counts=True)

In [None]:
synth_y_train = synth_df['Class']
synth_train = synth_df.drop(['Class', 'color'], axis=1)

X_augmented = pd.concat([X_train, synth_train], axis=0)
y_augmented = pd.concat([y_train, synth_y_train], axis=0)

In [None]:
synth_classmodel, y_pred = XGBoost_Classifier(X_augmented, y_augmented, X_test)

In [None]:
print_confusion_matrix(synth_classmodel, X_test, y_test)

The results have improved! Now let's continue.

## Export the model to UbiOps

In [None]:
# Save model for deployment

joblib.dump(synth_classmodel, 'fraud_deployment/fraud_model.joblib') 
print('XGBoost model built and saved successfully!')

Contents of the `deployment.py`. This is the code that runs on UbiOps

In [None]:
%%writefile fraud_deployment/deployment.py
"""
The file containing the deployment code is required to be called 'deployment.py' and should contain the 'Deployment'
class and 'request' method.
"""

import pandas as pd
import numpy as np
import os
from joblib import load

class Deployment:

    def __init__(self, base_directory, context):
        print("Initialising xgboost model")

        XGBOOST_MODEL = os.path.join(base_directory, "fraud_model.joblib")
        self.model = load(XGBOOST_MODEL)

    def request(self, data):
        print('Loading data')
        input_data = pd.read_csv(data['input'])
        
        print("Prediction being made")
        prediction = self.model.predict(input_data)
        
        # Writing the prediction to a csv for further use
        print('Writing prediction to csv')
        pd.DataFrame(prediction).to_csv('prediction.csv', header = ['Class prediction'], index_label= 'index')
        
        return {
            "output": 'prediction.csv'
        }


In [None]:
# Zip the deployment package folder so it is ready to be uploaded to UbiOps

shutil.make_archive('fraud_deployment', 'zip', '.', 'fraud_deployment')

## Deploy the model to UbiOps


### Connect to the UbiOps api

In [None]:
API_TOKEN = 'Token 12344'
PROJECT_NAME = 'YOUR PROJECT NAME'
DEPLOYMENT_NAME = 'ydata-fraud-model'
DEPLOYMENT_VERSION = 'v1'

configuration = ubiops.Configuration()
configuration.api_key['Authorization'] = API_TOKEN

client = ubiops.ApiClient(configuration)
api = ubiops.api.CoreApi(client)

print(api.service_status())
client.close()

### Create a deployment and version

In [None]:
deployment_template = ubiops.DeploymentCreate(
    name=DEPLOYMENT_NAME,
    description='XGBoost Fraud Detection',
    input_type='structured',
    output_type='structured',
    input_fields=[{'name':'input', 'data_type':'file'}],
    output_fields=[{'name':'output', 'data_type':'file'}]
)

deployment = api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)
print(deployment)

In [None]:
version_template = ubiops.DeploymentVersionCreate(
    version=DEPLOYMENT_VERSION,
    environment='python3-7',
    instance_type='512mb',
    maximum_instances=1,
    minimum_instances=0,
    maximum_idle_time=1800, # = 30 minutes
    request_retention_mode='none' # We don't need request storage in this example
)

version = api.deployment_versions_create(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    data=version_template
)
print(version)

In [None]:
file_upload_result =api.revisions_file_upload(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    file='fraud_deployment.zip'
)

### Test the deployment after building

In [None]:
ubiops.utils.wait_for_deployment_version(
    client=api.api_client,
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    version=DEPLOYMENT_VERSION,
    revision_id=file_upload_result.revision
)

In [None]:
# Write the request data to csv
test_data = X_test.to_csv('test_data.csv', index=False)

# Upload the test_data.csv to UbiOps
file_uri = ubiops.utils.upload_file(client, PROJECT_NAME, 'test_data.csv')

# Make the request using the csv
# Note this could take a minute because of the model cold start(first time the model is started).
request_result = api.deployment_requests_create(
    project_name=PROJECT_NAME,
    deployment_name=DEPLOYMENT_NAME,
    data={'input': file_uri}
)
print(request_result)

# Download the output file
ubiops.utils.download_file(client, PROJECT_NAME, file_uri=request_result.result['output'])

## All done! Let's close the client properly.

In [None]:
api.close()

## Wrapping up



In this notebook we have succesfully trained a model on syntethic data using the libraries provided by YData. Aftwerwards we deployed this model on UbiOps for professional serving.

See these links for more info on UbiOps and YData:
https://ydata.ai/
https://ubiops.com/