## Training a Random Forest Model to Predict Fraudulent Credit Card and Interac Requests

In [None]:
# for auto-reloading extensions - helpful if you're writing and testing a package
%reload_ext autoreload
%autoreload 2

# for inline plotting in python using matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

# for easier plots - also makes matplotlib plots look nicer by default
import seaborn as sns

# set up for using plotly offline without an API key - great for interactive plots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import plotly.figure_factory as ff
init_notebook_mode(connected=True)

# for numerical work
import pandas as pd
import numpy as np

import pymongo

import datetime
import time
import json

from pandas.io.json import json_normalize
from pymongo import MongoClient

import pickle

from confluent_kafka import Producer

import bson
from bson import json_util

import math

from einsteinds import db as edb
from einsteinds import event_processing
from einsteinds import ml
from einsteinds import plots
from einsteinds import utils


clean_events = event_processing.clean_events

# load the database credentials from file
with open('../creds/local_creds.json') as json_data:
    creds = json.load(json_data)

### Overview:

The training examples for the random forest model are summarized or aggregate features derived from the user's event history in the hour before the request. At a high level, the process to product these summaries is:

1. Select a credit card or interac purchase request.
2. Get the events in the hour before the request for the user.
3. Summarize the events into a set of numerical aggregates.

The process to train the random forest (or really any classifier) is to:

1. Label each training example as either fraudulent or not by comparing the user emails with the blacklist.
2. Find the optimal random forest model by using bayesian optimization combined with an n-fold grouped cross validation split with the training data.
3. Save the model and the model features(columns).

The process for prediction is as follows:

1. Select a credit card or interac purchase request for prediction.
2. Generate a single summarized training example.
3. Format the training example so that it is consistent with the features used in the trained model.
4. Generate a prediction.

### Pre-Processing The Data

The `einsteinds` package we created has a number of methods to clean events, generate sets of events related to a request and generate summaries based on those events.

In [None]:
# initialize the database with the credentials
db = edb.Database(creds)

Lets get all the requests for January and February

In [None]:
# get all the requests in Janurary in February
requests = db.get_deposit_requests(start_date=datetime.datetime(2018,1,1), end_date=datetime.datetime(2018,3,1))

requests[0]

Now we'll generate a single request event set based on the request above.

In [None]:
request_set = db.get_deposit_request_set(requests[0])

request_set

As you can see the event data is still in its raw format. Before we summarize the data we want to clean it and make it consistent. The following examples shows the output of a cleaned event. We don't call this directly, but it is happening behind the scenes and the functionality to handle the cleaning is in the`einsteinds` package.

In [None]:
event_processing.clean_events([request_set['events'][0]])

Now we'll summarize that single request set into a request summary. Note that this also handles event cleaning. We wanted to have the cleaning happen as part of the summary creation rather than at the request set stage, as we may want to use the raw request sets for a different purpose later.

In [None]:
request_summary = db.get_summarized_request_sets(rsets=[request_set])

request_summary.reset_index().to_dict('records')

We can handle this process in pieces or do it all at once. The code below, generates all the request sets in January and February.

In [None]:
# get all the requests in Janurary in February
rsets = db.get_deposit_request_sets(start_date=datetime.datetime(2018,1,1), end_date=datetime.datetime(2018,3,1))

Then we can summarize all those request sets, but we could also jump to the end result with the later call.

In [None]:
# get the request summaries based of the request sets
summaries = db.get_summarized_request_sets(rsets=rsets)

# or do the whole thing at once in one step
summaries = db.get_summarized_request_sets(start_date=datetime.datetime(2018,1,1), end_date=datetime.datetime(2018,3,1))

In [None]:
summaries

Now we can add the fraud labels to the data with one call that gets the blacklist from the database, compares the emails in the requests and adds a fraud column to the datafram

In [None]:
summaries_with_fraud = db.add_fraud_label(summaries, 'user_email')

Now we can train an optimized random forest using the summarized data. The function below trains an optimized random forest model using bayesian hyperparameter optimization and grouped n-fold cross validation. The number of folds is dependant on the number of groups in the dataset.

In [None]:
rf_model = ml.generate_optimal_random_forest(summaries_with_fraud)

We can see the parameters of the random forest produced and the resulting dictionary also contains the the features, which need to be saved to generate predictions.

In [None]:
rf_model

Now lets get some new data and generate some predictions. Let's get all the request summaries for the month of April.


In [None]:
new_data = db.get_summarized_request_sets(start_date=datetime.datetime(2018,3,1), end_date=datetime.datetime(2018,4,1))

In [None]:
new_data.head()

Now we have to format the data for prediction, which basically amounts to selecting the columns that were used to build the model and ordering them correctly.

In [None]:
X = ml.prepare_for_prediction(new_data, rf_model['model_features'])

In [None]:
results = new_data.reset_index()[['request_id', 'user_email']]

results['fraud'] = rf_model['classifier'].predict(X)

results

We have also created a simple function to predict from a raw request, that creates the request set, cleans the events, produces the summary, and generates a prediction.

In [None]:
new_requests = db.get_deposit_requests(start_date=datetime.datetime(2018,4,1), end_date=datetime.datetime(2018,5,1))

In [None]:
raw_request = new_requests[0]

In [None]:
# predict on a single raw request
ml.predict_from_request(request=raw_request, db=db, model=rf_model)