Like [Random decision forest models](https://en.wikipedia.org/wiki/Random_forest), which we covered in [another notebook](03-model-random-forest.ipynb), [gradient boosted trees](https://en.m.wikipedia.org/wiki/Gradient_boosting) work by training an *ensemble* of imprecise decision trees.  However, while individual trees in random decision forests [focus on different subsets of features](https://en.wikipedia.org/wiki/Bootstrap_aggregating) to reduce variance and avoid overfitting, gradient boosting trains new weak learners to focus on examples that were mispredicted in the existing ensemble.

We will begin by loading in our data set.

In [None]:
import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")

We need to split our data set into two. One part will be used for training the model, and the other will be a testing set we can use to evaluate the model we train. We're dealing with time-series data, so we'll split the data set based on time.

In order to save memory and time, we'll further downsample the training set.

In [None]:
first = df['timestamp'].min()
last = df['timestamp'].max()
cutoff = first + ((last - first) * 0.7)

train = df[df['timestamp'] <= cutoff].sample(frac=0.35, random_state=404).copy()
test = df[df['timestamp'] > cutoff].copy()

We also load in the feature engineering pipeline stage which we developed in [notebook 2](02-feature-engineering.ipynb). The model takes the feature vectors as input, rather than the raw data.

In [None]:
import cloudpickle as cp
feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

#### Dealing with Imbalanced Classes

When the training data set contains unequal representation from each of your classes we say we are dealing with 'imbalanced classes'. In our data set fewer than 2% of the samples are fraudulent, and the remaining 98% are legitimate. Thus we have imbalanced classes. 

This causes problems for a few reasons:

1. A model which classifies all transactions as 'legitimate' would be correct 98% of the time. This high accuracy can trick you into thinking that your model is working well, despite it just returning 'legitimate' for every sample it sees. 
2. Even if your model tries to learn patterns in the data, it may struggle to learn from the fraudulent transactions since there simply aren't enough of them.

XGBoost will address this problem by weighting mispredicted classes more heavily automatically, and this is the approach we'll take.  We can also give XGBoost a hint to explicitly weight classes in training before it automatically identifies imbalance; we'll get counts of classes in the training data now so that we can weight them later.

In [None]:
fraud_count = train[train["label"] == "fraud"]["label"].count()
legit_count = train[train["label"] == "legitimate"]["label"].count()

We're now ready to train our Random Forest model. The model is trained on the feature vectors (generated using our `feature_pipeline` from the previous notebook).

In [None]:
svecs = feature_pipeline.fit_transform(train)

In [None]:
pct = fraud_count/legit_count
pct

In [None]:
#%%time
from xgboost import XGBClassifier
from sklearn import model_selection

# set this to:
#  - 'exact' for slow but precise, 
#  - 'hist' for faster and less precise, or
#  - 'gpu_hist' for a GPU-optimized implementation of 'hist'

XGB_TREE_METHOD='hist'

xgb = XGBClassifier(tree_method=XGB_TREE_METHOD, 
                    # num_parallel_tree=16, 
                    n_estimators=1024, 
                    max_depth=3, 
                    colsample_bynode=0.3, 
                    colsample_bytree=0.3, 
                    subsample=0.3)

xgb.fit(svecs, train["label"])

## Model Validation 

We need to validate our model to check how well it performs on data it wasn't trained on. We use the model we just trained to make predictions for the data in our test set, and compare those predictions to the truth. 

In [None]:
from sklearn.metrics import classification_report

predictions = xgb.predict(feature_pipeline.fit_transform(test))
print(classification_report(test.label.values, predictions))
class_report = classification_report(test.label.values, predictions, output_dict=True)

This report shows that the model is performing well and that it is slightly better at identifying legitimate transactions than fraudulent ones. 

We can visualise the classification accuracy in a confusion matrix:

In [None]:
from mlworkflows import plot
df, chart = plot.binary_confusion_matrix(test["label"], predictions)
chart

We can also view the raw counts, as well as the proportions of correctly and incorrectly classified items:

In [None]:
df

One interesting aspect of random decision forests is that they provide a metric for how important each feature was to the ultimate conclusion. This is a useful property both for having explainable models (i.e., so you can explain to a human why the model made a particular prediction) and for guiding further experiments (i.e., so you can learn more about the real world based on what the model has identified as likely to be correlated with what you're trying to predict).

In [None]:
l = list(enumerate(xgb.feature_importances_))
l.sort(key=lambda x: -x[1])
l[:5]

We can look at the [feature engineering notebook](02-feature-engineering.ipynb) to see specifically that these features are, in order of importance:
- 0: interarrival time since the previous transaction
- 5: a hashed merchant id
- 6: a hashed merchant id
- 2: a hashed merchant id
- 3: a hashed merchant id


We want to save the model so that we can use it outside of this notebook. 

In [None]:
from mlworkflows import util
xgb_dict = {'model': xgb, 'class_report': pd.DataFrame(class_report).transpose()}
util.serialize_to(xgb_dict, "xgb.sav")