# XGBoost Model

In this notebook we train and evaluate an [XGBoost](https://xgboost.readthedocs.io) model. The model uses _gradient boosting_ on a random forest of decision trees, iteratively so as to focus subsequent trees on examples that were misclassified by the existing forest.

## Loading training data

In [None]:
import pandas as pd
import numpy as np

import os.path

training_data = pd.read_parquet(os.path.join(os.getcwd().rsplit('/', 1)[0], "data/processed/training.parquet"))

In [None]:
training_data.sample(10)

## Feature Engineering

In [None]:
import cloudpickle as cp
feature_pipeline = cp.load(open(os.path.join(os.getcwd().rsplit('/', 1)[0], "data/processed/feature_pipeline.sav"), 'rb'))

In [None]:
training_vecs = feature_pipeline.fit_transform(training_data["Message"])

## Training a model

In [None]:
from xgboost import XGBClassifier

In [None]:
%%time

XGB_TREE_METHOD='hist'
xgb = XGBClassifier(tree_method=XGB_TREE_METHOD, 
                    # num_parallel_tree=16, 
                    n_estimators=100, 
                    max_depth=3, 
                    colsample_bynode=0.3, 
                    colsample_bytree=0.3, 
                    subsample=1, 
                    reg_alpha=1)

xgb.fit(training_vecs, training_data["Category"])

## Evaluating model performance

In [None]:
xgb.score(training_vecs, training_data["Category"])

In [None]:
testing_data = pd.read_parquet(os.path.join("data", "testing.parquet"))
testing_vecs=feature_pipeline.transform(testing_data["Text"])
xgb.score(testing_vecs, testing_data["Category"])

In [None]:
from mlworkflows import plot

df, chart =plot.confusion_matrix(testing_data["Category"], xgb.predict(testing_vecs))

In [None]:
chart

In [None]:
from sklearn.metrics import classification_report
print(classification_report(testing_data["Category"], xgb.predict(testing_vecs)))

✅ With the parameters selected above, the model performs better than the [random forest model](02-random-forest-model.ipynb). Are there any advantages to using the random forest model over the XGBoost model? 

✅ Try changing the parameters of the model - how does this affect the model's performance? 



In [None]:
from mlworkflows import util

util.serialize_to(xgb, os.path.join(os.getcwd().rsplit('/', 1)[0], "data/processed/model.sav"))