# Part 2.2 - Train Final XGBoost Classifier
We will use XGBoost to train a model for predicting the astronomical classes. XGBoost uses a machine learning technique called gradient boosting. Similar in concept to random forests, gradient boosting uses an ensemble of decision trees for prediction.

While random forest ensembles can be built in parallel, gradient boosting ensembles are built by iteratively adding decision trees, which minimize some objective function, to the forest. Each iteration is focused on improving predictions for observations that performed poorly in previous iterations.

In [None]:
import cudf as gd
import pandas as pd
import numpy as np
import math
import xgboost as xgb
from termcolor import colored
import matplotlib.pyplot as plt

from utils import xgb_cross_entropy_loss, cross_entropy

from cuml.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
train_final_gd = gd.from_pandas(pd.read_pickle("train_gdf.pkl"))
test_final_gd = gd.from_pandas(pd.read_pickle("test_gdf.pkl"))

In [None]:
train_final_gd.head().to_pandas()

Let's extract the set of unique classes from our training set

In [None]:
y = train_final_gd['target']
classes = sorted(np.unique(y.to_array()))

In [None]:
classes

Since our classifier will expect labels to be in the range `[0, n_classes-1]`, we can use cuML's `LabelEncoder()` function to encode them in the training dataset. 

In [None]:
y = LabelEncoder().fit_transform(y).to_array()

In [None]:
y

Preprocess our columns to fill `nan` values with zeros

In [None]:
cols = [i for i in test_final_gd.columns if i not in ['object_id','target']]
for col in cols:
    train_final_gd[col] = train_final_gd[col].fillna(0).astype('float32')

for col in cols:
    test_final_gd[col] = test_final_gd[col].fillna(0).astype('float32')

We can use the `train_test_split()` function from Scikit-learn to perform a stratified split of our training dataset into 90% training and 10% validation datasets. 

In [None]:
X = train_final_gd[cols].as_matrix()
Xt = test_final_gd[cols].as_matrix()

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1,stratify=y)

XGBoost models are configured using a dictionary for parameters. You can learn more about the various different configuration parameters in the XGBoost [docs](https://xgboost.readthedocs.io/en/latest/parameter.html). 

In [None]:
gpu_params = {
            'objective': 'multi:softprob',    # softmax, return probabilities for each class
            'nthread': 16,                    
            'num_class':len(classes),         
            'max_depth': 7,                   
            'silent': 1,  
            'subsample':0.7,
            'colsample_bytree': 0.7,
            "tree_method": "gpu_hist"         # compute histograms for splits on GPU
}

XGBoost uses a `DMatrix` object to represent data inputs. We can construct them with our training, test, and validation data.

In [None]:
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dvalid = xgb.DMatrix(data=X_test, label=y_test)
dtest = xgb.DMatrix(data=Xt)

Train our XGBoost model

In [None]:
watchlist = [(dvalid, 'eval'), (dtrain, 'train')]

clf = xgb.train(gpu_params, 
                dtrain=dtrain,
                num_boost_round=60,
                evals=watchlist,
                feval=xgb_cross_entropy_loss(classes),
                early_stopping_rounds=10,
                verbose_eval=1000)

In [None]:
yp = clf.predict(dvalid)

gpu_loss = cross_entropy(y_test, yp, classes)

ysub = clf.predict(dtest)

line = 'validation loss %.4f'%gpu_loss
print(colored(line,'green'))

In [None]:
yp

### Indepdendent Exercise

Now that `yp` contains output probabilities for each class in the validation dataset, the accuracy can be computed by taking the number of correct predictions and diving by the total number of predictions. 

Recall that the result from `predict()` is an array of size `(n_samples, n_classes)` containing the class probabilities for each sample. 

In the cell below, use these class probabilties to compute `y_pred` so that it contains the actual predicted class labels.

In [None]:
y_pred = # Compute the actual class labels

Generally, we would use our validation data to tune our model and our test data to evaluate final performance of our model. 

We don't have labels for our test set so we will just compute the accuracy of our validation set.

In [None]:
accuracy_score(y_pred, y_test)

A confusion matrix will give us a good indication of the performance for each class in our validation set

In [None]:
from utils import plot_confusion_matrix
plot_confusion_matrix(y_test, y_pred, np.arange(1, len(classes)+1), normalize=True)

## Further Exercise

Now that you have trained an XGBoost model using both the timeseries embeddings and the statistical features,

1. Re-run the previous notebook, but don't merge the timseries features into your training and test datasets before storing them. 
2. Use your new dataset to train a new XGBoost classifier. 
3. Compare the accuracy and confusion matrix against the model that included the timeseries embedding features. 

