# XGBoost in Python

Simple xgboost model for classifying Iris data with default parameters

In [29]:
import xgboost as xgb
import numpy as np

from sklearn import datasets
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

## Load data

In [30]:
iris = datasets.load_iris()
x = iris.data
y = iris.target

## Split into train/test sets

`sklearn.model_selection.train_test_split()` performs random subsampling of features and target

In [31]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=503)
y_test

array([0, 0, 1, 1, 2, 0, 1, 2, 2, 1, 0, 0, 2, 1, 0, 2, 2, 0, 1, 1, 1, 1, 0,
       0, 2, 0, 2, 2, 2, 2, 2, 0, 0, 1, 1, 2, 0, 0, 2, 0, 2, 2, 0, 2, 1, 1,
       1, 2, 2, 1])

## Train default model - no hyperparameter tuning

No need to specify three-way softmax for output: XGB detects that this is a multi-class classification problem and applies the correct objective function.

In [32]:
default_model = xgb.XGBClassifier()
default_model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

## In-sample performance

In [33]:
is_preds = default_model.predict(X_train)
is_acc = accuracy_score(y_train, is_preds)
is_conf_mat = confusion_matrix(y_train, is_preds)
print('In-sample accuracy: ', is_acc*100, '%')
print('In-sample confusion matrix:\n', is_conf_mat)

In-sample accuracy:  100.0 %
In-sample confusion matrix:
 [[34  0  0]
 [ 0 35  0]
 [ 0  0 31]]


## Out-of-sample performance

In [34]:
oos_preds = default_model.predict(X_test)
oos_acc = accuracy_score(y_test, oos_preds)
oos_conf_mat = confusion_matrix(y_test, oos_preds)
print('Out-of-sample accuracy: ', oos_acc*100, '%')
print('Out-of-sample confusion matrix: \n', oos_conf_mat)

Out-of-sample accuracy:  94.0 %
Out-of-sample confusion matrix: 
 [[16  0  0]
 [ 0 13  2]
 [ 0  1 18]]
