## Using XGBoost for Classification Problem Overiew in Python 3.x
Pipeline: 
1. Import the libraries/modules needed
2. Import data
3. Data cleaning and pre-processing
4. Train-test split
5. XGBoost training and prediction
6. Model Evaluation

###1. Import the libraries/modules needed

In [1]:
# import the libraries needed
import xgboost as xgb
import numpy as np
import pandas as pd
import sklearn

###2. Import data

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
data = pd.DataFrame(iris.data)
data.columns = iris.feature_names
print(data.sample(10))

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
19                 5.1               3.8                1.5               0.3
86                 6.7               3.1                4.7               1.5
36                 5.5               3.5                1.3               0.2
27                 5.2               3.5                1.5               0.2
83                 6.0               2.7                5.1               1.6
74                 6.4               2.9                4.3               1.3
129                7.2               3.0                5.8               1.6
68                 6.2               2.2                4.5               1.5
131                7.9               3.8                6.4               2.0
140                6.7               3.1                5.6               2.4


###3. Data pre-processing

In [4]:
## Extract the data to the required variables
X = iris.data
y = iris.target

In [None]:
iris.data.shape

(150, 4)

###4. Train-test split

In [5]:
## split the data into train and test set. The test size here is 30% of the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [7]:
y_test.shape

(45,)

## Hyperparameter Tuning

In [8]:
# some hyperparameter tuning
param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

###5. XGBoost training and prediction

In [9]:
bst = xgb.train(param, dtrain, num_round)

In [10]:
preds = bst.predict(dtest)
preds.shape

(45, 3)

In [11]:
import numpy as np
best_preds = np.asarray([np.argmax(line) for line in preds])
print(best_preds)

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


In [12]:
best_preds.shape

(45,)

##6. Model Evaluation

In [13]:
from sklearn.metrics import precision_score, f1_score
print(precision_score(y_test, best_preds, average='macro'))
print(f1_score(y_test, best_preds, average='weighted'))


1.0
1.0
