Using XGBoost (eXtreme Gradient Boosting) is easy. Maybe too easy, considering it's generally considered the best ML algorithm around right now.

To install it, just:

pip install xgboost

Let's experiment using the Iris data set. This data set includes the width and length of the petals and sepals of many Iris flowers, and the specific species of Iris the flower belongs to. Our challenge is to predict the species of a flower sample just based on the sizes of its petals. We'll revisit this data set later when we talk about principal component analysis too.

In [1]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.2-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.2-py3-none-win_amd64.whl (124.9 MB)
   ---------------------------------------- 0.0/124.9 MB ? eta -:--:--
   -- ------------------------------------- 6.8/124.9 MB 38.1 MB/s eta 0:00:04
   ---- ----------------------------------- 14.4/124.9 MB 39.3 MB/s eta 0:00:03
   ------ --------------------------------- 20.4/124.9 MB 34.0 MB/s eta 0:00:04
   -------- ------------------------------- 28.0/124.9 MB 34.9 MB/s eta 0:00:03
   ----------- ---------------------------- 35.7/124.9 MB 35.4 MB/s eta 0:00:03
   -------------- ------------------------- 44.3/124.9 MB 36.6 MB/s eta 0:00:03
   ---------------- ----------------------- 51.6/124.9 MB 36.5 MB/s eta 0:00:03
   ------------------ --------------------- 58.7/124.9 MB 36.0 MB/s eta 0:00:02
   --------------------- ------------------ 66.8/124.9 MB 36.4 MB/s eta 0:00:02
   ------------------------ --------------- 75.2/124.9 MB 36.9

In [8]:
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()

# Get the number of samples and features
numSamples, numFeatures = iris.data.shape
print("Number of samples:", numSamples)
print("Number of features:", numFeatures)

# Get the feature names
feature_names = iris.feature_names
print("Feature names:", feature_names)

# Get the target class names
target_names = list(iris.target_names)
print("Target names:", target_names)


Number of samples: 150
Number of features: 4
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa', 'versicolor', 'virginica']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Let's divide our data into 20% reserved for testing our model, and the remaining 80% to train it with. By withholding our test data, we can make sure we're evaluating its results based on new flowers it hasn't seen before. Typically we refer to our features (in this case, the petal sizes) as X, and the labels (in this case, the species) as y.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

Now we'll load up XGBoost, and convert our data into the DMatrix format it expects. One for the training data, and one for the test data.

In [11]:
import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

Now we'll define our hyperparameters. We're choosing softmax since this is a multiple classification problem, but the other parameters should ideally be tuned through experimentation.

In [12]:
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10 

Let's go ahead and train our model using these parameters as a first guess.

In [13]:
model = xgb.train(param, train, epochs)

Now we'll use the trained model to predict classifications for the data we set aside for testing. Each classification number we get back corresponds to a specific species of Iris.

In [14]:
predictions = model.predict(test)

In [15]:
print(predictions)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


Let's measure the accuracy on the test data...

In [16]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

1.0

Holy crow! It's perfect, and that's just with us guessing as to the best hyperparameters!

Normally I'd have you experiment to find better hyperparameters as an activity, but you can't improve on those results. Instead, see what it takes to make the results worse! How few epochs (iterations) can I get away with? How low can I set the max_depth? Basically try to optimize the simplicity and performance of the model, now that you already have perfect accuracy.