# Scaling XGBoost with Dask and Coiled

[XGBoost](https://xgboost.readthedocs.io/en/latest/) is a library used for training gradient boosted supervised machine learning models. In this guide, you'll learn how to train an XGBoost model in parallel in your own cloud account using Dask and Coiled. Download {download}`this jupyter notebook <xgboost.ipynb>` to follow along.

## Before you start

You'll first need to create consistent local and remote software environments with `dask`, `coiled`, and the necessary dependencies installed. You can use [coiled-runtime](https://docs.coiled.io/user_guide/software_environment.html#coiled-runtime), a conda metapackage, which already includes `xgboost` and `dask-ml`.

You can install `coiled-runtime` locally in a conda environment:

```
conda create -n xgboost-example -c conda-forge python=3.9 coiled-runtime
```

And activate the conda environment you just created:

```
conda activate xgboost-example
```

## Launch your Coiled cluster

Create a Dask cluster in your cloud account with Coiled:

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=5,
    name="xgboost-example"
)

And connect Dask to your remote Coiled cluster:

In [91]:
import dask.distributed

client = dask.distributed.Client(cluster)

0,1
Connection method: Cluster object,Cluster type: coiled.ClusterBeta
Dashboard: http://34.229.183.68:8787,

0,1
Dashboard: http://34.229.183.68:8787,Workers: 10
Total threads: 20,Total memory: 37.75 GiB

0,1
Comm: tls://10.0.9.70:8786,Workers: 10
Dashboard: http://10.0.9.70:8787/status,Total threads: 20
Started: 1 minute ago,Total memory: 37.75 GiB

0,1
Comm: tls://10.0.14.85:41987,Total threads: 2
Dashboard: http://10.0.14.85:45971/status,Memory: 3.78 GiB
Nanny: tls://10.0.14.85:44077,
Local directory: /scratch/dask-worker-space/worker-2vxe2co5,Local directory: /scratch/dask-worker-space/worker-2vxe2co5

0,1
Comm: tls://10.0.14.74:40389,Total threads: 2
Dashboard: http://10.0.14.74:45349/status,Memory: 3.78 GiB
Nanny: tls://10.0.14.74:43301,
Local directory: /scratch/dask-worker-space/worker-15p27tfu,Local directory: /scratch/dask-worker-space/worker-15p27tfu

0,1
Comm: tls://10.0.13.65:34529,Total threads: 2
Dashboard: http://10.0.13.65:36785/status,Memory: 3.78 GiB
Nanny: tls://10.0.13.65:33537,
Local directory: /scratch/dask-worker-space/worker-6dyjc8gm,Local directory: /scratch/dask-worker-space/worker-6dyjc8gm

0,1
Comm: tls://10.0.1.91:40257,Total threads: 2
Dashboard: http://10.0.1.91:36887/status,Memory: 3.78 GiB
Nanny: tls://10.0.1.91:40003,
Local directory: /scratch/dask-worker-space/worker-3jzg56vz,Local directory: /scratch/dask-worker-space/worker-3jzg56vz

0,1
Comm: tls://10.0.6.182:33137,Total threads: 2
Dashboard: http://10.0.6.182:39023/status,Memory: 3.78 GiB
Nanny: tls://10.0.6.182:38051,
Local directory: /scratch/dask-worker-space/worker-xhxjb74z,Local directory: /scratch/dask-worker-space/worker-xhxjb74z

0,1
Comm: tls://10.0.10.42:38521,Total threads: 2
Dashboard: http://10.0.10.42:39829/status,Memory: 3.78 GiB
Nanny: tls://10.0.10.42:38643,
Local directory: /scratch/dask-worker-space/worker-bmw0qjz3,Local directory: /scratch/dask-worker-space/worker-bmw0qjz3

0,1
Comm: tls://10.0.13.106:40497,Total threads: 2
Dashboard: http://10.0.13.106:44907/status,Memory: 3.78 GiB
Nanny: tls://10.0.13.106:33763,
Local directory: /scratch/dask-worker-space/worker-got9tlbu,Local directory: /scratch/dask-worker-space/worker-got9tlbu

0,1
Comm: tls://10.0.7.5:40039,Total threads: 2
Dashboard: http://10.0.7.5:44959/status,Memory: 3.78 GiB
Nanny: tls://10.0.7.5:38337,
Local directory: /scratch/dask-worker-space/worker-vl888io3,Local directory: /scratch/dask-worker-space/worker-vl888io3

0,1
Comm: tls://10.0.10.116:42899,Total threads: 2
Dashboard: http://10.0.10.116:37847/status,Memory: 3.78 GiB
Nanny: tls://10.0.10.116:42445,
Local directory: /scratch/dask-worker-space/worker-6xcej6hx,Local directory: /scratch/dask-worker-space/worker-6xcej6hx

0,1
Comm: tls://10.0.14.233:36111,Total threads: 2
Dashboard: http://10.0.14.233:43361/status,Memory: 3.78 GiB
Nanny: tls://10.0.14.233:37535,
Local directory: /scratch/dask-worker-space/worker-o5wsyu4f,Local directory: /scratch/dask-worker-space/worker-o5wsyu4f


## Train your model

You’ll use the [Higgs dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) available on Amazon S3. This dataset is composed of 11 million simulated particle collisions, each of which is described by 28 real-valued features and a binary label indicating which class the sample belongs to (i.e. whether the sample represents a signal or background event).

You'll use Dask's `read_csv` function makes to read in all the CSV files in the dataset:

In [None]:
import dask.dataframe as dd

# Load the entire dataset lazily using Dask
ddf = dd.read_csv("s3://coiled-data/higgs/higgs-*.csv", storage_options={"anon": True})

You can separate the classification label and training features and then partition the dataset into training and testing samples. Dask's machine learning library, [Dask-ML](https://ml.dask.org/), mimics Scikit-learn's API, providing scalable versions of `sklearn.datasets.make_classification` and `sklearn.model_selection.train_test_split` that are designed to work with Dask Arrays and DataFrames larger than available RAM.

In [93]:
from dask_ml.model_selection import train_test_split

X, y = ddf.iloc[:, 1:], ddf["labels"]
# use Dask-ML to generate test and train datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2)

Next you'll persist your training and testing datasets into memory to avoid re-computations (see the Dask documentation for [best practices using *persist*](https://docs.dask.org/en/stable/best-practices.html#persist-when-you-can)):

In [94]:
import dask

X_train, X_test, y_train, y_test = dask.persist(X_train, X_test, y_train, y_test)

To do distributed training of an XGBoost model, you'll use XGBoost with Dask (see the XGBoost tutorial on [using XGBoost with Dask](https://xgboost.readthedocs.io/en/stable/tutorials/dask.html)). You’ll need to first construct the `xgboost.DMatrix` object for both your training and testing datasets – these are the internal data structures XGBoost uses to manage dataset features and targets. Since you're using XGBoost with Dask, you can pass your training and testing datasets directly to `xgboost.dask.DMatrix()`.

In [95]:
import xgboost

dtrain = xgboost.dask.DaskDMatrix(client=client, data=X_train, label=y_train)

Next you'll define the set of hyperparameters to use for the model and train the model (see the [XGBoost documentation on parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)):

In [None]:
params = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'min_child_weight': 0.5,
    'eval_metric': 'logloss'
}

bst = xgboost.dask.train(client,  params, dtrain, num_boost_round=3)

## Generate model predictions

Now that your model has been trained, you can use it to make predictions on the testing dataset which was *not* used to train the model:

In [96]:
y_pred = xgboost.dask.predict(client, bst, X_test)
y_test, y_pred = dask.compute(y_test, y_pred)

Voilà! Congratulations on training a boosted decision tree in the cloud.

Once you're done, you can shutdown the cluster (it will shutdown automatically after 20 minutes of inactivity):

In [None]:
cluster.close()
client.close()

## Next steps

For a more in-depth look at what you can do with XGBoost, Dask, and Coiled, check out [this Coiled blogpost](https://coiled.io/blog/dask-python-xgboost-example/).