# Get Started with Distributed Training using XGBoost

Ray Train has built-in support for XGBoost.

## Quickstart

### XGBoost Example

In [None]:
import ray
from ray.train import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration. Set to True to schedule GPU workers.
        use_gpu=False,
    ),
    label_column="target",
    num_boost_round=20,
    params={
        # XGBoost specific params (see the `xgboost.train` API reference)
        "objective": "binary:logistic",
        # uncomment this and set `use_gpu=True` to use GPU for training
        # "tree_method": "gpu_hist",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    # If running in a multi-node cluster, this is where you
    # should configure the run's persistent storage that is accessible
    # across all worker nodes.
    # run_config=ray.train.RunConfig(storage_path="s3://..."),
)
result = trainer.fit()
print(result.metrics)

Trainer constructors pass Ray-specific parameters.

## Save and load XGBoost and LightGBM checkpoints

When you train a new tree on every boosting round, you can save a checkpoint to snapshot the training progress so far.
[`XGBoostTrainer`](https://docs.ray.io/en/latest/train/api/doc/ray.train.xgboost.XGBoostTrainer.html#ray.train.xgboost.XGBoostTrainer) and [`LightGBMTrainer`](https://docs.ray.io/en/latest/train/api/doc/ray.train.lightgbm.LightGBMTrainer.html#ray.train.lightgbm.LightGBMTrainer) both implement checkpointing out of the box. These checkpoints can be loaded into memory
using static methods [`XGBoostTrainer.get_model`](https://docs.ray.io/en/latest/train/api/doc/ray.train.xgboost.XGBoostTrainer.get_model.html#ray.train.xgboost.XGBoostTrainer.get_model) and [`LightGBMTrainer.get_model`](https://docs.ray.io/en/latest/train/api/doc/ray.train.lightgbm.LightGBMTrainer.get_model.html#ray.train.lightgbm.LightGBMTrainer.get_model).

The only required change is to configure [`CheckpointConfig`](https://docs.ray.io/en/latest/train/api/doc/ray.train.CheckpointConfig.html#ray.train.CheckpointConfig) to set the checkpointing frequency. For example, the following configuration
saves a checkpoint on every boosting round and only keeps the latest checkpoint.

In [None]:
from ray.train import RunConfig, CheckpointConfig

run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        # Checkpoint every iteration.
        checkpoint_frequency=1,
        # Only keep the latest checkpoint and delete the others.
        num_to_keep=1,
    )
)

# from ray.train.xgboost import XGBoostTrainer
# trainer = XGBoostTrainer(..., run_config=run_config)

:::{tip} Once you enable checkpointing, you can follow [this guide](https://docs.ray.io/en/latest/train/user-guides/fault-tolerance.html#train-fault-tolerance) to enable fault tolerance. :::

## Basic training with tree-based models in Train

Just as in the original [`xgboost.train()`](https://xgboost.readthedocs.io/en/stable/parameter.html) and [`lightgbm.train()`](https://lightgbm.readthedocs.io/en/latest/Parameters.html) functions, the
training parameters are passed as the `params` dictionary.

### XGBoost Example

In [None]:
import ray
from ray.train import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        num_workers=2,
        use_gpu=False,
    ),
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()
print(result.metrics)

## How to scale out training?

The benefit of using Ray Train is that you can seamlessly scale up your training by
adjusting the [`ScalingConfig`](https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html#ray.train.ScalingConfig).

:::{note}
Ray Train doesn’t modify or otherwise alter the working of the underlying XGBoost or LightGBM distributed training algorithms. Ray only provides orchestration, data ingest and fault tolerance. For more information on GBDT distributed training, refer to [XGBoost documentation](https://xgboost.readthedocs.io/en/stable/) and [LightGBM documentation](https://lightgbm.readthedocs.io/en/latest/).
:::

### Multi-node CPU Example

Setup: 4 nodes with 8 CPUs each.

Use-case: To utilize all resources in multi-node training.

```python
scaling_config = ScalingConfig(
    num_workers=4,
    resources_per_worker={"CPU": 8},
)
```

### Single-node multi-GPU Example

Setup: 1 node with 8 CPUs and 4 GPUs.

Use-case: If you have a single node with multiple GPUs, you need to use
distributed training to leverage all GPUs.

```python
scaling_config = ScalingConfig(
    num_workers=4,
    use_gpu=True,
)
```

### Multi-node multi-GPU Example

Setup: 4 nodes with 8 CPUs and 4 GPUs each.

Use-case: If you have multiple nodes with multiple GPUs, you need to
schedule one worker per GPU.

```python
scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
)
```

::: {warning}
Specifying a *shared storage location* (such as cloud storage or NFS) is *optional* for single-node clusters, but it is **required for multi-node clusters**. Using a local path will [raise an error](https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html#multinode-local-storage-warning) during checkpointing for multi-node clusters.

```python
trainer = XGBoostTrainer(
    ..., run_config=ray.train.RunConfig(storage_path="s3://...")
)
```
:::

## How to preprocess data for training?

Particularly for tabular data, Ray Data comes with out-of-the-box preprocessors that implement common feature preprocessing operations.
You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:

In [None]:
import ray
from ray.data.preprocessors import MinMaxScaler
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(0, 32, 3)])
valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)])

preprocessor = MinMaxScaler(["x"])
preprocessor.fit(train_dataset)
train_dataset = preprocessor.transform(train_dataset)
valid_dataset = preprocessor.transform(valid_dataset)

trainer = XGBoostTrainer(
    label_column="y",
    params={"objective": "reg:squarederror"},
    scaling_config=ScalingConfig(num_workers=2),
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()

In [None]:
from ray.train import CheckpointConfig

checkpoint_config = CheckpointConfig(
    checkpoint_frequency=1,
    num_to_keep=1,
)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(num_workers=2),
    label_column="target",
    num_boost_round=20,
    params={"objective": "binary:logistic"},
    datasets={"train": train_dataset, "valid": valid_dataset},
    checkpoint_config=checkpoint_config,
)
result = trainer.fit()