# Get Started with Distributed Training using XGBoost

Ray Train has built-in support for XGBoost.

## Quickstart

### XGBoost Example

In [None]:
import ray
from ray.train import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration. Set to True to schedule GPU workers.
        use_gpu=False,
    ),
    label_column="target",
    num_boost_round=20,
    params={
        # XGBoost specific params (see the `xgboost.train` API reference)
        "objective": "binary:logistic",
        # uncomment this and set `use_gpu=True` to use GPU for training
        # "tree_method": "gpu_hist",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    # If running in a multi-node cluster, this is where you
    # should configure the run's persistent storage that is accessible
    # across all worker nodes.
    # run_config=ray.train.RunConfig(storage_path="s3://..."),
)
result = trainer.fit()
print(result.metrics)

## How to scale out training?

The benefit of using Ray Train is that you can seamlessly scale up your training by
adjusting the `ScalingConfig`.

### Multi-node CPU Example

Setup: 4 nodes with 8 CPUs each.

Use-case: To utilize all resources in multi-node training.

In [None]:
scaling_config = ScalingConfig(
    num_workers=4,
    resources_per_worker={"CPU": 8},
)

### Single-node multi-GPU Example

Setup: 1 node with 8 CPUs and 4 GPUs.

Use-case: If you have a single node with multiple GPUs, you need to use
distributed training to leverage all GPUs.

In [None]:
scaling_config = ScalingConfig(
    num_workers=4,
    use_gpu=True,
)

### Multi-node multi-GPU Example

Setup: 4 nodes with 8 CPUs and 4 GPUs each.

Use-case: If you have multiple nodes with multiple GPUs, you need to
schedule one worker per GPU.

In [None]:
scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
)

## How to preprocess data for training?

Particularly for tabular data, Ray Data comes with out-of-the-box preprocessors that implement common feature preprocessing operations.
You can use this with Ray Train Trainers by applying them on the dataset before passing the dataset into a Trainer. For example:

In [None]:
import ray
from ray.data.preprocessors import MinMaxScaler
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(0, 32, 3)])
valid_dataset = ray.data.from_items([{"x": x, "y": 2 * x} for x in range(1, 32, 3)])

preprocessor = MinMaxScaler(["x"])
preprocessor.fit(train_dataset)
train_dataset = preprocessor.transform(train_dataset)
valid_dataset = preprocessor.transform(valid_dataset)

trainer = XGBoostTrainer(
    label_column="y",
    params={"objective": "reg:squarederror"},
    scaling_config=ScalingConfig(num_workers=2),
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()