# XGBoost-Ray with Dask

This notebook includes an example workflow using [XGBoost-Ray](https://docs.ray.io/en/latest/xgboost-ray.html) and Dask for distributed model training, hyperparameter optimization and prediction.

In [None]:
import time
import os

import dask
import dask.dataframe as dd

import ray
from ray import tune
from ray.util.dask import ray_dask_get

from xgboost_ray import RayDMatrix, RayParams, train, predict

## Anyscale Connect

The cell below connects the notebook to an Anyscale cluster. XGBoost-Ray will automatically do computation remotely. Make sure to replace the `CLUSTER_NAME`, `CLUSTER_ENV` and `CLUSTER_COMPUTE` to match your settings.

It is not necessary to use Anyscale - this workflow will work just as well with a Ray cluster.

In [None]:
CLUSTER_NAME = "cluster"
CLUSTER_ENV = "cluster_env"
CLUSTER_COMPUTE = "CLUSTER_COMPUTE"
#ray.init(f"anyscale://{CLUSTER_NAME}", cluster_env=CLUSTER_ENV, cluster_compute=CLUSTER_COMPUTE)
ray.init()

## Data preparation

We will use the [HIGGS dataset from the UCI Machine Learning dataset repository](https://archive.ics.uci.edu/ml/datasets/HIGGS). The HIGGS dataset consists of 11,000,000 samples and 28 attributes, which is a large enough size to show the benefits of distributed computation.

We set the Dask scheduler to `ray_dask_get` to use [Dask on Ray](https://docs.ray.io/en/latest/data/dask-on-ray.html) backend.

In [None]:
LABEL_COLUMN = "label"
FILE_NAME = "HIGGS.csv.gz"

print("Loading HIGGS data.")

dask.config.set(scheduler=ray_dask_get)

def download_higgs(target_file):
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/" \
          "00280/HIGGS.csv.gz"

    try:
        import urllib.request
    except ImportError as e:
        raise ValueError(
            f"Automatic downloading of the HIGGS dataset requires `urllib`."
            f"\nFIX THIS by running `pip install urllib` or manually "
            f"downloading the dataset from {url}.") from e

    print(f"Downloading HIGGS dataset to {target_file}")
    urllib.request.urlretrieve(url, target_file)
    return os.path.exists(target_file)

download_higgs(FILE_NAME)

colnames = [LABEL_COLUMN] + ["feature-%02d" % i for i in range(1, 29)]
data = dd.read_csv(FILE_NAME, names=colnames)
data = data[sorted(colnames)]
data = data.repartition(npartitions=100)

print("Loaded HIGGS data.")

We will split the data into a training set and a evaluation set using a 75-25 proportion.

In [None]:
train_df, eval_df = data.random_split([0.8, 0.2])

## Distributed training

The `train_xgboost` function contains all of the logic necessary for training using XGBoost-Ray.

Distributed training can not only speed up the process, but also allow you to use datasets that are to large to fit in memory of a single node. With distributed training, the dataset is sharded across different actors running on separate nodes. Those actors communicate with each other to create the final model.

First, the dataframes are wrapped in `RayDMatrix` objects, which handle data sharding across the cluster. Then, the `train` function is called. The evaluation scores will be saved to `evals_result` dictionary. The function returns a tuple of the trained model (booster) and the evaluation scores.

The `ray_params` variable expects a `RayParams` object that contains Ray-specific settings, such as the number of workers.

In [None]:
def train_xgboost(config, train_df, test_df, target_column, ray_params, num_boost_round=100):
    train_set = RayDMatrix(train_df, target_column)
    test_set = RayDMatrix(test_df, target_column)

    evals_result = {}

    start_time = time.time()
    # Train the classifier
    bst = train(
        params=config,
        dtrain=train_set,
        evals=[(test_set, "eval")],
        evals_result=evals_result,
        verbose_eval=True,
        num_boost_round=num_boost_round,
        ray_params=ray_params)
    print(f"Total time taken: {time.time()-start_time}")

    model_path = "model.xgb"
    bst.save_model(model_path)
    print("Final validation error: {:.4f}".format(
        evals_result["eval"]["error"][-1]))

    return bst, evals_result

We can now pass our Dask dataframes and run the function. We will use `RayParams` to specify that our model is to be trained on 4 actors.

In [None]:
# standard XGBoost config for classification
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}

bst, evals_result = train_xgboost(config, train_df, eval_df, LABEL_COLUMN, RayParams(num_actors=4))
evals_result

## Hyperparameter optimization

If we are not content with the results obtained with default XGBoost parameters, we can use [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) for cutting-edge distributed hyperparameter tuning. XGBoost-Ray automatically integrates with Ray Tune, meaning we can use the same training function as before.

In this workflow, we will tune three hyperparameters - `eta`, `subsample` and `max_depth`. We are using [Tune's samplers to define the search space](https://docs.ray.io/en/latest/tune/user-guide.html#search-space-grid-random).

The experiment configuration is done through `tune.run`. We set the amount of resources each trial (hyperparameter combination) requires by using the `get_tune_resources` method of `RayParams`. The `num_samples` argument controls how many trials will be ran in total. In the end, the best combination of hyperparameters evaluated during the experiment will be returned.

By default, Tune will use simple random search. However, Tune also provides various [search algorithms](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html) and [schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html) to further improve the optimization process.

In [None]:
def tune_xgboost(train_df, test_df, target_column):
    # Set XGBoost config.
    config = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "eta": tune.loguniform(1e-4, 1e-1),
        "subsample": tune.uniform(0.5, 1.0),
        "max_depth": tune.randint(1, 9)
    }

    ray_params = RayParams(
        max_actor_restarts=1, gpus_per_actor=0, cpus_per_actor=8, num_actors=4)

    analysis = tune.run(
        tune.with_parameters(
            train_xgboost,
            train_df=train_df,
            test_df=test_df,
            target_column=target_column,
            ray_params=ray_params),
        # Use the `get_tune_resources` helper function to set the resources.
        resources_per_trial=ray_params.get_tune_resources(),
        config=config,
        num_samples=10,
        metric="eval-error",
        mode="min",
        verbose=1)

    accuracy = 1. - analysis.best_result["eval-error"]
    print(f"Best model parameters: {analysis.best_config}")
    print(f"Best model total accuracy: {accuracy:.4f}")

    return analysis.best_config

Hyperparameter optimization may take some time to complete.

best_hyperparameters = tune_xgboost(train_df, eval_df, LABEL_COLUMN)
print("Best hyperparameters:")
best_hyperparameters

## Prediction

With the model trained, we can now predict on unseen data. For the purposes of this example, we will use the same dataset for prediction as for training.

Since prediction is naively parallelizable, distributing it over multiple actors can measurably reduce the amount of time needed.

In [None]:
inference_df = RayDMatrix(data, ignore=[LABEL_COLUMN, "partition"])
results = predict(
    bst,
    inference_df,
    ray_params=RayParams(cpus_per_actor=2, num_actors=16))

results