# XGBoost-Ray with Modin 

This notebook includes an example workflow using [XGBoost-Ray](https://docs.ray.io/en/latest/xgboost-ray.html) and [Modin](https://modin.readthedocs.io/en/latest/) for distributed model training and prediction.

Please ensure you are running this notebook in a fresh Conda enviroment/virtualenv to avoid any package issues. See the included README for an example on how to set up a new `modin-xgboost-env-kernel` that this notebook can use.

## Python Setup

First, we'll install the required dependencies and import them locally to verify that our local environment is configured correctly.

In [None]:
! pip install "ray[tune]" "xgboost_ray[default]" modin

In [None]:
import argparse
import json
import os
import time

import modin.pandas as pd
import ray

from modin.experimental.sklearn.model_selection import train_test_split
from ray import tune

import xgboost_ray
from xgboost_ray import RayDMatrix, RayParams, train, predict

## Cluster Setup

Next, we'll set up our Ray Cluster. The `modin-xgboost.yaml` provided can be used to configure an AWS cluster. 

In [None]:
! pip install boto3
! ray up modin-xgboost.yaml -y

Now, let's connect our Python script to this newly deployed Ray cluster!

### Connecting to the Ray cluster

In another terminal, run the following to set up port forwarding:

```
ray attach modin-xgboost.yaml -p 10001
```

You can then connect directly to port `10001` here:

In [None]:
ray.init(address="ray://localhost:10001")

### Alternative Approach

**Note:** This does not work with Ray 1.6.0 but is fixed in latest.

In [None]:
address = ! ray get-head-ip modin-xgboost.yaml | tail -n 1
print(address)

In [None]:
ray_address = f"ray://{address[0]}:10001"
print(ray_address)

In [None]:
! ray --version

In [None]:
ray.init(address=ray_address)

## Data Preparation

We will use the [HIGGS dataset from the UCI Machine Learning dataset repository](https://archive.ics.uci.edu/ml/datasets/HIGGS). The HIGGS dataset consists of 11,000,000 samples and 28 attributes, which is a large enough size to show the benefits of distributed computation.

We set the Dask scheduler to ray_dask_get to use Dask on Ray backend.

In [None]:
LABEL_COLUMN = "label"

# Test dataset with only 10,000 records.
FILE_URL = "https://ray-ci-higgs.s3.us-west-2.amazonaws.com/simpleHIGGS.csv"

# Uncomment this to run on the actual dataset. This may take a couple of minutes to run.
#FILE_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"

colnames = [LABEL_COLUMN] + ["feature-%02d" % i for i in range(1, 29)]

load_data_start_time = time.time()

# Force read_csv to be executed on the Ray server.
df = ray.get(ray.remote(pd.read_csv).remote(FILE_URL, names=colnames))

load_data_end_time = time.time()
load_data_duration = load_data_end_time - load_data_start_time

print(f"Dataset loaded in {load_data_duration} seconds.")

In [None]:
# Split data into training and validation.
df_train, df_validation = train_test_split(df)

print(df_train)

## Distributed Training

The `train_xgboost` function contains all of the logic necessary for training using XGBoost-Ray.

Distributed training can not only speed up the process, but also allow you to use datasets that are to large to fit in memory of a single node. With distributed training, the dataset is sharded across different actors running on separate nodes. Those actors communicate with each other to create the final model.

First, the dataframes are wrapped in `RayDMatrix` objects, which handle data sharding across the cluster. Then, the `train` function is called. The evaluation scores will be saved to `evals_result` dictionary. The function returns a tuple of the trained model (booster) and the evaluation scores.

The `ray_params` variable expects a `RayParams` object that contains Ray-specific settings, such as the number of workers.

In [None]:
def train_xgboost(config, train_df, test_df, target_column, ray_params):
    train_set = RayDMatrix(train_df, target_column)
    test_set = RayDMatrix(test_df, target_column)

    evals_result = {}

    train_start_time = time.time()
    
    # Train the classifier
    bst = train(
        params=config,
        dtrain=train_set,
        evals=[(test_set, "eval")],
        evals_result=evals_result,
        verbose_eval=False,
        num_boost_round=100,
        ray_params=ray_params)
    
    train_end_time = time.time()
    train_duration = train_end_time - train_start_time
    print(f"Total time taken: {train_duration} seconds.")

    model_path = "model.xgb"
    bst.save_model(model_path)
    print("Final validation error: {:.4f}".format(
        evals_result["eval"]["error"][-1]))

    return bst, evals_result

We can now pass our Modin dataframes and run the function. We will use `RayParams` to specify that our model is to be trained on 4 actors.

The dataset has to be downloaded onto the cluster, which may take a few minutes.

In [None]:
# standard XGBoost config for classification
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}

bst, evals_result = train_xgboost(config, df_train, df_validation, LABEL_COLUMN, RayParams(cpus_per_actor=8, num_actors=4))
print(evals_result)

## Prediction

With the model trained, we can now predict on unseen data. For the purposes of this example, we will use the same dataset for prediction as for training.

Since prediction is naively parallelizable, distributing it over multiple actors can measurably reduce the amount of time needed.

In [None]:
inference_df = RayDMatrix(df, ignore=[LABEL_COLUMN, "partition"])
results = predict(
    bst,
    inference_df,
    ray_params=RayParams(cpus_per_actor=2, num_actors=16))

print(results)