# Training XGBoost with Dask RAPIDS in Databricks

This notebook shows how to deploy Dask RAPIDS workflow in Databricks. We will focus on the HIGGS dataset, a moderately sized classification problem from the [UCI Machine Learning repository.](https://archive.ics.uci.edu/dataset/280/higgs)

In the following sections, we start from basic data loading from Delta Lake and preprocessing with Dask. Then, train an XGBoost model on returned data with different configurations. Lastly, we share some optimization techniques with inference.

**Summary of what the problem is and what we are predicting**

## Pre-requisites

The following example need to be run on a machine with at least one NVIDIA GPU. One of the advantages of Dask is its flexibility that users can scale up the computation to clusters with a minimal code changes.

Additionally, Dask has introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) CLI tool (available through [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this new tool, `dask databricks run` command makes it easy for users to launch a RAPIDS Dask cluster on an MNMG Databricks cluster in just a few minutes.


Follow these [instructions](https://github.com/rapidsai/deployment/blob/main/source/platforms/databricks.md) for more details on how to get started. **Note** that you may need to install additional packages using `pip` in the initialization script to configure the environment based on your specific workflow requirements.

In [0]:
from typing import Tuple

import pandas as pd
import numpy as np
import cupy
import cudf
import dask

import dask_cudf
import dask_databricks

import xgboost as xgb
from xgboost import dask as dxgb

from dask_ml.model_selection import train_test_split
from distributed import wait

# from dask import dataframe as dd



## Connect to Dask client


In [0]:
client = dask_databricks.get_client()
client

0,1
Connection method: Cluster object,Cluster type: dask_databricks.DatabricksCluster
Dashboard: https://dbc-dp-8721196619973675.cloud.databricks.com/driver-proxy/o/8721196619973675/1031-230718-l2ubf858/8087/status,

0,1
Dashboard: https://dbc-dp-8721196619973675.cloud.databricks.com/driver-proxy/o/8721196619973675/1031-230718-l2ubf858/8087/status,Workers: 2
Total threads: 2,Total memory: 30.65 GiB

0,1
Comm: tcp://10.59.154.181:8786,Workers: 2
Dashboard: http://10.59.154.181:8087/status,Total threads: 2
Started: 6 minutes ago,Total memory: 30.65 GiB

0,1
Comm: tcp://10.59.141.37:42093,Total threads: 1
Dashboard: http://10.59.141.37:42227/status,Memory: 15.33 GiB
Nanny: tcp://10.59.141.37:33061,
Local directory: /tmp/dask-scratch-space/worker-riw08g3w,Local directory: /tmp/dask-scratch-space/worker-riw08g3w
GPU: Tesla T4,GPU memory: 15.00 GiB

0,1
Comm: tcp://10.59.159.39:41453,Total threads: 1
Dashboard: http://10.59.159.39:40211/status,Memory: 15.33 GiB
Nanny: tcp://10.59.159.39:42657,
Local directory: /tmp/dask-scratch-space/worker-nn0fdlxo,Local directory: /tmp/dask-scratch-space/worker-nn0fdlxo
GPU: Tesla T4,GPU memory: 15.00 GiB


## Data Preparation 

Given a cluster, we start loading the data into GPUs.  Because the data is loaded multiple times during parameter tuning, we convert the CSV file into Parquet format for better performance.  This can be easily done using `dask_cudf`. 

From a high level, this section of the notebook can be divided into the following steps: 

1. Upload to storage (?) -- 
2. Load the dataset and write to a Delta Lake table
3. Read from Delta Lake table using Dask and load into a `dask_cudf` dataframe
4. Split the returned `dask_cudf dataframe` using `dask_ml` in preparation for training


### Upload dataset
First we download the dataset into `/data` in the current working directory. Optionally, you can upload to DBFS or other supported storage (S3, Google Cloud Storage, Azure Data Lake). Refer to for [docs](https://docs.databricks.com/en/storage/index.html#:~:text=Databricks%20uses%20cloud%20object%20storage,storage%20locations%20in%20your%20account.) more information

For demo purposes, we will choose the latter

### Create Delta Table

All tables created in Databricks use Delta Table by default

In [0]:
from pyspark.sql import DataFrame


def create_and_describe_delta_table(file_path: str, table_name: str) -> DataFrame:
    """
    Load data from a Parquet file into a Delta table and return detailed information about the table.

    Parameters:
    - file_path: The path to the Parquet file.
    - table_name: The name to be given to the Delta table.

    Returns:
    - delta_table_schema: DataFrame representing the schema of the Delta table.
    """
    # Load the data from its source.
    data = spark.read.load(
        "dbfs:/dbfs/databricks/skirui/part_10.parquet", format="parquet"
    )

    # Load the data from its source.
    data = spark.read.load(file_path, format="parquet")

    # Write to Delta table using the schema inferred from `data`
    data.write.saveAsTable(table_name)

    # Display detailed information about the Delta table.
    delta_table_schema = spark.sql("DESCRIBE EXTENDED {}".format(table_name))

    return delta_table_schema

### Load from Delta Lake using Dask

In [0]:
import dask_cudf
import dask_deltatable as ddt


def read_deltatable_with_dask(file_path: str, table_name: str):
    """
    Read data from a Delta table using Dask and convert it to a Dask cuDF DataFrame.

    Returns:
    - ddf: cuDF Dask DataFrame representing the data from the Delta table.
    """
    # Read from Delta table using Dask, returns dask dataframe
    dd = ddt.read_deltalake(file_path)

    # Convert Dask DataFrame to Dask cuDF for GPU acceleration (optional)
    # Uncomment the line below if you have a GPU and want to use cuDF
    # dask.config.set({"dataframe.backend": "cudf"})
    ddf = dask_cudf.from_dask_dataframe(dd)

    return ddf

### Splitting data

In the preceding step, we used dask-cudf for loading data from the disk; now use train_test_split function from dask-ml for splitting up the dataset. 

Most of the time, the GPU backend of dask works seamlessly with utilities in dask-ml and we can accelerate the entire ML pipeline as such: 


In [0]:
def load_higgs(
    ddf,
) -> Tuple[
    dask_cudf.core.DataFrame,
    dask_cudf.core.Series,
    dask_cudf.core.DataFrame,
    dask_cudf.core.Series,
]:
    y = ddf["label"]
    X = ddf[ddf.columns.difference(["label"])]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.33, random_state=42
    )
    X_train, X_valid, y_train, y_valid = client.persist(
        [X_train, X_valid, y_train, y_valid]
    )
    wait([X_train, X_valid, y_train, y_valid])

    return X_train, X_valid, y_train, y_valid

In [0]:
X_train, X_valid, y_train, y_valid = load_higgs(ddf)

# print(f"len(X_train): {len(X_train)}")
# print(f"len(X_valid): {len(X_valid)}")
# print(f"len(y_train): {len(y_train)}")
# print(f"len(y_valid): {len(y_valid)}")



## Training model
One of the most frequently requested features is early stopping support for the Dask interface.  In the XGBoost 1.4 release, not only can we specify the number of stopping rounds, but also develop customized early stopping strategies.  For the simplest case, providing stopping rounds to the train function enables early stopping:

There are two things to notice here.  Firstly, we specify the number of rounds to trigger early stopping for training.  XGBoost will stop the training process once the validation metric fails to improve in consecutive X rounds, where X is the number of rounds specified for early stopping.  Secondly, we use a data type called DaskDeviceQuantileDMatrix for training but DaskDMatrix for validation.  DaskDeviceQuantileDMatrix is a drop-in replacement of DaskDMatrix for GPU-based training inputs that avoids extra data copies.

In [0]:
!pip list

Package                      Version
---------------------------- -------------
absl-py                      1.4.0
asttokens                    2.2.1
astunparse                   1.6.3
backcall                     0.2.0
blinker                      1.4
bokeh                        3.2.2
cachetools                   5.3.1
certifi                      2023.7.22
charset-normalizer           3.2.0
click                        8.1.7
cloudpickle                  3.0.0
comm                         0.1.3
contourpy                    1.1.0
cryptography                 3.4.8
cubinlinker-cu11             0.3.0.post1
cuda-python                  11.8.3
cudf-cu11                    23.10.2
cuml-cu11                    23.10.0
cupy-cuda11x                 12.2.0
cycler                       0.11.0
dask                         2023.9.2
dask-cuda                    23.10.0
dask-cudf-cu11               23.10.2
dask-databricks              0.3.0
dask-deltatable              0.3

In [0]:
def fit_model_es(client, X, y, X_valid, y_valid) -> dxgb.Booster:
    early_stopping_rounds = 5
    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
    # train the model
    booster = dxgb.train(
        client,
        {
            "objective": "binary:logistic",
            "eval_metric": "error",
            "tree_method": "gpu_hist",
        },
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        early_stopping_rounds=early_stopping_rounds,
    )["booster"]
    return booster

In [0]:
booster = fit_model_es(client, X=X_train, y=y_train, X_valid=X_valid, y_valid=y_valid)
booster



Out[15]: <xgboost.core.Booster at 0x7f7b5ece38e0>

## Customized objective and evaluation metric

XGBoost is designed to be scalable through customized objective functions and metrics. In 1.4, this feature is brought to the dask interface. The requirement is exactly the same as for the single node interface:

Optional: In the example below we use the custom objective function and metric to implement a logistic regression model along with early stopping. Note that the function returns both gradient and hessian, which XGBoost uses to optimize the model.  Also, the parameter named metric_name needs to be specified in our callback. It is used to inform XGBoost that the custom error function should be used for evaluating early stopping criteria.

In [0]:
def fit_model_customized_objective(client, X, y, X_valid, y_valid) -> dxgb.Booster:
    def logit(predt: np.ndarray, Xy: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
        predt = 1.0 / (1.0 + np.exp(-predt))
        labels = Xy.get_label()
        grad = predt - labels
        hess = predt * (1.0 - predt)
        return grad, hess

    def error(predt: np.ndarray, Xy: xgb.DMatrix) -> Tuple[str, float]:
        label = Xy.get_label()
        r = np.zeros(predt.shape)
        predt = 1.0 / (1.0 + np.exp(-predt))
        gt = predt > 0.5
        r[gt] = 1 - label[gt]
        le = predt <= 0.5
        r[le] = label[le]
        return "CustomErr", float(np.average(r))

    # Use early stopping with custom objective and metric.
    early_stopping_rounds = 5
    # Specify the metric we want to use for early stopping.
    es = xgb.callback.EarlyStopping(
        rounds=early_stopping_rounds, save_best=True, metric_name="CustomErr"
    )

    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
    booster = dxgb.train(
        client,
        {"eval_metric": "error", "tree_method": "gpu_hist"},
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        obj=logit,  # pass the custom objective
        feval=error,  # pass the custom metric
        callbacks=[es],
    )["booster"]
    return booster

In [0]:
booster_custom = fit_model_customized_objective(
    client, X=X_train, y=y_train, X_valid=X_valid, y_valid=y_valid
)
booster_custom



Out[22]: <xgboost.core.Booster at 0x7f7becae4be0>

## Explaining the model
After obtaining our first model, we might want to explain predictions using SHAP.  SHAP(SHapley Additive exPlanations) is a game theoretic approach to explain the output of machine learning models based on Shapley Value.  For details about the algorithm, please refer to the papers.  As XGBoost now has support for GPU-accelerated Shapley values, we extend this feature to the Dask interface. Now, users can compute shap values on distributed GPU clusters. This is enabled by the significantly improved predict function and the GPUTreeShap library:

In [0]:
# def explain(client, model, X):
#     # Use array instead of dataframe in case of output dim is greater than 2.
#     X_array = X.values
#     contribs = dxgb.predict(client, model, X_array, \
#                     pred_contribs=True, validate_features=False\
#                     )
#     # Use the result for further analysis
#     return contribs

In [0]:
# contribs = explain(client, model=booster, X=X_train)
# contribs

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan, 30)","(nan, 30)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan, 30) (nan, 30) Dask graph 1 chunks in 3 graph layers Data type float32 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan, 30)","(nan, 30)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [0]:
preds = dxgb.predict(client, booster, X_train)
preds

Out[35]: Dask Series Structure:
npartitions=1
    float32
        ...
Name: 0, dtype: float32
Dask Name: getitem, 3 graph layers

In [0]:
preds.head()

Out[37]: 0     0.891946
1     0.769179
3     0.882120
10    0.798238
11    0.663871
Name: 0, dtype: float32