# Training XGBoost with Dask RAPIDS in Databricks

This notebook shows how to deploy Dask RAPIDS workflow in Databricks. We will focus on the HIGGS dataset, a moderately sized classification problem from the [UCI Machine Learning repository.](https://archive.ics.uci.edu/dataset/280/higgs)

In the following sections, we start from basic data loading from Delta Lake and preprocessing with Dask. Then, train an XGBoost model on returned data with different configurations. Lastly, we share some optimization techniques with inference.


## Launch multi-node Dask Cluster

The following example need to be run on a machine with at least one NVIDIA GPU.   However, one of the advantages of Dask is that users can easily distribute or scale up computation tasks across multiple GPUs with minimal adjustments to existing code.

Dask  recently introduced [**dask-databricks**](https://github.com/dask-contrib/dask-databricks) (available via [conda](https://github.com/conda-forge/dask-databricks-feedstock) and [pip](https://pypi.org/project/dask-databricks/)). With this CLI tool, by default, the `dask databricks run` command will launch a dask scheduler in the driver node and standard workers on remaining nodes. To launch a dask cluster with GPU workers, you must parse in `--cuda` flag option. 

**Note** that you may need to install additional packages using `pip` in the initialization script to configure the environment based on your specific requirements.

Follow the [instructions](https://github.com/rapidsai/deployment/blob/main/source/platforms/databricks.md) on RAPIDS deployment page to get started!


## Import packages 

Once your cluster has launched, start by importing all necessary libraries and dependencies.

In [0]:
import os
from time import time
from typing import Tuple

import pandas as pd
import numpy as np
import cupy
import cudf
import dask
import dask_cudf
import dask_databricks
import dask_deltatable as ddt
import xgboost as xgb
from xgboost import dask as dxgb
from dask_ml.model_selection import train_test_split
from distributed import wait



## Connect to Dask Client

Connect to the client (and optionally Dashboard) to submit tasks.

In [0]:
client = dask_databricks.get_client()
client

0,1
Connection method: Cluster object,Cluster type: dask_databricks.DatabricksCluster
Dashboard: https://dbc-dp-8721196619973675.cloud.databricks.com/driver-proxy/o/8721196619973675/1031-230718-l2ubf858/8087/status,

0,1
Dashboard: https://dbc-dp-8721196619973675.cloud.databricks.com/driver-proxy/o/8721196619973675/1031-230718-l2ubf858/8087/status,Workers: 2
Total threads: 2,Total memory: 30.65 GiB

0,1
Comm: tcp://10.59.190.49:8786,Workers: 2
Dashboard: http://10.59.190.49:8087/status,Total threads: 2
Started: 42 minutes ago,Total memory: 30.65 GiB

0,1
Comm: tcp://10.59.170.147:38187,Total threads: 1
Dashboard: http://10.59.170.147:43171/status,Memory: 15.33 GiB
Nanny: tcp://10.59.170.147:45893,
Local directory: /tmp/dask-scratch-space/worker-vqj1vwhq,Local directory: /tmp/dask-scratch-space/worker-vqj1vwhq
GPU: Tesla T4,GPU memory: 15.00 GiB

0,1
Comm: tcp://10.59.172.70:35957,Total threads: 1
Dashboard: http://10.59.172.70:34875/status,Memory: 15.33 GiB
Nanny: tcp://10.59.172.70:36051,
Local directory: /tmp/dask-scratch-space/worker-j77rdsmw,Local directory: /tmp/dask-scratch-space/worker-j77rdsmw
GPU: Tesla T4,GPU memory: 15.00 GiB


## Integrating Delta Lake Tables with Dask

[Delta Lake](https://docs.databricks.com/en/delta/index.html) is an optimized storage layer within the Databricks lakehouse that provides a foundational platform for storing data and tables. This open-source software extends Parquet data files by incorporating a file-based transaction log to support [ACID transactions](https://docs.databricks.com/en/lakehouse/acid.html) and scalable metadata handling. 

Delta Lake is the default storage format for all operations on Databricks, i.e (unless otherwise specified, all tables on Databricks are Delta tables). Whether you’re using Apache Spark DataFrames or SQL, you can get all the benefits of Delta Lake just by saving your data to the lakehouse with default settings.

Check out [tutorial](https://docs.databricks.com/en/delta/tutorial.html) for examples with basic Delta Lake operations.


Let's explore step-by-step how we can leverage Data Lake tables with dask to accelerate data pre-processing with RAPIDS.


### Download dataset from source

First we download the dataset into `/data` in current directory or Databrick File Storage (DBFS). Alternatively, you could also use cloud storage (S3, Google Cloud, Azure Data Lake). Refer to [docs](https://docs.databricks.com/en/storage/index.html#:~:text=Databricks%20uses%20cloud%20object%20storage,storage%20locations%20in%20your%20account.) for  more information

Uncomment the next three lines to upload the dataset to your chosen location. Only run ONCE!


In [0]:
# %fs mkdirs dbfs:/dbfs/databricks/skirui

In [0]:
# !curl https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz --output /dbfs/dbfs/databricks/skirui/HIGGS.csv.gz

In [0]:
# unzip compressed file into .csv file
# !gunzip "/dbfs/dbfs/databricks/skirui/HIGGS.csv.gz"

Next we load the data into GPUs.  Because the data is loaded multiple times during parameter tuning, we convert the original CSV file into Parquet format for better performance.  This can be easily done using `dask_cudf`. 

In [0]:
def to_parquet(dirpath):
    """Convert the HIGGS.csv.gz file to parquet files."""

    parquet_path = os.path.join(dirpath, "HIGGS.parquet")

    # Check if the Parquet file already exists, if yes, return its path
    if os.path.exists(parquet_path):
        return parquet_path

    # else, read the CSV file into a Dask DataFrame
    csv_path = os.path.join(dirpath, "HIGGS.csv")
    colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
    df = dask_cudf.read_csv(csv_path, header=None, names=colnames, dtype=float)

    # write the Dask DataFrame to Parquet files
    df.to_parquet(parquet_path, engine="pyarrow")

    # Return the path to the created Parquet file
    return parquet_path

In [0]:
# path = to_parquet("/dbfs/dbfs/databricks/skirui/")
# path

### Store data in Delta Lake

In this step, we will load the `parquet` file from DBFS location using `spark.read.load()` then write to a Delta table. 

For managed tables, Databricks determines the location for the data. To check the location and schema of the table, you can use the DESCRIBE DETAIL statement, i.e `display(spark.sql('DESCRIBE DETAIL <table_name>))`


In [0]:
def create_delta_table(path, tablename):
    """
    Load data from a Parquet file into a Delta table and return detailed information about the table.

    Parameters:
    - path: The DBFS path to the Parquet file.
    - tablename: The name to be given to the Delta table.

    Returns:
    - delta_table: DataFrame representing the schema of the Delta table.
    """
    # Load the data.
    data = spark.read.load(path, format="parquet")

    # Write to Delta table using the schema inferred from `data`
    data.write.saveAsTable(tablename)

    # Display detailed information about the Delta table.
    delta_table = display(spark.sql(f"DESCRIBE DETAIL {tablename}"))

    return delta_table

In [0]:
higgs_table = create_delta_table(
    "dbfs:/dbfs/databricks/skirui/HIGGS.parquet", "HIGGS_table"
)
higgs_table

In [0]:
delta_table = display(spark.sql(f"DESCRIBE DETAIL higgs_table"))

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,acc6f59b-9fb4-44e6-a824-558e2cfe11b7,spark_catalog.default.higgs_table,,dbfs:/user/hive/warehouse/higgs_table,2023-12-02T06:06:45.538+0000,2023-12-02T06:07:41.000+0000,List(),10,943799193,Map(),1,2,"List(appendOnly, invariants)",Map()


### Read from Delta table with Dask

With Dask's [**dask-deltatable**](https://github.com/dask-contrib/dask-deltatable/tree/main), we can read the parquet --- from Delta Lake table and parallelize with Dask. Calling `dask_deltalake.read_deltalake()` will return a `dask dataframe`. However, our objective is to utilize GPU acceleration for the entire ML pipeline, including data processing, model training and inference. 

For this reason, we will read the dask dataframe into a `cUDF dask-dataframe`  using `dask_cudf.from_dask_dataframe()`

In [0]:
def read_deltatable_with_dask(path):
    """
    Read data from a Delta table using Dask and convert it to a Dask cuDF DataFrame.

    Parameters:
    - path: The directory path to the delta lake table.

    Returns:
    - ddf: cuDF Dask DataFrame representing the data from the Delta table.
    """
    # Read from Delta table using Dask, returns dask dataframe
    dd = ddt.read_deltalake(path)

    # Convert Dask DataFrame to Dask cuDF for GPU acceleration (optional)
    # Uncomment the line below if you have a GPU and want to use cuDF
    # dask.config.set({"dataframe.backend": "cudf"})
    ddf = dask_cudf.from_dask_dataframe(dd)

    return ddf

In [0]:
ddf = read_deltatable_with_dask("/dbfs/user/hive/warehouse/higgs_table")
ddf

Unnamed: 0_level_0,__null_dask_index__,1.000000000000000000e+00,8.692932128906250000e-01,-6.350818276405334473e-01,2.256902605295181274e-01,3.274700641632080078e-01,-6.899932026863098145e-01,7.542022466659545898e-01,-2.485731393098831177e-01,-1.092063903808593750e+00,0.000000000000000000e+00,1.374992132186889648e+00,-6.536741852760314941e-01,9.303491115570068359e-01,1.107436060905456543e+00,1.138904333114624023e+00,-1.578198313713073730e+00,-1.046985387802124023e+00,0.000000000000000000e+00.1,6.579295396804809570e-01,-1.045456994324922562e-02,-4.576716944575309753e-02,3.101961374282836914e+00,1.353760004043579102e+00,9.795631170272827148e-01,9.780761599540710449e-01,9.200048446655273438e-01,7.216574549674987793e-01,9.887509346008300781e-01,8.766783475875854492e-01
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [0]:
ddf.shape

Out[22]: (Delayed('int-1fd68d93-98b9-46b4-bc34-2783b45dfa49'), 30)

In [0]:
colnames = ["partition"] + ["label"] + ["feature-%02d" % i for i in range(1, 29)]
ddf.columns = colnames
ddf.head()

Unnamed: 0,partition,label,feature-01,feature-02,feature-03,feature-04,feature-05,feature-06,feature-07,feature-08,...,feature-19,feature-20,feature-21,feature-22,feature-23,feature-24,feature-25,feature-26,feature-27,feature-28
0,0,1.0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
1,1,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
2,2,0.0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
3,3,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487
4,4,0.0,1.595839,-0.607811,0.007075,1.81845,-0.111906,0.84755,-0.566437,1.581239,...,-0.654227,-1.274345,3.101961,0.823761,0.938191,0.971758,0.789176,0.430553,0.961357,0.957818


### Splitting data

In the preceding step, we used `dask-cudf` for loading data from the Delta table's, now use `train_test_split()` function from `dask-ml` to split up the dataset. 

Most of the time, the GPU backend of dask works seamlessly with utilities in `dask-ml` and we can accelerate the entire ML pipeline as such: 


In [0]:
def load_higgs(
    ddf,
) -> Tuple[
    dask_cudf.core.DataFrame,
    dask_cudf.core.Series,
    dask_cudf.core.DataFrame,
    dask_cudf.core.Series,
]:
    y = ddf["label"]
    X = ddf[ddf.columns.difference(["label"])]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.33, random_state=42
    )
    X_train, X_valid, y_train, y_valid = client.persist(
        [X_train, X_valid, y_train, y_valid]
    )
    wait([X_train, X_valid, y_train, y_valid])

    return X_train, X_valid, y_train, y_valid

In [0]:
X_train, X_valid, y_train, y_valid = load_higgs(ddf)



In [0]:
X_train.head()

Unnamed: 0,feature-01,feature-02,feature-03,feature-04,feature-05,feature-06,feature-07,feature-08,feature-09,feature-10,...,feature-20,feature-21,feature-22,feature-23,feature-24,feature-25,feature-26,feature-27,feature-28,partition
0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,0.812581,...,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343,0
1,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,0.851737,...,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118,1
3,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,0.800872,...,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487,3
10,0.739357,-0.17829,0.829934,0.504539,-0.130217,0.961051,-0.355518,-1.717399,2.173076,0.620956,...,0.39882,3.101961,0.944536,1.026261,0.982197,0.542115,1.250979,0.830045,0.761308,10
11,1.384098,0.116822,-1.179879,0.762913,-0.079782,1.019863,0.877318,1.276887,2.173076,0.331252,...,0.504809,3.101961,0.959325,0.807376,1.191814,1.22121,0.861141,0.929341,0.838302,11


In [0]:
y_train.head()

Out[33]: 0     1.0
1     1.0
3     1.0
10    0.0
11    1.0
Name: label, dtype: float64

## Model training

There are two things to notice here.  Firstly, we specify the number of rounds to trigger early stopping for training.  XGBoost will stop the training process once the validation metric fails to improve in consecutive X rounds, where X is the number of rounds specified for early 
stopping.  

Secondly, we use a data type called `DaskDeviceQuantileDMatrix` for training but `DaskDMatrix` for validation.  `DaskDeviceQuantileDMatrix` is a drop-in replacement of `DaskDMatrix` for GPU-based training inputs that avoids extra data copies.

In [0]:
def fit_model_es(client, X, y, X_valid, y_valid) -> dxgb.Booster:
    early_stopping_rounds = 5
    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
    # train the model
    booster = dxgb.train(
        client,
        {
            "objective": "binary:logistic",
            "eval_metric": "error",
            "tree_method": "gpu_hist",
        },
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        early_stopping_rounds=early_stopping_rounds,
    )["booster"]
    return booster

In [0]:
booster = fit_model_es(client, X=X_train, y=y_train, X_valid=X_valid, y_valid=y_valid)
booster



Out[35]: <xgboost.core.Booster at 0x7f21c13bd0f0>

## Train with Customized objective and evaluation metric

 In the example below the XGBoost model is trained using a custom logistic regression-based objective function (`logit`) and a custom evaluation metric (`error`) along with early stopping.
 
 Note that the function returns both gradient and hessian, which XGBoost uses to optimize the model.  Also, the parameter named `metric_name` needs to be specified in our callback. It is used to inform XGBoost that the custom error function should be used for evaluating early stopping criteria.

In [0]:
def fit_model_customized_objective(client, X, y, X_valid, y_valid) -> dxgb.Booster:
    def logit(predt: np.ndarray, Xy: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
        predt = 1.0 / (1.0 + np.exp(-predt))
        labels = Xy.get_label()
        grad = predt - labels
        hess = predt * (1.0 - predt)
        return grad, hess

    def error(predt: np.ndarray, Xy: xgb.DMatrix) -> Tuple[str, float]:
        label = Xy.get_label()
        r = np.zeros(predt.shape)
        predt = 1.0 / (1.0 + np.exp(-predt))
        gt = predt > 0.5
        r[gt] = 1 - label[gt]
        le = predt <= 0.5
        r[le] = label[le]
        return "CustomErr", float(np.average(r))

    # Use early stopping with custom objective and metric.
    early_stopping_rounds = 5
    # Specify the metric we want to use for early stopping.
    es = xgb.callback.EarlyStopping(
        rounds=early_stopping_rounds, save_best=True, metric_name="CustomErr"
    )

    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
    booster = dxgb.train(
        client,
        {"eval_metric": "error", "tree_method": "gpu_hist"},
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        obj=logit,  # pass the custom objective
        feval=error,  # pass the custom metric
        callbacks=[es],
    )["booster"]
    return booster

In [0]:
booster_custom = fit_model_customized_objective(
    client, X=X_train, y=y_train, X_valid=X_valid, y_valid=y_valid
)
booster_custom



Out[37]: <xgboost.core.Booster at 0x7f21b9d674f0>

## Explaining the model
After obtaining our first model, we might want to explain predictions using [SHAP](https://github.com/slundberg/shap).  SHAP(SHapley Additive exPlanations) is a game theoretic approach to explain the output of machine learning models based on Shapley Value.  For details about the algorithm, please refer to the [papers](https://github.com/slundberg/shap#citations).

In [0]:
# def explain(client, model, X):
#     # Use array instead of dataframe in case of output dim is greater than 2.
#     X_array = X.values
#     contribs = dxgb.predict(client, model, X_array, \
#                     pred_contribs=True, validate_features=False\
#                     )
#     # Use the result for further analysis
#     return contribs

In [0]:
# contribs = explain(client, model=booster, X=X_train)
# contribs

## Running inference

After some tuning, we arrive at the final model for performing inference on new data. 



In [0]:
def predict(client, model, X):
    predt = dxgb.predict(client, model, X)
    return predt

In [0]:
preds = predict(client, booster, X_train)
preds.head()

Out[48]: 0     0.835550
1     0.970019
3     0.418864
10    0.297600
11    0.970294
Name: 0, dtype: float32

## Clean up

In [0]:
client.close()