# XGBoost regression (multi-node with Dask)

<table>
    <tr>
        <td>
            <img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="300">
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/6/69/XGBoost_logo.png" width="300">
        </td>
        <td>
            <img src="./img/snowflake.png" width="450">
        </td>
    </tr>
</table>

This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see ["Distributed XGBoost with Dask"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and ["XGBoost Training with Dask"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation.

In [None]:
import os

MODEL_PATH = 'models'
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
numeric_feat = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
categorical_feat = [
    'pickup_taxizone_id', 
    'dropoff_taxizone_id',
]
features = numeric_feat + categorical_feat
y_col = 'tip_fraction'

# Initialize Dask cluster

The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running.

In [None]:
import dask
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

n_workers = 3
cluster = SaturnCluster(
    n_workers=n_workers, 
    scheduler_size='medium',
    worker_size='large', 
    nthreads=2,
)
client = Client(cluster)
cluster

Open the dashboard (link above ^) and watch it when you execute some commands, you'll see which tasks are running across the cluster.

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [None]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a sample from a single month for this exercise. Note we are loading the data with Dask now by defining a `dask.delayed` function to load partitions in parallel.

In [None]:
import os
import dask.dataframe as dd
import snowflake.connector

SNOWFLAKE_ACCOUNT = os.environ['SNOWFLAKE_ACCOUNT']
SNOWFLAKE_USER = os.environ['SNOWFLAKE_USER']
SNOWFLAKE_PASSWORD = os.environ['SNOWFLAKE_PASSWORD']

SNOWFLAKE_WAREHOUSE = os.environ['SNOWFLAKE_WAREHOUSE']
TAXI_DATABASE = os.environ['TAXI_DATABASE']
TAXI_SCHEMA = os.environ['TAXI_SCHEMA']

conn_info = {
    'account': SNOWFLAKE_ACCOUNT,
    'user': SNOWFLAKE_USER,
    'password': SNOWFLAKE_PASSWORD,
    'warehouse': SNOWFLAKE_WAREHOUSE,
    'database': TAXI_DATABASE,
    'schema': TAXI_SCHEMA,
}
conn = snowflake.connector.connect(**conn_info)

The `load()` function specifies how each partition of the Dask dataframe should load its chunk of the data. As each partition of a Dask dataframe is a Pandas dataframe, we utilize the `fetch_pandas_all()` method.  We use a [binding for the Snowflake query](https://docs.snowflake.com/en/user-guide/python-connector-example.html#binding-data) so that we can pass different values at execution time. In this case, we pull one day of taxi rides into each partition.

In [None]:
query = """
SELECT * FROM (
    SELECT 
        pickup_taxizone_id,
        dropoff_taxizone_id,
        passenger_count,
        DIV0(tip_amount, fare_amount) AS tip_fraction,
        DAYOFWEEKISO(pickup_datetime) - 1 AS pickup_weekday,
        WEEKOFYEAR(pickup_datetime) AS pickup_weekofyear,
        HOUR(pickup_datetime) AS pickup_hour,
        (pickup_weekday * 24) + pickup_hour AS pickup_week_hour,
        MINUTE(pickup_datetime) AS pickup_minute
    FROM taxi_yellow
    WHERE
        DATE(pickup_datetime) = %s
) SAMPLE (30)
"""

@dask.delayed
def load(conn_info, query, day):
    conn = snowflake.connector.connect(**conn_info)
    taxi = conn.cursor().execute(query, str(day)).fetch_pandas_all()
    taxi.columns = taxi.columns.str.lower()
    return taxi    

Run a query to determine which specific dates we need to pull

In [None]:
def get_dates(start, end):
    date_query = """
    SELECT
        DISTINCT(DATE(pickup_datetime)) as date 
    FROM taxi_yellow
    WHERE
        pickup_datetime BETWEEN %s and %s
    """
    dates_df = conn.cursor().execute(date_query, (start, end)).fetch_pandas_all()
    return dates_df['DATE'].tolist()

dates = get_dates('2019-01-01', '2019-01-31')

The `dd.from_delayed` function takes delayed Pandas outputs from a list and converts that to a high-level DataFrame object.

Since we're sampling in the Snowflake query, we run `persist()` to tell Dask to load all results into memory across the cluster. This ensures that Dask's lazy evaluation doesn't call the Snowflake query multiple times.

In [None]:
taxi = dd.from_delayed([load(conn_info, query, day) for day in dates])
taxi = taxi.persist()
_ = wait(taxi)

In [None]:
print(f'Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e6} MB')

In [None]:
taxi_train = taxi[features + [y_col]].astype(float).fillna(-1)

In [None]:
taxi_train.head()

# Train a model

This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost).

In [None]:
import xgboost as xgb

Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dask's distributed collections (Dask DataFrame and Dask Array).

In [None]:
dtrain = xgb.dask.DaskDMatrix(
    client=client,
    data=taxi_train[features],
    label=taxi_train[y_col]
)

You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`.

In [None]:
%%time
result = xgb.dask.train(
    client=client,
    params={
        "objective": "reg:squarederror",
        "tree_method": "hist",
        "learning_rate": 0.1,
        "max_depth": 5,
    },
    dtrain=dtrain,
    num_boost_round=50
)

`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training.

In [None]:
booster = result["booster"]
type(booster)

# xgboost.core.Booster

## Save model

In [None]:
import cloudpickle

with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:
    cloudpickle.dump(booster, f)

## Calculate metrics on test set

Use a different month for test set

In [None]:
test_dates = get_dates('2019-02-01', '2019-02-28')
taxi_test = dd.from_delayed([load(conn_info, query, day) for day in test_dates])
taxi_test = taxi_test.persist()
_ = wait(taxi_test)

`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Because the model object here is just a regular XGBoost model, using `dask-xgboost` for batch scoring doesn't require that you also perform training on Dask.

This function returns a Dask Array or Dask Series of predictions, depending on the input type.

In [None]:
preds = xgb.dask.predict(
    client=client,
    model=booster,
    data=taxi_test[features]
)

The metrics functions in `dask_ml` can compute metrics on Dask collections like Dask Array and Dask DataFrame, so you never have to hold all of the test data in memory on the client. These functions intentionally mimic the metrics functions in scikit-learn.

In [None]:
from dask_ml.metrics import mean_squared_error

mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)