# XGBoost regression (multi-node with Dask)

<table>
    <tr>
        <td>
            <img src="https://docs.dask.org/en/latest/_images/dask_horizontal.svg" width="300">
        </td>
        <td>
            <img src="https://upload.wikimedia.org/wikipedia/commons/6/69/XGBoost_logo.png" width="300">
        </td>
    </tr>
</table>

In [1]:
import os

MODEL_PATH = 'models'
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
numeric_feat = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
categorical_feat = [
    'PULocationID', 
    'DOLocationID',
]
features = numeric_feat + categorical_feat
y_col = 'tip_fraction'

# Initialize Dask cluster

In [2]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

n_workers = 3
cluster = SaturnCluster(
    n_workers=n_workers,
    scheduler_size='medium',
    worker_size='large',
    nthreads=2
)
client = Client(cluster)
cluster

[2020-12-14 18:59:23] INFO - dask-saturn | Cluster is ready
[2020-12-14 18:59:23] INFO - dask-saturn | Registering default plugins
[2020-12-14 18:59:23] INFO - dask-saturn | {'tcp://10.0.0.15:46173': {'status': 'OK'}, 'tcp://10.0.12.175:38441': {'status': 'OK'}, 'tcp://10.0.17.82:34435': {'status': 'OK'}}


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

Open the dashboard (link above ^) and watch it when you execute some commands, you'll see which tasks are running across the cluster.

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster from the "Dask" page in Saturn if you want to get a head start!

In [3]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a sample from a single month for this exercise

In [4]:
import dask.dataframe as dd

taxi = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
).sample(frac=0.3, replace=False)

In [5]:
print(f'Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e6} MB')

Num rows: 2300337, Size: 464.668074 MB


In [6]:
def prep_df(df: dd.DataFrame) -> dd.DataFrame:
    '''
    Generate features from a raw taxi dataframe.
    '''
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df['tip_fraction'] = df.tip_amount / df.fare_amount
    
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.isocalendar().week
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [y_col]].astype(float).fillna(-1)
    
    return df
    
taxi_train = prep_df(taxi)

In [7]:
taxi_train.head()

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,PULocationID,DOLocationID,tip_fraction
684541,4.0,1.0,11.0,107.0,41.0,2.0,195.0,74.0,0.0
710772,4.0,1.0,14.0,110.0,29.0,0.0,132.0,17.0,0.204
369969,2.0,1.0,22.0,70.0,53.0,2.0,151.0,116.0,0.0
93652,1.0,1.0,12.0,36.0,3.0,1.0,236.0,163.0,0.217778
131375,1.0,1.0,16.0,40.0,25.0,6.0,138.0,137.0,0.253469


Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing once.

In [8]:
%%time
taxi_train = taxi_train.persist()
_ = wait(taxi_train)

CPU times: user 29.8 ms, sys: 560 µs, total: 30.4 ms
Wall time: 8.47 s


# Train a model

In [11]:
import xgboost as xgb

In [12]:
help(xgb.dask.DaskDMatrix)

Help on class DaskDMatrix in module xgboost.dask:

class DaskDMatrix(builtins.object)
 |  DaskDMatrix(client, data, label=None, missing=None, weight=None, base_margin=None, label_lower_bound=None, label_upper_bound=None, feature_names=None, feature_types=None)
 |  
 |  DMatrix holding on references to Dask DataFrame or Dask Array.  Constructing
 |  a `DaskDMatrix` forces all lazy computation to be carried out.  Wait for
 |  the input data explicitly if you want to see actual computation of
 |  constructing `DaskDMatrix`.
 |  
 |  .. note::
 |  
 |      DaskDMatrix does not repartition or move data between workers.  It's
 |      the caller's responsibility to balance the data.
 |  
 |  .. versionadded:: 1.0.0
 |  
 |  Parameters
 |  ----------
 |  client: dask.distributed.Client
 |      Specify the dask client used for training.  Use default client
 |      returned from dask if it's set to None.
 |  data : dask.array.Array/dask.dataframe.DataFrame
 |      data source of DMatrix.
 |  lab

In [14]:
dtrain = xgb.dask.DaskDMatrix(
    client=client,
    data=taxi_train[features],
    label=taxi_train[y_col]
)

In [16]:
%%time
bst = xgb.dask.train(
    client=client,
    params={
        "objective": "reg:squarederror",
        "tree_method": "hist",
        "learning_rate": 0.1,
        "max_depth": 5,
    },
    dtrain=dtrain,
    num_boost_round=50
)

AttributeError: /opt/conda/envs/saturn/lib/libxgboost.so: undefined symbol: XGDMatrixSetDenseInfo

## Save model

In [None]:
import cloudpickle

with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:
    cloudpickle.dump(xgb_reg, f)

## Calculate metrics on test set

Use a different month for test set

In [None]:
taxi_test = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
    assume_missing=True,
).sample(frac=0.01, replace=False)

taxi_test = prep_df(taxi_test)

The metrics functions in `dask_ml` can compute metrics on Dask collections like Dask Array and Dask DataFrame, so you never have to hold all of the test data in memory on the client. These functions intentionally mimic the metrics functions in scikit-learn.

In [None]:
from dask_ml.metrics import mean_squared_error

preds = xgb_reg.predict(taxi_test[features])
mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)