# LightGBM Classification (multi-node with Dask)

This notebook describes a machine learning training workflow using the famous [NYC Taxi Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). That dataset contains information on taxi trips in New York City.

In this exercise, you'll load data into a [Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) and use [LightGBM](https://lightgbm.readthedocs.io/en/latest/) to answer this question

> based on characteristics that can be known at the beginning of a trip, will this trip result in a high tip?

This notebook gives an introductory tutorial on how to use Dask to scale training of LightGBM models. For more detailed information, see ["LightGBM Training with Dask"](https://www.saturncloud.io/docs/tutorials/lightgbm/) in Saturn Cloud's documentation.

In [1]:
numeric_feat = [
    "pickup_weekday",
    "pickup_weekofyear",
    "pickup_hour",
    "pickup_week_hour",
    "pickup_minute",
    "passenger_count",
]
categorical_feat = [
    "PULocationID",
    "DOLocationID",
]
features = numeric_feat + categorical_feat
y_col = "tip_fraction"

<hr>

## Initialize A Dask Cluster

This tutorial uses multiple machines to show how to apply more computing resources to machine learning training. This is done with Dask. Saturn Cloud offers managed Dask clusters, which can be provisioned and modified programmatically.

The code below creates a Dask cluster using [`dask-saturn`](https://github.com/saturncloud/dask-saturn), the official Dask client for Saturn Cloud. It creates a cluster with the following specs:

* `n_workers=3` --> 3 machines in the cluster
* `scheduler_size='medium'` --> the Dask scheduler will have 4GB of RAM and 2 CPU cores
* `worker_size='large'` --> each worker machine will have 2 CPU cores and 16GB of RAM

To see a list of possible sizes, run the code below.

In [2]:
import dask_saturn

dask_saturn.describe_sizes()

{'medium': 'Medium - 2 cores - 4 GB RAM',
 'large': 'Large - 2 cores - 16 GB RAM',
 'xlarge': 'XLarge - 4 cores - 32 GB RAM',
 '2xlarge': '2XLarge - 8 cores - 64 GB RAM',
 '4xlarge': '4XLarge - 16 cores - 128 GB RAM',
 '8xlarge': '8XLarge - 32 cores - 256 GB RAM',
 '12xlarge': '12XLarge - 48 cores - 384 GB RAM',
 '16xlarge': '16XLarge - 64 cores - 512 GB RAM',
 'g4dnxlarge': 'T4-XLarge - 4 cores - 16 GB RAM - 1 GPU',
 'g4dn4xlarge': 'T4-4XLarge - 16 cores - 64 GB RAM - 1 GPU',
 'g4dn8xlarge': 'T4-8XLarge - 32 cores - 128 GB RAM - 1 GPU',
 'p32xlarge': 'V100-2XLarge - 8 cores - 61 GB RAM - 1 GPU',
 'p38xlarge': 'V100-8XLarge - 32 cores - 244 GB RAM - 4 GPU',
 'p316xlarge': 'V100-16XLarge - 64 cores - 488 GB RAM - 8 GPU'}

The `dask-saturn` code below creates two important objects: a cluster and a client.

* `cluster`: knows about and manages the scheduler and workers
    - can be used to create, resize, reconfigure, or destroy those resources
    - knows how to communicate with the scheduler, and where to find logs and diagnostic dashboards
* `client`: tells the cluster to do things
    - can send work to the cluster
    - can restart all the worker processes
    - can send data to the cluster or pull data back from the cluster

In [4]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

n_workers = 3
cluster = SaturnCluster(n_workers=n_workers, scheduler_size="medium", worker_size="large")
client = Client(cluster)
cluster

INFO:dask-saturn:Cluster is ready
INFO:dask-saturn:Registering default plugins
INFO:dask-saturn:{'tcp://10.0.16.130:38105': {'status': 'OK'}, 'tcp://10.0.22.67:45431': {'status': 'OK'}, 'tcp://10.0.26.1:41171': {'status': 'OK'}}


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster in the Saturn UI if you want to get a head start!

In [5]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a sample from a single month for this exercise

In [6]:
import dask.dataframe as dd

taxi = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
    assume_missing=True,
).sample(frac=0.3, replace=False)

In [7]:
print(f"Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e6} MB")

Num rows: 2300337, Size: 464.668074 MB


In [11]:
def prep_df(df: dd.DataFrame) -> dd.DataFrame:
    """
    Generate features from a raw taxi dataframe.
    """
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df["tip_fraction"] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.isocalendar().week
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [y_col]].astype(float).fillna(-1)

    return df


taxi_train = prep_df(taxi)

In [12]:
taxi_train.head()

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,PULocationID,DOLocationID,tip_fraction
492548,3.0,1.0,14.0,86.0,10.0,2.0,228.0,227.0,0.217778
1120,1.0,1.0,0.0,24.0,29.0,1.0,68.0,68.0,0.257778
625148,4.0,1.0,6.0,102.0,37.0,1.0,262.0,48.0,0.0
49041,1.0,1.0,3.0,27.0,26.0,1.0,132.0,39.0,0.209811
520958,3.0,1.0,16.0,88.0,52.0,1.0,162.0,229.0,0.0


Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing once.

In [13]:
%%time
taxi_train = taxi_train.persist()
_ = wait(taxi_train)

CPU times: user 28.2 ms, sys: 0 ns, total: 28.2 ms
Wall time: 8.19 s


# Train a model

In [70]:
import lightgbm.dask as dlgbm
from dask_ml.metrics import mean_squared_error, accuracy_score

In [67]:
data = taxi_train[features]
label = taxi_train[y_col]
label = label > 0.2

In [68]:
import random
port = random.randint(9000, 12000)

In [71]:
dask_classifier = dlgbm.DaskLGBMClassifier(
    silent=False,
    max_depth=5,
    random_state=708,
    learning_rate=0.1,
    tree_learner="data",
    n_estimators=50,
    min_child_samples=1,
    n_jobs=-1,
    local_listen_port=port,
    time_out=5
)
dask_classifier = dask_classifier.fit(data, label, client=client)

In [72]:
preds = dask_classifier.predict(data)

In [73]:
accuracy_score(preds, label)

0.5381862198462837

<hr>

## Save model

The model object produced by `dlgb.LGBMRegressor.fit()` is an instance of dlgb.LGBMRegressor, but it contains a regular LightGBM Booster object, the same that would be produced by non-Dask training.

That Booster can be saved to a file using .save_model(), or can be directly serialized using cloudpickle. From there, you can deploy it wherever and however you deploy other machine learning models.



In [74]:
import cloudpickle

bst = dask_classifier.booster_

# write with cloudpickle
with open("model.pkl", "wb") as f:
    cloudpickle.dump(bst, f)


<hr>

## Calculate metrics on test set

Use a different month for test set

In [75]:
taxi_test = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
    assume_missing=True,
).sample(frac=0.01, replace=False)

taxi_test = prep_df(taxi_test)

In [76]:
preds = dask_classifier.predict(taxi_test[features])

In [77]:
label = taxi_test[y_col]
label = label > 0.2

In [78]:
from dask_ml.metrics import accuracy_score
accuracy_score(preds, label)

0.5585918547903764