# An Example with Light GBM

This example is still experimental!

In [1]:
numeric_feat = [
    "pickup_weekday",
    "pickup_weekofyear",
    "pickup_hour",
    "pickup_week_hour",
    "pickup_minute",
    "passenger_count",
]
categorical_feat = [
    "PULocationID",
    "DOLocationID",
]
features = numeric_feat + categorical_feat
y_col = "tip_fraction"

<hr>

## Initialize A Dask Cluster

This tutorial uses multiple machines to show how to apply more computing resources to machine learning training. This is done with Dask. Saturn Cloud offers managed Dask clusters, which can be provisioned and modified programmatically.

The code below creates a Dask cluster using [`dask-saturn`](https://github.com/saturncloud/dask-saturn), the official Dask client for Saturn Cloud. It creates a cluster with the following specs:

* `n_workers=3` --> 3 machines in the cluster
* `scheduler_size='medium'` --> the Dask scheduler will have 4GB of RAM and 2 CPU cores
* `worker_size='large'` --> each worker machine will have 2 CPU cores and 16GB of RAM

To see a list of possible sizes, run the code below.

In [2]:
import dask_saturn

dask_saturn.describe_sizes()

{'medium': 'Medium - 2 cores - 4 GB RAM',
 'large': 'Large - 2 cores - 16 GB RAM',
 'xlarge': 'XLarge - 4 cores - 32 GB RAM',
 '2xlarge': '2XLarge - 8 cores - 64 GB RAM',
 '4xlarge': '4XLarge - 16 cores - 128 GB RAM',
 '8xlarge': '8XLarge - 32 cores - 256 GB RAM',
 '12xlarge': '12XLarge - 48 cores - 384 GB RAM',
 '16xlarge': '16XLarge - 64 cores - 512 GB RAM',
 'g4dnxlarge': 'T4-XLarge - 4 cores - 16 GB RAM - 1 GPU',
 'g4dn4xlarge': 'T4-4XLarge - 16 cores - 64 GB RAM - 1 GPU',
 'g4dn8xlarge': 'T4-8XLarge - 32 cores - 128 GB RAM - 1 GPU',
 'p32xlarge': 'V100-2XLarge - 8 cores - 61 GB RAM - 1 GPU',
 'p38xlarge': 'V100-8XLarge - 32 cores - 244 GB RAM - 4 GPU',
 'p316xlarge': 'V100-16XLarge - 64 cores - 488 GB RAM - 8 GPU'}

The `dask-saturn` code below creates two important objects: a cluster and a client.

* `cluster`: knows about and manages the scheduler and workers
    - can be used to create, resize, reconfigure, or destroy those resources
    - knows how to communicate with the scheduler, and where to find logs and diagnostic dashboards
* `client`: tells the cluster to do things
    - can send work to the cluster
    - can restart all the worker processes
    - can send data to the cluster or pull data back from the cluster

In [3]:
from dask.distributed import Client, wait
from dask_saturn import SaturnCluster

n_workers = 3
cluster = SaturnCluster(n_workers=n_workers, scheduler_size="medium", worker_size="large")
client = Client(cluster)
cluster

[2021-01-12 22:19:30] INFO - dask-saturn | Cluster is ready
[2021-01-12 22:19:30] INFO - dask-saturn | Registering default plugins
[2021-01-12 22:19:30] INFO - dask-saturn | {'tcp://10.0.0.136:41119': {'status': 'repeat'}, 'tcp://10.0.16.173:37309': {'status': 'repeat'}, 'tcp://10.0.27.235:38503': {'status': 'repeat'}}


VBox(children=(HTML(value='<h2>SaturnCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n   …

If you created your cluster here in this notebook, it might take a few minutes for all your nodes to become available. You can run the chunk below to block until all nodes are ready.

>**Pro tip**: Create and/or start your cluster in the Saturn UI if you want to get a head start!

In [4]:
client.wait_for_workers(n_workers=n_workers)

# Load data and feature engineering

Load a sample from a single month for this exercise

In [5]:
import dask.dataframe as dd

taxi = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
    assume_missing=True,
).sample(frac=0.3, replace=False)

In [6]:
print(f"Num rows: {len(taxi)}, Size: {taxi.memory_usage(deep=True).sum().compute() / 1e6} MB")

Num rows: 2300337, Size: 464.668074 MB


In [7]:
def prep_df(df: dd.DataFrame) -> dd.DataFrame:
    """
    Generate features from a raw taxi dataframe.
    """
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df["tip_fraction"] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.isocalendar().week
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [y_col]].astype(float).fillna(-1)

    return df


taxi_train = prep_df(taxi)

In [8]:
taxi_train.head()

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,PULocationID,DOLocationID,tip_fraction
603621,3.0,1.0,23.0,95.0,38.0,1.0,211.0,49.0,0.218571
103264,1.0,1.0,13.0,37.0,23.0,1.0,163.0,142.0,0.224615
167915,1.0,1.0,20.0,44.0,23.0,1.0,164.0,161.0,0.0
545276,3.0,1.0,18.0,90.0,15.0,1.0,143.0,48.0,0.0
218676,2.0,1.0,8.0,56.0,0.0,1.0,48.0,138.0,0.0


Dask performs computations in a [lazy manner](https://tutorial.dask.org/01x_lazy.html), so we persist the dataframe to perform data loading and feature processing once.

In [9]:
%%time
taxi_train = taxi_train.persist()
_ = wait(taxi_train)

CPU times: user 27 ms, sys: 2.61 ms, total: 29.6 ms
Wall time: 7.7 s


# Train a model

In [10]:
import lightgbm.dask as dlgbm

In [11]:
data = taxi_train[features]
label = taxi_train[y_col]
label = label > 0.2

In [12]:
import random

port = random.randint(9000, 12000)

In [13]:
dask_classifier = dlgbm.DaskLGBMClassifier(time_out=5, local_listen_port=port)

In [14]:
dask_classifier = dask_classifier.fit(data, label, client=client)

Parameter tree_learner not set or set to incorrect value (None), using "data" as default


In [15]:
result = dask_classifier.predict(data)

In [16]:
result.compute()

array([ True, False, False, ..., False, False,  True])