# Getting up to speed with Dask

## Part 3: Scale up!

We will do the same analysis as Part 1 & 2 but now with a Dask cluster!

AWS EC2 instance types
- (notebook): r5.xlarge (2 CPU, 16GB RAM)
- (10 workers): r5.2xlarge (8 CPU, 64GB RAM)


We are running in [Saturn Cloud](https://www.saturncloud.io/) so we are using a `SaturnCluster`, but Dask supports many other cluster deployment tools such as [YARN](https://yarn.dask.org/en/latest/) or [Kubernetes](https://docs.dask.org/en/latest/setup/kubernetes.html)

In [17]:
from dask.distributed import Client
from dask_saturn import SaturnCluster

cluster = SaturnCluster(n_workers=10, worker_size='2xlarge', scheduler_size='xlarge')
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://d-aaron-getting-up-to-speed-wi-31ea6981b2f849c18e9b508d8d4cd002.main-namespace:8786  Dashboard: https://d-aaron-getting-up-to-speed-wi-31ea6981b2f849c18e9b508d8d4cd002.demo.saturnenterprise.io,Cluster  Workers: 10  Cores: 80  Memory: 635.00 GB


In [2]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import datetime
import s3fs

seed = 42

# Load and explore data

The worker nodes are different machines, so they do not have the same `data` folder as the Jupyter server. This is good, because it would be expensive to shuttle the same data to all the nodes! Because of this, we will pull directly from S3.

In [3]:
taxi_dtypes = {
    'store_and_fwd_flag': str,
    'RatecodeID': 'float64',
    'VendorID': 'float64',
    'passenger_count': 'float64',
    'payment_type': 'float64',
}

In [4]:
%%time

taxi = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
    dtype=taxi_dtypes, 
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
)

CPU times: user 143 ms, sys: 21.9 ms, total: 165 ms
Wall time: 293 ms


In [5]:
%%time
len(taxi)

CPU times: user 51.4 ms, sys: 5.35 ms, total: 56.8 ms
Wall time: 16.6 s


84399019

In [6]:
%%time
taxi.memory_usage(deep=True).sum().compute() / 1e9

CPU times: user 73.4 ms, sys: 4.48 ms, total: 77.8 ms
Wall time: 15.4 s


16.367014316

In [7]:
%%time
np.round(taxi.describe().compute(), 3).T

CPU times: user 3.58 s, sys: 61.4 ms, total: 3.64 s
Wall time: 26.8 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VendorID,84152418.0,1.645,0.498,1.0,1.0,2.0,2.0,4.0
passenger_count,84152418.0,1.563,1.208,0.0,1.0,1.0,2.0,9.0
trip_distance,84399019.0,3.001,8.091,-37264.53,1.07,1.93,8.82,45977.22
RatecodeID,84152418.0,1.061,0.76,1.0,1.0,1.0,1.0,99.0
PULocationID,84399019.0,163.158,66.016,1.0,132.0,162.0,234.0,265.0
DOLocationID,84399019.0,161.353,70.251,1.0,116.0,163.0,236.0,265.0
payment_type,84152418.0,1.289,0.479,1.0,1.0,1.0,2.0,5.0
fare_amount,84399019.0,13.344,174.375,-1856.0,7.0,11.0,32.04,943274.8
extra,84399019.0,1.087,1.249,-60.0,0.0,1.0,3.0,535.38
mta_tax,84399019.0,0.495,0.067,-0.5,0.5,0.5,0.5,212.42


# Feature engineering

In [8]:
def make_features(df):
    """ Same code from Part 1 """
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.isocalendar().week.astype(int)
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df['pickup_year_seconds'] = (df.tpep_pickup_datetime - datetime.datetime(2019, 1, 1, 0, 0, 0)).dt.seconds
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['store_and_fwd_flag'] = (df.store_and_fwd_flag == 'Y').astype(int)
    df['VendorID'] = df.VendorID.fillna(-1)
    df['RatecodeID'] = df.RatecodeID.fillna(-1)

In [9]:
%%time

make_features(taxi)

CPU times: user 53.4 ms, sys: 31 µs, total: 53.4 ms
Wall time: 53.4 ms


In [10]:
%%time

taxi.head()

CPU times: user 15 ms, sys: 23 µs, total: 15 ms
Wall time: 2.61 s


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_minute,pickup_year_seconds,pickup_week_hour
0,1.0,2019-01-01 00:46:40,2019-01-01 00:53:20,1.0,1.5,1.0,0,151,239,1.0,...,0.0,0.3,9.95,,1,1,0,46,2800,24
1,1.0,2019-01-01 00:59:47,2019-01-01 01:18:59,1.0,2.6,1.0,0,239,246,1.0,...,0.0,0.3,16.3,,1,1,0,59,3587,24
2,2.0,2018-12-21 13:48:30,2018-12-21 13:52:40,3.0,0.0,1.0,0,236,236,1.0,...,0.0,0.3,5.8,,4,51,13,48,49710,109
3,2.0,2018-11-28 15:52:25,2018-11-28 15:55:45,5.0,0.0,1.0,0,193,193,2.0,...,0.0,0.3,7.55,,2,48,15,52,57145,63
4,2.0,2018-11-28 15:56:57,2018-11-28 15:58:33,5.0,0.0,2.0,0,193,193,2.0,...,0.0,0.3,55.55,,2,48,15,56,57417,63


<br>

If you have the RAM, you can call `df.persist()` to avoid repeated CSV loading. This returns a [future](https://docs.dask.org/en/latest/futures.html) which continues to execute in the background until it's complete.

In [11]:
taxi = taxi.persist()

Can call `wait()` to block until the `persist()` is done.

In [12]:
%%time

from dask.distributed import wait
_ = wait(taxi)

CPU times: user 59.9 ms, sys: 3.46 ms, total: 63.4 ms
Wall time: 16.2 s


In [13]:
%%time
len(taxi)

CPU times: user 85 ms, sys: 4.23 ms, total: 89.3 ms
Wall time: 190 ms


84399019

In [14]:
%%time
np.round(taxi.describe().compute(), 3).T

CPU times: user 4.95 s, sys: 79.1 ms, total: 5.03 s
Wall time: 16.3 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VendorID,84399019.0,1.638,0.517,-1.0,1.0,2.0,2.0,4.0
passenger_count,84152418.0,1.563,1.208,0.0,1.0,1.0,2.0,9.0
trip_distance,84399019.0,3.001,8.091,-37264.53,1.07,1.93,8.82,45977.22
RatecodeID,84399019.0,1.055,0.767,-1.0,1.0,1.0,1.0,99.0
store_and_fwd_flag,84399019.0,0.008,0.09,0.0,0.0,0.0,0.0,1.0
PULocationID,84399019.0,163.158,66.016,1.0,132.0,162.0,234.0,265.0
DOLocationID,84399019.0,161.353,70.251,1.0,116.0,163.0,236.0,265.0
payment_type,84152418.0,1.289,0.479,1.0,1.0,1.0,2.0,5.0
fare_amount,84399019.0,13.344,174.375,-1856.0,7.0,11.0,32.04,943274.8
extra,84399019.0,1.087,1.249,-60.0,0.0,1.0,3.0,535.38


# Machine learning

In [15]:
# same as Part 1
numeric_feat = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_minute', 
    'pickup_year_seconds',
    'pickup_week_hour', 
    'passenger_count',
]
categorical_feat = [
    'VendorID', 
    'RatecodeID', 
    'store_and_fwd_flag',
    'PULocationID',
    'DOLocationID',
]
features = numeric_feat + categorical_feat
y_col = 'total_amount'

In [18]:
# note the dask_ml imports rather than sklearn
from dask_ml.model_selection import train_test_split
from dask_ml.metrics import mean_squared_error
from xgboost.dask import DaskXGBRegressor

In [19]:
%%time

X_train, X_test, y_train, y_test = train_test_split(
    taxi[features], taxi[y_col], test_size=0.33, random_state=seed, shuffle=True)

CPU times: user 8.96 ms, sys: 39 µs, total: 9 ms
Wall time: 8.67 ms


In [20]:
X_train = X_train.persist()
y_train = y_train.persist()
_ = wait(X_train)

In [21]:
xgb = DaskXGBRegressor(
    n_estimators=10, 
    max_depth=3, 
    learning_rate=0.1, 
    random_state=seed, 
)

In [22]:
%%time

_ = xgb.fit(X_train, y_train)

CPU times: user 149 ms, sys: 164 µs, total: 149 ms
Wall time: 27.8 s


In [24]:
%%time

# get test RMSE
preds = xgb.predict(X_test)
y_test_arr = y_test.to_dask_array(lengths=True)
mean_squared_error(preds, y_test_arr, squared=False)

CPU times: user 399 ms, sys: 154 ms, total: 553 ms
Wall time: 4.52 s


145.07297131491882