<img style="float: right" src="img/saturn.png" width="300" />

# Scaling Machine Learning in Python

## Hyperparameters

The hyperparameter search from [02-single-node.ipynb](02-single-node.ipynb) is an example of a compute-bound workload. The data fits comfortably into memory of the Jupyter Server, but the grid search still takes some time to execute. Let's take this workflow and parallelize it with Dask!

## Initialize Dask cluster

The `dask_saturn` package makes the Dask Cluster that we created from Saturn Cloud accessible in our notebook. If the cluster was already created, we would not need to specify any arguments when initializing `SaturnCluster`, but it is a good idea to do so for reproducibility purposes. The arguments to `SaturnCluster` match the fields presented when editing a Dask Cluster from the Saturn Cloud.

In [None]:
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster(
    scheduler_size='medium',
    worker_size='xlarge',
    n_workers=3,
    nthreads=4,
)
client = Client(cluster)

<br>

To see the options for scheduler and worker sizes, and how they match up to the options presented in Saturn Cloud, run the following:

In [None]:
from dask_saturn.core import describe_sizes
describe_sizes()

<br>

The `Client` object is our "entry point" to Dask. Most Dask operations will automatically detect the client and run operations across the cluster, but sometimes its necessary to pass a `client` object when performing more advanced operations. Previewing the `client` object tells us details about the cluster and a link to the Dashboard. Open up the Dashboard now and keep it  visible in a separate window - you'll see it light up when we run Dask operations!

In [None]:
client

The following cell will block until all workers are available. You can also view cluster status and access the Dashboard link from the Project page in Saturn Cloud.

In [None]:
client.wait_for_workers(3)
print('Ready to go!')

You can change the number of workers on the running cluster using the `cluster.scale()` method. Note that it will take a few minutes to spin up new workers, but you can use the above `wait_for_workers()` function to block until they're ready:

```python
cluster.scale(10) # more workers!
client.wait_for_workers(10)
```


## Scikit-learn + Joblib

Scikit-learn has some algorithms that support parallel execution via the `n_jobs` parameter. By default, this parallelizes across all cores on a single machine (in this case, our Jupyter Server). Dask provides a [Joblib backend](https://ml.dask.org/joblib.html) that hooks into scikit-learn algorithms to parallelize work across a Dask cluster. This enables us to pull in Dask just for the grid search.

In this case, all data loading and processing code is exactly the same as [02-single-node.ipynb](02-single-node.ipynb) and executes on the Jupyter Server. The only part that executes in a Dask cluster is the grid search execution.

The following cell will take a few seconds to execute, but everything in it is copied from [02-single-node.ipynb](02-single-node.ipynb). 

In [None]:
import s3fs
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter("ignore")

s3 = s3fs.S3FileSystem(anon=True)

taxi = pd.read_csv(
    s3.open(
        's3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv',
        mode='rb',
    ),
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
)

raw_features = [
    'tpep_pickup_datetime', 
    'passenger_count', 
    'tip_amount', 
    'fare_amount',
]
features = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
label = 'tip_fraction'

def prep_df(taxi_df):
    '''
    Generate features from a raw taxi dataframe.
    '''
    df = taxi_df[taxi_df.fare_amount > 0][raw_features].copy()  # avoid divide-by-zero
    df[label] = df.tip_amount / df.fare_amount
     
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.weekofyear
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [label]].astype(float).fillna(-1)
    
    return df

taxi_feat = prep_df(taxi)

pipeline = Pipeline(steps=[
    ('scale', StandardScaler()),
    ('clf', ElasticNet(normalize=False, max_iter=100, l1_ratio=0)),
])

params = {
    'clf__l1_ratio': np.arange(0, 1.1, 0.1),
    'clf__alpha': [0, 0.5, 1, 2],
}

grid_search = GridSearchCV(
    pipeline, 
    params, 
    cv=3, 
    n_jobs=-1,
    verbose=1, 
    scoring='neg_mean_squared_error',
)

Take a sample to match what we did in [02-single-node.ipynb](02-single-node.ipynb).

In [None]:
taxi_sample = taxi_feat.sample(frac=0.1, replace=False)

To execute the grid search in Dask we need to run inside a context manager for a Joblib backend. Besides that, we call the `grid_search.fit()` method the same way as before. When you run this cell, watch the Dask Dashboard to see the progress.

In [None]:
%%time
import joblib

with joblib.parallel_backend('dask'):
    _ = grid_search.fit(
        taxi_sample[features],
        taxi_sample[label],
    )

### Success! 

That executed in half the time as [02-single-node.ipynb](02-single-node.ipynb), just by initializing a Dask Cluster and adding one line of code. If we had more nodes in the cluster it would execute even faster.

> **Note**: Using the Dask Joblib backend requires sending the DataFrame through the scheduler to all the workers. This causes problems with DataFrames larger than what was used in this example.

Dask + Joblib is useful for small data scenarios. Our next example will work with DataFrames of any size!

## Dask ML

Dask ML has its own parallel implementations of some scikit-learn algorithms, including `GridSearchCV` and [other hyperparameter search options](https://ml.dask.org/hyper-parameter-search.html). To use it, we convert our pandas DataFrame to a Dask DataFrame and use Dask ML's preprocessing and model selection classes. Don't worry if you don't know all the details of a Dask DataFrame -  we'll cover that in the next notebook. For now, enjoy the speedups with Dask!

In [None]:
import dask.dataframe as dd
import numpy as np

taxi_sample_dd = (
    dd.from_pandas(taxi_sample, npartitions=10)
    .persist()
)

Notice how the following code looks almost identical to the scikit-learn version, and even still uses scikit-learn's `Pipeline` and `ElasticNet` classes.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet

from dask_ml.preprocessing import StandardScaler
from dask_ml.model_selection import GridSearchCV

pipeline = Pipeline(steps=[
    ('scale', StandardScaler()),
    ('clf', ElasticNet(normalize=False, max_iter=100, l1_ratio=0)),
])

params = {
    'clf__l1_ratio': np.arange(0, 1.1, 0.1),
    'clf__alpha': [0, 0.5, 1, 2],
}

grid_search = GridSearchCV(
    pipeline, 
    params, 
    cv=3, 
    scoring='neg_mean_squared_error',
)

### Exercise

Run the grid search using the `grid_search` object defined above. (Hint: it works the same way as scikit-learn's `GridSearchCV` class).

In [None]:
%%time

<FILL IN>

In [None]:
%%time
_ = grid_search.fit(
    taxi_sample_dd[features], 
    taxi_sample_dd[label],
)

--- 
### Super fast!

This ran even faster than the Joblib example, because Dask was able to parallelize all steps of the pipeline. Dask ML's `GridSearchCV` class also [avoids repeated work](https://ml.dask.org/hyper-parameter-search.html#avoid-repeated-work) to make the grid search faster. 

Jump over to [04-large-dataset.ipynb](04-large-dataset.ipynb) for a deeper dive into Dask and see how we can do batch inference and model training with large amounts of data.