# Machine Learning with Dask

Seattle has seen an increase in pet registrations over the past few years, and Cascadia City wants to plan registration, animal control, dog parks, and other facilities in line with anticipated growth.

We'll analyze a small set of Seattle pet data to try and identify some trends.

<img src='images/dog-park.jpg' width=600>

Let's spin up a small cluster and dive in.

In [None]:
import coiled
from dask.distributed import Client

cluster = coiled.Cluster(name="training-cluster")
client = Client(cluster)
client

## Dask ML Overview

__The general goal of Dask-style ML is to reproduce, as closely as possible, the Pandas and scikit-learn style workflows that data scientists are used to...__

```python
import dask.dataframe as dd
df = dd.read_parquet('...')
data = df[['age', 'income', 'married']]
labels = df['outcome']

from dask_ml.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(data, labels)
```

__... and to integrate smoothly with other ML tools, all while making the necessary changes to support out-of-core (larger than local memory) as well as fully distributed training and scoring use cases.__

In some cases, this is a straightforward mission, while for others -- parallelizing algorithms that are traditionally sequential -- it can be trickier.

<img src='images/ml.png' width=600>
                            
Dask's solutions break down into several high-level categories:

### Provide Dask ML scalable algorithms

Full-scale, out-of-core, distributed processing for several types of tasks, including
* Feature engineering (pre-processing)
* Linear models / GLMs
* Clustering

### Out-of-core non-parallel training with scikit-learn

For modest datasets that don't require parallelization or a big cluster, but can't fit in memory, Dask provides a wrapper for `Incremental` training of any scikit-learn estimator that supports `partial_fit`, such as Naive Bayes


### Parallelize scikit-learn via joblib

Scikit-Learn provides limited parallel computing on a single machine via `joblib`. 

Dask provides a drop-in implementation that extends joblib to many machines in a cluster. This approach works seamlessly any place that scikit-learn supports joblib-based parallelism, but often requires data to fit in memory on each node.

Examples include:
* Parallelizing random forests
* Hyperparameter optimization

### Parallel prediction/scoring only: small models, big data

Many useful scikit-learn models are small in size, but we want to apply them to large datasets for scoring. Since prediction is usually an embarrassingly parallel task, Dask provides a wrapper that scales `model.predict` to large data and large clusters.

### Partner with other distributed libraries

Some machine learning libraries like XGBoost already have distributed solutions that work well. Dask-ML makes no attempt to re-implement these systems. Instead, Dask-ML makes it easy to use normal Dask workflows to prepare and set up data, then it deploys XGBoost automatically *alongside* Dask, and hands the data over. The result is a smooth API using best-of-breed implementations.

## Dask ML and total pet registrations

We'll start off trying a simple model to measure the growth in total pet licenses

In [None]:
import dask.dataframe as ddf

pets = ddf.read_csv('s3://coiled-training/data/pets.csv', parse_dates=["License Issue Date"], 
                    blocksize=1e6, 
                    dtype={'License Number': 'object',
                           'ZIP Code': 'object'},
                    storage_options={"anon": True})
pets

In [None]:
pets.head()

In [None]:
pets.describe().compute()

It looks like if we drop the "Secondary Breed" column, we'll have a lot more complete cases.

Let's do that, drop "License Number" as well, filter out incomplete records, and simplify the column names a bit.

In [None]:
pets = pets.drop(columns=['Secondary Breed', 'License Number' ]).dropna()

pets = pets.rename(columns={'License Issue Date':'license_date','Animal\'s Name':'name',
                            'Species':'species', 'Primary Breed':'breed', 'ZIP Code':'zip'})

pets.head()

It might be handy to have the day represented as an integer

In [None]:
import pandas as pd

pets['day'] = pets['license_date'].apply(pd.Timestamp.toordinal)
pets.head()

Our first goal just to count the registrations over time

In [None]:
pet_counts = pets['day'].value_counts()
pet_counts.head()

In this case, the dataset isn't really big ... but let's imagine it might be, and downsample to inspect the data.

In [None]:
pet_counts_local_sample = pet_counts.sample(0.5).compute()
pet_counts_local_sample

In [None]:
local_array = pet_counts_local_sample.reset_index()
local_array

In [None]:
local_array.columns = ['day', 'registrations']

local_array.plot('day', 'registrations', kind='scatter')

Not very linear! Maybe we should take the log

In [None]:
import numpy as np
from matplotlib import pyplot as plt

plt.scatter(local_array.day, np.log(local_array.registrations))

This still doesn't look great, but it's a little more promising.

Remember this is the local, downsampled data.

So our gameplan for Dask is to 
* apply these transforms on the full dataset
* perform a train/test split
* model
* evaluate

In [None]:
pet_counts = pet_counts.reset_index()
pet_counts.columns = ['day', 'registrations']
pet_counts

In [None]:
import dask.array as da
pet_counts['log_registrations'] = da.log(pet_counts.registrations)
pet_counts.head()

Next, we'll convert to Dask array and measure the chunk sizes (otherwise, Dask may not know how many records we have in each chunk, since reading a dataframe or series is a lazy operation)

In [None]:
predictor = pet_counts['day'].to_dask_array(lengths=True)
predictor

In [None]:
response = pet_counts['log_registrations'].to_dask_array(lengths=True)
response

Now we'll prepare our training/test split

Notice
* we import `train_test_split` from `dask_ml.model_selection`
    * it works like the scikit-learn version but supports parallel operation on large, distributed data
* since we only have 1 predictor, but the function expects a 2-axis array (feature matrix), we reshape

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(predictor.reshape(-1, 1), response, test_size=0.1)

X_train

Now we'll fit a model. 

* The syntax pattern (instantiate, configure, fit) looks a lot like scikit-learn, 
* but we're importing from `dask_ml.linear_model` and were configuring a distributed, iterative solver.

This one is a second-order (quasi-Newton) solver that's popular for modest numbers of predictors (narrow matrix).

In [None]:
from dask_ml.linear_model import LinearRegression

lr = LinearRegression(solver='lbfgs', max_iter=5)
lr_model = lr.fit(X_train, y_train)

Our `model.predict` step is distributed as well

In [None]:
y_predicted = lr_model.predict(X_test)

y_predicted

And our measurement step -- here we're looking at RMSE on log scale -- looks like scikit-learn but comes from the parallel implementation in `dask_ml.metrics`

In [None]:
from dask_ml.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_test, y_predicted))
rmse

In [None]:
np.exp(rmse)

In [None]:
np.exp(y_train.std().compute())

Probably not a stellar result for the city! 

* We're off by around a factor of 2.5 (vs. a mean-baseline that would be off by 3.8x)

But the goal here was really to explore the API and method -- luckily.

*Bonus activity idea: Poisson regression*
* Since we're modeling count data over time in this task, Poisson regression might be a better fit.
* Trying that out with Dask's GLMs is a very small change.

## Dask, scikit, joblib, and counting cats

<img src='images/cat.jpg' width=600>

Pet trends seem to come and go in different Seattle neighborhoods, so it seems reasonable and useful to try and forecast registration of dogs and cats by zip code. 

For this task, we'll look at two more Dask features
* Categorical data for the zip code feature and pet species
* Scaling a random forest with Dask and joblib

In [None]:
pets.species.value_counts().compute()

In [None]:
cats_and_dogs = pets[(pets['species'] == 'Dog') | (pets['species'] == 'Cat')]
cats_and_dogs = cats_and_dogs[['day', 'zip' , 'species']]
cats_and_dogs

We could use scikit-learn's `Pipeline` class and `make_pipeline` helper with Dask components

In [None]:
from sklearn.pipeline import make_pipeline
from dask_ml.preprocessing import Categorizer, LabelEncoder, DummyEncoder

pipe = make_pipeline(
    Categorizer(),
    DummyEncoder()
)

pipe.fit_transform(cats_and_dogs)

That works ... and the code looks nice, simple and familiar. But it leads us toward an unnecessary high-dimensionality problem, especially since we're going to be using a forest algorithm. 

Let's label encode the categorical instead of one-hot encoding it. We can use a `LabelEncoder` for the zip predictor as well as the categorical (species) response.

Note that
* We need a parallel-aware LabelEncoder, because Dask needs to identify the union of all values for that field, and then assign them uniform labeling throughout the dataset
* In fact, the same requirement applied to the one-hot encoding above (uniform label map prior to generating the one-hot columns)
* `LabelEncoder` supports a slightly non-standard API within scikit-learn, making it hard to use with a `Pipeline` in Dask (or scikit-learn)
    * for more standard transformations, Dask *does* have a version of scikit-learn's `ColumnTransformer`

In [None]:
cats_and_dogs = cats_and_dogs.categorize()
cats_and_dogs

In [None]:
cats_and_dogs['zip'] = LabelEncoder().fit_transform(cats_and_dogs['zip'])
cats_and_dogs.head()

In [None]:
cats_and_dogs['species'] = LabelEncoder().fit_transform(cats_and_dogs['species'])
cats_and_dogs.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cats_and_dogs[['day', 'zip']], 
                                                    cats_and_dogs.species, test_size=0.1)

X_train

In [None]:
X_train.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
import joblib

rf = RandomForestClassifier()

with joblib.parallel_backend('dask'):
    rf.fit(X_train, y_train)

We have a lot of options for measuring our performance.

In the case where we have small data and can train a sklearn model locally (or load a model trained elsewhere), we can use Dask to parallelize post-fit operations like `transform`, `predict`, and `predict_proba`.

Dask's `ParallelPostFit` wrapper/meta-estimator can make predictions using parallel tasks for *any* sklearn estimator because, under the hood, it's basically just doing a `map_partitions` or `map_blocks` with the relevant function.

However, before we launch this, let's think a moment about what's going on.
* We've got a local scikit-learn model that may be large
* We want to use it to score records on remote workers
* That means the model needs to get to all of those workers

If we have high bandwidth between our nodes *and* we get a really valuable speedup from parallel prediction at scale, then this makes sense. But keep in mind it may not always pay off.

Consider the size of our random forest:

In [None]:
import pickle
import sys

p = pickle.dumps(rf)
print(sys.getsizeof(p))

If we don't mind shipping this around, we are ready to go!

In [None]:
from dask_ml.wrappers import ParallelPostFit

parallel_predicting_scorer = ParallelPostFit(estimator=rf, scoring='accuracy')

predictions = parallel_predicting_scorer.predict(X_test)

predictions

In [None]:
from dask_ml.metrics import accuracy_score

accuracy_score(y_test, predictions)

That's no better than random since dogs outnumber cats in our dataset 2:1 ... The mayor is just about ready to show us the door and invite some better data scientists to join the team.

Luckily, we have a job waiting for us on the engineering and ops side.

But before we head out for the day, let's try a lab exercise.

## Lab: Dask + XGBoost

### Activity 1: Train the cat-dog data with XGBoost

As of 2020, XGBoost has a new, official API for working with Dask (although the older `dask-xgboost` package still works).

The new API is documented at https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

First, we'll just train a model. Most of the configuration information in this API is passed via a parameters object, described here: https://xgboost.readthedocs.io/en/latest/parameter.html

To get started, keep it as simple as possible

### Activity 2: Predict on the test set

We'll need to make a DMatrix again to feed the test set to XGBoost. 

XGB also has a Dask-specific API for distributed prediction.

See if you can generate a vector of predictions and inspect those.

### Activity 3: Accuracy

Using a distributed mechanism, convert the prediction probabilities into an accuracy score for the test set.

### Lab wrapup notes

XGBoost does still implement scikit-learn-style interfaces, such as `DaskXGBClassifier` and `DaskXGBRegressor` ... and there's also the scikit-style `dask-xgboost` package.

If the goal of Dask ML is to provide an experience like scikit, and these scikit-style wrappers exist for XGBoost, why did we look at an API that feels different?

* We've done scikit-style examples, so seeing more of the same isn't as valuable
    * i.e., you probably already know how to use those other APIs if you need to
* Any time a tool presents a new *official* API, it's usually a good idea to try it out and consider a path forward, should the other APIs become deprecated or unsupported
* XGBoost is such a powerful and popular tool -- running across many platforms -- that it may be valuable to use its standard API, such as the `params`-based configuration
* Of course, at the end of they day, it's your project and your decision: some scikit-style examples are at https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_cpu_training.py

## Wrapping up a whirlwind tour

This has been a brief introduction to core workflows in Dask machine learning, and there are certainly plenty of areas we haven't explored in this introduction!