# Machine Learning Lab Solution

In [None]:
import coiled
from dask.distributed import Client

cluster = coiled.Cluster(name="training-cluster")
client = Client(cluster)
client

## Lab: Dask + XGBoost

### Activity 1: Train the cat-dog data with XGBoost

As of 2020, XGBoost has a new, official API for working with Dask (although the older `dask-xgboost` package still works).

The new API is documented at https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

First, we'll just train a model. Most of the configuration information in this API is passed via a parameters object, described here: https://xgboost.readthedocs.io/en/latest/parameter.html

To get started, keep it as simple as possible

In [None]:
import dask.dataframe as ddf
import pandas as pd

pets = ddf.read_csv('s3://coiled-training/data/pets.csv', parse_dates=["License Issue Date"], 
                    dtype={'License Number': 'object',
                           'ZIP Code': 'object'},
                    blocksize=1e6, storage_options={"anon": True})

pets = pets.drop(columns=['Secondary Breed', 'License Number' ]).dropna()

pets = pets.rename(columns={'License Issue Date':'license_date','Animal\'s Name':'name',
                            'Species':'species', 'Primary Breed':'breed', 'ZIP Code':'zip'})

pets['day'] = pets['license_date'].apply(pd.Timestamp.toordinal)

In [None]:
cats_and_dogs = pets[(pets['species'] == 'Dog') | (pets['species'] == 'Cat')]
cats_and_dogs = cats_and_dogs[['day', 'zip' , 'species']]

In [None]:
from dask_ml.preprocessing import LabelEncoder

cats_and_dogs = cats_and_dogs.categorize()
cats_and_dogs['zip'] = LabelEncoder().fit_transform(cats_and_dogs['zip'])
cats_and_dogs['species'] = LabelEncoder().fit_transform(cats_and_dogs['species'])

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cats_and_dogs[['day', 'zip']], 
                                                    cats_and_dogs.species, test_size=0.1)

X_train

In [None]:
import xgboost as xgb

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

param = {'objective':'binary:logistic', 'verbosity': 2 }

model = xgb.dask.train(client, param, dtrain,
                        num_boost_round=4, evals=[(dtrain, 'train')])

In [None]:
model

### Activity 2: Predict on the test set

We'll need to make a DMatrix again to feed the test set to XGBoost. 

XGB also has a Dask-specific API for distributed prediction.

See if you can generate a vector of predictions and inspect those.

In [None]:
dtest = xgb.dask.DaskDMatrix(client, X_test)

predictions = xgb.dask.predict(client, model, dtest)

predictions

In [None]:
predictions[:5].compute()

### Activity 3: Accuracy

Using a distributed mechanism, convert the prediction probabilities into an accuracy score for the test set.

In [None]:
from dask_ml.metrics import accuracy_score

accuracy_score(y_test, predictions > 0.5)