# Exploring a Real-World Use Case: Batch ML Scoring 

Did you get an email in your inbox today offering product promotions and deals -- customized just for you?

If so, you've already experienced the common real-world use case we're about to explore: batch scoring with a machine learning model.

In batch scoring, we have a trained model and a large number of records. We'd like to parallelize the scoring of these records so that we can generate predictions -- like what to include in your promo email -- in a short period of time.

Unlike near-real-time scoring (like the models which determine whether to flag a credit card swipe as fraud), batch scoring has an advantage: it doesn't need to happen in milliseconds. It also has a disadvantage: instead of scoring for one person or one credit card, there might be hundreds of millions of records involved.

Let's put everything we've learned together and implement this scenario *the right way* with Dask!

## The Setup

Here's what we have so far

* a trained PyTorch model
    * in our example, a convolutional network trained on Fashion MNIST to recognize clothing items
        * https://github.com/zalandoresearch/fashion-mnist
* a bunch of records we want to score
    * we'll use 10,000 but the goal is to design this to scale this arbitrarily large
* Dask!

In [None]:
import coiled
from dask.distributed import Client

cluster = coiled.Cluster(name="training-cluster")
client = Client(cluster)
client

__We know that Dask+Python are flexible enough that we could do this in a lot of ways and it would probably run.__

But we also know that if we don't plan ahead, we might see suboptimal performance due to things like
* moving too much data
* moving data to/from the wrong places (e.g., distant storage or a bottleneck like the process hosting the client)
* tasks that are too short or too long in duration
* tasks that waste cycles repeating common logic
* instability due to running too close to resource limits
* or other issues

__So let's make a plan__

### Input data

The input data are the records we want to score. 

#### Location

We'd like this to be in a place that all of our workers can read from in parallel and without excessive network transport.

Since our compute will happen in AWS, we'd read source data in from an S3 bucket.

#### Format

We are scoring image data, which is effectively array data.

We should use an efficient format like HDF or Zarr.

#### Lazy access

Dask's Array will give us the lazy-read characteristics we want. Handling the shape/chunking should be no problem since we know the details of the data ... __but__ we may want to revisit chunking along the batch (record number) axis. We'll come back to this as we talk about processing.

In [None]:
import dask.array as da

arr = da.from_zarr('s3://coiled-training/data/images_to_score', storage_options={"anon": True})
arr

In [None]:
arr = arr.rechunk(100, 28, 28)
arr

### Now the ML Processing ... 

<br>
<br>
<br>
<br>
<br>
<br>

<img src='images/jerry.gif' width=480>

<br>
<br>
<br>
<br>
<br>
<br>

Not quite yet ... 

### More data: the model

In this use case, our model is actually a hefty piece of data. Our small Fashion MNIST model is 2.5MB, which we don't want to transport every time we score records, and a more "serious" model might easily be in the GB range (to say nothing of state-of-the-art language models).

There are a few options. We can load the model from shared storage like S3 or a model database for maximum flexibility, or -- if we are creating a container config for this specific application -- we could bake it into the container image.

Each worker will need to load the model at least once. But Dask is smart enough to figure this out, and move the model around the cluster as needed ... we just need a token (`Future`) that points to the model, and we can get this by calling `client.submit` on a helper that loads the model once.

> For models that are not serializable in the standard way, or where Dask cannot automate replication effectively, we can maually set up the model for each worker via `client.run`. There is an example notebook illustrating that approach, but this "automatic" one is preferable, and easier to think about

First we'll test locally (as a sanity check, since remote execution is tougher to troubleshoot)

In [None]:
import torch
import numpy as np
import urllib.request
from io import BytesIO

file = 'https://coiled-training.s3.amazonaws.com/data/fmnist.pyt'

with urllib.request.urlopen(file) as f:
    model = torch.load(BytesIO(f.read()))
    
model

In [None]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [None]:
model(torch.randn(1, 1, 28, 28))

Ok, looks good.

In [None]:
def load_model():
    file = 'https://coiled-training.s3.amazonaws.com/data/fmnist.pyt'
    with urllib.request.urlopen(file) as f:
        model = torch.load(BytesIO(f.read()))
    return model

model_future = client.submit(load_model)

### Processing

We can keep it simple and use a Dask Array `map_blocks` approach. We may need to tune these blocks later, but for now let's recall that our data looks like this

In [None]:
arr

In [None]:
def test_score(block):
    return np.random.normal(size=(block.shape[0], 10))

In [None]:
test_score(arr.blocks[0]).shape

In [None]:
arr.map_blocks(test_score, chunks=(100,10), dtype=np.float32, drop_axis=2).compute().shape

In [None]:
def score(block, model):
    torch_tensor = torch.tensor(block)
    reshaped = torch_tensor.view(-1, 1, 28, 28)
    rescaled = reshaped / 255.0
    raw_scores = model(rescaled)
    scores = torch.softmax(raw_scores, 1).detach().numpy()
    return scores

In [None]:
handle_to_outputs = arr.map_blocks(score, model_future, chunks=(100,10), dtype=np.float32, drop_axis=2)
handle_to_outputs

In [None]:
handle_to_outputs[:3, :].compute()

In [None]:
handle_to_outputs[:3, :].argmax(axis=1).compute()

### Writing output

We want to write the output in parallel from the scoring tasks, to shared storage.

Let's store the class probabilities in the same order as the source records, in zarr format, to shared storage. 

## The Execution

Now that we've planned everything out, the write command will kick it off

In [None]:
import os

bucket = os.environ['WRITE_BUCKET']

handle_to_outputs.to_zarr('s3://' + bucket + '/scores', overwrite=True)

## Tuning

The last step is to see what our resource consumption looks like, and tune our job. 

Mainly, we want to optimize the number of records scored per batch, so that we are making good use of the Dask Scheduler.

Since the batches are currently aligned with array blocks -- a good practice -- changing the batch size may mean changing array chunk size.

> In my sample run, I have a lot of very short tasks. The actual duration will depend on hardware and other factors. But we could go to 10x or more with the amount of data and aim for a longer task time.

In [None]:
bigger_task_output = arr.rechunk(2000, 28, 28).map_blocks(score, model_future, chunks=(1000,10), dtype=np.float32, drop_axis=2)
bigger_task_output

In [None]:
bigger_task_output.compute()

Notice the memory spike a little bit ... In a real-life scenario, we could run out of memory or spill to disk with a much larger data chunk ... but an even more likely scenario is running out of GPU memory if we are using GPU. Since we want to "load up" our GPUs with as many records as we can fit, we'll often be on the edge of GPU mem limits, and unfortunately those are harder to increase :)

We can easily rechunk to use less memory

In [None]:
bigger_task_output = arr.rechunk(500, 28, 28).map_blocks(score, model_future, chunks=(500,10), dtype=np.float32, drop_axis=2)
bigger_task_output.compute()

In this last experiment, I'm getting somewhat longer tasks, and not running out of memory.

For a real-world deployment, we would keep tuning, and probably use larger workers and even bigger chunks of data.

In [None]:
bigger_task_output.to_zarr('s3://' + bucket + '/scores', overwrite=True)

For now, we've illustrated the key end-to-end workflow...

Next steps might be to try with even larger datasets and make sure we retain both good parallel utilization, good task size, and no failures.

## Bonus Lab Activity

In our data folder, there's another version of this same model, in an open industry-standard format called ONNX. 

You can learn more about ONNX at https://onnx.ai/

There are a variety of reasons to use an open, standard format for deploying your models -- and ONNX can handle a huge variety of models: 
* https://onnx.ai/supported-tools.html
* https://github.com/onnx/onnxmltools

ONNX features a number of runtime (deployment) options. One is an open source implementation for CPU (and another for GPU) from Microsoft: https://microsoft.github.io/onnxruntime/

Try replicating what we did here in this lab, but using the ONNX model artefact and the `onnxruntime` library for scoring.