This notebook is adapted from "components keras" [tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/components_keras).
The goal is to predict whether a trip may generate a big tip.

In this example, we showcase how to convert a tfx pipeline to [Ray AIR](https://docs.ray.io/en/latest/ray-air/getting-started.html), covering
every step from data ingestion to pushing a model to serving.

1. Read a CSV file into ray dataset.
2. Process the dataset by chaining a variety of off-the-shelf preprocessors.
3. Train the model using distributed tensorflow with few lines of code.
4. Serve the model that will apply the same preprocessing to the incoming requests.

Note, ``ray.ml.checkpoint.Checkpoint`` serves as the bridge between step 3 and step 4.
By capturing both model and preprocessing steps in a way compatible with Ray Serve, this
abstraction makes sure ml workload can transition seamlessly between training and
serving.

Uncomment and run the following line in order to install all the necessary dependencies:

In [1]:
# ! pip install "tensorflow>=2.8.0" "ray[tune, data, serve]>=1.12.1"
# ! ray install-nightly
# ! pip install fastapi

# Set up Ray

We will use `ray.init()` to initialize a local cluster. By default, this cluster will be compromised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

In [2]:
from pprint import pprint
import ray

ray.init()

RayContext(dashboard_url='127.0.0.1:8265', python_version='3.7.7', ray_version='2.0.0.dev0', ray_commit='5c96e7223b468fed6b6db763c837728c721f78cd', address_info={'node_ip_address': '172.31.80.125', 'raylet_ip_address': '172.31.80.125', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-05-17_20-53-28_756452_181/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-05-17_20-53-28_756452_181/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-05-17_20-53-28_756452_181', 'metrics_export_port': 64821, 'gcs_address': '172.31.80.125:9031', 'address': '172.31.80.125:9031', 'node_id': '387b01b554f692bdb9ad031355d085f9381e5cf1dd22bbb1a9a7469e'})

We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine.

In [3]:
pprint(ray.cluster_resources())

{'CPU': 8.0,
 'CPU_group_0_67c86083f1dc83ce596e707e262302000000': 1.0,
 'CPU_group_67c86083f1dc83ce596e707e262302000000': 1.0,
 'bundle_group_0_67c86083f1dc83ce596e707e262302000000': 1000.0,
 'bundle_group_67c86083f1dc83ce596e707e262302000000': 1000.0,
 'memory': 19439608628.0,
 'node:172.31.80.125': 1.0,
 'object_store_memory': 9719804313.0}


# Getting the data

Let's start with defining a helper function to get the data to work with. Some columns are dropped for simplicity.

In [4]:
import pandas as pd

def get_data() -> pd.DataFrame:
    """Fetch the taxi fare data to work on."""
    _data = pd.read_csv(
        "https://raw.githubusercontent.com/tensorflow/tfx/master/"
        "tfx/examples/chicago_taxi_pipeline/data/simple/data.csv"
    )
    _data["is_big_tip"] = _data["tips"] / _data["fare"] > 0.2
    # We drop some columns here for the sake of simplicity.
    return _data.drop(
        [
            "tips",
            "fare",
            "dropoff_latitude",
            "dropoff_longitude",
            "pickup_latitude",
            "pickup_longitude",
            "pickup_census_tract",
        ],
        axis=1,
    )

In [5]:
data = get_data()

Now let's take a look at the data. Notice that some values are missing. This is exactly where preprocessing comes into the picture. We will come back to this in the preprocessing session below.

In [6]:
data.head(5)

Unnamed: 0,pickup_community_area,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,trip_miles,dropoff_census_tract,payment_type,company,trip_seconds,dropoff_community_area,is_big_tip
0,,5,19,6,1400269500,0.0,,Credit Card,Chicago Elite Cab Corp. (Chicago Carriag,0.0,,False
1,,3,19,5,1362683700,0.0,,Unknown,Chicago Elite Cab Corp.,300.0,,False
2,60.0,10,2,3,1380593700,12.6,,Cash,Taxi Affiliation Services,1380.0,,False
3,10.0,10,1,2,1382319000,0.0,,Cash,Taxi Affiliation Services,180.0,,False
4,14.0,5,7,5,1369897200,0.0,,Cash,Dispatch Taxi Affiliation,1080.0,,False


We continue to split the data into training and test data.
For the test data, we separate out the features to run serving on as well as labels to compare serving results with.

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split
from typing import Tuple


def split_data(data: pd.DataFrame) -> Tuple[ray.data.Dataset, pd.DataFrame, np.array]:
    """Split the data in a stratified way.

    Returns:
        A tuple containing train dataset, test data and test label.
    """
    train_data, test_data = train_test_split(
        data, stratify=data[["is_big_tip"]], random_state=1113
    )
    _train_ds = ray.data.from_pandas(train_data)
    _test_label = test_data["is_big_tip"].values
    _test_df = test_data.drop(["is_big_tip"], axis=1)
    return _train_ds, _test_df, _test_label

train_ds, test_df, test_label = split_data(data)

In [8]:
print(f"There are {train_ds.count()} samples for training and {test_df.shape[0]} samples for testing.")

There are 11251 samples for training and 3751 samples for testing.


# Preprocessing

Let's focus on preprocessing first.
Usually input data needs to go through some preprocessing before being
fed into model. It is a good idea to package preprocessing logic into
a modularized component so that the same logic can be applied to both
training data as well as data for online serving or offline batch prediction.

In AIR, this component is `ray.ml.preprocessor.Preprocessor`.
It is constructed in a way that allows easy composition.

Now let's construct a chained preprocessor composed of simple preprocessors, including
1. Imputer for filling missing features;
2. OneHotEncoder for encoding categorical features;
3. BatchMapper where arbitrary udf can be applied to batches of records;
and so on. Take a look at `ray.ml.preprocessor.Preprocessor` for more details.
The output of the preprocessing step goes into model for training.

In [9]:
from ray.ml.preprocessors import (
    BatchMapper,
    Chain,
    OneHotEncoder,
    SimpleImputer,
)

def get_preprocessor():
    """Construct a chain of preprocessors."""
    imputer1 = SimpleImputer(
        ["dropoff_census_tract"], strategy="constant", fill_value=17031839100
    )
    imputer2 = SimpleImputer(
        ["pickup_community_area", "dropoff_community_area"],
        strategy="constant",
        fill_value=8,
    )
    imputer3 = SimpleImputer(["payment_type"], strategy="constant", fill_value="Cash")
    imputer4 = SimpleImputer(
        ["company"], strategy="constant", fill_value="Taxi Affiliation Services"
    )
    imputer5 = SimpleImputer(
        ["trip_start_timestamp", "trip_miles", "trip_seconds"], strategy="mean"
    )

    ohe = OneHotEncoder(
        columns=[
            "trip_start_hour",
            "trip_start_day",
            "trip_start_month",
            "dropoff_census_tract",
            "pickup_community_area",
            "dropoff_community_area",
            "payment_type",
            "company",
        ],
        limit={
            "dropoff_census_tract": 25,
            "pickup_community_area": 20,
            "dropoff_community_area": 20,
            "payment_type": 2,
            "company": 7,
        },
    )

    def fn(df):
        df["trip_start_year"] = pd.to_datetime(df["trip_start_timestamp"], unit="s").dt.year
        return df

    chained_pp = Chain(
        imputer1,
        imputer2,
        imputer3,
        imputer4,
        imputer5,
        ohe,
        BatchMapper(fn),
        BatchMapper(lambda x: x.drop(["trip_start_timestamp"], axis=1)),
    )
    return chained_pp


Now let's define some constants for clarity.

In [10]:
# Note that `INPUT_SIZE` here is corresponding to the output dimension
# of the previously defined processing steps.
# This is used to specify the input shape of Keras model as well as
# when converting from training data from `ray.data.Dataset` to `tf.Tensor`.
INPUT_SIZE = 120
# The training batch size. Based on `NUM_WORKERS`, each worker
# will get its own share of this batch size. For example, if
# `NUM_WORKERS = 2`, each worker will work on 4 samples per batch.
BATCH_SIZE = 8
# Number of epoch. Adjust it based on how quickly you want the run to be.
EPOCH = 1
# Number of training workers.
NUM_WORKERS = 2

# Training

Let's starting with defining a simple Keras model for the classification task.

In [11]:
import tensorflow as tf

def build_model():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.Input(shape=(INPUT_SIZE,)))
    model.add(tf.keras.layers.Dense(50, activation="relu"))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    return model

Now let's define the training loop. This code will be run on each training
worker in a distributed fashion.

In [12]:
from ray import train
from ray.train.tensorflow import prepare_dataset_shard

def train_loop_per_worker():
    dataset_shard = train.get_dataset_shard("train")

    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    with strategy.scope():
        model = build_model()
        model.compile(
            loss="binary_crossentropy",
            optimizer="adam",
            metrics=["accuracy"],
        )

    for epoch in range(EPOCH):
        # This will make sure that the training workers will get their own
        # share of batch to work on.
        # See `ray.train.tensorflow.prepare_dataset_shard` for more information.
        tf_dataset = prepare_dataset_shard(
            dataset_shard.to_tf(
                label_column="is_big_tip",
                output_signature=(
                    tf.TensorSpec(shape=(BATCH_SIZE, INPUT_SIZE), dtype=tf.float32),
                    tf.TensorSpec(shape=(BATCH_SIZE,), dtype=tf.int64),
                ),
                batch_size=BATCH_SIZE,
                drop_last=True,
            )
        )

        model.fit(tf_dataset)
        # This saves checkpoint in a way that can be used by Ray Serve coherently.
        train.save_checkpoint(epoch=epoch, model=model.get_weights())

Now let's define a trainer that takes in the training loop,
the training dataset as well the preprocessor that we just defined.

And run it!

Notice that you can tune how long you want the run to be by changing ``EPOCH``.

In [13]:
from ray.ml.train.integrations.tensorflow import TensorflowTrainer

trainer = TensorflowTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config={"num_workers": NUM_WORKERS},
    datasets={"train": train_ds},
    preprocessor=get_preprocessor(),
)
result = trainer.fit()

E0517 21:19:02.244123215    6495 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers


Trial name,status,loc
TensorflowTrainer_a5bfe_00000,TERMINATED,172.31.80.125:6628


[2m[36m(BaseWorkerMixin pid=6748)[0m Instructions for updating:
[2m[36m(BaseWorkerMixin pid=6748)[0m use distribute.MultiWorkerMirroredStrategy instead
[2m[36m(BaseWorkerMixin pid=6748)[0m 2022-05-17 21:19:14.652171: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
[2m[36m(BaseWorkerMixin pid=6748)[0m 2022-05-17 21:19:14.652210: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
[2m[36m(BaseWorkerMixin pid=6749)[0m Instructions for updating:
[2m[36m(BaseWorkerMixin pid=6749)[0m use distribute.MultiWorkerMirroredStrategy instead
[2m[36m(BaseWorkerMixin pid=6749)[0m 2022-05-17 21:19:14.652324: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda

      7/Unknown - 3s 9ms/step - loss: 31.4913 - accuracy: 0.7679
      6/Unknown - 3s 10ms/step - loss: 36.7398 - accuracy: 0.7292
     18/Unknown - 3s 9ms/step - loss: 20.7746 - accuracy: 0.7361 
     18/Unknown - 3s 9ms/step - loss: 20.7746 - accuracy: 0.7361 
     30/Unknown - 3s 9ms/step - loss: 20.1022 - accuracy: 0.7125
     30/Unknown - 3s 9ms/step - loss: 20.1022 - accuracy: 0.7125
     41/Unknown - 3s 9ms/step - loss: 15.9702 - accuracy: 0.6829 
     41/Unknown - 3s 9ms/step - loss: 15.9702 - accuracy: 0.6829 
     51/Unknown - 3s 10ms/step - loss: 13.1416 - accuracy: 0.6912
     51/Unknown - 3s 10ms/step - loss: 13.1416 - accuracy: 0.6912
     61/Unknown - 3s 10ms/step - loss: 11.2825 - accuracy: 0.6967
     62/Unknown - 3s 10ms/step - loss: 11.1298 - accuracy: 0.6915
     73/Unknown - 3s 10ms/step - loss: 9.9122 - accuracy: 0.6798 
     73/Unknown - 3s 10ms/step - loss: 9.9122 - accuracy: 0.6798 
     87/Unknown - 4s 10ms/step - loss: 8.7285 - accuracy: 0.6897
     87/Unknow

2022-05-17 21:19:31,907	ERROR checkpoint_manager.py:193 -- Result dict has no key: training_iteration. checkpoint_score_attr must be set to a key of the result dict. Valid keys are ['trial_id', 'experiment_id', 'date', 'timestamp', 'pid', 'hostname', 'node_ip', 'config', 'done']


Trial TensorflowTrainer_a5bfe_00000 completed. Last result: 


2022-05-17 21:19:32,019	INFO tune.py:753 -- Total run time: 29.79 seconds (28.51 seconds for the tuning loop).


# Moving on to Serve

Ray Serve serves the trained model through constructs of `ray.serve.model_wrappers.ModelWrapper` and `ray.serve.model_wrappers.ModelWrapperDeployment`. These constructs wrap a `ray.ml.checkpoint.Checkpoint` into an endpoint that can readily serve http requests.

This removes the boilerplate code and minimizes the effort to bring your model to production!

Generally speaking the http request can either send in json or data.
Upon receiving this payload, Ray Serve would need some "adapter" to convert the request payload into some shape or form that can be accepted as input to the preprocessing steps. In this case, we send in a json request and converts it to a pandas DataFrame through `dataframe_adapter`, defined below.

In [14]:
from fastapi import Request

async def dataframe_adapter(request: Request):
    """Serve HTTP Adapter that reads JSON and converts to pandas DataFrame."""
    content = await request.json()
    return pd.DataFrame.from_dict(content)

Now let's wrap everything in a serve endpoint that exposes a URL to where requests can be sent to.

In [None]:
from ray import serve
from ray.ml.checkpoint import Checkpoint
from ray.ml.predictors.integrations.tensorflow import TensorflowPredictor
from ray.serve.model_wrappers import ModelWrapperDeployment


def serve_model(checkpoint: Checkpoint, model_definition, adapter, name="Model") -> str:
    """Expose a serve endpoint.

    Returns:
        serve URL.
    """
    serve.start(detached=True)
    deployment = ModelWrapperDeployment.options(name=name)
    deployment.deploy(
        TensorflowPredictor,
        checkpoint,
        # This is due to a current limitation on Serve that's
        # being addressed.
        # TODO(xwjiang): Change to True.
        batching_params=False,
        model_definition=model_definition,
        http_adapter=adapter,
    )
    return deployment.url

# Generally speaking, training and serving are done in totally different ray clusters.
# To simulate that, let's shutdown the old ray cluster in preparation for serving.
ray.shutdown()

endpoint_uri = serve_model(result.checkpoint, build_model, dataframe_adapter)

2022-05-17 21:19:33,136	INFO worker.py:863 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-05-17 21:19:33,137	INFO worker.py:965 -- Connecting to existing Ray cluster at address: 172.31.80.125:9031


Let's write a helper function to send requests to this endpoint and compare the results with labels.

In [None]:
import requests

NUM_SERVE_REQUESTS = 100

def send_requests(df: pd.DataFrame, label: np.array):
    for i in range(NUM_SERVE_REQUESTS):
        one_row = df.iloc[[i]].to_dict()
        serve_result = requests.post(endpoint_uri, json=one_row).json()
        print(
            f"request[{i}] prediction: {serve_result['predictions']['0']} "
            f"- label: {str(label[i])}"
        )

In [None]:
send_requests(test_df, test_label)