# Scaling Many Model Training with Ray Tune

This template is a quickstart to using [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) for batch inference. Ray Tune is one of many libraries under the [Ray AI Runtime](https://docs.ray.io/en/latest/ray-air/getting-started.html). See [this blog post](https://www.anyscale.com/blog/training-one-million-machine-learning-models-in-record-time-with-ray) for more information on the benefits of performing many model training with Ray!

This template walks through time-series forecasting using `statsforecast`, but the framework and data format can be swapped out easily -- they are there just to help you build your own application!

At a high level, this template will:

1. [Define the training function for a single partition of data.](https://docs.ray.io/en/latest/tune/tutorials/tune-run.html)
2. [Define a Tune search space to run training over many partitions of data.](https://docs.ray.io/en/latest/tune/tutorials/tune-search-spaces.html)
3. [Extract the best model per dataset partition from the Tune experiment output.](https://docs.ray.io/en/latest/tune/examples/tune_analyze_results.html)

## Installing Dependencies

First, we'll need to install necessary dependencies in the Anyscale Workspace. To do so, first open up a terminal, and follow one of the following install steps, depending on which size template you picked:


### Install Dependencies (Small-scale Template)

The small-scale template only runs on a single node (the head node), so we just need to install the requirements *locally*.

In [None]:
%pip install -r requirements.txt --upgrade


### Install Cluster-wide Dependencies (Large-scale Template)

When running in a distributed Ray Cluster, all nodes need to have access to the installed packages.
For this, we'll use `pip install --user` to install the necessary requirements.
On an [Anyscale Workspace](https://docs.anyscale.com/user-guide/develop-and-debug/workspaces),
this will install packages to a *shared filesystem* that will be available to all nodes in the cluster.

In [None]:
%pip install --user -r requirements.txt --upgrade


> Slot in your code below wherever you see the ✂️ icon to build a many model training Ray application off of this template!

In [None]:
import pandas as pd
from pyarrow import parquet as pq
from sklearn.metrics import mean_squared_error

import ray
from ray import tune
from ray.air import session

try:
    from statsforecast import StatsForecast
    from statsforecast.models import AutoARIMA, AutoETS
except ImportError as e:
    raise RuntimeError("Did you follow the steps above to install dependencies?") from e


> ✂️ Replace this value to change the number of data partitions you will use. This will be total the number of Tune trials you will run!
>
> Note that this template fits two models per data partition and reports the best performing one.

In [None]:
# Default values for the small-scale template
NUM_DATA_PARTITIONS: int = 50


In [None]:
# Default values for the large-scale template
NUM_DATA_PARTITIONS: int = 1000


```{tip}
If you're running the small-scale version of the template, try setting
the number of trials to the recommended number of trials for the large-scale version.
It'll be much slower, but you'll see the dramatic speedup once distributing the load
to a multi-node Ray cluster in the large-scale version!
```

> ✂️ Replace the following with your own data-loading and evaluation helper functions. (Or, just delete these!)

In [None]:
def get_m5_partition(unique_id: str) -> pd.DataFrame:
    df = (
        pq.read_table(
            "s3://anonymous@m5-benchmarks/data/train/target.parquet",
            columns=["item_id", "timestamp", "demand"],
            filters=[("item_id", "=", unique_id)],
        )
        .to_pandas()
        .rename(columns={"item_id": "unique_id", "timestamp": "ds", "demand": "y"})
    )
    df["unique_id"] = df["unique_id"].astype(str)
    df["ds"] = pd.to_datetime(df["ds"])
    return df.dropna()


def evaluate_cross_validation(df, metric):
    models = df.drop(columns=["ds", "cutoff", "y"]).columns.tolist()
    evals = []
    for model in models:
        eval_ = (
            df.groupby(["unique_id", "cutoff"])
            .apply(lambda x: metric(x["y"].values, x[model].values))
            .to_frame()
        )
        eval_.columns = [model]
        evals.append(eval_)
    evals = pd.concat(evals, axis=1)
    evals = evals.groupby(["unique_id"]).mean(numeric_only=True)
    evals["best_model"] = evals.idxmin(axis=1)
    return evals


> ✂️ Replace this with your own training logic.

In [None]:
model_classes = [AutoARIMA, AutoETS]
n_windows = 1


def train_fn(config: dict):
    data_partition_id = config["data_partition_id"]
    train_df = get_m5_partition(data_partition_id)

    models = [model_cls() for model_cls in model_classes]
    forecast_horizon = 4

    sf = StatsForecast(
        df=train_df,
        models=models,
        freq="D",
        n_jobs=n_windows * len(models),
    )
    cv_df = sf.cross_validation(
        h=forecast_horizon,
        step_size=forecast_horizon,
        n_windows=n_windows,
    )

    eval_df = evaluate_cross_validation(df=cv_df, metric=mean_squared_error)
    best_model = eval_df["best_model"][data_partition_id]
    forecast_mse = eval_df[best_model][data_partition_id]

    # Report the best-performing model and its corresponding eval metric.
    session.report({"forecast_mse": forecast_mse, "best_model": best_model})


trainable = train_fn
trainable = tune.with_resources(
    trainable, resources={"CPU": len(model_classes) * n_windows}
)


```{note}
`tune.with_resources` is used at the end to specify the number of resources to assign *each trial*.
Feel free to change this to the resources required by your application! You can also comment out the `tune.with_resources` block to assign `1 CPU` (the default) to each trial.

Note that this is purely for Tune to know how many trials to schedule concurrently -- setting the number of CPUs does not actually enforce any kind of resource isolation!
```

See [Ray Tune's guide on assigning resources](https://docs.ray.io/en/latest/tune/tutorials/tune-resources.html) for more information.

> ✂️ Replace this with your desired hyperparameter search space!
>
> For example, this template searches over the data partition ID to train a model on.

In [None]:
data_partitions = list(pd.read_csv("item_ids.csv")["item_id"])
if NUM_DATA_PARTITIONS > len(data_partitions):
    print(f"There are only {len(data_partitions)} partitions!")

param_space = {
    "data_partition_id": tune.grid_search(data_partitions[:NUM_DATA_PARTITIONS]),
}


Run many model training using Ray Tune!

In [None]:
tuner = tune.Tuner(trainable, param_space=param_space)
result_grid = tuner.fit()


> ✂️ Replace the metric and mode below with the metric you reported in your training function.

In [None]:
sample_result = result_grid[0]
sample_result.metrics
