# Scaling Many Model Training with Ray Tune

This template is a quickstart to using [Ray Tune](https://docs.ray.io/en/latest/tune/index.html) for batch inference. Ray Tune is one of many libraries under the [Ray AI Runtime](https://docs.ray.io/en/latest/ray-air/getting-started.html). See [this blog post](https://www.anyscale.com/blog/training-one-million-machine-learning-models-in-record-time-with-ray) for more information on the benefits of performing many model training with Ray!

This template walks through time-series forecasting using `statsforecast`, but the framework and data format can be swapped out easily -- they are there just to help you build your own application!

At a high level, this template will:

1. [Define the training function for a single partition of data.](https://docs.ray.io/en/latest/tune/tutorials/tune-run.html)
2. [Define a Tune search space to run training over many partitions of data.](https://docs.ray.io/en/latest/tune/tutorials/tune-search-spaces.html)
3. [Extract the best model per dataset partition from the Tune experiment output.](https://docs.ray.io/en/latest/tune/examples/tune_analyze_results.html)

> Slot in your code below wherever you see the ✂️ icon to build a many model training Ray application off of this template!

## Handling Dependencies

This template requires certain Python packages to be available to every node in the cluster.

> ✂️ Add your own package dependencies in the `requirements.txt` file!


In [None]:
requirements_path = "./requirements.txt"


In [None]:
with open(requirements_path, "r") as f:
    requirements = f.read().strip().splitlines()

print("Requirements:")
print("\n".join(requirements))


First, we may want to use these modules right here in our script, which is running on the head node.
Install the Python packages on the head node using `pip install`.

```{note}
You may need to restart this notebook kernel to access the installed packages.
```


In [None]:
%pip install -r {requirements_path} --upgrade

Next, we need to make sure all worker nodes also have access to the dependencies.
For this, use a [Ray Runtime Environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments)
to dynamically set up dependencies throughout the cluster.


In [None]:
import ray

ray.init(runtime_env={"pip": requirements})


In [None]:
import pandas as pd
from pyarrow import parquet as pq
from sklearn.metrics import mean_squared_error

from ray import tune
from ray.air import session


> ✂️ Replace this value to change the number of data partitions you will use (<= 5000 for this dataset). This will be total the number of Tune trials you will run!
>
> Note that this template fits two models per data partition and reports the best performing one.

In [None]:
NUM_DATA_PARTITIONS: int = 500


> ✂️ Replace the following with your own data-loading and evaluation helper functions. (Or, just delete these!)

In [None]:
def get_m5_partition(unique_id: str) -> pd.DataFrame:
    df = (
        pq.read_table(
            "s3://anonymous@m5-benchmarks/data/train/target.parquet",
            columns=["item_id", "timestamp", "demand"],
            filters=[("item_id", "=", unique_id)],
        )
        .to_pandas()
        .rename(columns={"item_id": "unique_id", "timestamp": "ds", "demand": "y"})
    )
    df["unique_id"] = df["unique_id"].astype(str)
    df["ds"] = pd.to_datetime(df["ds"])
    return df.dropna()


def evaluate_cross_validation(df: pd.DataFrame, metric) -> pd.DataFrame:
    models = df.drop(columns=["ds", "cutoff", "y"]).columns.tolist()
    evals = []
    for model in models:
        eval_ = (
            df.groupby(["unique_id", "cutoff"])
            .apply(lambda x: metric(x["y"].values, x[model].values))
            .to_frame()
        )
        eval_.columns = [model]
        evals.append(eval_)
    evals = pd.concat(evals, axis=1)
    evals = evals.groupby(["unique_id"]).mean(numeric_only=True)
    evals["best_model"] = evals.idxmin(axis=1)
    return evals


> ✂️ Replace this with your own training logic.

In [None]:
def train_fn(config: dict):
    try:
        from statsforecast import StatsForecast
        from statsforecast.models import AutoARIMA, AutoETS
    except ImportError as e:
        raise RuntimeError("Did you set a runtime env to install dependencies?") from e

    data_partition_id = config["data_partition_id"]
    train_df = get_m5_partition(data_partition_id)

    models = [AutoARIMA(), AutoETS()]
    n_windows = 1
    forecast_horizon = 4

    sf = StatsForecast(
        df=train_df,
        models=models,
        freq="D",
        n_jobs=n_windows * len(models),
    )
    cv_df = sf.cross_validation(
        h=forecast_horizon,
        step_size=forecast_horizon,
        n_windows=n_windows,
    )

    eval_df = evaluate_cross_validation(df=cv_df, metric=mean_squared_error)
    best_model = eval_df["best_model"][data_partition_id]
    forecast_mse = eval_df[best_model][data_partition_id]

    # Report the best-performing model and its corresponding eval metric.
    session.report({"forecast_mse": forecast_mse, "best_model": best_model})


trainable = train_fn
trainable = tune.with_resources(trainable, resources={"CPU": 2 * 1})


```{note}
`tune.with_resources` is used at the end to specify the number of resources to assign *each trial*.
Feel free to change this to the resources required by your application! You can also comment out the `tune.with_resources` block to assign `1 CPU` (the default) to each trial.

Note that this is purely for Tune to know how many trials to schedule concurrently -- setting the number of CPUs does not actually enforce any kind of resource isolation!
In this template, `statsforecast` runs cross validation in parallel with M models * N temporal cross-validation windows (e.g. 2 * 1).
```

See [Ray Tune's guide on assigning resources](https://docs.ray.io/en/latest/tune/tutorials/tune-resources.html) for more information.

> ✂️ Replace this with your desired hyperparameter search space!
>
> For example, this template searches over the data partition ID to train a model on.

In [None]:
# Download the list of item ids used to partition the dataset.
data_partitions = list(
    pd.read_csv(
        "https://air-example-data.s3.us-west-2.amazonaws.com/m5_benchmarks_item_ids.csv"
    )["item_id"]
)
if NUM_DATA_PARTITIONS > len(data_partitions):
    print(f"There are only {len(data_partitions)} partitions!")

param_space = {
    "data_partition_id": tune.grid_search(data_partitions[:NUM_DATA_PARTITIONS]),
}


Run many model training using Ray Tune!

In [None]:
tuner = tune.Tuner(trainable, param_space=param_space)
result_grid = tuner.fit()


View the reported results of all trials as a dataframe.

In [None]:
results_df = result_grid.get_dataframe()
results_df
