# Getting Data in and out of Tune

Often, you will find yourself needing to pass data into Tune [Trainables](tune_60_seconds_trainables) (datasets, models, other large parameters) and get data out of them (metrics, checkpoints, other artifacts). In this guide, we'll explore different ways of doing that and see in what circumstances they should be used.

1. Getting data in  
    1.1 Search space  
    1.2 Parameters  
    1.3 Loading data in Trainable  
2. Getting data out  
    2.1 Reporting metrics  
    2.2 Logging metrics with callbacks  
    2.3 Checkpoints & other artifacts  

Let's start by defining a simple Trainable function. We'll be expanding this function with different functionality as we go.

In [1]:
# !pip install mlflow

In [2]:
import random
import time
import pandas as pd


def training_function(config, checkpoint_dir=None):
    # For now, we have nothing here.
    data = None
    model = {"hyperparameter_a": None, "hyperparameter_b": None}
    epochs = 0

    # Simulate training & evaluation - we obtain back a "metric" and a "trained_model".
    for epoch in range(epochs):
        # Simulate doing something expensive.
        time.sleep(1)
        metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** (
            -1
        ) + model["hyperparameter_b"] * 0.1 * data["A"].sum()
        trained_model = {"state": model, "epoch": epoch}

Our `training_function` function requires a pandas DataFrame, a model with some hyperparameters and the number of epochs to train the model for as inputs. The hyperparameters of the model impact the metric returned, and in each epoch (iteration of training), the `trained_model` state is changed.

We will run hyperparameter optimization using the [Tuner API](tune-run-ref).

In [3]:
from ray.tune import Tuner
from ray import tune

tuner = Tuner(training_function, tune_config=tune.TuneConfig(num_samples=4))

Package pickle5 becomes unnecessary in Python 3.8 and above. Its presence may confuse libraries including Ray. Please uninstall the package.


## Getting data in

First order of business is to provide the inputs for the Trainable. We can broadly separate them into two categories - variables and constants.

Variables are the parameters we want to tune. They will be different for every [Trial](tune_60_seconds_trials). For example, those may be the learning rate and batch size for a neural network, number of trees and the maximum depth for a random forest, or the data partition if you are using Tune as an execution engine for batch training.

Constants are the parameters that are the same for every Trial. Those can be the number of epochs, model hyperparameters we want to set but not tune, the dataset and so on. Often, the constants will be quite large (eg. the dataset or the model).

### Search space

The first way of passing inputs into Trainables is the [*search space*](tune_60_seconds_search_spaces) (it may also be called *parameter space* or *config*). In the Trainable itself, it maps to the `config` dict passed in as an argument to the function. You define the search space using the `param_space` argument of the `Tuner`. The search space is a dict and may be composed of [*distributions*](<tune-sample-docs>), which will sample a different value for each Trial, or out of constant values. The search space may be composed out of nested dictionaries, and those in turn can have distributions as well.

```{warning}
Each value in the search space will be saved directly in the Trial metadata. This means that every value in the search space **must** be serializable and take up a small amount of memory.
```

For example, passing in a large pandas DataFrame or an unserializable model object as a value in the search space will at best cause large slowdowns and disk space usage as Trial metadata saved to disk will also contain this data, or at worst, an exception as the data cannot be sent over to the Trial workers. For more details, see {ref}`tune-bottlenecks`.

Instead, use strings or other identifiers as your values, and initialize/load the objects inside your Trainable directly depending on those.

```{note}
[Ray Datasets](datasets_getting_started) can be used as values in the search space directly.
```

In our example, we want to tune the two model hyperparameters. We also want to set the number of epochs, so that we can easily tweak it later. For the hyperparameters, we will use the `tune.uniform` distribution. We will also modify the `training_function` to obtain those values from the `config` dictionary.

In [4]:
def training_function(config, checkpoint_dir=None):
    # For now, we have nothing here.
    data = None

    model = {
        "hyperparameter_a": config["hyperparameter_a"],
        "hyperparameter_b": config["hyperparameter_b"],
    }
    epochs = config["epochs"]

    # Simulate training & evaluation - we obtain back a "metric" and a "trained_model".
    for epoch in range(epochs):
        # Simulate doing something expensive.
        time.sleep(1)
        metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** (
            -1
        ) + model["hyperparameter_b"] * 0.1 * data["A"].sum()
        trained_model = {"state": model, "epoch": epoch}


tuner = Tuner(
    training_function,
    param_space={
        "hyperparameter_a": tune.uniform(0, 20),
        "hyperparameter_b": tune.uniform(-100, 100),
        "epochs": 10,
    },
    tune_config=tune.TuneConfig(num_samples=4),
)

## Parameters

If we have large objects that are constant across Trials, we can use the [`tune.with_parameters`](tune-with-parameters) utility to pass them into the Trainable directly. The objects will be stored in the [Ray object store](serialization-guide) so that each Trial worker may access them to obtain a local copy to use in its process.

```{warning}
Objects put into the Ray object store must be serializable.
```

Note that the serialization (once) and deserialization (for each Trial) of large objects may incur a performance overhead.

In our example, we will pass the `data` DataFrame using `tune.with_parameters`. In order to do that, we need to modify our function signature to include `data` as an argument.

In [5]:
def training_function(config, data, checkpoint_dir=None):
    model = {
        "hyperparameter_a": config["hyperparameter_a"],
        "hyperparameter_b": config["hyperparameter_b"],
    }
    epochs = config["epochs"]

    # Simulate training & evaluation - we obtain back a "metric" and a "trained_model".
    for epoch in range(epochs):
        # Simulate doing something expensive.
        time.sleep(1)
        metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** (
            -1
        ) + model["hyperparameter_b"] * 0.1 * data["A"].sum()
        trained_model = {"state": model, "epoch": epoch}


tuner = Tuner(
    training_function,
    param_space={
        "hyperparameter_a": tune.uniform(0, 20),
        "hyperparameter_b": tune.uniform(-100, 100),
        "epochs": 10,
    },
)

Next step is to wrap the `training_function` using `tune.with_parameters` before passing it into the `Tuner`. Every keyword argument of the `tune.with_parameters` call will be mapped to the keyword arguments in the Trainable signature.

In [6]:
data = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

tuner = Tuner(
    tune.with_parameters(training_function, data=data),
    param_space={
        "hyperparameter_a": tune.uniform(0, 20),
        "hyperparameter_b": tune.uniform(-100, 100),
        "epochs": 10,
    },
    tune_config=tune.TuneConfig(num_samples=4),
)

## Loading data in Trainable

You can also load data directly in Trainable from eg. cloud storage, NFS or from disk.

```{warning}
When loading from disk, ensure that all nodes in your cluster have access to the file you are trying to load.
```

A common use-case is to load the dataset from S3 or any other cloud storage with pandas, arrow or any other framework.

The working directory of the Trainable worker will be automatically changed to the corresponding Trial directory. For more details, see {ref}`tune-working-dir`.

For brevity, we will not use this approach in our example.

Our tuning run can now be run, though we will not yet obtain any meaningful outputs back.

In [7]:
results = tuner.fit()

0,1
Current time:,2022-11-29 00:00:27
Running for:,00:00:15.01
Memory:,6.7/31.0 GiB

Trial name,status,loc,hyperparameter_a,hyperparameter_b
training_function_cb758_00000,TERMINATED,172.31.43.110:566809,13.1449,33.7213
training_function_cb758_00001,TERMINATED,172.31.43.110:566874,17.2479,-84.5562
training_function_cb758_00002,TERMINATED,172.31.43.110:566876,19.0947,-99.0795
training_function_cb758_00003,TERMINATED,172.31.43.110:566878,14.6521,50.3226


Trial training_function_cb758_00000 completed. Last result: 
Trial training_function_cb758_00001 completed. Last result: 
Trial training_function_cb758_00003 completed. Last result: 
Trial training_function_cb758_00002 completed. Last result: 


2022-11-29 00:00:27,573	INFO tune.py:762 -- Total run time: 15.72 seconds (15.01 seconds for the tuning loop).


## Getting data out

We can now run our tuning run using the `training_function` Trainable. The next step is to report *metrics* to Tune that can be used to guide the optimization. We will also want to *checkpoint* our trained models so that we can resume the training after an interruption, and to use them for prediction later.

The [`ray.air.session`](air-session-ref) API is used to get data out of the Trainable workers. `session.report` can be called multiple times in the Trainable function. Each call corresponds to one iteration (epoch, step, tree) of training.

### Reporting metrics

*Metrics* are values passed through the `metrics` argument in a `session.report` call. Metrics can be used by Tune [Search Algorithms](search-alg-ref) and [Schedulers](schedulers-ref) to direct the search. After the tuning run is complete, you can [analyze the results](/tune/examples/tune_analyze_results), which include the reported metrics.

```{warning}
Similarly to search space values, each value reported as a metric will be saved directly in the Trial metadata. This means that every value reported as a metric **must** be serializable and take up a small amount of memory.
```

```{note}
Tune will automatically include some metrics, such as the training iteration, timestamp and more.
```

In our example, we want to maximize the `metric`. We will report it each epoch to Tune, and set the `metric` and `mode` arguments in `tune.TuneConfig` to let Tune know that it should use it as the optimization objective.

In [8]:
from ray.air import session


def training_function(config, data, checkpoint_dir=None):
    model = {
        "hyperparameter_a": config["hyperparameter_a"],
        "hyperparameter_b": config["hyperparameter_b"],
    }
    epochs = config["epochs"]

    # Simulate training & evaluation - we obtain back a "metric" and a "trained_model".
    for epoch in range(epochs):
        # Simulate doing something expensive.
        time.sleep(1)
        metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** (
            -1
        ) + model["hyperparameter_b"] * 0.1 * data["A"].sum()
        trained_model = {"state": model, "epoch": epoch}
        session.report(metrics={"metric": metric})


tuner = Tuner(
    tune.with_parameters(training_function, data=data),
    param_space={
        "hyperparameter_a": tune.uniform(0, 20),
        "hyperparameter_b": tune.uniform(-100, 100),
        "epochs": 10,
    },
    tune_config=tune.TuneConfig(num_samples=4, metric="metric", mode="max"),
)

### Logging metrics with callbacks

Every metric logged using `session.report` can be accessed during the tuning run through Tune [Callbacks](tune-logging). Ray AIR provides [several built-in integrations](air-builtin-callbacks) with popular frameworks, such as MLFlow, Weights & Biases, CometML and more. You can also use the [Callback API](tune-callbacks-docs) to create your own callbacks.

Callbacks are passed in the `callback` argument of the `Tuner`'s `RunConfig`.

In our example, we'll use the MLFlow callback to track the progress of our tuning run and the changing value of the `metric`.

In [None]:
from ray.air import RunConfig
from ray.air.integrations.mlflow import MLflowLoggerCallback


def training_function(config, data, checkpoint_dir=None):
    model = {
        "hyperparameter_a": config["hyperparameter_a"],
        "hyperparameter_b": config["hyperparameter_b"],
    }
    epochs = config["epochs"]

    # Simulate training & evaluation - we obtain back a "metric" and a "trained_model".
    for epoch in range(epochs):
        # Simulate doing something expensive.
        time.sleep(1)
        metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** (
            -1
        ) + model["hyperparameter_b"] * 0.1 * data["A"].sum()
        trained_model = {"state": model, "epoch": epoch}
        session.report(metrics={"metric": metric})


tuner = Tuner(
    tune.with_parameters(training_function, data=data),
    param_space={
        "hyperparameter_a": tune.uniform(0, 20),
        "hyperparameter_b": tune.uniform(-100, 100),
        "epochs": 10,
    },
    tune_config=tune.TuneConfig(num_samples=4, metric="metric", mode="max"),
    run_config=RunConfig(
        callbacks=[MLflowLoggerCallback(experiment_name="example")]
    ),
)

### Checkpoints & other artifacts

Aside from metrics, you may want to save the state of your trained model and any other artifacts to allow resumption from training failure and further inspection and usage. Those cannot be saved as metrics, as they often are far too large and may not be easily serializable. Finally, they should be persisted on disk or cloud storage to allow access after the Tune run is interrupted or terminated.

Ray AIR (which contains Ray Tune) provides a [`Checkpoint`](air-checkpoints-doc) API for that purpose. `Checkpoint` objects can be created from various sources (dictionaries, directories, cloud storage) and used between different AIR components.

In Tune, `Checkpoints` are created by the user in their Trainable functions and reported using the optional `checkpoint` argument of `session.report`. `Checkpoints` can contain arbitrary data and can be freely passed around the Ray cluster. After a tuning run is over, `Checkpoints` can be [obtained from the results](/tune/examples/tune_analyze_results).

Ray Tune can be configured to [automatically sync checkpoints to cloud storage](tune-checkpoint-syncing), [keep only a certain number of checkpoints to save space](air-session-ref) and more.

```{note}
The experiment state itself is checkpointed separately. See {ref}`tune-two-types-of-ckpt` for more details.
```

In our example, we want to be able to resume the training from the latest checkpoint, and to save the `trained_model` in a checkpoint every iteration. To accomplish this, we will use the `session` and `Checkpoint` APIs.

In [10]:
from ray.air import Checkpoint


def training_function(config, data, checkpoint_dir=None):
    model = {
        "hyperparameter_a": config["hyperparameter_a"],
        "hyperparameter_b": config["hyperparameter_b"],
    }
    epochs = config["epochs"]

    # Load the checkpoint, if there is any.
    loaded_checkpoint = session.get_checkpoint()
    if loaded_checkpoint is not None:
        last_epoch = loaded_checkpoint["epoch"] + 1
    else:
        last_epoch = 0

    # Simulate training & evaluation - we obtain back a "metric" and a "trained_model".
    for epoch in range(last_epoch, epochs):
        # Simulate doing something expensive.
        time.sleep(1)
        metric = (0.1 + model["hyperparameter_a"] * epoch / 100) ** (
            -1
        ) + model["hyperparameter_b"] * 0.1 * data["A"].sum()
        trained_model = {"state": model, "epoch": epoch}

        # Create the checkpoint.
        checkpoint = Checkpoint.from_dict({"model": trained_model})
        session.report(metrics={"metric": metric}, checkpoint=checkpoint)


tuner = Tuner(
    tune.with_parameters(training_function, data=data),
    param_space={
        "hyperparameter_a": tune.uniform(0, 20),
        "hyperparameter_b": tune.uniform(-100, 100),
        "epochs": 10,
    },
    tune_config=tune.TuneConfig(num_samples=4, metric="metric", mode="max"),
    run_config=RunConfig(
        callbacks=[MLflowLoggerCallback(experiment_name="example")]
    ),
)

With all of those changes implemented, we can now run our tuning and obtain meaningful metrics and artifacts.

In [12]:
results = tuner.fit()
results

0,1
Current time:,2022-11-29 00:01:07
Running for:,00:00:15.49
Memory:,7.7/31.0 GiB

Trial name,status,loc,hyperparameter_a,hyperparameter_b,iter,total time (s),metric
training_function_e3653_00000,TERMINATED,172.31.43.110:567338,16.5098,56.7657,10,10.3257,34.69
training_function_e3653_00001,TERMINATED,172.31.43.110:567406,3.19302,94.0236,10,10.3026,58.9957
training_function_e3653_00002,TERMINATED,172.31.43.110:567407,12.8895,14.6021,10,10.2733,9.55486
training_function_e3653_00003,TERMINATED,172.31.43.110:567409,12.4421,91.7,10,10.2,55.8398


2022-11-29 00:01:07,624	INFO tune.py:762 -- Total run time: 15.61 seconds (15.48 seconds for the tuning loop).


<ray.tune.result_grid.ResultGrid at 0x7f62edb96430>

For more information on how to interact with the returned `ResultGrid` object, see {doc}`/tune/examples/tune_analyze_results`.