Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
124 lines (85 sloc) 7.36 KB

RLlib Offline Datasets

Working with Offline Datasets

RLlib's I/O APIs enable you to work with datasets of experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in web applications. You can also log new agent experiences produced during online training for future use.

RLlib represents trajectory sequences (i.e., (s, a, r, s', ...) tuples) with SampleBatch objects. Using a batch format enables efficient encoding and compression of experiences. During online training, RLlib uses policy evaluation actors to generate batches of experiences in parallel using the current policy. RLlib also uses this same batch format for reading and writing experiences to offline storage.

Example: Training on previously saved experiences

In this example, we will save batches of experiences generated during online training to disk, and then leverage this saved data to train a policy offline using DQN. First, we run a simple policy gradient algorithm for 100k steps with "output": "/tmp/cartpole-out" to tell RLlib to write simulation outputs to the /tmp/cartpole-out directory.

$ rllib train
    --run=PG \
    --env=CartPole-v0 \
    --config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' \
    --stop='{"timesteps_total": 100000}'

The experiences will be saved in compressed JSON batch format:

$ ls -l /tmp/cartpole-out
total 11636
-rw-rw-r-- 1 eric eric 5022257 output-2019-01-01_15-58-57_worker-0_0.json
-rw-rw-r-- 1 eric eric 5002416 output-2019-01-01_15-59-22_worker-0_1.json
-rw-rw-r-- 1 eric eric 1881666 output-2019-01-01_15-59-47_worker-0_2.json

Then, we can tell DQN to train using these previously generated experiences with "input": "/tmp/cartpole-out". We disable exploration since it has no effect on the input:

$ rllib train \
    --run=DQN \
    --env=CartPole-v0 \
    --config='{
        "input": "/tmp/cartpole-out",
        "exploration_final_eps": 0,
        "exploration_fraction": 0}'

Since the input experiences are not from running simulations, RLlib cannot report the true policy performance during training. However, you can use tensorboard --logdir=~/ray_results to monitor training progress via other metrics such as estimated Q-value:

offline-q.png

In offline input mode, no simulations are run, though you still need to specify the environment in order to define the action and observation spaces. If true simulation is also possible (i.e., your env supports step()), you can also set "input_evaluation": "simulation" to tell RLlib to run background simulations to estimate current policy performance. The output of these simulations will not be used for learning.

Example: Converting external experiences to batch format

When the env does not support simulation (e.g., it is a web application), it is necessary to generate the *.json experience batch files outside of RLlib. This can be done by using the JsonWriter class to write out batches. This runnable example shows how to generate and save experience batches for CartPole-v0 to disk:

.. literalinclude:: ../../python/ray/rllib/examples/saving_experiences.py
   :language: python
   :start-after: __sphinx_doc_begin__
   :end-before: __sphinx_doc_end__

On-policy algorithms and experience postprocessing

RLlib assumes that input batches are of postprocessed experiences. This isn't typically critical for off-policy algorithms (e.g., DQN's post-processing is only needed if n_step > 1 or worker_side_prioritization: True). For off-policy algorithms, you can also safely set the postprocess_inputs: True config to auto-postprocess data.

However, for on-policy algorithms like PPO, you'll need to pass in the extra values added during policy evaluation and postprocessing to batch_builder.add_values(), e.g., logits, vf_preds, value_target, and advantages for PPO. This is needed since the calculation of these values depends on the parameters of the behaviour policy, which RLlib does not have access to in the offline setting (in online training, these values are automatically added during policy evaluation).

Note that for on-policy algorithms, you'll also have to throw away experiences generated by prior versions of the policy. This greatly reduces sample efficiency, which is typically undesirable for offline training, but can make sense for certain applications.

Mixing simulation and offline data

RLlib supports multiplexing inputs from multiple input sources, including simulation. For example, in the following example we read 40% of our experiences from /tmp/cartpole-out, 30% from hdfs:/archive/cartpole, and the last 30% is produced via policy evaluation. Input sources are multiplexed using np.random.choice:

$ rllib train \
    --run=DQN \
    --env=CartPole-v0 \
    --config='{
        "input": {
            "/tmp/cartpole-out": 0.4,
            "hdfs:/archive/cartpole": 0.3,
            "sampler": 0.3,
        },
        "exploration_final_eps": 0,
        "exploration_fraction": 0}'

Scaling I/O throughput

Similar to scaling online training, you can scale offline I/O throughput by increasing the number of RLlib workers via the num_workers config. Each worker accesses offline storage independently in parallel, for linear scaling of I/O throughput. Within each read worker, files are chosen in random order for reads, but file contents are read sequentially.

Input API

You can configure experience input for an agent using the following options:

.. literalinclude:: ../../python/ray/rllib/agents/agent.py
   :language: python
   :start-after: __sphinx_doc_input_begin__
   :end-before: __sphinx_doc_input_end__

The interface for a custom input reader is as follows:

.. autoclass:: ray.rllib.offline.InputReader
    :members:

Output API

You can configure experience output for an agent using the following options:

.. literalinclude:: ../../python/ray/rllib/agents/agent.py
   :language: python
   :start-after: __sphinx_doc_output_begin__
   :end-before: __sphinx_doc_output_end__

The interface for a custom output writer is as follows:

.. autoclass:: ray.rllib.offline.OutputWriter
    :members: