# Offline reinforcement learning with Ray AIR
In this example, we'll train a reinforcement learning agent using offline training.

Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment.

Let's start with installing our dependencies:

In [1]:
!pip install -qU "ray[rllib]" gym

Now we can run some imports:

In [2]:
import argparse
import gym
import os

import numpy as np
import ray
from ray.ml import Checkpoint
from ray.ml.config import RunConfig
from ray.ml.predictors.integrations.rl.rl_predictor import RLPredictor
from ray.ml.train.integrations.rl.rl_trainer import RLTrainer
from ray.ml.result import Result
from ray.rllib.agents.marwil import BCTrainer
from ray.tune.tuner import Tuner



We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences.

Usually this data could come from external systems, or a database of historical data. But for this example, we'll generate some offline data ourselves and store it using RLlibs `output_config`.

In [3]:
def generate_offline_data(path: str):
    print(f"Generating offline data for training at {path}")
    trainer = RLTrainer(
        algorithm="PPO",
        run_config=RunConfig(stop={"timesteps_total": 5000}),
        config={
            "env": "CartPole-v0",
            "output": "dataset",
            "output_config": {
                "format": "json",
                "path": path,
                "max_num_samples_per_file": 1,
            },
            "batch_mode": "complete_episodes",
        },
    )
    trainer.fit()

Here we define the training function. It will create an `RLTrainer` using the `PPO` algorithm and kick off training on the `CartPole-v0` environment. It will use the offline data provided in `path` for this.

In [4]:
def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result:
    print("Starting offline training")
    dataset = ray.data.read_json(
        path, parallelism=num_workers, ray_remote_args={"num_cpus": 1}
    )

    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config={
            "num_workers": num_workers,
            "use_gpu": use_gpu,
        },
        datasets={"train": dataset},
        algorithm=BCTrainer,
        config={
            "env": "CartPole-v0",
            "framework": "tf",
            "evaluation_num_workers": 1,
            "evaluation_interval": 1,
            "evaluation_config": {"input": "sampler"},
        },
    )

    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
    # result = trainer.fit()
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]
    return result

Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:

In [5]:
def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v0")

    rewards = []
    for i in range(num_episodes):
        obs = env.reset()
        reward = 0.0
        done = False
        while not done:
            action = predictor.predict([obs])
            obs, r, done, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

Let's put it all together. First, we create the offline data:

In [6]:
path = "/tmp/out"
generate_offline_data(path)



Generating offline data for training at /tmp/out


2022-05-19 14:01:31,158	INFO services.py:1483 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8268[39m[22m


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
AIRPPOTrainer_cf170_00000,TERMINATED,127.0.0.1:14487,2,13.3352,8535,40.1748,126,8,40.1748


[2m[33m(raylet)[0m 2022-05-19 14:01:35,670	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=65257 --object-store-name=/tmp/ray/session_2022-05-19_14-01-28_484617_14404/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_14-01-28_484617_14404/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=65513 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65479 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
[2m[36m(AIRPPOTrainer pid=14487)[0m 2022-05-19 14:01:44,924	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution sp

Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
[2m[33m(raylet)[0m 2022-05-19 14:01:59,339	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/

Result for AIRPPOTrainer_cf170_00000:
  agent_timesteps_total: 4397
  counters:
    num_agent_steps_sampled: 4397
    num_agent_steps_trained: 4397
    num_env_steps_sampled: 4397
    num_env_steps_trained: 4397
  custom_metrics: {}
  date: 2022-05-19_14-02-04
  done: false
  episode_len_mean: 22.31979695431472
  episode_media: {}
  episode_reward_max: 106.0
  episode_reward_mean: 22.31979695431472
  episode_reward_min: 8.0
  episodes_this_iter: 197
  episodes_total: 197
  experiment_id: b81bc29440f945c690285474ae805258
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6674726605415344
          entropy_coeff: 0.0
          kl: 0.026404578238725662
          model: {}
          policy_loss: -0.04029688984155655
          total_loss: 9.022927284240723
          vf_explained_var: -0.0838877484202385
        

Repartition: 100%|██████████| 1/1 [00:00<00:00, 218.43it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 229.15it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 242.96it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 149.37it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 167.14it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 244.03it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 253.88it/s]
Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 267.09it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 202.14it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 284.42it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 179.67it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 252.97it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 148.61it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 276.87it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 245.70it/s]
Write Progress: 100%|████

Result for AIRPPOTrainer_cf170_00000:
  agent_timesteps_total: 8535
  counters:
    num_agent_steps_sampled: 8535
    num_agent_steps_trained: 8535
    num_env_steps_sampled: 8535
    num_env_steps_trained: 8535
  custom_metrics: {}
  date: 2022-05-19_14-02-08
  done: true
  episode_len_mean: 40.1747572815534
  episode_media: {}
  episode_reward_max: 126.0
  episode_reward_mean: 40.1747572815534
  episode_reward_min: 8.0
  episodes_this_iter: 103
  episodes_total: 300
  experiment_id: b81bc29440f945c690285474ae805258
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.6117416620254517
          entropy_coeff: 0.0
          kl: 0.018075594678521156
          model: {}
          policy_loss: -0.03786909207701683
          total_loss: 9.375883102416992
          vf_explained_var: -0.030680589377880096
         

2022-05-19 14:02:09,250	INFO tune.py:753 -- Total run time: 35.56 seconds (34.54 seconds for the tuning loop).


Then, we run training:

In [7]:
result = train_rl_bc_offline(path=path, num_workers=2, use_gpu=False)

Starting offline training


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
AIRBCTrainer_e4891_00000,TERMINATED,127.0.0.1:14526,5,10.9716,2321,,,,


[2m[33m(raylet)[0m 2022-05-19 14:02:10,878	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=65257 --object-store-name=/tmp/ray/session_2022-05-19_14-01-28_484617_14404/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_14-01-28_484617_14404/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=65513 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65479 --redis-password=5241590000000000 --startup-token=24 --runtime-env-hash=-2010331134
[2m[36m(AIRBCTrainer pid=14526)[0m 2022-05-19 14:02:18,809	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution spe

[2m[36m(RolloutWorker pid=14535)[0m DatasetReader  1  has  38  samples.
[2m[36m(RolloutWorker pid=14536)[0m DatasetReader  2  has  38  samples.


Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
[2m[33m(raylet)[0m 2022-05-19 14:02:29,521	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=65257 --object-store-name=/tmp/ray/session_2022-05-19_14-01-28_484617_14404/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_14-01-28_484617_14404/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=65513 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65479 --redis-password=5241590000000000 --startup-token=27 --runtime-env-hash=-2010331134


Result for AIRBCTrainer_e4891_00000:
  agent_timesteps_total: 482
  counters:
    num_agent_steps_sampled: 482
    num_agent_steps_trained: 2000
    num_env_steps_sampled: 482
    num_env_steps_trained: 2000
  custom_metrics: {}
  date: 2022-05-19_14-02-37
  done: false
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  episodes_total: 0
  evaluation:
    custom_metrics: {}
    episode_len_mean: 32.0
    episode_media: {}
    episode_reward_max: 88.0
    episode_reward_mean: 32.0
    episode_reward_min: 11.0
    episodes_this_iter: 10
    hist_stats:
      episode_lengths:
      - 18
      - 45
      - 11
      - 36
      - 15
      - 88
      - 32
      - 12
      - 34
      - 29
      episode_reward:
      - 18.0
      - 45.0
      - 11.0
      - 36.0
      - 15.0
      - 88.0
      - 32.0
      - 12.0
      - 34.0
      - 29.0
    off_policy_estimator: {}
    policy_reward_max: {}
 

2022-05-19 14:02:39,989	INFO tune.py:753 -- Total run time: 30.29 seconds (30.05 seconds for the tuning loop).
Read progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.29it/s]


And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:

In [8]:
num_eval_episodes = 3

rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")

2022-05-19 14:02:40,504	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-05-19 14:02:40,506	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31.56it/s]
Repartition: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 52.74it/s]


[2m[36m(RolloutWorker pid=14652)[0m DatasetReader  2  has  38  samples.
[2m[36m(RolloutWorker pid=14651)[0m DatasetReader  1  has  38  samples.


2022-05-19 14:02:49,333	INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRBCTrainer_2022-05-19_14-02-09/AIRBCTrainer_e4891_00000_0_2022-05-19_14-02-09/checkpoint_000005/checkpoint-5
2022-05-19 14:02:49,334	INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 10.971628904342651, '_episodes_total': 0}


Average reward over 3 episodes: 35.333333333333336


