Skip to content

Latest commit



178 lines (119 loc) · 5.97 KB


File metadata and controls

178 lines (119 loc) · 5.97 KB

pyRDDLGym-rl: Reinforcement Learning

pyRDDLGym-rl provides wrappers for deep reinforcement learning algorithms (i.e. Stable Baselines 3 and RLlib) to work with pyRDDLGym.


This package requires Python 3.8+, pyRDDLGym>=2.0 together with one of

  • stable-baselines3>=2.2.1
  • ray[rllib]>=2.9.2

Installing via pip

You can install pyRDDLGym-rl and all of its requirements via pip:

pip install stable-baselines3  # need one of these two
pip install -U "ray[rllib]"
pip install rddlrepository pyRDDLGym-rl

Installing the Pre-Release Version via git

pip install git+

Running the Basic Stable Baselines 3 Example

To run the stable-baselines3 example, navigate to the install directory of pyRDDLGym-rl, and type:

python -m pyRDDLGym_rl.examples.run_stable_baselines <domain> <instance> <method> <steps> <learning_rate>


  • <domain> is the name of the domain in rddlrepository, or a path pointing to a domain.rddl file
  • <instance> is the name of the instance in rddlrepository, or a path pointing to an instance.rddl file
  • <method> is the RL algorithm to use [a2c, ddpg, dqn, ppo, sac, td3]
  • <steps> is the (optional) number of samples to generate from the environment for training
  • <learning_rate> is the (optional) learning rate to specify for the algorithm.

Running the Basic RLlib Example

To run the RLlib example, from the install directory of pyRDDLGym-rl, type:

python -m pyRDDLGym_rl.examples.run_rllib <domain> <instance> <method> <iters>


  • <domain> is the name of the domain in rddlrepository, or a path pointing to a domain.rddl file
  • <instance> is the name of the instance in rddlrepository, or a path pointing to an instance.rddl file
  • <method> is the RL algorithm to use [dqn, ppo, sac]
  • <iters> is the (optional) number of iterations of training.

Running Stable Baselines 3 from the Python API

The following example sets up the Stable Baselines 3 PPO algorithm to work with pyRDDLGym:

from stable_baselines3 import *

import pyRDDLGym
from pyRDDLGym_rl.core.agent import StableBaselinesAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

# create the environment
env = pyRDDLGym.make("domain", "instance", base_class=SimplifiedActionRDDLEnv)

# train the PPO agent (pass additional arguments, such as learning rate, here)
agent = PPO('MultiInputPolicy', env, verbose=1)

# wrap the agent in a RDDL policy and evaluate
ppo_agent = StableBaselinesAgent(agent)
ppo_agent.evaluate(env, episodes=1, verbose=True, render=True)


Running RLlib from the Python API

The following example sets up the RLlib PPO algorithm to work with pyRDDLGym:

from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig

import pyRDDLGym
from pyRDDLGym_rl.core.agent import RLLibAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

# set up the environment
def env_creator(cfg):
    return pyRDDLGym.make(cfg['domain'], cfg['instance'], base_class=SimplifiedActionRDDLEnv)
register_env('RLLibEnv', env_creator)

    # create agent
config = {'domain': "domain", 'instance': "instance"}
agent = PPOConfig().environment('RLLibEnv', cfg=config).build()

# train agent
for _ in range(iters):

# wrap the agent in a RDDL policy and evaluate
ppo_agent = RLLibAgent(agent)
ppo_agent.evaluate(env_creator(config), episodes=1, verbose=True, render=True)


The Environment Wrapper

You can use the environment wrapper with your own RL implementations, or a package that is not currently supported by us:

import pyRDDLGym
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv
env = pyRDDLGym.make("domain", "instance", base_class=SimplifiedActionRDDLEnv)

The goal of this wrapper is to simplify the action space as much as possible. To illustrate, the action space of the MarsRover domain is defined as:

    'power-x___d1': Box(-0.1, 0.1, (1,), float32),
    'power-x___d2': Box(-0.1, 0.1, (1,), float32),
    'power-y___d1': Box(-0.1, 0.1, (1,), float32),
    'power-y___d2': Box(-0.1, 0.1, (1,), float32),
    'harvest___d1': Discrete(2), 'harvest___d2': Discrete(2)

However, the action space of the wrapper simplifies to

    'discrete': MultiDiscrete([2 2]),
    'continuous': Box(-0.1, 0.1, (4,), float32)

where the discrete and continuous action variable components have been aggregated. Actions provided to the environment must therefore follow this form, i.e. must be a dictionary with the discrete field is assigned a (2,) array of integer type, and the continuous field is assigned a (4,) array of float type.


The vectorized option is required by the wrapper and is automatically set to True.


The action simplification rules apply max-nondef-actions only to boolean actions, and assume this value is either 1 or greater than or equal to the total number of boolean actions. Any other scenario is currently not supported in pyRDDLGym-rl and will raise an exception.


We cite several limitations of pyRDDLGym-rl:

  • The required action space in the stable-baselines/RLlib agent implementation must be compatible with the action space produced by pyRDDLGym (e.g. DQN only handles Discrete spaces)
  • Only special types of constraints on boolean actions are supported (as described above).