# The Reinforcement Learning Loop Interface

In [1]:
#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS
#include <rl_tools/operations/cpu_mux.h>
#include <rl_tools/nn/operations_cpu_mux.h>
#include <rl_tools/rl/environments/pendulum/operations_generic.h>
#include <rl_tools/nn_models/operations_cpu.h>
namespace rlt = rl_tools;
#pragma cling load("openblas")

In [2]:
using DEVICE = rlt::devices::DEVICE_FACTORY<>;
using RNG = decltype(rlt::random::default_engine(typename DEVICE::SPEC::RANDOM{}));
using T = float;
using TI = typename DEVICE::index_t;

If you just want to use a deep RL algorithm off the shelf and not tinker with its implementation you can use the `loop` interface. For each RL algorithm in RLtools we provide a loop interface consisting of a configuration, a corresponding state datastructur and step operation. To use the loop interface we include the core loop of e.g. [PPO](https://arxiv.org/abs/1707.06347):

In [3]:
#include <rl_tools/rl/algorithms/ppo/loop/core/config.h>
#include <rl_tools/rl/algorithms/ppo/loop/core/operations.h>

Next we can define the [MDP](https://en.wikipedia.org/wiki/Markov_decision_process) in form of an environment (see [Custom Environment](./08-Custom%20Environment.ipynb) for details):

In [4]:
using PENDULUM_SPEC = rlt::rl::environments::pendulum::Specification<T, TI, rlt::rl::environments::pendulum::DefaultParameters<T>>;
using ENVIRONMENT = rlt::rl::environments::Pendulum<PENDULUM_SPEC>;

Based on this environment we can create the default PPO loop config (with default shapes for the actor and critic networks as well as other parameters):

In [5]:
struct LOOP_CORE_PARAMETERS: rlt::rl::algorithms::ppo::loop::core::Parameters<T, TI, ENVIRONMENT>{
    static constexpr TI ENVIRONMENT_STEP_LIMIT = 200;
    static constexpr TI TOTAL_STEP_LIMIT = 300000;
    static constexpr TI STEP_LIMIT = TOTAL_STEP_LIMIT/(ON_POLICY_RUNNER_STEPS_PER_ENV * N_ENVIRONMENTS) + 1; // number of PPO steps
};
using LOOP_CORE_CONFIG = rlt::rl::algorithms::ppo::loop::core::Config<T, TI, RNG, ENVIRONMENT, LOOP_CORE_PARAMETERS>;

This config, which can be customized creating a subclass and overwriting the desired fields, gives rise to a loop state:

In [6]:
using LOOP_CORE_STATE = typename LOOP_CORE_CONFIG::template State<LOOP_CORE_CONFIG>;

Next we can create an instance of this state and allocate as well as initialize it:

In [7]:
DEVICE device;
LOOP_CORE_STATE lsc;
rlt::malloc(device, lsc);
TI seed = 1337;
rlt::init(device, lsc, seed);

Now we can execute PPO steps. A PPO step consists of collecting `LOOP_CONFIG::PARAMETERS::N_ENVIRONMENTS * LOOP_CONFIG::PARAMETERS::ON_POLICY_RUNNER_STEPS_PER_ENV` steps using the `OffPolicyRunner` and then training the actor and critic for `LOOP_CONFIG::PARAMETERS::PPO_PARAMETERS::N_EPOCHS`:

In [8]:
bool finished = rlt::step(device, lsc);

Since we don't want to re-implement e.g. the evaluation for each algorithm, we can wrap the PPO core config in an evaluation loop config wich adds its own configuration, state datastructure and step operation:

In [9]:
#include <rl_tools/rl/environments/pendulum/ui_xeus.h> // For the interactive UI used later on
#include <rl_tools/rl/loop/steps/evaluation/config.h>
#include <rl_tools/rl/loop/steps/evaluation/operations.h>

In [10]:
template <typename NEXT>
struct LOOP_EVAL_PARAMETERS: rlt::rl::loop::steps::evaluation::Parameters<T, TI, NEXT>{
    static constexpr TI EVALUATION_INTERVAL = 4;
    static constexpr TI NUM_EVALUATION_EPISODES = 10;
    static constexpr TI N_EVALUATIONS = NEXT::PARAMETERS::STEP_LIMIT / EVALUATION_INTERVAL;
};
using LOOP_CONFIG = rlt::rl::loop::steps::evaluation::Config<LOOP_CORE_CONFIG, LOOP_EVAL_PARAMETERS<LOOP_CORE_CONFIG>>;
using LOOP_STATE = typename LOOP_CONFIG::template State<LOOP_CONFIG>;

In [11]:
LOOP_STATE ls;
rlt::malloc(device, ls);
rlt::init(device, ls, seed);
ls.actor_optimizer.parameters.alpha = 1e-3; // increasing the learning rate leads to faster training of the Pendulum-v1 environment
ls.critic_optimizer.parameters.alpha = 1e-3;

In [12]:
while(!rlt::step(device, ls)){
    if(ls.step == 5){
        std::cout << "Stepping yourself > hooks/callbacks" << std::endl;
    }
}

Step: 0/74 Mean return: -1406.78
Step: 4/74 Mean return: -1368.09
Stepping yourself > hooks/callbacks
Step: 8/74 Mean return: -1263.23
Step: 12/74 Mean return: -1444.44
Step: 16/74 Mean return: -1390.65
Step: 20/74 Mean return: -1302.68
Step: 24/74 Mean return: -1287.31
Step: 28/74 Mean return: -1125.27
Step: 32/74 Mean return: -1185.73
Step: 36/74 Mean return: -903.619
Step: 40/74 Mean return: -909.322
Step: 44/74 Mean return: -697.736
Step: 48/74 Mean return: -604.199
Step: 52/74 Mean return: -371.755
Step: 56/74 Mean return: -345.625
Step: 60/74 Mean return: -224.121
Step: 64/74 Mean return: -163.862
Step: 68/74 Mean return: -162.212


In [13]:
using UI_SPEC = rlt::rl::environments::pendulum::ui::xeus::Specification<T, TI, 400, 100>; // float type, index type, size, playback speed (in %)
using UI = rlt::rl::environments::pendulum::ui::xeus::UI<UI_SPEC>;
UI ui;
rlt::MatrixDynamic<rlt::matrix::Specification<T, TI, 1, ENVIRONMENT::OBSERVATION_DIM>> observations_mean;
rlt::MatrixDynamic<rlt::matrix::Specification<T, TI, 1, ENVIRONMENT::OBSERVATION_DIM>> observations_std;
rlt::malloc(device, observations_mean);
rlt::malloc(device, observations_std);
rlt::set_all(device, observations_mean, 0);
rlt::set_all(device, observations_std, 1);
ui.canvas

A Jupyter widget with unique id: 5a01f3b5a6144fbaba15094279b01168

In [18]:
 rlt::evaluate(device, ls.env_eval, ui, rlt::get_actor(ls), rlt::rl::utils::evaluation::Specification<1, 200>(), observations_mean, observations_std, ls.actor_deterministic_evaluation_buffers, ls.rng_eval);

You can execute the previous cell again to run another rollout using the UI.