Add tutorial of offline policy selection

takuseno · Jun 24, 2022 · f771001 · f771001
1 parent babc0dc
commit f771001
Show file tree

Hide file tree

Showing 5 changed files with 97 additions and 0 deletions.
diff --git a/docs/assets/dqn_cartpole.png b/docs/assets/dqn_cartpole.png
diff --git a/docs/assets/fqe_cartpole_init_value.png b/docs/assets/fqe_cartpole_init_value.png
diff --git a/docs/assets/fqe_cartpole_soft_opc.png b/docs/assets/fqe_cartpole_soft_opc.png
diff --git a/docs/tutorials/index.rst b/docs/tutorials/index.rst
@@ -13,4 +13,5 @@ Tutorials
    customize_neural_network
    online_rl
    finetuning
+   offline_policy_selection
    use_distributional_q_function
diff --git a/docs/tutorials/offline_policy_selection.rst b/docs/tutorials/offline_policy_selection.rst
@@ -0,0 +1,96 @@
+************************
+Offline Policy Selection
+************************
+
+d3rlpy supports offline policy selection by training Fitted Q Evaluation (FQE), which is an offline on-policy RL algorithm.
+The use of FQE for offline policy selection is proposed by `Paine et al. <https://arxiv.org/abs/2007.09055>`_.
+The concept is that FQE trains Q-function with the trained policy in on-policy manner so that the learned Q-function reflects the expected return of the trained policy.
+By using the Q-value estimation of FQE, the candidate trained policies can be ranked only with offline dataset.
+Check :doc:`../references/off_policy_evaluation` for more information.
+
+.. note::
+
+  Offline policy selection with FQE is confirmed that it usually works out with discrete action-space policies.
+  However, it seems require some hyperparameter tuning for ranking continuous action-space policies.
+  The more techniques will be supported along with the advancement of this research domain.
+
+
+Prepare trained policies
+------------------------
+
+In this tutorial, let's train DQN with the built-in CartPole-v0 dataset.
+
+.. code-block:: python
+
+   import d3rlpy
+
+   # setup replay CartPole-v0 dataset and environment
+   dataset, env = d3rlpy.datasets.get_dataset("cartpole-replay")
+
+   # setup algorithm
+   dqn = d3rlpy.algos.DQN()
+
+   # start offline training
+   dqn.fit(
+      dataset,
+      eval_episodes=dataset.episodes,
+      n_steps=100000,
+      n_steps_per_epoch=10000,
+      scorers={
+          "environment": d3rlpy.metrics.evaluate_on_environment(env),
+      },
+   )
+
+Here is the example result of online evaluation.
+
+.. image:: ../assets/dqn_cartpole.png
+
+Train FQE with the trained policies
+-----------------------------------
+
+Next, we train FQE algorithm with the trained policies.
+Please note that we use ``initial_state_value_estimation_scorer`` and ``soft_opc_scorer`` proposed in `Paine et al. <https://arxiv.org/abs/2007.09055>`_.
+``initial_state_value_estimation_scorer`` computes the mean action-value estimation at the initial states.
+Thus, if this value for a certain policy is bigger than others, the policy is expected to obtain the higher episode return.
+On the other hand, ``soft_opc_scorer`` computes the mean difference between the action-value estimation for the success episodes and the action-value estimation for the all episodes.
+If this value for a certain policy is bigger than others, the learned Q-function can clearly tell the difference between the success episodes and others.
+
+.. code-block:: python
+
+   import d3rlpy
+
+   # setup the same dataset used in policy training
+   dataset, _ = d3rlpy.datasets.get_dataset("cartpole-replay")
+
+   # load pretrained policy
+   dqn = d3rlpy.algos.DQN()
+   dqn.build_with_dataset(dataset)
+   dqn.load_model("d3rlpy_logs/DQN_20220624191141/model_100000.pt")
+
+   # setup FQE algorithm
+   fqe = d3rlpy.ope.DiscreteFQE()
+
+   # start FQE training
+   fqe.fit(
+      dataset,
+      eval_episodes=dataset.episodes,
+      n_steps=10000,
+      n_steps_per_epoch=1000,
+      scorers={
+          "init_value": d3rlpy.metrics.initial_state_value_estimation_scorer,
+          "soft_opc": d3rlpy.metrics.soft_opc_scorer(180),  # set 180 for success return threshold here
+      },
+   )
+
+In this example, the policies from epoch 10, epoch 5 and epoch 1 (evaluation episode returns of 107.5, 200.0 and 17.5 respectively) are compared.
+The first figure represents the ``init_value`` metrics during FQE training.
+As you can see here, the scale of ``init_value`` has correlation with the ranks of evaluation episode returns.
+
+.. image:: ../assets/fqe_cartpole_init_value.png
+
+The second figure represents the ``soft_opc`` metrics during FQE training.
+These curves also have correlation with the ranks of evaluation episode returns.
+
+.. image:: ../assets/fqe_cartpole_soft_opc.png
+
+Please note that there is usually no convergence in offline RL training due to the non-fixed bootstrapped target.