-
Notifications
You must be signed in to change notification settings - Fork 227
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add tutorial of offline policy selection
- Loading branch information
Showing
5 changed files
with
97 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,4 +13,5 @@ Tutorials | |
customize_neural_network | ||
online_rl | ||
finetuning | ||
offline_policy_selection | ||
use_distributional_q_function |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
************************ | ||
Offline Policy Selection | ||
************************ | ||
|
||
d3rlpy supports offline policy selection by training Fitted Q Evaluation (FQE), which is an offline on-policy RL algorithm. | ||
The use of FQE for offline policy selection is proposed by `Paine et al. <https://arxiv.org/abs/2007.09055>`_. | ||
The concept is that FQE trains Q-function with the trained policy in on-policy manner so that the learned Q-function reflects the expected return of the trained policy. | ||
By using the Q-value estimation of FQE, the candidate trained policies can be ranked only with offline dataset. | ||
Check :doc:`../references/off_policy_evaluation` for more information. | ||
|
||
.. note:: | ||
|
||
Offline policy selection with FQE is confirmed that it usually works out with discrete action-space policies. | ||
However, it seems require some hyperparameter tuning for ranking continuous action-space policies. | ||
The more techniques will be supported along with the advancement of this research domain. | ||
|
||
|
||
Prepare trained policies | ||
------------------------ | ||
|
||
In this tutorial, let's train DQN with the built-in CartPole-v0 dataset. | ||
|
||
.. code-block:: python | ||
import d3rlpy | ||
# setup replay CartPole-v0 dataset and environment | ||
dataset, env = d3rlpy.datasets.get_dataset("cartpole-replay") | ||
# setup algorithm | ||
dqn = d3rlpy.algos.DQN() | ||
# start offline training | ||
dqn.fit( | ||
dataset, | ||
eval_episodes=dataset.episodes, | ||
n_steps=100000, | ||
n_steps_per_epoch=10000, | ||
scorers={ | ||
"environment": d3rlpy.metrics.evaluate_on_environment(env), | ||
}, | ||
) | ||
Here is the example result of online evaluation. | ||
|
||
.. image:: ../assets/dqn_cartpole.png | ||
|
||
Train FQE with the trained policies | ||
----------------------------------- | ||
|
||
Next, we train FQE algorithm with the trained policies. | ||
Please note that we use ``initial_state_value_estimation_scorer`` and ``soft_opc_scorer`` proposed in `Paine et al. <https://arxiv.org/abs/2007.09055>`_. | ||
``initial_state_value_estimation_scorer`` computes the mean action-value estimation at the initial states. | ||
Thus, if this value for a certain policy is bigger than others, the policy is expected to obtain the higher episode return. | ||
On the other hand, ``soft_opc_scorer`` computes the mean difference between the action-value estimation for the success episodes and the action-value estimation for the all episodes. | ||
If this value for a certain policy is bigger than others, the learned Q-function can clearly tell the difference between the success episodes and others. | ||
|
||
.. code-block:: python | ||
import d3rlpy | ||
# setup the same dataset used in policy training | ||
dataset, _ = d3rlpy.datasets.get_dataset("cartpole-replay") | ||
# load pretrained policy | ||
dqn = d3rlpy.algos.DQN() | ||
dqn.build_with_dataset(dataset) | ||
dqn.load_model("d3rlpy_logs/DQN_20220624191141/model_100000.pt") | ||
# setup FQE algorithm | ||
fqe = d3rlpy.ope.DiscreteFQE() | ||
# start FQE training | ||
fqe.fit( | ||
dataset, | ||
eval_episodes=dataset.episodes, | ||
n_steps=10000, | ||
n_steps_per_epoch=1000, | ||
scorers={ | ||
"init_value": d3rlpy.metrics.initial_state_value_estimation_scorer, | ||
"soft_opc": d3rlpy.metrics.soft_opc_scorer(180), # set 180 for success return threshold here | ||
}, | ||
) | ||
In this example, the policies from epoch 10, epoch 5 and epoch 1 (evaluation episode returns of 107.5, 200.0 and 17.5 respectively) are compared. | ||
The first figure represents the ``init_value`` metrics during FQE training. | ||
As you can see here, the scale of ``init_value`` has correlation with the ranks of evaluation episode returns. | ||
|
||
.. image:: ../assets/fqe_cartpole_init_value.png | ||
|
||
The second figure represents the ``soft_opc`` metrics during FQE training. | ||
These curves also have correlation with the ranks of evaluation episode returns. | ||
|
||
.. image:: ../assets/fqe_cartpole_soft_opc.png | ||
|
||
Please note that there is usually no convergence in offline RL training due to the non-fixed bootstrapped target. |