d4rlpy MDPDataset #135

cclvr · 2021-09-10T12:01:37Z

Hi @takuseno, firstly thanks a lot for such a high quality repo for offline RL.
I have a question about the method get_d4rl(), why the rewards are all moved by one step?
while cursor < dataset_size:
# collect data for step=t
observation = dataset["observations"][cursor]
action = dataset["actions"][cursor]
if episode_step == 0:
reward = 0.0
else:
reward = dataset["rewards"][cursor - 1]

Long for your feedback.

takuseno · 2021-09-10T12:04:37Z

@cclvr Thanks for the issue. This is because MDPDatataset expects rewards [r_t, r_t+1, r_t+2, ...], but D4RL datasets has rewards in [r_t+1, r_t+2, r_t+3, ...]. This is why I did something weird there.

takuseno · 2021-09-10T12:05:32Z

Without this change, the algorithm performance will be significantly worse.

cclvr · 2021-09-10T12:09:51Z

Without this change, the algorithm performance will be significantly worse.

Thanks a lot for your reply~
By the way, it will be appreciated if DICE family OPE methods can be implemented here. I know I ask too much ^-^

takuseno · 2021-09-10T15:43:26Z

Thanks for the proposal! Actually, I'm working on an academic paper of this library and preparing v1.0.0 release. Once those things are done, I'll look for a way to add more OPE methods. BTW, I'm always happy for the contributions from users 😄

mamengyiyi · 2021-09-11T07:07:30Z

Hi, I thought the data collecting process of D4RL is the same with the commonly used method in online training, i.e., the transition is <s_t, a_t, r_t, s_t+1>. I thought the r_t indicates the reward at s_t after executing action a_t, while the reward is obtained at t+1. See code https://github.com/rail-berkeley/d4rl/blob/master/scripts/generation/mujoco/collect_data.py. I'm not sure if I understand this code correctly. I'm aprreciated if you could check D4RL codes again and discuss with me further

takuseno · 2021-09-11T07:23:43Z

In d3rlpy, that reward is considered to be got at t+1. This function tells the implementation difference.

d3rlpy/d3rlpy/online/iterators.py

Line 471 in 747b174

def collect(

I believe I'm following Sutton book's notation.

takuseno · 2021-09-11T07:35:35Z

Yes, I'm right.
https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf

mamengyiyi · 2021-09-11T07:52:43Z

Picture from Sutton's book.

Actually, the R_{t+1} is obtained at timestep t+1, but it indicates the reward of <s_t, a_t>.

takuseno · 2021-09-11T08:01:02Z

Yes, but, it's at t+1. Anyway, don't worry, the implementation is correct.

cclvr · 2021-09-11T09:07:54Z

Hi @takuseno , through your discussion, I guess that d4rl collects data with the form (s,a,R(s,a)).

Currently I' m reproducing bestDICE algorithm under d3rlpy framework, and collect data as d4rl does, i.e. (s,a,R(s,a)), and then use get_d4rl() to go on, but bestDICE does not work. Then I try to collect data as (s,a,R(s',a')), and amazingly, bestDICE works.!!!!

This phenomena confuse me a lot, since if my implementation was right, then it means d4rl's was wrong, and vice versa. Do you have any idea about this?

BTW, you referred to that 'algorithm performance will be significantly worse', can your point out which algorithms on which environments?

Thanks a lot for your patient reply~

takuseno · 2021-09-11T09:13:49Z

@cclvr Did you use Transition objects correctly? For example, if we write TD errors in d3rlpy, the code will be like below:

dataset = # from some dataset

episode = dataset.episodes[0]
transition = episode[0]

Q = # let's say Q is Q-function

td = (Q(transition.observation, transition.action) - transition.next_reward - gamma * Q.max(transition.next_observation)) ** 2

https://d3rlpy.readthedocs.io/en/v0.91/references/dataset.html

takuseno · 2021-09-11T09:17:21Z

Regarding the performance, for example, BEAR is not correctly learning d4rl dataset without that change.

cclvr · 2021-09-11T09:32:23Z

Hi @takuseno Thanks for your replay, I maintain the same dataset form as d4rl and try transition.next _reward instead of transition.reward, then it works. It's my fault not to understand transition class correctly. Thanks again~

takuseno · 2021-09-11T09:34:30Z

@cclvr Sounds good! To be honest, I'm considering changing the current dataset scheme to match D4RL because many people ask about this difference. Anyway, thanks for the issue! Please reopen this if you have a discussion about this issue.

takuseno closed this as completed Sep 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

d4rlpy MDPDataset #135

d4rlpy MDPDataset #135

cclvr commented Sep 10, 2021

takuseno commented Sep 10, 2021

takuseno commented Sep 10, 2021

cclvr commented Sep 10, 2021

takuseno commented Sep 10, 2021

mamengyiyi commented Sep 11, 2021

takuseno commented Sep 11, 2021

takuseno commented Sep 11, 2021

mamengyiyi commented Sep 11, 2021

takuseno commented Sep 11, 2021

cclvr commented Sep 11, 2021

takuseno commented Sep 11, 2021 •

edited

takuseno commented Sep 11, 2021

cclvr commented Sep 11, 2021

takuseno commented Sep 11, 2021

d4rlpy MDPDataset #135

d4rlpy MDPDataset #135

Comments

cclvr commented Sep 10, 2021

takuseno commented Sep 10, 2021

takuseno commented Sep 10, 2021

cclvr commented Sep 10, 2021

takuseno commented Sep 10, 2021

mamengyiyi commented Sep 11, 2021

takuseno commented Sep 11, 2021

takuseno commented Sep 11, 2021

mamengyiyi commented Sep 11, 2021

takuseno commented Sep 11, 2021

cclvr commented Sep 11, 2021

takuseno commented Sep 11, 2021 • edited

takuseno commented Sep 11, 2021

cclvr commented Sep 11, 2021

takuseno commented Sep 11, 2021

takuseno commented Sep 11, 2021 •

edited