Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

d4rlpy MDPDataset #135

Closed
cclvr opened this issue Sep 10, 2021 · 14 comments
Closed

d4rlpy MDPDataset #135

cclvr opened this issue Sep 10, 2021 · 14 comments

Comments

@cclvr
Copy link

cclvr commented Sep 10, 2021

Hi @takuseno, firstly thanks a lot for such a high quality repo for offline RL.
I have a question about the method get_d4rl(), why the rewards are all moved by one step?
while cursor < dataset_size:
# collect data for step=t
observation = dataset["observations"][cursor]
action = dataset["actions"][cursor]
if episode_step == 0:
reward = 0.0
else:
reward = dataset["rewards"][cursor - 1]

Long for your feedback.

@takuseno
Copy link
Owner

@cclvr Thanks for the issue. This is because MDPDatataset expects rewards [r_t, r_t+1, r_t+2, ...], but D4RL datasets has rewards in [r_t+1, r_t+2, r_t+3, ...]. This is why I did something weird there.

@takuseno
Copy link
Owner

Without this change, the algorithm performance will be significantly worse.

@cclvr
Copy link
Author

cclvr commented Sep 10, 2021

Without this change, the algorithm performance will be significantly worse.

Thanks a lot for your reply~
By the way, it will be appreciated if DICE family OPE methods can be implemented here. I know I ask too much ^-^

@takuseno
Copy link
Owner

Thanks for the proposal! Actually, I'm working on an academic paper of this library and preparing v1.0.0 release. Once those things are done, I'll look for a way to add more OPE methods. BTW, I'm always happy for the contributions from users 😄

@mamengyiyi
Copy link

Hi, I thought the data collecting process of D4RL is the same with the commonly used method in online training, i.e., the transition is <s_t, a_t, r_t, s_t+1>. I thought the r_t indicates the reward at s_t after executing action a_t, while the reward is obtained at t+1. See code https://github.com/rail-berkeley/d4rl/blob/master/scripts/generation/mujoco/collect_data.py. I'm not sure if I understand this code correctly. I'm aprreciated if you could check D4RL codes again and discuss with me further

@takuseno
Copy link
Owner

In d3rlpy, that reward is considered to be got at t+1. This function tells the implementation difference.

def collect(

I believe I'm following Sutton book's notation.

@takuseno
Copy link
Owner

@mamengyiyi
Copy link

Picture from Sutton's book.
3de06b16810865514e5aafb269a6a5c
Actually, the R_{t+1} is obtained at timestep t+1, but it indicates the reward of <s_t, a_t>.

@takuseno
Copy link
Owner

Yes, but, it's at t+1. Anyway, don't worry, the implementation is correct.

@cclvr
Copy link
Author

cclvr commented Sep 11, 2021

Hi @takuseno , through your discussion, I guess that d4rl collects data with the form (s,a,R(s,a)).

Currently I' m reproducing bestDICE algorithm under d3rlpy framework, and collect data as d4rl does, i.e. (s,a,R(s,a)), and then use get_d4rl() to go on, but bestDICE does not work. Then I try to collect data as (s,a,R(s',a')), and amazingly, bestDICE works.!!!!

This phenomena confuse me a lot, since if my implementation was right, then it means d4rl's was wrong, and vice versa. Do you have any idea about this?

BTW, you referred to that 'algorithm performance will be significantly worse', can your point out which algorithms on which environments?

Thanks a lot for your patient reply~

@takuseno
Copy link
Owner

takuseno commented Sep 11, 2021

@cclvr Did you use Transition objects correctly? For example, if we write TD errors in d3rlpy, the code will be like below:

dataset = # from some dataset

episode = dataset.episodes[0]
transition = episode[0]

Q = # let's say Q is Q-function

td = (Q(transition.observation, transition.action) - transition.next_reward - gamma * Q.max(transition.next_observation)) ** 2

https://d3rlpy.readthedocs.io/en/v0.91/references/dataset.html

@takuseno
Copy link
Owner

Regarding the performance, for example, BEAR is not correctly learning d4rl dataset without that change.

@cclvr
Copy link
Author

cclvr commented Sep 11, 2021

Hi @takuseno Thanks for your replay, I maintain the same dataset form as d4rl and try transition.next _reward instead of transition.reward, then it works. It's my fault not to understand transition class correctly. Thanks again~

@takuseno
Copy link
Owner

@cclvr Sounds good! To be honest, I'm considering changing the current dataset scheme to match D4RL because many people ask about this difference. Anyway, thanks for the issue! Please reopen this if you have a discussion about this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants