New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Final observation not stored #98
Comments
I've also encountered this bug. It's especially problematic in environments like FourRooms-v1 MiniGrid (where the reward always comes at the last timestep). This simple fix should help.
|
Moreover, in the current database all the rewards are shifted by one timestep if you use DQN loss
Changing |
Hi, d3rlpy's data structure is a bit different from others. Basically this tuple |
Specifically batch.rewards are |
a_t and r_t are not defined for the last observation, no? Related: hill-a/stable-baselines#400 |
At the last step, there is a reward (aka terminal reward), and the last action is virtually taken in d3rlpy framework. This function might be an example. d3rlpy/d3rlpy/online/iterators.py Line 457 in 962efc5
|
I see... d3rlpy/d3rlpy/online/iterators.py Line 494 in 7db11ed
and taking a virtual last action: d3rlpy/d3rlpy/online/iterators.py Line 510 in 7db11ed
right? |
Yes, that's right. |
@araffin But, thanks for this issue anyway. This makes me notice the errors in d4rl dataset conversion. |
Feel free to reopen this if you have any discussions. Thanks for the issue! |
I'll need to fix the SB3 conversion too... for i in range(len(observations)):
if dones[i]:
# Convert to d3rlpy convention, see https://github.com/takuseno/d3rlpy/issues/98
# Add last observation
episode_obs = np.concatenate((observations[index : i + 1], next_observations[i, None]), axis=0)
# Add virtual last action (not used in practice normally)
episode_actions = np.concatenate((actions[index : i + 1], actions[i, None]), axis=0)
# Add virtual first reward/termination
episode_rewards = np.concatenate(([0.0], rewards[index : i + 1]))
# terminals = np.concatenate(([0.0], new_dones))
episode = Episode(
observation_shape=obs_shape,
action_size=action_size,
observations=episode_obs,
actions=episode_actions,
rewards=episode_rewards,
terminal=new_dones[i],
create_mask=False,
mask_size=0,
)
episodes_list.append(episode)
index = i + 1 |
Hello,
Describe the bug
it seems that the final observation is not stored in the
Episode
object.Looking at the code, if an episode is only one step long,
the
Episode
object should store:But it seems that the
observations
array has the same length as theactions
orrewards
one which probably means that the final observation is not stored.Note: this would probably require some changes later on in the code as no action is taken after the final observation.
Additional context
The way it is handled in SB3 for instance is to have a separate array that store the next observation.
A special treatment is also needed when using multiple envs at the same time that may reset automatically.
See https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/off_policy_algorithm.py#L488
and
https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/buffers.py#L267
(when using only one array)
cc @megan-klaiber
The text was updated successfully, but these errors were encountered: