[BUG] Final observation not stored #98

araffin · 2021-07-21T12:48:07Z

Hello,

Describe the bug
it seems that the final observation is not stored in the Episode object.

Looking at the code, if an episode is only one step long,
the Episode object should store:

initial observation
action, reward
final observation

But it seems that the observations array has the same length as the actions or rewards one which probably means that the final observation is not stored.

Note: this would probably require some changes later on in the code as no action is taken after the final observation.

Additional context
The way it is handled in SB3 for instance is to have a separate array that store the next observation.
A special treatment is also needed when using multiple envs at the same time that may reset automatically.

See https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/off_policy_algorithm.py#L488
and
https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/buffers.py#L267
(when using only one array)

cc @megan-klaiber

The text was updated successfully, but these errors were encountered:

vkurenkov · 2021-07-21T21:05:47Z

I've also encountered this bug. It's especially problematic in environments like FourRooms-v1 MiniGrid (where the reward always comes at the last timestep).

This simple fix should help.

dataset.pyx until this line

prev_transition = None
for i in range(num_data):
    observation = observations[i]
    action = actions[i]
    reward = rewards[i]

    if i < num_data - 1:
        next_observation = observations[i + 1]
        next_action = actions[i + 1]
        next_reward = rewards[i + 1]
    else:
        next_observation = observation
        next_action = action
        next_reward = reward

    env_terminal = terminal if i == num_data - 1 else 0.0

vkurenkov · 2021-07-21T21:18:22Z

Moreover, in the current database all the rewards are shifted by one timestep if you use DQN loss

Correct target: y_t = r_t + gamma * q * (1-terminal)
Target is computed here
rew_tp1 are passed here
As you can see, batch.next_rewards are passed instead of batch.rewards resulting in y_t = r_{t+1} + gamma * q * (1-terminal)

Changing batch.next_rewards to batch.rewards helps. I've only checked this fix with the fix above. For CQL + FourRooms env it worked like a charm.

takuseno · 2021-07-22T01:02:34Z

Hi, d3rlpy's data structure is a bit different from others. Basically this tuple (s_t, a_t, r_t), not (s_t, a_t, r_t+1). So the last observation and reward is stored correctly.

takuseno · 2021-07-22T01:04:01Z

Specifically batch.rewards are r_t and batch.next_rewards are r_t+1.

araffin · 2021-07-22T08:34:56Z

Basically this tuple (s_t, a_t, r_t), not (s_t, a_t, r_t+1). So the last observation and reward is stored correctly.

a_t and r_t are not defined for the last observation, no?
My point is that at the last step of an episode, the agent will receive a terminal observation but won't take any action or receive any reward, hence the length of the observation array should be larger than the action/reward one.

Related: hill-a/stable-baselines#400

araffin · 2021-07-22T08:42:15Z

I made a small diagram with d3rlpy convention:

with gym code, my point is that obs_3 is not stored:

import gym

max_steps = 2
env = gym.wrappers.TimeLimit(gym.make("CartPole-v1"), max_steps)

# initial observation
obs_1 = env.reset()
# agent takes action for step=1
action_1 = env.action_space.sample()
# step=1 and beginning of step=2
obs_2, reward_1, done_1, info_1 = env.step(action_1)
# agent takes action for step=2
action_2 = env.action_space.sample()
# step=2 (end of episode due to timeout)
obs_3, reward_2, done_2, info_2 = env.step(action_2)

assert done_2 is True
assert info_2["TimeLimit.truncated"] is True

takuseno · 2021-07-22T12:21:05Z

At the last step, there is a reward (aka terminal reward), and the last action is virtually taken in d3rlpy framework. This function might be an example.

d3rlpy/d3rlpy/online/iterators.py

Line 457 in 962efc5

def collect(

araffin · 2021-07-22T12:37:43Z

At the last step, there is a reward (aka terminal reward), and the last action is virtually taken in d3rlpy framework. This function might be an example.

I see...
Looking at the code, it seems that you are also adding a virtual first reward:

d3rlpy/d3rlpy/online/iterators.py

Line 494 in 7db11ed

observation, reward, terminal = env.reset(), 0.0, False

and taking a virtual last action:

d3rlpy/d3rlpy/online/iterators.py

Line 510 in 7db11ed

action = algo.sample_action([fed_observation])[0]

right?

takuseno · 2021-07-22T23:53:09Z

Yes, that's right.

takuseno · 2021-07-23T06:56:47Z

@araffin But, thanks for this issue anyway. This makes me notice the errors in d4rl dataset conversion.

takuseno · 2021-07-24T08:20:39Z

Feel free to reopen this if you have any discussions. Thanks for the issue!

araffin · 2021-07-24T08:57:41Z

I'll need to fix the SB3 conversion too...
It should look like that (for each episode):

for i in range(len(observations)):
        if dones[i]:
            # Convert to d3rlpy convention, see https://github.com/takuseno/d3rlpy/issues/98
            # Add last observation
            episode_obs = np.concatenate((observations[index : i + 1], next_observations[i, None]), axis=0)
            # Add virtual last action (not used in practice normally)
            episode_actions = np.concatenate((actions[index : i + 1], actions[i, None]), axis=0)
            # Add virtual first reward/termination
            episode_rewards = np.concatenate(([0.0], rewards[index : i + 1]))
            # terminals = np.concatenate(([0.0], new_dones))

            episode = Episode(
                observation_shape=obs_shape,
                action_size=action_size,
                observations=episode_obs,
                actions=episode_actions,
                rewards=episode_rewards,
                terminal=new_dones[i],
                create_mask=False,
                mask_size=0,
            )

            episodes_list.append(episode)
            index = i + 1

araffin added the bug Something isn't working label Jul 21, 2021

takuseno closed this as completed Jul 24, 2021

araffin mentioned this issue Nov 23, 2021

[BUG] Terminal observation missing sail-sg/envpool#33

Closed

3 tasks

Kongyj mentioned this issue Mar 9, 2022

Questions on MDPDataset and training #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Final observation not stored #98

[BUG] Final observation not stored #98

araffin commented Jul 21, 2021

vkurenkov commented Jul 21, 2021 •

edited

vkurenkov commented Jul 21, 2021

takuseno commented Jul 22, 2021

takuseno commented Jul 22, 2021

araffin commented Jul 22, 2021 •

edited

araffin commented Jul 22, 2021 •

edited

takuseno commented Jul 22, 2021

araffin commented Jul 22, 2021 •

edited

takuseno commented Jul 22, 2021

takuseno commented Jul 23, 2021

takuseno commented Jul 24, 2021

araffin commented Jul 24, 2021

[BUG] Final observation not stored #98

[BUG] Final observation not stored #98

Comments

araffin commented Jul 21, 2021

vkurenkov commented Jul 21, 2021 • edited

vkurenkov commented Jul 21, 2021

takuseno commented Jul 22, 2021

takuseno commented Jul 22, 2021

araffin commented Jul 22, 2021 • edited

araffin commented Jul 22, 2021 • edited

takuseno commented Jul 22, 2021

araffin commented Jul 22, 2021 • edited

takuseno commented Jul 22, 2021

takuseno commented Jul 23, 2021

takuseno commented Jul 24, 2021

araffin commented Jul 24, 2021

vkurenkov commented Jul 21, 2021 •

edited

araffin commented Jul 22, 2021 •

edited

araffin commented Jul 22, 2021 •

edited

araffin commented Jul 22, 2021 •

edited