Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Final observation not stored #98

Closed
araffin opened this issue Jul 21, 2021 · 12 comments
Closed

[BUG] Final observation not stored #98

araffin opened this issue Jul 21, 2021 · 12 comments
Labels
bug Something isn't working

Comments

@araffin
Copy link
Contributor

araffin commented Jul 21, 2021

Hello,

Describe the bug
it seems that the final observation is not stored in the Episode object.

Looking at the code, if an episode is only one step long,
the Episode object should store:

  • initial observation
  • action, reward
  • final observation

But it seems that the observations array has the same length as the actions or rewards one which probably means that the final observation is not stored.

Note: this would probably require some changes later on in the code as no action is taken after the final observation.

Additional context
The way it is handled in SB3 for instance is to have a separate array that store the next observation.
A special treatment is also needed when using multiple envs at the same time that may reset automatically.

See https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/off_policy_algorithm.py#L488
and
https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/buffers.py#L267
(when using only one array)

cc @megan-klaiber

@araffin araffin added the bug Something isn't working label Jul 21, 2021
@vkurenkov
Copy link

vkurenkov commented Jul 21, 2021

I've also encountered this bug. It's especially problematic in environments like FourRooms-v1 MiniGrid (where the reward always comes at the last timestep).

This simple fix should help.

dataset.pyx until this line

prev_transition = None
for i in range(num_data):
    observation = observations[i]
    action = actions[i]
    reward = rewards[i]

    if i < num_data - 1:
        next_observation = observations[i + 1]
        next_action = actions[i + 1]
        next_reward = rewards[i + 1]
    else:
        next_observation = observation
        next_action = action
        next_reward = reward

    env_terminal = terminal if i == num_data - 1 else 0.0

@vkurenkov
Copy link

Moreover, in the current database all the rewards are shifted by one timestep if you use DQN loss

  1. Correct target: y_t = r_t + gamma * q * (1-terminal)
  2. Target is computed here
  3. rew_tp1 are passed here
  4. As you can see, batch.next_rewards are passed instead of batch.rewards resulting in y_t = r_{t+1} + gamma * q * (1-terminal)

Changing batch.next_rewards to batch.rewards helps. I've only checked this fix with the fix above. For CQL + FourRooms env it worked like a charm.

@takuseno
Copy link
Owner

Hi, d3rlpy's data structure is a bit different from others. Basically this tuple (s_t, a_t, r_t), not (s_t, a_t, r_t+1). So the last observation and reward is stored correctly.

@takuseno
Copy link
Owner

Specifically batch.rewards are r_t and batch.next_rewards are r_t+1.

@araffin
Copy link
Contributor Author

araffin commented Jul 22, 2021

Basically this tuple (s_t, a_t, r_t), not (s_t, a_t, r_t+1). So the last observation and reward is stored correctly.

a_t and r_t are not defined for the last observation, no?
My point is that at the last step of an episode, the agent will receive a terminal observation but won't take any action or receive any reward, hence the length of the observation array should be larger than the action/reward one.

Related: hill-a/stable-baselines#400

@araffin
Copy link
Contributor Author

araffin commented Jul 22, 2021

I made a small diagram with d3rlpy convention:
rl_episode

with gym code, my point is that obs_3 is not stored:

import gym

max_steps = 2
env = gym.wrappers.TimeLimit(gym.make("CartPole-v1"), max_steps)

# initial observation
obs_1 = env.reset()
# agent takes action for step=1
action_1 = env.action_space.sample()
# step=1 and beginning of step=2
obs_2, reward_1, done_1, info_1 = env.step(action_1)
# agent takes action for step=2
action_2 = env.action_space.sample()
# step=2 (end of episode due to timeout)
obs_3, reward_2, done_2, info_2 = env.step(action_2)

assert done_2 is True
assert info_2["TimeLimit.truncated"] is True

@takuseno
Copy link
Owner

At the last step, there is a reward (aka terminal reward), and the last action is virtually taken in d3rlpy framework. This function might be an example.

def collect(

@araffin
Copy link
Contributor Author

araffin commented Jul 22, 2021

At the last step, there is a reward (aka terminal reward), and the last action is virtually taken in d3rlpy framework. This function might be an example.

I see...
Looking at the code, it seems that you are also adding a virtual first reward:

observation, reward, terminal = env.reset(), 0.0, False

and taking a virtual last action:
action = algo.sample_action([fed_observation])[0]

right?

@takuseno
Copy link
Owner

Yes, that's right.

@takuseno
Copy link
Owner

@araffin But, thanks for this issue anyway. This makes me notice the errors in d4rl dataset conversion.

@takuseno
Copy link
Owner

Feel free to reopen this if you have any discussions. Thanks for the issue!

@araffin
Copy link
Contributor Author

araffin commented Jul 24, 2021

I'll need to fix the SB3 conversion too...
It should look like that (for each episode):

for i in range(len(observations)):
        if dones[i]:
            # Convert to d3rlpy convention, see https://github.com/takuseno/d3rlpy/issues/98
            # Add last observation
            episode_obs = np.concatenate((observations[index : i + 1], next_observations[i, None]), axis=0)
            # Add virtual last action (not used in practice normally)
            episode_actions = np.concatenate((actions[index : i + 1], actions[i, None]), axis=0)
            # Add virtual first reward/termination
            episode_rewards = np.concatenate(([0.0], rewards[index : i + 1]))
            # terminals = np.concatenate(([0.0], new_dones))

            episode = Episode(
                observation_shape=obs_shape,
                action_size=action_size,
                observations=episode_obs,
                actions=episode_actions,
                rewards=episode_rewards,
                terminal=new_dones[i],
                create_mask=False,
                mask_size=0,
            )

            episodes_list.append(episode)
            index = i + 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants