Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various minor PPO refactors #167

Closed
Tracked by #206
vwxyzjn opened this issue Apr 21, 2022 · 1 comment
Closed
Tracked by #206

Various minor PPO refactors #167

vwxyzjn opened this issue Apr 21, 2022 · 1 comment

Comments

@vwxyzjn
Copy link
Owner

vwxyzjn commented Apr 21, 2022

Problem Description

A lot of the formatting changes are suggested by @Howuhh

1. Refactor on next_done

The current code to handle done looks like this

            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

which is fine, but when I tried to adapt isaacgym it became an issue. Specifically, I thought the to(device) code is no longer needed so just did

            next_obs, reward, done, info = envs.step(action)

but this is wrong because I should have done next_done = done. The current next_done = torch.Tensor(done).to(device) just does not make a lot of sense.

We should refactor it to

            next_obs, reward, next_done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(next_done).to(device)

2. make_env refactor

if capture_video:
    if idx == 0:
        env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

to

if capture_video and idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

3. flatten batch

        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

to

        b_obs = obs.flatten(0, 1)
        b_actions = actions.flatten(0, 1)
        b_logprobs = logprobs.reshape(-1)
        b_returns = returns.reshape(-1)
        b_advantages = advantages.reshape(-1)
        b_values = values.reshape(-1)

4.


            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

to

            if args.target_kl is not None and approx_kl > args.target_kl:
                break

5.

global_step += 1 * args.num_envs

to

global_step += args.num_envs

6.

move

num_updates = args.total_timesteps // args.batch_size

to the argparse.

This was referenced Jun 20, 2022
@vwxyzjn vwxyzjn changed the title Minor refactor on CleanRL's PPO Various PPO refactors Jun 21, 2022
@vwxyzjn vwxyzjn changed the title Various PPO refactors Various minor PPO refactors Jun 21, 2022
@vwxyzjn vwxyzjn mentioned this issue Sep 27, 2022
19 tasks
@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Nov 28, 2023

Closed by #424

@vwxyzjn vwxyzjn closed this as completed Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant