Various minor PPO refactors #167

vwxyzjn · 2022-04-21T14:03:56Z

Problem Description

A lot of the formatting changes are suggested by @Howuhh

1. Refactor on `next_done`

The current code to handle done looks like this

            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

which is fine, but when I tried to adapt isaacgym it became an issue. Specifically, I thought the to(device) code is no longer needed so just did

            next_obs, reward, done, info = envs.step(action)

but this is wrong because I should have done next_done = done. The current next_done = torch.Tensor(done).to(device) just does not make a lot of sense.

We should refactor it to

            next_obs, reward, next_done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(next_done).to(device)

2. `make_env` refactor

if capture_video:
    if idx == 0:
        env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

to

if capture_video and idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

3. flatten batch

        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

to

        b_obs = obs.flatten(0, 1)
        b_actions = actions.flatten(0, 1)
        b_logprobs = logprobs.reshape(-1)
        b_returns = returns.reshape(-1)
        b_advantages = advantages.reshape(-1)
        b_values = values.reshape(-1)

4.


            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

to

            if args.target_kl is not None and approx_kl > args.target_kl:
                break

5.

cleanrl/cleanrl/ppo_atari.py

Line 209 in 9a74142

global_step += 1 * args.num_envs

to

global_step += args.num_envs

6.

move

cleanrl/cleanrl/ppo.py

Line 183 in 9a74142

num_updates = args.total_timesteps // args.batch_size

to the argparse.

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2023-11-28T02:39:53Z

Closed by #424

This was referenced Jun 20, 2022

Roadmap for CleanRL #115

Closed

PPO improvements #206

Closed

vwxyzjn changed the title ~~Minor refactor on CleanRL's PPO~~ Various PPO refactors Jun 21, 2022

vwxyzjn changed the title ~~Various PPO refactors~~ Various minor PPO refactors Jun 21, 2022

vwxyzjn mentioned this issue Jun 26, 2022

Add rnd_ppo.py documentation and refactor #151

Merged

19 tasks

vwxyzjn mentioned this issue Sep 27, 2022

PPO + JAX + EnvPool + Atari #227

Merged

19 tasks

vwxyzjn closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various minor PPO refactors #167

Various minor PPO refactors #167

vwxyzjn commented Apr 21, 2022 •

edited

vwxyzjn commented Nov 28, 2023

Various minor PPO refactors #167

Various minor PPO refactors #167

Comments

vwxyzjn commented Apr 21, 2022 • edited

Problem Description

1. Refactor on next_done

2. make_env refactor

3. flatten batch

4.

5.

6.

vwxyzjn commented Nov 28, 2023

vwxyzjn commented Apr 21, 2022 •

edited

1. Refactor on `next_done`

2. `make_env` refactor