Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace CategoricalConvPolicy #906

Merged
merged 2 commits into from Oct 16, 2019
Merged

Conversation

lywong92
Copy link
Member

@lywong92 lywong92 commented Oct 2, 2019

  • Remove all occurrences of CategoricalConvPolicy
  • Rename CategoricalConvPolicyWithModel to CategoricalConvPolicy
  • Create and remove integration test

Benchmark script is located in origin/benchmark_categorical_cnn_policy.

Results:
MemorizeDigits-v0_benchmark_ppo

MemorizeDigits-v0_benchmark_ppo_mean

Also tried running both versions in an atari environment PongNoFrameskip with PPO, but realized that this combination of environment and algorithm is not ideal for our testing:
image

As discussed, results from MemorizeDigits are sufficient to show that the layer implementation can be replaced with the model implementation.

@lywong92 lywong92 requested a review from a team as a code owner October 2, 2019 20:54
@codecov
Copy link

codecov bot commented Oct 2, 2019

Codecov Report

Merging #906 into master will decrease coverage by 0.39%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #906     +/-   ##
=========================================
- Coverage   69.67%   69.28%   -0.4%     
=========================================
  Files         171      170      -1     
  Lines        9485     9434     -51     
  Branches     1250     1249      -1     
=========================================
- Hits         6609     6536     -73     
- Misses       2660     2681     +21     
- Partials      216      217      +1
Impacted Files Coverage Δ
src/garage/tf/policies/categorical_cnn_policy.py 96% <100%> (ø)
src/garage/tf/core/network.py 0% <0%> (-23.22%) ⬇️
src/garage/envs/grid_world_env.py 86.56% <0%> (-4.48%) ⬇️
src/garage/experiment/experiment_wrapper.py 84.84% <0%> (-0.76%) ⬇️
src/garage/misc/special.py 33.33% <0%> (+3.03%) ⬆️
.../exploration_strategies/epsilon_greedy_strategy.py 100% <0%> (+3.57%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4673a26...be4c045. Read the comment docs.

Copy link
Contributor

@ahtsan ahtsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding an example for CategoricalCNNPolicy. Please replace 'conv' with 'CNN' as it's the preferred convention before merging. Well done!

"""Run task."""
with LocalTFRunner(snapshot_config=snapshot_config) as runner:
env = TfEnv(normalize(gym.make('CubeCrash-v0')))
policy = CategoricalConvPolicy(env_spec=env.spec,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to use CNN for convention, so it would be CategoricalCNNPolicy. Please also edit other occurrence.

@ryanjulian
Copy link
Member

ryanjulian commented Oct 2, 2019

Can you run benchmarks using the following environments:

  • MemorizeDigits-v0
  • CubeCrash-v0
  • CarRacing-v0
  • Acrobot-v1^
  • MountainCar-v0^
  • CartPole-v1^
  • LunarLander-v2^

^ - Using the wrappers PixelObservationWrapper and FrameStack (n=4)

I omitted the Sparse and ScreenBecomesBlack variants of CubeCrash because those are meant to test off-policy algorithms and RNNs respectively.

First list found using this script:

> import gym
> from gym.spaces import Box
> targets =  [s for s in gym.envs.registry.all() if s._entry_point != 'gym.envs.atari:AtariEnv' and type(s.make().observation_space) == Box and len(s.make().observation_space.shape) == 3]
> print(targets)
[EnvSpec(CubeCrashScreenBecomesBlack-v0),
 EnvSpec(CubeCrash-v0),
 EnvSpec(CarRacing-v0),
 EnvSpec(MemorizeDigits-v0),
 EnvSpec(CubeCrashSparse-v0)]

Second list using this additional query:

> from gym.spaces import Discrete
> targets = [s for s in gym.envs.registry.all() if s._entry_point != 'gym.envs.atari:AtariEnv' and type(s.make().action_space) == Discrete and 'rgb _array' in s.make().metadata['render.modes']]
> print(targets)
[EnvSpec(MountainCar-v0),
 EnvSpec(CubeCrashScreenBecomesBlack-v0),
 EnvSpec(CartPole-v1),
 EnvSpec(CartPole-v0),
 EnvSpec(CubeCrash-v0),
 EnvSpec(Acrobot-v1),
 EnvSpec(LunarLander-v2),
 EnvSpec(MemorizeDigits-v0),
 EnvSpec(CubeCrashSparse-v0)]
  • Along with your PR, can you add two examples (to examples/): one using one of the first set of environments, and another example with an environment using the wrappers? In your examples, you can also include the list of the other environments for which that example will work.
  • Please include your new benchmark script in benchmarks/, of course omitting the part of the benchmark which tests the now-deleted primitive.

@ahtsan
Copy link
Contributor

ahtsan commented Oct 2, 2019

Can you run benchmarks using the following environments:

  • MemorizeDigits-v0
  • CubeCrash-v0
  • CarRacing-v0
  • Acrobot-v1^
  • MountainCar-v0^
  • CartPole-v1^
  • LunarLander-v2^

^ - Using the wrappers PixelObservationWrapper and FrameStack (n=4)

I omitted the Sparse and ScreenBecomesBlack variants of CubeCrash because those are meant to test off-policy algorithms and RNNs respectively.

First list found using this script:

> import gym
> from gym.spaces import Box
> targets =  [s for s in gym.envs.registry.all() if s._entry_point != 'gym.envs.atari:AtariEnv' and type(s.make().observation_space) == Box and len(s.make().observation_space.shape) == 3]
> print(targets)
[EnvSpec(CubeCrashScreenBecomesBlack-v0),
 EnvSpec(CubeCrash-v0),
 EnvSpec(CarRacing-v0),
 EnvSpec(MemorizeDigits-v0),
 EnvSpec(CubeCrashSparse-v0)]

Second list using this additional query:

> from gym.spaces import Discrete
> targets = [s for s in gym.envs.registry.all() if s._entry_point != 'gym.envs.atari:AtariEnv' and type(s.make().action_space) == Discrete and 'rgb _array' in s.make().metadata['render.modes']]
> print(targets)
[EnvSpec(MountainCar-v0),
 EnvSpec(CubeCrashScreenBecomesBlack-v0),
 EnvSpec(CartPole-v1),
 EnvSpec(CartPole-v0),
 EnvSpec(CubeCrash-v0),
 EnvSpec(Acrobot-v1),
 EnvSpec(LunarLander-v2),
 EnvSpec(MemorizeDigits-v0),
 EnvSpec(CubeCrashSparse-v0)]
  • Along with your PR, can you add two examples (to examples/): one using one of the first set of environments, and another example with an environment using the wrappers? In your examples, you can also include the list of the other environments for which that example will work.

CarRacing won't work since it has continuous action space?

@ryanjulian
Copy link
Member

@ahtsan gym documentation mentions that it can easily be controlled as discrete using 100% on/off actions.

@krzentner
Copy link
Contributor

I think even a small number of the Atari environments should be enough? It would be very surprising if we saw a large regression on some but not most of them.

@ryanjulian
Copy link
Member

@krzentner the issue is that PPO/on-policy algos can't train Atari in any reasonable number of steps, because of exploration (and this primitive is only applicable to on-policy algos).

@krzentner
Copy link
Contributor

@krzentner the issue is that PPO/on-policy algos can't train Atari in any reasonable number of steps, because of exploration (and this primitive is only applicable to on-policy algos).

Ah, I see, my bad. The PPO paper does use Atari as a benchmark, but takes ~10M timesteps to train even the easiest envs.

The list you've given above is a good idea then (perhaps minus CarRacing, if that will take extra work).

@ryanjulian
Copy link
Member

@lywong92 did you have a chance to run the additional benchmarks?

@ahtsan
Copy link
Contributor

ahtsan commented Oct 7, 2019

@lywong92 did you have a chance to run the additional benchmarks?

Extra effort is needed for getting discrete action for CarRacing environment.

@ryanjulian
Copy link
Member

okay -- how much extra effort? if it's a lot, we can just test with the smaller set.

@krzentner
Copy link
Contributor

If we had a ContinuousToDiscrete (or BangBangControl) env wrapper this would be trivial. Perhaps we should add one? It might actually be useful in other environments (especially with the right clipping options).

@krzentner
Copy link
Contributor

After some discussion, implementing that wrapper correctly in a way that would interoperate with our policies seems difficult, so let's just skip CarRacing for now.

@lywong92
Copy link
Member Author

lywong92 commented Oct 8, 2019

I ran CubeCrash-v0:

image
image

I had some trouble using the pixel wrapper in gym. I think that's solved but now I'm getting an error saying that the env can't be pickled. I wrapped the Acrobot env with PixelObservationWrapper, Grayscale, then StackFrames.

image

Do you happen to know if there's a way to resolve this?

@ahtsan
Copy link
Contributor

ahtsan commented Oct 8, 2019

I ran CubeCrash-v0:

image
image

I had some trouble using the pixel wrapper in gym. I think that's solved but now I'm getting an error saying that the env can't be pickled. I wrapped the Acrobot env with PixelObservationWrapper, Grayscale, then StackFrames.

image

Do you happen to know if there's a way to resolve this?

Upodate: the viewer object in gym.Env is not pickleable. We hot fixed it by setting it to None before the pickling happens since it will always be created if it's None (drawback is it might get slow down a bit) - we are moving forward now.

@ryanjulian
Copy link
Member

re: the viewer -- this can be handled in GarageEnv. we already have special cases for closing the viewers (because gym doesn't do it properly)

@ryanjulian
Copy link
Member

when you post your results, can you use the confidence interval plots introduced by Yong, Utkarsh, and Anson? They make results much easier to interpret.

@ahtsan
Copy link
Contributor

ahtsan commented Oct 8, 2019

re: the viewer -- this can be handled in GarageEnv. we already have special cases for closing the viewers (because gym doesn't do it properly)

The issue we are having is that the viewer is not pickleable and raise error when OnPolicyVectorizedSampler tries to pickle it when setting up the sampling. I think GarageEnv handles it when the environment is closed, which is a separate issue.

@ryanjulian
Copy link
Member

@ahtsan i'm suggesting you can handle this in a similar way to how he handle closing environments.

@lywong92
Copy link
Member Author

lywong92 commented Oct 9, 2019

I did a trial run of Acrobot-v1 with 100 iterations, and the average return never improved. I resized the image to a smaller dimension to make it run faster for the last run. I'm trying again without resizing and with more iterations.

I compared the original environment with the one after all the wrapping:
image

I'm wondering if the missing line would be a problem.

@ryanjulian
Copy link
Member

ryanjulian commented Oct 9, 2019

@lywong92 what you done should be enough, thought 100 iterations is a really low number. resizing should make it go faster and not really affect performance. i wouldn't expect to see signs of life until ~1000 or more (assuming the batch size is 4000). perhaps try running this with PPO rather than TRPO, which might be kinder to the CNNs.

let's not get hung up trying to get the Pixel wrappers to work right now. the goal is to ensure that this primitive isn't any worse than before.

since it looks like CubeCrash and MemorizeDigits are the environments which work today, and they look good, i think this is ready to merge.

can you make sure to add your benchmark script before merging?

@ahtsan
Copy link
Contributor

ahtsan commented Oct 9, 2019

Another 10 runs for CubeCrash
CubeCrash-v0_benchmark_ppo_mean

@lywong92
Copy link
Member Author

lywong92 commented Oct 9, 2019

@lywong92 what you done should be enough, thought 100 iterations is a really low number. resizing should make it go faster and not really affect performance. i wouldn't expect to see signs of life until ~1000 or more (assuming the batch size is 4000). perhaps try running this with PPO rather than TRPO, which might be kinder to the CNNs.

let's not get hung up trying to get the Pixel wrappers to work right now. the goal is to ensure that this primitive isn't any worse than before.

since it looks like CubeCrash and MemorizeDigits are the environments which work today, and they look good, i think this is ready to merge.

can you make sure to add your benchmark script before merging?

Okay, will do!

@lywong92 lywong92 force-pushed the replace_categorical_conv_policy branch from 7c20abf to 298f61e Compare October 9, 2019 23:56
@lywong92
Copy link
Member Author

Sorry about the lag. I rebased and am seeing a possible bug when I run examples. Currently working on it and will merge as soon as I get that figured out.

@ahtsan
Copy link
Contributor

ahtsan commented Oct 10, 2019

This is the bug -- #794

- Remove all occurrences of CategoricalConvPolicy
- Rename CategoricalConvPolicyWithModel to CategoricalConvPolicy
- Create and remove integration test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants