Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RND fails on LunarLander-v2 #11

Closed
seungjaeryanlee opened this issue Jul 24, 2019 · 6 comments
Closed

RND fails on LunarLander-v2 #11

seungjaeryanlee opened this issue Jul 24, 2019 · 6 comments
Labels
bug Something isn't working

Comments

@seungjaeryanlee
Copy link
Owner

seungjaeryanlee commented Jul 24, 2019

The problem is gone when I don't normalize observation by dividing by 255. The high value estimation loss does not seem to matter.


It used to work, but now it gives worse performance than vanilla PPO. I suspect it has something to do with

  1. overly high value estimation loss OR
  2. observation normalization
Algorithm Average Return Value Estimation Loss
RND image image
PPO image image
@seungjaeryanlee seungjaeryanlee added the bug Something isn't working label Jul 24, 2019
@seungjaeryanlee
Copy link
Owner Author

When calculating value_estimation_loss, the returns are way too large:

# With RND: [7101.44531 7166.82178 7235.9126 ... 2192.41016 1505.95093 776.063354]
# Without RND: [-8.60589886 -8.08707809 -9.66196251 ... -2.82508373 -1.94054389 -1.00004435]
tf.print(returns)
# With RND: [1.00574807e-05 1.06366251e-05 1.37969992e-05 ... 1.24972794e-05 9.22966865e-06 1.47654901e-05]
# Without RND: [-3.9881561e-05 -4.0003295e-05 -3.92220318e-05 ... -3.9767292e-05 -4.233031e-05 -4.445585e-05]
tf.print(value_preds)

@seungjaeryanlee
Copy link
Owner Author

In compute_return_and_advantage, the normalized intrinsic rewards are too large compared to extrinsic rewards:

# Unnnormalized extrinsic reward
[2.4655962 1.55493808 -0.12013232 ... -0.0169174224 0.165178448 -0.241332144]
 [-2.2789948 -1.25498295 0.27360487 ... -1.38796663 -1.61660671 4.28619528]
 [-3.87851095 -0.429568112 0.417723715 ... 0.347376496 4.02928114 0.372619241]
 ...
 [1.25972 1.44503868 -1.37598336 ... 0.0951891318 0.978246629 1.19230056]
 [0.293478042 -0.386281192 -0.469306529 ... -2.24820375 -0.359491259 -1.34746885]
 [1.23334634 -2.23048186 -1.25190639 ... -3.44087744 -1.34803867 -2.59871984]]

# Unnormalized intrinsic reward
[[6.36824608 5.42133617 5.31663656 ... 17.2304916 16.5008812 16.64258]
 [14.4909668 15.7531471 17.2095242 ... 12.7710819 13.4693031 14.7130861]
 [19.154789 19.1849155 19.2228661 ... 16.8683529 17.8014469 17.604372]
 ...
 [12.4912157 12.3132076 12.2927904 ... 19.5716267 19.7414684 18.6168365]
 [14.0771189 14.4516859 12.9586821 ... 24.6045246 24.6045246 24.6045246]
 [13.0537968 10.8395157 10.9614 ... 26.7444553 26.7444553 26.7444553]]

# Normalized extrinsic reward
[[1 1 -1 ... -0.534975827 1 -1]
 [-1 -1 1 ... -1 -1 1]
 [-1 -1 1 ... 1 1 1]
 ...
 [1 1 -1 ... 1 1 1]
 [1 -1 -1 ... -1 -1 -1]
 [1 -1 -1 ... -1 -1 -1]]

# Normalized intrinsic rewards
[[201.381607 171.437683 168.126801 ... 544.875916 521.80365 526.284546]
 [458.244568 498.158203 544.212891 ... 403.857025 425.936737 465.268585]
 [605.727539 606.680237 607.880371 ... 533.424133 562.931152 556.699097]
 ...
 [395.006897 389.377777 388.732147 ... 618.909119 624.279968 588.716]
 [445.157562 457.002411 409.78949 ... 778.063354 778.063354 778.063354]
 [412.797272 342.775543 346.629883 ... 845.733887 845.733887 845.733887]]

@seungjaeryanlee
Copy link
Owner Author

This might be the reason.

image

@seungjaeryanlee
Copy link
Owner Author

Implemented _init_rnd_normalizer, but it does not seem to fix the issue.

Average Return Value Estimation Loss
RND image image

@seungjaeryanlee
Copy link
Owner Author

seungjaeryanlee commented Jul 24, 2019

With use_td_lambda_return==False,

Average Return Value Estimation Loss
RND image image
Average Returns Average Value Prediction
RND image image

Similar results when use_gae==False.

@seungjaeryanlee
Copy link
Owner Author

Observation normalization was the issue. Reverted to using streaming normalizer for both PPO and RND for now, but the situation is more complex than I had expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant