RND fails on LunarLander-v2 #11

seungjaeryanlee · 2019-07-24T05:43:09Z

The problem is gone when I don't normalize observation by dividing by 255. The high value estimation loss does not seem to matter.

~~It used to work, but now it gives worse performance than vanilla PPO. I suspect it has something to do with~~

~~overly high value estimation loss OR~~
~~observation normalization~~

Algorithm	Average Return	Value Estimation Loss
RND
PPO

seungjaeryanlee · 2019-07-24T13:30:35Z

When calculating value_estimation_loss, the returns are way too large:

# With RND: [7101.44531 7166.82178 7235.9126 ... 2192.41016 1505.95093 776.063354]
# Without RND: [-8.60589886 -8.08707809 -9.66196251 ... -2.82508373 -1.94054389 -1.00004435]
tf.print(returns)
# With RND: [1.00574807e-05 1.06366251e-05 1.37969992e-05 ... 1.24972794e-05 9.22966865e-06 1.47654901e-05]
# Without RND: [-3.9881561e-05 -4.0003295e-05 -3.92220318e-05 ... -3.9767292e-05 -4.233031e-05 -4.445585e-05]
tf.print(value_preds)

seungjaeryanlee · 2019-07-24T13:35:51Z

In compute_return_and_advantage, the normalized intrinsic rewards are too large compared to extrinsic rewards:

# Unnnormalized extrinsic reward
[2.4655962 1.55493808 -0.12013232 ... -0.0169174224 0.165178448 -0.241332144]
 [-2.2789948 -1.25498295 0.27360487 ... -1.38796663 -1.61660671 4.28619528]
 [-3.87851095 -0.429568112 0.417723715 ... 0.347376496 4.02928114 0.372619241]
 ...
 [1.25972 1.44503868 -1.37598336 ... 0.0951891318 0.978246629 1.19230056]
 [0.293478042 -0.386281192 -0.469306529 ... -2.24820375 -0.359491259 -1.34746885]
 [1.23334634 -2.23048186 -1.25190639 ... -3.44087744 -1.34803867 -2.59871984]]

# Unnormalized intrinsic reward
[[6.36824608 5.42133617 5.31663656 ... 17.2304916 16.5008812 16.64258]
 [14.4909668 15.7531471 17.2095242 ... 12.7710819 13.4693031 14.7130861]
 [19.154789 19.1849155 19.2228661 ... 16.8683529 17.8014469 17.604372]
 ...
 [12.4912157 12.3132076 12.2927904 ... 19.5716267 19.7414684 18.6168365]
 [14.0771189 14.4516859 12.9586821 ... 24.6045246 24.6045246 24.6045246]
 [13.0537968 10.8395157 10.9614 ... 26.7444553 26.7444553 26.7444553]]

# Normalized extrinsic reward
[[1 1 -1 ... -0.534975827 1 -1]
 [-1 -1 1 ... -1 -1 1]
 [-1 -1 1 ... 1 1 1]
 ...
 [1 1 -1 ... 1 1 1]
 [1 -1 -1 ... -1 -1 -1]
 [1 -1 -1 ... -1 -1 -1]]

# Normalized intrinsic rewards
[[201.381607 171.437683 168.126801 ... 544.875916 521.80365 526.284546]
 [458.244568 498.158203 544.212891 ... 403.857025 425.936737 465.268585]
 [605.727539 606.680237 607.880371 ... 533.424133 562.931152 556.699097]
 ...
 [395.006897 389.377777 388.732147 ... 618.909119 624.279968 588.716]
 [445.157562 457.002411 409.78949 ... 778.063354 778.063354 778.063354]
 [412.797272 342.775543 346.629883 ... 845.733887 845.733887 845.733887]]

seungjaeryanlee · 2019-07-24T13:38:27Z

This might be the reason.

seungjaeryanlee · 2019-07-24T15:20:57Z

Implemented _init_rnd_normalizer, but it does not seem to fix the issue.

	Average Return	Value Estimation Loss
RND

seungjaeryanlee · 2019-07-24T15:58:57Z

With use_td_lambda_return==False,

	Average Return	Value Estimation Loss
RND

	Average Returns	Average Value Prediction
RND

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RND fails on LunarLander-v2 #11

RND fails on LunarLander-v2 #11

seungjaeryanlee commented Jul 24, 2019 •

edited

Loading

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019 •

edited

Loading

seungjaeryanlee commented Jul 25, 2019

RND fails on LunarLander-v2 #11

RND fails on LunarLander-v2 #11

Comments

seungjaeryanlee commented Jul 24, 2019 • edited Loading

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019

seungjaeryanlee commented Jul 24, 2019 • edited Loading

seungjaeryanlee commented Jul 25, 2019

seungjaeryanlee commented Jul 24, 2019 •

edited

Loading

seungjaeryanlee commented Jul 24, 2019 •

edited

Loading