np.argmax may lead to unexpected behavior #51

ShangtongZhang · 2017-10-28T15:30:12Z

One issue of np.argmax is that it always return the first index even if there is many maximal values. Instead we should use

np.random.choice([action for action, value in enumerate(values) if value == np.max(values)])

At the very beginning I implemented my own argmax in utils like this. However later on utils was removed due to much complaint of import error and at that time I switched back to np.argmax. Now it turns out to be problematic especially for some tabular examples. It may lead to infinite loop.

I fixed it if it's used for behavior policy as I think it's the only case np.argmax will lead to problem. However I still leave this issue open to give some hint if you find some strange bug.

The text was updated successfully, but these errors were encountered:

zwfcrazy · 2017-11-07T08:40:31Z

I just found this problem as well.
Here is another approach, we can use np.where or np.argwhere instead of writing our own codes.

max_actions = np.argwhere(values==np.amax(values))
action = np.random.choice(max_actions.flatten())

ShangtongZhang · 2017-11-07T17:12:12Z

@zwfcrazy It looks cool! Did you find any bug in the repo, or I have fixed them all?

zwfcrazy · 2017-11-16T07:13:30Z

@ShangtongZhang Not yet. I am still reading the book...slowly...I am glad to see that Prof. Sutton has finished the draft! BTW, perhaps we could try to improve the efficiency of the simulations...Exercise 2.9 of chapter 2 requires running the parameter study for 200k steps, which takes days to complete...

zwfcrazy · 2017-11-16T07:19:27Z

@ShangtongZhang Hi I just found you didn't fix the argmax problem in chapter 2.

ShangtongZhang · 2017-11-16T15:54:04Z

@zwfcrazy Did it lead to some bugs? I didn't mean to replace all the np.argmax

zwfcrazy · 2017-11-17T01:12:52Z

@ShangtongZhang the simulation results seem no difference, but the simple bandit algorithm requires breaking ties randomly (see p24 of the complete draft). I think it may slow down exploration at the beginning since we assume initial estimates of Q(a) are the same for all actions. This won't be critical later on as ties happen rarely.

ShangtongZhang mentioned this issue Oct 28, 2017

Shouldn't update state action value using absolute priority in Priority Sweeping #50

Closed

ShangtongZhang added the Hint label Oct 28, 2017

ShangtongZhang closed this as completed Nov 6, 2017

ShangtongZhang reopened this Nov 7, 2017

ShangtongZhang closed this as completed Apr 29, 2018

ShangtongZhang mentioned this issue May 11, 2018

One bug on the MountainCar.py in the folder Chapter12 #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

np.argmax may lead to unexpected behavior #51

np.argmax may lead to unexpected behavior #51

ShangtongZhang commented Oct 28, 2017 •

edited

zwfcrazy commented Nov 7, 2017 •

edited

ShangtongZhang commented Nov 7, 2017

zwfcrazy commented Nov 16, 2017

zwfcrazy commented Nov 16, 2017

ShangtongZhang commented Nov 16, 2017

zwfcrazy commented Nov 17, 2017 •

edited

np.argmax may lead to unexpected behavior #51

np.argmax may lead to unexpected behavior #51

Comments

ShangtongZhang commented Oct 28, 2017 • edited

zwfcrazy commented Nov 7, 2017 • edited

ShangtongZhang commented Nov 7, 2017

zwfcrazy commented Nov 16, 2017

zwfcrazy commented Nov 16, 2017

ShangtongZhang commented Nov 16, 2017

zwfcrazy commented Nov 17, 2017 • edited

ShangtongZhang commented Oct 28, 2017 •

edited

zwfcrazy commented Nov 7, 2017 •

edited

zwfcrazy commented Nov 17, 2017 •

edited