Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.argmax may lead to unexpected behavior #51

Closed
ShangtongZhang opened this issue Oct 28, 2017 · 6 comments
Closed

np.argmax may lead to unexpected behavior #51

ShangtongZhang opened this issue Oct 28, 2017 · 6 comments
Labels

Comments

@ShangtongZhang
Copy link
Owner

ShangtongZhang commented Oct 28, 2017

One issue of np.argmax is that it always return the first index even if there is many maximal values. Instead we should use

np.random.choice([action for action, value in enumerate(values) if value == np.max(values)])

At the very beginning I implemented my own argmax in utils like this. However later on utils was removed due to much complaint of import error and at that time I switched back to np.argmax. Now it turns out to be problematic especially for some tabular examples. It may lead to infinite loop.

I fixed it if it's used for behavior policy as I think it's the only case np.argmax will lead to problem. However I still leave this issue open to give some hint if you find some strange bug.

@zwfcrazy
Copy link

zwfcrazy commented Nov 7, 2017

I just found this problem as well.
Here is another approach, we can use np.where or np.argwhere instead of writing our own codes.

max_actions = np.argwhere(values==np.amax(values))
action = np.random.choice(max_actions.flatten())

@ShangtongZhang
Copy link
Owner Author

@zwfcrazy It looks cool! Did you find any bug in the repo, or I have fixed them all?

@zwfcrazy
Copy link

@ShangtongZhang Not yet. I am still reading the book...slowly...I am glad to see that Prof. Sutton has finished the draft! BTW, perhaps we could try to improve the efficiency of the simulations...Exercise 2.9 of chapter 2 requires running the parameter study for 200k steps, which takes days to complete...

@zwfcrazy
Copy link

@ShangtongZhang Hi I just found you didn't fix the argmax problem in chapter 2.

@ShangtongZhang
Copy link
Owner Author

@zwfcrazy Did it lead to some bugs? I didn't mean to replace all the np.argmax

@zwfcrazy
Copy link

zwfcrazy commented Nov 17, 2017

@ShangtongZhang the simulation results seem no difference, but the simple bandit algorithm requires breaking ties randomly (see p24 of the complete draft). I think it may slow down exploration at the beginning since we assume initial estimates of Q(a) are the same for all actions. This won't be critical later on as ties happen rarely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants