Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Q Network Performing Worse than Q Network #331

Open
lukepolson opened this issue Mar 19, 2020 · 7 comments
Open

Categorical Q Network Performing Worse than Q Network #331

lukepolson opened this issue Mar 19, 2020 · 7 comments

Comments

@lukepolson
Copy link

lukepolson commented Mar 19, 2020

I am currently training an agent that receives sparse rewards (a reward of +1 approximately every 5-10 moves). I have the discount factor set to 0.99.

I trained both a Q Network and Categorical Q Network on the environment- for both I used fc_layer_params=[100,100,100] and for the Categorical Q Network I used n_atoms=51, min_q_value=-20, max_q_value=20, and n_step_update=2. The categorical Q network performs significantly worse on the environment. In particular, in the categorical Q Network, the average reward initially increases up to +10 (after 1 mil steps) and then gradually decreases to +6 (from 1 mil to 10 mil steps).

Might there be a reason for this? Are the min and max q values inappropriate? I'm quite confused- the C51 paper linked in your tutorial says that the C51 Network is particularly suited for complex environments with sparse rewards (precisely my environment- I'm training a snake game where agent gets +1 reward for eating food and -1 reward for dying).

Any help/suggestions is appreciated.

@sguada
Copy link
Member

sguada commented Mar 19, 2020

You may need to adjust the values of categorical Q-Network, for instance if you never get negative rewards or only -1 when dies you may want to change the min_q_value=-1, also you may need to adjust n_atoms=11 or n_atoms=21.

The default values are those used for Atari.

@lukepolson
Copy link
Author

What's the reasoning for adjusting to n_atoms to 11 or 21? Wouldn't that just give a distribution of Q-values that has less resolution? Has this ever been shown to help?

@sguada
Copy link
Member

sguada commented Mar 19, 2020

Take a look at https://arxiv.org/pdf/1707.06887.pdf Figure 3, where they run different n_atoms to find best solution for Atari. 5, 11, 21, 51 were the solutions they tried.

@lukepolson
Copy link
Author

lukepolson commented Mar 19, 2020

I see... so 21 returns seemed to perform better for Asterix. I am making the following changes

trial 1: min_q_value=-1, max_q_value=20, n_atoms=21, n_step_update=2
trial 2: min_q_value=-1, max_q_value=20, n_atoms=11, n_step_update=2

I will come back in 8 hours once my code has finished running and comment on the results.

@lukepolson
Copy link
Author

lukepolson commented Mar 19, 2020

I should also note that I use using an epsilon greedy policy (decayed to eps=0.01 over 500000 steps) for the Q Network but not for the Categorical Q Network (my reasoning for not using one was because there was none in tutorial 9). Could this possibly be the reason why the agent is performing poorly?- not enough initial exploration while learning? @sguada

@sguada
Copy link
Member

sguada commented Mar 19, 2020

Yeah that could explain the bad performance, you need some extra exploration at the beginning of training, even for Categorical Q-Networks.
The tutorials are intended to show different use cases, don't cover all the aspects in each one.
Try to keep the same exploration policy for both agents.

@lukepolson
Copy link
Author

lukepolson commented Mar 20, 2020

@sguada Here are the results for n_atoms=21 vs n_atoms=51:

snake

Recall that this was trained with a uniform replay buffer with an epsilon greedy policy of 500000 steps. The initial negative reward is simply because I give the snake -0.5 for running into itself (and then take a different random action where the snake does not run into itself instead - initially it runs into itself a lot)

The most interesting feature I see here is that the reward gets worse for n_atoms=51 over time but stays static for n_atoms=21. What could be the reason for this? It seems to suggest that n_atoms=21 has somehow found a more accurate distribution of the Q-Values? Is it possibly hinting at a design flaw somewhere else?

This seems to be the exact opposite of Figure 3 in the paper you linked (see Breakout graph for example) where small number of n_atoms cause the reward to decrease over time but larger number of n_atoms cause the reward to increase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants