-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduce in mix dataset of Hopper-v3 #3
Comments
Hi Linh, I did not run into this before. It seems that ETA network just gave up. The asynchronous updates in ETA and the rest could be the reason. You can notice that once ETA network gives up, the critic loss becomes high, i.e., your value network no longer estimates robust value correctly. My suggestion is to tune the ETA network hyper-parameter a bit. Hope this helps. Regards, ZX |
Thank @zaiyan-x, let me try this. Thank you so much and have a good weekend :D . |
Hi @zaiyan-x , I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward? Thank you so much :D. |
Hi Linh, The reason we used a perturbed Regards, Zaiyan |
It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that |
Thank you for your suggestion. For me right now it seems like using not_done signal when training makes the etas stable and be in the reasonable range. |
Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice? |
Yes, it is for making the dataset more diverse. :d Hope this helps, Zaiyan |
Yup. Thank you so much 👍 Linh |
Hi @zaiyan-x ,
Thank for your work.
I'm trying to reproduce your code in the mix dataset of the Hopper-v3 environment. I begin with running the generate_offline_data.py to generate the mix dataset of Hopper-v3. Then I run the code train_rfqi.py to train the agent. However, around 80k iters, the critic loss goes to high value and the max_eta goes to zero.
I am quite confused about this behavior. Did you face the same behavior while training on Hopper-v3?
Thank you so much and have a nice day.
Best,
Linh
The text was updated successfully, but these errors were encountered: