Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce in mix dataset of Hopper-v3 #3

Closed
linhlpv opened this issue May 25, 2024 · 9 comments
Closed

Reproduce in mix dataset of Hopper-v3 #3

linhlpv opened this issue May 25, 2024 · 9 comments

Comments

@linhlpv
Copy link

linhlpv commented May 25, 2024

Hi @zaiyan-x ,

Thank for your work.

I'm trying to reproduce your code in the mix dataset of the Hopper-v3 environment. I begin with running the generate_offline_data.py to generate the mix dataset of Hopper-v3. Then I run the code train_rfqi.py to train the agent. However, around 80k iters, the critic loss goes to high value and the max_eta goes to zero.
image
image

I am quite confused about this behavior. Did you face the same behavior while training on Hopper-v3?
Thank you so much and have a nice day.
Best,
Linh

@zaiyan-x
Copy link
Owner

zaiyan-x commented May 25, 2024

Hi Linh,

I did not run into this before. It seems that ETA network just gave up. The asynchronous updates in ETA and the rest could be the reason. You can notice that once ETA network gives up, the critic loss becomes high, i.e., your value network no longer estimates robust value correctly.

My suggestion is to tune the ETA network hyper-parameter a bit. Hope this helps.

Regards,

ZX

@linhlpv
Copy link
Author

linhlpv commented May 26, 2024

Thank @zaiyan-x, let me try this.
One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D .
Best,
Linh

@linhlpv
Copy link
Author

linhlpv commented May 27, 2024

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as
current version:
target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma
with not_done version:
target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version.
image
The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D.
Best,
Linh

@zaiyan-x
Copy link
Owner

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

@zaiyan-x
Copy link
Owner

zaiyan-x commented May 27, 2024

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. image The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

@linhlpv
Copy link
Author

linhlpv commented May 28, 2024

Hi @zaiyan-x ,
I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()
I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. image The red and orange lines are the version with the not_done signal, and the green one is for the current version.
I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?
Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

Thank you for your suggestion. For me right now it seems like using not_done signal when training makes the etas stable and be in the reasonable range.

@linhlpv
Copy link
Author

linhlpv commented May 28, 2024

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

@zaiyan-x
Copy link
Owner

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,
The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.
Regards,
Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

Yes, it is for making the dataset more diverse. :d

Hope this helps,

Zaiyan

@linhlpv
Copy link
Author

linhlpv commented May 28, 2024

Yup. Thank you so much 👍

Linh

@linhlpv linhlpv closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants