Reproduce in mix dataset of Hopper-v3 #3

linhlpv · 2024-05-25T08:24:33Z

Thank for your work.

I'm trying to reproduce your code in the mix dataset of the Hopper-v3 environment. I begin with running the generate_offline_data.py to generate the mix dataset of Hopper-v3. Then I run the code train_rfqi.py to train the agent. However, around 80k iters, the critic loss goes to high value and the max_eta goes to zero.

I am quite confused about this behavior. Did you face the same behavior while training on Hopper-v3?
Thank you so much and have a nice day.
Best,
Linh

zaiyan-x · 2024-05-25T12:32:20Z

Hi Linh,

I did not run into this before. It seems that ETA network just gave up. The asynchronous updates in ETA and the rest could be the reason. You can notice that once ETA network gives up, the critic loss becomes high, i.e., your value network no longer estimates robust value correctly.

My suggestion is to tune the ETA network hyper-parameter a bit. Hope this helps.

Regards,

ZX

linhlpv · 2024-05-26T04:17:44Z

Thank @zaiyan-x, let me try this.
One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D .
Best,
Linh

linhlpv · 2024-05-27T02:11:43Z

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as
current version:
target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma
with not_done version:
target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version.

The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D.
Best,
Linh

zaiyan-x · 2024-05-27T15:34:18Z

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].

Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

zaiyan-x · 2024-05-27T16:04:55Z

Hi @zaiyan-x ,

I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()

I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version.

I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?

Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

linhlpv · 2024-05-28T01:40:04Z

Hi @zaiyan-x ,
I have tried to tune the ETA params a bit but it didn't work. I realized that in the Hopper-v3, the agent could be terminated before the end of the episode (1000). So I incorporated the not_done signal (cause the current code in GitHub didn't have it) as current version: target_Q = reward - gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma with not_done version: target_Q = reward - not_done * (gamma * torch.maximum(etas - target_Q, etas.new_tensor(0)) + (1 - rho) * etas * gamma) self.loss_log['robust_target_Q'] = target_Q.mean().item()
I found that the critic loss doesn't go to too high, but the eval reward is smaller than the current version. The red and orange lines are the version with the not_done signal, and the green one is for the current version.
I'm quite confused about this result. Do you think missing not_done signal is the reason for the critic loss problem. And what could cause the lower eval reward?
Thank you so much :D. Best, Linh

It could be. I think we had a discussion on this before haha ;) I am glad you found this issue. Yes, I recommend you fix it this way. As for whether this fixes the whole issue, I am not sure. I still think there is something wrong with the training (not you just that this algorithm is very difficult to materialize empirically). One thing I am certain is that max_eta should not decrease to zero. And during my implementation, max_eta usually fluctuates between a reasonable range (which kinda made sense to me). You can use this as a signal for whether the training is catastrophic or not. I apologize that I can't give you a definitive suggestion how to fix this.

Thank you for your suggestion. For me right now it seems like using not_done signal when training makes the etas stable and be in the reasonable range.

linhlpv · 2024-05-28T01:42:56Z

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,

The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.

Regards,

Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

zaiyan-x · 2024-05-28T16:53:56Z

Thank @zaiyan-x, let me try this. One more question is about the choice of data generation method. In the paper, you said that you trained the SAC with the model parameter actuator_ctrlrange set to [−0.85, 0.85] and this leads to a more diverse dataset. I'm just curious about this specific choice. Could you please explain more about the intuition and reason under the specific choice of training with the actuator_ctrlrange set to [−0.85, 0.85].
Thank you so much and have a good weekend :D . Best, Linh

Hi Linh,
The reason we used a perturbed actuator_ctrlrange is because we want to see if we let FQI foresee perturbation, i.e., during training rather than just in testing, can RFQI still outperforms FQI. In other words, the diversity of training dataset is meant to "help" FQI.
Regards,
Zaiyan

Ohh, I understand. Just one follow-up question to make it more clear for me (of course :D ). I see in the paper you used epsilon greedy during the data generation process. So did you add the random actions to make the dataset more diverse or is there any reason for this this choice?

Yes, it is for making the dataset more diverse. :d

Hope this helps,

Zaiyan

linhlpv · 2024-05-28T23:33:46Z

Yup. Thank you so much 👍

Linh

linhlpv closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce in mix dataset of Hopper-v3 #3

Reproduce in mix dataset of Hopper-v3 #3

linhlpv commented May 25, 2024

zaiyan-x commented May 25, 2024 •

edited

Loading

linhlpv commented May 26, 2024

linhlpv commented May 27, 2024

zaiyan-x commented May 27, 2024

zaiyan-x commented May 27, 2024 •

edited

Loading

linhlpv commented May 28, 2024

linhlpv commented May 28, 2024 •

edited

Loading

zaiyan-x commented May 28, 2024

linhlpv commented May 28, 2024

Reproduce in mix dataset of Hopper-v3 #3

Reproduce in mix dataset of Hopper-v3 #3

Comments

linhlpv commented May 25, 2024

zaiyan-x commented May 25, 2024 • edited Loading

linhlpv commented May 26, 2024

linhlpv commented May 27, 2024

zaiyan-x commented May 27, 2024

zaiyan-x commented May 27, 2024 • edited Loading

linhlpv commented May 28, 2024

linhlpv commented May 28, 2024 • edited Loading

zaiyan-x commented May 28, 2024

linhlpv commented May 28, 2024

zaiyan-x commented May 25, 2024 •

edited

Loading

zaiyan-x commented May 27, 2024 •

edited

Loading

linhlpv commented May 28, 2024 •

edited

Loading