-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training hyperparameters besides Table 6 #9
Comments
Thanks for your interest in our work. 1 & 2. Please refer to the Section E.1 in our updated paper for more training details. We used a batch size of 16 with DDP. We didn't accumulate gradients.
|
1 & 2: The updated version paper now has more details, thanks! |
We just weighted rotation and position equally. Components in rotation/position are also equally weighted, e.g., assuming a rotation quaternion is represented with four parameters, then each one is dived by 1/4.
Besides forcing all quaternion components to the range of [-1, 1], we didn't apply any other constraints. We opt to directly represent quaternions with four parameters. You may also represent rotations with 2D matrices with 6 parameters. |
Thanks for your reply. I have another question: During inference (from the code), the last token of the action-obs-prompt token will be taken as the predicted action token, and then the decoder to get the actions. After the prediction, that action will be encoded and padded to form the new sequence tokens. However, during the training, how to deal with the action token in the middle steps, like the step-1 action token in one 3-steps action? Since unlike the language model, the Step-2/ Word-2 is predicted based on the ground-truth Word-1, we did not have such GT for actions, since Action-1 embedding is unknown and the encoder is waiting to learn. |
If I understand your question correctly, during training, the model is conditioned on ground-truth past actions to make predictions. |
Thanks for your reply. My further question would be, how could we pass the past actions to the transformer? Since the gt action tokens must be encoded by the action encoder, also trained from scratch, the GT token is then no longer precisely the GT. |
Hi @SiyuanHuang95, thanks for the followup. Yes you are right. The action encoder is also trained from scratch. Practically it is updated every mini-batch. In other words, the same action encoder is applied for all history actions in the batch, then we calculate the gradients and update the encoder. |
Hi @yunfanjiang, thanks for your update. I then have one follow-up question: During Inference: For the first action, the input to the transformer would be the sequence of (prompt token+obs token), and the predicted action token is from the last element of that sequence, e.g. from the obs token. For the second action, the input transformer would be the sequence of (prompt token + first obs token + GT first action token + second obs token), am I right. So for the training procedure, we should construct a complete sequence, with GT first action token inserted along with the prompt and obs tokens? So I guess the initialization of that action embedding would somehow be important. What kind of initialization do you use? Also during batch-training, some sequence needs to be padded to the same length, what kind of padding stratgetry do you use? Do you pad every training sample to the same action steps, and for each action step, every training sample have the same maximal number of objects? Since the action steps and object numbers vary across different sequences? Or you only pad them to the same total length, and did not considering the varying object numbers. Would it be possible that you provide the data processing script which be helpful for the followers understanding your algorithms better? |
Thanks for the followup questions! To answer them
No, during inference the model is conditioned on prompt and its rollout history. Therefore, the past actions are what it predicts instead of GT actions.
We didn't observe any sensitivity to the initialization of action embedding layer. So we just employed the default way with values drawn from normal distributions.
We pad sequences to the max length in that batch, considering both trajectory steps and varying number of objects.
Since the data processing depends on internal codebase, I'm afraid we don't have any plan to release them at this point. That being said, we will let you know if anything changes. |
Hi,
Thanks for your sharing!
Could you share more training hyperparameters besides Table 6 in your appendix?
The text was updated successfully, but these errors were encountered: