Training hyperparameters besides Table 6 #9

SiyuanHuang95 · 2023-04-05T15:57:22Z

Hi,

Thanks for your sharing!

Could you share more training hyperparameters besides Table 6 in your appendix?

Batch size and epochs (along with bs and steps)
do you use accumulated memory to increase the effective bs?
Training issue for IL. Let's take a 2-step sequence as an example. The a2 should be predicted with history a1. My question is, is a1 predicted by the model during training, or do we use some tricks to fill it with the GT one? If it was the predicted one, we force the model to predict the right outputs with the drifted input (last pose).

yunfanjiang · 2023-05-21T07:50:44Z

Thanks for your interest in our work.

1 & 2. Please refer to the Section E.1 in our updated paper for more training details. We used a batch size of 16 with DDP. We didn't accumulate gradients.

Models are trained by optimizing NLL of actions predicted autoregressively.

SiyuanHuang95 · 2023-05-25T11:12:17Z

1 & 2: The updated version paper now has more details, thanks!
3: how do you balance the loss between rotation and position, and how is the constrain for the rotation parameter?

yunfanjiang · 2023-05-25T21:12:49Z

how do you balance the loss between rotation and position

We just weighted rotation and position equally. Components in rotation/position are also equally weighted, e.g., assuming a rotation quaternion is represented with four parameters, then each one is dived by 1/4.

how is the constrain for the rotation parameter?

Besides forcing all quaternion components to the range of [-1, 1], we didn't apply any other constraints. We opt to directly represent quaternions with four parameters. You may also represent rotations with 2D matrices with 6 parameters.

SiyuanHuang95 · 2023-05-26T10:36:20Z

Thanks for your reply.

I have another question:

During inference (from the code), the last token of the action-obs-prompt token will be taken as the predicted action token, and then the decoder to get the actions. After the prediction, that action will be encoded and padded to form the new sequence tokens.

However, during the training, how to deal with the action token in the middle steps, like the step-1 action token in one 3-steps action? Since unlike the language model, the Step-2/ Word-2 is predicted based on the ground-truth Word-1, we did not have such GT for actions, since Action-1 embedding is unknown and the encoder is waiting to learn.

yunfanjiang · 2023-05-27T04:40:45Z

However, during the training, how to deal with the action token in the middle steps, like the step-1 action token in one 3-steps action? Since unlike the language model, the Step-2/ Word-2 is predicted based on the ground-truth Word-1, we did not have such GT for actions, since Action-1 embedding is unknown and the encoder is waiting to learn.

If I understand your question correctly, during training, the model is conditioned on ground-truth past actions to make predictions.

SiyuanHuang95 · 2023-05-27T06:24:49Z

Thanks for your reply.

My further question would be, how could we pass the past actions to the transformer? Since the gt action tokens must be encoded by the action encoder, also trained from scratch, the GT token is then no longer precisely the GT.

yunfanjiang · 2023-05-28T19:35:38Z

Hi @SiyuanHuang95, thanks for the followup. Yes you are right. The action encoder is also trained from scratch. Practically it is updated every mini-batch. In other words, the same action encoder is applied for all history actions in the batch, then we calculate the gradients and update the encoder.

SiyuanHuang95 · 2023-05-29T04:08:03Z

Hi @yunfanjiang, thanks for your update. I then have one follow-up question:

During Inference:

For the first action, the input to the transformer would be the sequence of (prompt token+obs token), and the predicted action token is from the last element of that sequence, e.g. from the obs token. For the second action, the input transformer would be the sequence of (prompt token + first obs token + GT first action token + second obs token), am I right.

So for the training procedure, we should construct a complete sequence, with GT first action token inserted along with the prompt and obs tokens? So I guess the initialization of that action embedding would somehow be important. What kind of initialization do you use?

Also during batch-training, some sequence needs to be padded to the same length, what kind of padding stratgetry do you use? Do you pad every training sample to the same action steps, and for each action step, every training sample have the same maximal number of objects? Since the action steps and object numbers vary across different sequences? Or you only pad them to the same total length, and did not considering the varying object numbers.

Would it be possible that you provide the data processing script which be helpful for the followers understanding your algorithms better?

yunfanjiang · 2023-06-04T20:03:10Z

Thanks for the followup questions! To answer them

[...] For the second action, the input transformer would be the sequence of (prompt token + first obs token + GT first action token + second obs token)

No, during inference the model is conditioned on prompt and its rollout history. Therefore, the past actions are what it predicts instead of GT actions.

So I guess the initialization of that action embedding would somehow be important. What kind of initialization do you use?

We didn't observe any sensitivity to the initialization of action embedding layer. So we just employed the default way with values drawn from normal distributions.

Also during batch-training, some sequence needs to be padded to the same length, what kind of padding stratgetry do you use? Do you pad every training sample to the same action steps, and for each action step, every training sample have the same maximal number of objects? Since the action steps and object numbers vary across different sequences? Or you only pad them to the same total length, and did not considering the varying object numbers.

We pad sequences to the max length in that batch, considering both trajectory steps and varying number of objects.

Would it be possible that you provide the data processing script which be helpful for the followers understanding your algorithms better?

Since the data processing depends on internal codebase, I'm afraid we don't have any plan to release them at this point. That being said, we will let you know if anything changes.

yunfanjiang closed this as completed May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training hyperparameters besides Table 6 #9

Training hyperparameters besides Table 6 #9

SiyuanHuang95 commented Apr 5, 2023

yunfanjiang commented May 21, 2023

SiyuanHuang95 commented May 25, 2023

yunfanjiang commented May 25, 2023

SiyuanHuang95 commented May 26, 2023

yunfanjiang commented May 27, 2023 •

edited

Loading

SiyuanHuang95 commented May 27, 2023

yunfanjiang commented May 28, 2023

SiyuanHuang95 commented May 29, 2023

yunfanjiang commented Jun 4, 2023

Training hyperparameters besides Table 6 #9

Training hyperparameters besides Table 6 #9

Comments

SiyuanHuang95 commented Apr 5, 2023

yunfanjiang commented May 21, 2023

SiyuanHuang95 commented May 25, 2023

yunfanjiang commented May 25, 2023

SiyuanHuang95 commented May 26, 2023

yunfanjiang commented May 27, 2023 • edited Loading

SiyuanHuang95 commented May 27, 2023

yunfanjiang commented May 28, 2023

SiyuanHuang95 commented May 29, 2023

yunfanjiang commented Jun 4, 2023

yunfanjiang commented May 27, 2023 •

edited

Loading