Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training hyperparameters besides Table 6 #9

Closed
SiyuanHuang95 opened this issue Apr 5, 2023 · 9 comments
Closed

Training hyperparameters besides Table 6 #9

SiyuanHuang95 opened this issue Apr 5, 2023 · 9 comments

Comments

@SiyuanHuang95
Copy link

Hi,

Thanks for your sharing!

Could you share more training hyperparameters besides Table 6 in your appendix?

  1. Batch size and epochs (along with bs and steps)
  2. do you use accumulated memory to increase the effective bs?
  3. Training issue for IL. Let's take a 2-step sequence as an example. The a2 should be predicted with history a1. My question is, is a1 predicted by the model during training, or do we use some tricks to fill it with the GT one? If it was the predicted one, we force the model to predict the right outputs with the drifted input (last pose).
@yunfanjiang
Copy link
Member

Thanks for your interest in our work.

1 & 2. Please refer to the Section E.1 in our updated paper for more training details. We used a batch size of 16 with DDP. We didn't accumulate gradients.

  1. Models are trained by optimizing NLL of actions predicted autoregressively.

@SiyuanHuang95
Copy link
Author

1 & 2: The updated version paper now has more details, thanks!
3: how do you balance the loss between rotation and position, and how is the constrain for the rotation parameter?

@yunfanjiang
Copy link
Member

how do you balance the loss between rotation and position

We just weighted rotation and position equally. Components in rotation/position are also equally weighted, e.g., assuming a rotation quaternion is represented with four parameters, then each one is dived by 1/4.

how is the constrain for the rotation parameter?

Besides forcing all quaternion components to the range of [-1, 1], we didn't apply any other constraints. We opt to directly represent quaternions with four parameters. You may also represent rotations with 2D matrices with 6 parameters.

@SiyuanHuang95
Copy link
Author

Thanks for your reply.

I have another question:

During inference (from the code), the last token of the action-obs-prompt token will be taken as the predicted action token, and then the decoder to get the actions. After the prediction, that action will be encoded and padded to form the new sequence tokens.

However, during the training, how to deal with the action token in the middle steps, like the step-1 action token in one 3-steps action? Since unlike the language model, the Step-2/ Word-2 is predicted based on the ground-truth Word-1, we did not have such GT for actions, since Action-1 embedding is unknown and the encoder is waiting to learn.

@yunfanjiang
Copy link
Member

yunfanjiang commented May 27, 2023

However, during the training, how to deal with the action token in the middle steps, like the step-1 action token in one 3-steps action? Since unlike the language model, the Step-2/ Word-2 is predicted based on the ground-truth Word-1, we did not have such GT for actions, since Action-1 embedding is unknown and the encoder is waiting to learn.

If I understand your question correctly, during training, the model is conditioned on ground-truth past actions to make predictions.

@SiyuanHuang95
Copy link
Author

Thanks for your reply.

My further question would be, how could we pass the past actions to the transformer? Since the gt action tokens must be encoded by the action encoder, also trained from scratch, the GT token is then no longer precisely the GT.

@yunfanjiang
Copy link
Member

Hi @SiyuanHuang95, thanks for the followup. Yes you are right. The action encoder is also trained from scratch. Practically it is updated every mini-batch. In other words, the same action encoder is applied for all history actions in the batch, then we calculate the gradients and update the encoder.

@SiyuanHuang95
Copy link
Author

Hi @yunfanjiang, thanks for your update. I then have one follow-up question:

During Inference:

For the first action, the input to the transformer would be the sequence of (prompt token+obs token), and the predicted action token is from the last element of that sequence, e.g. from the obs token. For the second action, the input transformer would be the sequence of (prompt token + first obs token + GT first action token + second obs token), am I right.

So for the training procedure, we should construct a complete sequence, with GT first action token inserted along with the prompt and obs tokens? So I guess the initialization of that action embedding would somehow be important. What kind of initialization do you use?

Also during batch-training, some sequence needs to be padded to the same length, what kind of padding stratgetry do you use? Do you pad every training sample to the same action steps, and for each action step, every training sample have the same maximal number of objects? Since the action steps and object numbers vary across different sequences? Or you only pad them to the same total length, and did not considering the varying object numbers.

Would it be possible that you provide the data processing script which be helpful for the followers understanding your algorithms better?

@yunfanjiang
Copy link
Member

Thanks for the followup questions! To answer them

[...] For the second action, the input transformer would be the sequence of (prompt token + first obs token + GT first action token + second obs token)

No, during inference the model is conditioned on prompt and its rollout history. Therefore, the past actions are what it predicts instead of GT actions.

So I guess the initialization of that action embedding would somehow be important. What kind of initialization do you use?

We didn't observe any sensitivity to the initialization of action embedding layer. So we just employed the default way with values drawn from normal distributions.

Also during batch-training, some sequence needs to be padded to the same length, what kind of padding stratgetry do you use? Do you pad every training sample to the same action steps, and for each action step, every training sample have the same maximal number of objects? Since the action steps and object numbers vary across different sequences? Or you only pad them to the same total length, and did not considering the varying object numbers.

We pad sequences to the max length in that batch, considering both trajectory steps and varying number of objects.

Would it be possible that you provide the data processing script which be helpful for the followers understanding your algorithms better?

Since the data processing depends on internal codebase, I'm afraid we don't have any plan to release them at this point. That being said, we will let you know if anything changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants