-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting out the optimizer utility + LR scheduler for PoS #1320
Conversation
bert_parameters = [p for n, p in model.named_parameters() if p.requires_grad and n.startswith("bert_model.")] | ||
bert_parameters = [{'param_group_name': 'bert', 'params': bert_parameters, 'lr': lr * bert_learning_rate}] | ||
else: | ||
# because PEFT handles what to hand to an optimizer, we don't want to touch that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between the blocks? i would expect that named_parameters
would include the parameters under model.bert_model
with bert_model.
as the prefix of the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typing mistake; it should be .parameters()
—this was an oversight when applying patches. Apparently PEFT does weird shenanigans with the parameters list (the work with .default.lora_a
, etc.) which shadows the original parameters and may not fit a specific name filter. This has been corrected in c72a3b9 which sets it correctly as .parameters()
in the second case.
stanza/models/pos/trainer.py
Outdated
@@ -48,7 +58,7 @@ def update(self, batch, eval=False): | |||
self.model.eval() | |||
else: | |||
self.model.train() | |||
self.optimizer.zero_grad() | |||
[i.zero_grad() for i in self.optimizers.values()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need to go in a list, i would think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in c72a3b9. apologies.
stanza/models/pos/trainer.py
Outdated
@@ -59,7 +69,11 @@ def update(self, batch, eval=False): | |||
|
|||
loss.backward() | |||
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.args['max_grad_norm']) | |||
self.optimizer.step() | |||
|
|||
[i.step() for i in self.optimizers.values()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, maybe just for optimizer in self.optimizers.values(): optimizer.step()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in c72a3b9
Looks pretty good to me, thanks. Let me try it out (hopefully tonight) before merging |
No rush; will move on to PEFTing NER or depparse after I survive 109 final tonight; thanks as always! |
c72a3b9
to
0f2d0ac
Compare
…Bert; this method is backwards compatible, such that get_optimizer gets retained to be used for methods that have not yet been migrated, while a new get_split_optimizer is used for methods that desire an optimizer that has been split out. Further, the optimizer takes an is_peft option which stages the optimizer to tune .parameters() of the Bert model instead of filtering for.named_parameters() which is selected. This allows HF Peft to do weird shenanigans to the parameters value, and we will—when the weights trainable is being constrained by the Peft library—tune what it tells us to tune instead of tuning things that we select via its name. Taking advantage of this fact that we have a split optimizer now, Part of Speech tagging with Bert finetuning features a learning rate scheduler which a warmup and a linear decay if the user requests the Bert to be tuned.
…inetuning is doing what we want yet, though)
3ef1dbc
to
8f137a7
Compare
Thanks! Hopefully this helps peft be more effective for depparse or NER, even if it wasn't helping POS yet... |
This PR splits out the utility optimizer into separate parts for Bert and non-Bert; this method is backwards compatible, such that
get_optimizer
gets retained to be used for methods that have not yet been migrated, while a newget_split_optimizer
is used for methods that desire an optimizer that has been split out.Further, the optimizer takes an
is_peft
option which stages the optimizer to tune.parameters()
of the Bert model instead of filtering for.named_parameters()
which is selected. This allows HF Peft to do weird shenanigans to the parameters value, and we will—when the weights trainable is being constrained by the Peft library—tune what it tells us to tune instead of tuning things that we select via its name.Taking advantage of this fact that we have a split optimizer now, Part of Speech tagging with Bert finetuning features a learning rate scheduler which a warmup and a linear decay if the user requests the Bert to be tuned.