Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting out the optimizer utility + LR scheduler for PoS #1320

Merged
merged 2 commits into from Dec 14, 2023

Conversation

Jemoka
Copy link
Member

@Jemoka Jemoka commented Dec 13, 2023

This PR splits out the utility optimizer into separate parts for Bert and non-Bert; this method is backwards compatible, such that get_optimizer gets retained to be used for methods that have not yet been migrated, while a new get_split_optimizer is used for methods that desire an optimizer that has been split out.

Further, the optimizer takes an is_peft option which stages the optimizer to tune .parameters() of the Bert model instead of filtering for.named_parameters() which is selected. This allows HF Peft to do weird shenanigans to the parameters value, and we will—when the weights trainable is being constrained by the Peft library—tune what it tells us to tune instead of tuning things that we select via its name.

Taking advantage of this fact that we have a split optimizer now, Part of Speech tagging with Bert finetuning features a learning rate scheduler which a warmup and a linear decay if the user requests the Bert to be tuned.

bert_parameters = [p for n, p in model.named_parameters() if p.requires_grad and n.startswith("bert_model.")]
bert_parameters = [{'param_group_name': 'bert', 'params': bert_parameters, 'lr': lr * bert_learning_rate}]
else:
# because PEFT handles what to hand to an optimizer, we don't want to touch that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the blocks? i would expect that named_parameters would include the parameters under model.bert_model with bert_model. as the prefix of the name

Copy link
Member Author

@Jemoka Jemoka Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a typing mistake; it should be .parameters()—this was an oversight when applying patches. Apparently PEFT does weird shenanigans with the parameters list (the work with .default.lora_a, etc.) which shadows the original parameters and may not fit a specific name filter. This has been corrected in c72a3b9 which sets it correctly as .parameters() in the second case.

@@ -48,7 +58,7 @@ def update(self, batch, eval=False):
self.model.eval()
else:
self.model.train()
self.optimizer.zero_grad()
[i.zero_grad() for i in self.optimizers.values()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to go in a list, i would think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in c72a3b9. apologies.

@@ -59,7 +69,11 @@ def update(self, batch, eval=False):

loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.args['max_grad_norm'])
self.optimizer.step()

[i.step() for i in self.optimizers.values()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, maybe just for optimizer in self.optimizers.values(): optimizer.step()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in c72a3b9

@AngledLuffa
Copy link
Collaborator

Looks pretty good to me, thanks. Let me try it out (hopefully tonight) before merging

@Jemoka
Copy link
Member Author

Jemoka commented Dec 13, 2023

No rush; will move on to PEFTing NER or depparse after I survive 109 final tonight; thanks as always!

Jemoka and others added 2 commits December 13, 2023 23:13
…Bert; this method is backwards compatible, such that get_optimizer gets retained to be used for methods that have not yet been migrated, while a new get_split_optimizer is used for methods that desire an optimizer that has been split out.

Further, the optimizer takes an is_peft option which stages the optimizer to tune .parameters() of the Bert model instead of filtering for.named_parameters() which is selected. This allows HF Peft to do weird shenanigans to the parameters value, and we will—when the weights trainable is being constrained by the Peft library—tune what it tells us to tune instead of tuning things that we select via its name.

Taking advantage of this fact that we have a split optimizer now, Part of Speech tagging with Bert finetuning features a learning rate scheduler which a warmup and a linear decay if the user requests the Bert to be tuned.
…inetuning is doing what we want yet, though)
@AngledLuffa
Copy link
Collaborator

Thanks! Hopefully this helps peft be more effective for depparse or NER, even if it wasn't helping POS yet...

@AngledLuffa AngledLuffa merged commit a168a18 into dev Dec 14, 2023
1 check passed
@AngledLuffa AngledLuffa deleted the split-optim branch December 14, 2023 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants