-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depparse peft #1344
Depparse peft #1344
Conversation
@Jemoka want to look over some of the changes I made to add a 2nd optimizer or peft integration to the dependency parser? The overall benefit is close to 0, unfortunately, aside from a couple languages including Marathi. You can ignore the first change - it's relevant to the conparser, and I wanted to run experiments on both of those tools simultaneously. |
0e100d2
to
6c1322a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly lgtm—only question is whether or not the gradient clipping + PEFT causes slower convergence? I.e. do we want to sweep on those parameters?
Happy to take that on, but otherwise this is good.
opt.zero_grad() | ||
# if there is no bert optimizer, we will tell the model to detach bert so it uses less GPU | ||
detach = "bert_optimizer" not in self.optimizer | ||
loss, _ = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, lemma, head, deprel, word_orig_idx, sentlens, wordlens, text, detach=detach) | ||
loss_val = loss.data.item() | ||
if eval: | ||
return loss_val | ||
|
||
loss.backward() | ||
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.args['max_grad_norm']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this causes problems with PEFT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The detach
setting just controls whether or not the gradients of the transformer embedding are kept. I don't think it should affect peft in any way if the model is frozen
I can check. Hadn't really considered whether that would mess with the bert finetuning or not |
17a45c3
to
cd50fc6
Compare
… us use the full batch size even after training the transformer with a smaller batch size because of GPU limitations
… of just doing the one optimizer Create these multiple optimizers using the util method which returns a separate optimizer for the transformer Detach the bert vectors if there is no bert optimizer, learning rate is 0, etc. Fix up the unit test to match the new layout.
…nes for scoring a new dev best
Add a test method which checks that the bert optimizer and scheduler were successfully added
…he depparse. Doesn't seem to help much Fix loading existing models
…the peft_config utility method
… report the score of a newly reloaded model when switching to the 2nd optimizer
…nding to the previous save...
…ince we do the same thing twice already
Add some various tooling such as a 2nd optimizer and integration with peft to the dependency parser. Unfortunately, aside from Marathi for some reason, there seems to be little in the way of a big win for either full finetuning or using peft on the depparser. Will continue to CV for better training parameters which do have a benefit, and integrating this tooling should make it so that any models we do find can be immediately released without a new code release.