Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depparse peft #1344

Merged
merged 17 commits into from
Feb 15, 2024
Merged

Depparse peft #1344

merged 17 commits into from
Feb 15, 2024

Conversation

AngledLuffa
Copy link
Collaborator

Add some various tooling such as a 2nd optimizer and integration with peft to the dependency parser. Unfortunately, aside from Marathi for some reason, there seems to be little in the way of a big win for either full finetuning or using peft on the depparser. Will continue to CV for better training parameters which do have a benefit, and integrating this tooling should make it so that any models we do find can be immediately released without a new code release.

@AngledLuffa
Copy link
Collaborator Author

@Jemoka want to look over some of the changes I made to add a 2nd optimizer or peft integration to the dependency parser? The overall benefit is close to 0, unfortunately, aside from a couple languages including Marathi. You can ignore the first change - it's relevant to the conparser, and I wanted to run experiments on both of those tools simultaneously.

Copy link
Member

@Jemoka Jemoka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm—only question is whether or not the gradient clipping + PEFT causes slower convergence? I.e. do we want to sweep on those parameters?
Happy to take that on, but otherwise this is good.

stanza/models/depparse/model.py Outdated Show resolved Hide resolved
opt.zero_grad()
# if there is no bert optimizer, we will tell the model to detach bert so it uses less GPU
detach = "bert_optimizer" not in self.optimizer
loss, _ = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, lemma, head, deprel, word_orig_idx, sentlens, wordlens, text, detach=detach)
loss_val = loss.data.item()
if eval:
return loss_val

loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.args['max_grad_norm'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this causes problems with PEFT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detach setting just controls whether or not the gradients of the transformer embedding are kept. I don't think it should affect peft in any way if the model is frozen

@AngledLuffa
Copy link
Collaborator Author

mostly lgtm—only question is whether or not the gradient clipping + PEFT causes slower convergence? I.e. do we want to sweep on those parameters? Happy to take that on, but otherwise this is good.

I can check. Hadn't really considered whether that would mess with the bert finetuning or not

… us use the full batch size even after training the transformer with a smaller batch size because of GPU limitations
… of just doing the one optimizer

Create these multiple optimizers using the util method which returns a separate optimizer for the transformer

Detach the bert vectors if there is no bert optimizer, learning rate is 0, etc.  Fix up the unit test to match the new layout.
Add a test method which checks that the bert optimizer and scheduler were successfully added
…he depparse. Doesn't seem to help much

Fix loading existing models
… report the score of a newly reloaded model when switching to the 2nd optimizer
@AngledLuffa AngledLuffa merged commit 82c6968 into dev Feb 15, 2024
1 check passed
@AngledLuffa AngledLuffa deleted the depparse-peft branch February 15, 2024 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants