Depparse peft #1344

AngledLuffa · 2024-02-14T21:20:09Z

Add some various tooling such as a 2nd optimizer and integration with peft to the dependency parser. Unfortunately, aside from Marathi for some reason, there seems to be little in the way of a big win for either full finetuning or using peft on the depparser. Will continue to CV for better training parameters which do have a benefit, and integrating this tooling should make it so that any models we do find can be immediately released without a new code release.

AngledLuffa · 2024-02-14T21:21:41Z

@Jemoka want to look over some of the changes I made to add a 2nd optimizer or peft integration to the dependency parser? The overall benefit is close to 0, unfortunately, aside from a couple languages including Marathi. You can ignore the first change - it's relevant to the conparser, and I wanted to run experiments on both of those tools simultaneously.

Jemoka

mostly lgtm—only question is whether or not the gradient clipping + PEFT causes slower convergence? I.e. do we want to sweep on those parameters?
Happy to take that on, but otherwise this is good.

stanza/models/depparse/model.py

Jemoka · 2024-02-14T23:36:46Z

stanza/models/depparse/trainer.py

+                opt.zero_grad()
+        # if there is no bert optimizer, we will tell the model to detach bert so it uses less GPU
+        detach = "bert_optimizer" not in self.optimizer
+        loss, _ = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, lemma, head, deprel, word_orig_idx, sentlens, wordlens, text, detach=detach)
        loss_val = loss.data.item()
        if eval:
            return loss_val

        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.args['max_grad_norm'])


I wonder if this causes problems with PEFT

The detach setting just controls whether or not the gradients of the transformer embedding are kept. I don't think it should affect peft in any way if the model is frozen

AngledLuffa · 2024-02-15T01:49:23Z

mostly lgtm—only question is whether or not the gradient clipping + PEFT causes slower convergence? I.e. do we want to sweep on those parameters? Happy to take that on, but otherwise this is good.

I can check. Hadn't really considered whether that would mess with the bert finetuning or not

… being done

… us use the full batch size even after training the transformer with a smaller batch size because of GPU limitations

… of just doing the one optimizer Create these multiple optimizers using the util method which returns a separate optimizer for the transformer Detach the bert vectors if there is no bert optimizer, learning rate is 0, etc. Fix up the unit test to match the new layout.

…nes for scoring a new dev best

Add a test method which checks that the bert optimizer and scheduler were successfully added

…ert optimizer

…he depparse. Doesn't seem to help much Fix loading existing models

…the peft_config utility method

… report the score of a newly reloaded model when switching to the 2nd optimizer

…nding to the previous save...

…ince we do the same thing twice already

AngledLuffa force-pushed the depparse-peft branch from 0e100d2 to 6c1322a Compare February 14, 2024 21:24

Jemoka requested changes Feb 14, 2024

View reviewed changes

Jemoka self-requested a review February 15, 2024 02:10

AngledLuffa force-pushed the depparse-peft branch from 17a45c3 to cd50fc6 Compare February 15, 2024 07:33

AngledLuffa added 17 commits February 14, 2024 23:35

Don't create an optimizer for the transformer if there is no learning…

f0ce4d0

… being done

Add a flag to set a different sized batch for the 2nd optimizer. Lets…

0a0f172

… us use the full batch size even after training the transformer with a smaller batch size because of GPU limitations

Make checkpoints less often - every 500 by default, with occasional o…

0dcf712

…nes for scoring a new dev best

Add a warmup scheduler to finetuning the depparse transformer

0687c37

Add a test method which checks that the bert optimizer and scheduler were successfully added

Fix --no_checkpoint option

96fb54e

Oops, fix usage of weight_decay in the common function to build the b…

e75cb31

…ert optimizer

Add flags for using weight decay in the first round of optimizer in t…

331d3ae

…he depparse. Doesn't seem to help much Fix loading existing models

Move --use_peft and the checking of --use_peft vs --bert_finetune to …

c087fcf

…the peft_config utility method

Set the default number of hidden layers used from the transformer to 4

df37c25

Force bert saved when loading / saving a model

19fa8c0

Add a PEFT wrapper to the dependency parser

b9ee08f

Saved depparse models were missing the last score update

f99e7aa

Refactor the code which runs the predictions in depparse. Use this to…

59085d3

… report the score of a newly reloaded model when switching to the 2nd optimizer

Add a linear warmup scheduler for the 2nd optimizer pass

4c7d46f

Continue training from the current global_step count rather than rewi…

72a3cd5

…nding to the previous save...

Refactor a method that builds the LoRA wrapper around a bert model, s…

cd50fc6

…ince we do the same thing twice already

AngledLuffa merged commit 82c6968 into dev Feb 15, 2024
1 check passed

AngledLuffa deleted the depparse-peft branch February 15, 2024 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depparse peft #1344

Depparse peft #1344

AngledLuffa commented Feb 14, 2024

AngledLuffa commented Feb 14, 2024

Jemoka left a comment

Jemoka Feb 14, 2024

AngledLuffa Feb 15, 2024

AngledLuffa commented Feb 15, 2024

Depparse peft #1344

Depparse peft #1344

Conversation

AngledLuffa commented Feb 14, 2024

AngledLuffa commented Feb 14, 2024

Jemoka left a comment

Choose a reason for hiding this comment

Jemoka Feb 14, 2024

Choose a reason for hiding this comment

AngledLuffa Feb 15, 2024

Choose a reason for hiding this comment

AngledLuffa commented Feb 15, 2024