TrainOps + Two Stage Optim for depparse #1337

Jemoka · 2024-01-28T18:17:22Z

Optional two-stage optimization scheme after first optim converged
wandb gradient logging
Model checkpointing with optimizer

Jemoka · 2024-01-28T19:34:23Z

ack; there is a patch here which is embedded in a commit in #1336 to get the tests to stop barfing. @AngledLuffa would love your thoughts on this

AngledLuffa · 2024-01-29T07:03:49Z

I still hope to be able to review things one piece at a time - mind pulling that patch into this PR? What I normally do in situations like this is args.get('second_optim', None) and the handle the case of no predefined value for the argument in a reasonable manner.

I also do kind of wonder, it doesn't need an optimizer when it's in eval mode, right? So how much memory & time is it wasting by creating that optimizer in this setting anyway?

Jemoka · 2024-01-29T07:12:19Z

sg; will address these first thing tmr if that's ok. Its probably not too much time lost, but you are right that this shouldn't be created during eval mode, etc.

Jemoka · 2024-01-29T17:40:30Z

I still hope to be able to review things one piece at a time - mind pulling that patch into this PR? What I normally do in situations like this is args.get('second_optim', None) and the handle the case of no predefined value for the argument in a reasonable manner.

I also do kind of wonder, it doesn't need an optimizer when it's in eval mode, right? So how much memory & time is it wasting by creating that optimizer in this setting anyway?

Jemoka · 2024-01-29T17:40:45Z

I still hope to be able to review things one piece at a time - mind pulling that patch into this PR? What I normally do in situations like this is args.get('second_optim', None) and the handle the case of no predefined value for the argument in a reasonable manner.

I also do kind of wonder, it doesn't need an optimizer when it's in eval mode, right? So how much memory & time is it wasting by creating that optimizer in this setting anyway?

Done. Addressed by dac72d0.

AngledLuffa · 2024-01-29T21:05:05Z

stanza/models/depparse/trainer.py

+                import wandb
+                # track gradients!
+                wandb.watch(self.model, log_freq=4, log="all", log_graph=True)
+        if ignore_model_config:


Why is this needed? To make sure the model shapes are the same? But this doesn't allow for training with a different optimizer, unless I'm mistaken about that

One thing that might work would be separating the parameters into two separate maps, or possibly when the model itself is constructed, the model shapes parameters are reused from the save file whereas the passed in args are used for the embedding locations and the optimizer

AngledLuffa · 2024-01-29T21:11:28Z

stanza/models/depparse/trainer.py

        self.model = self.model.to(device)
-        self.optimizer = utils.get_optimizer(self.args['optim'], self.model, self.args['lr'], betas=(0.9, self.args['beta2']), eps=1e-6, bert_learning_rate=self.args.get('bert_learning_rate', 0.0))
+        self.primary_optimizer = utils.get_optimizer(self.args['optim'], self.model, self.args['lr'], betas=(0.9, self.args['beta2']), eps=1e-6, bert_learning_rate=self.args.get('bert_learning_rate', 0.0))


I would worry that having primary and secondary optimizers is kind of wasteful in terms of space. The primary optimizer is kept around after switching, but never used again, and its derivatives etc will take up a lot of GPU

What I did with the constituency parser - which is by no means definitive - was to just keep one optimizer as part of the trainer. At the end of each epoch, if the switching condition was triggered, throw away the old one. We could then save a flag in the save file which marked whether the optimizer was the first or second optimizer

Happy to discuss alternatives for how best to keep track of which optimizer is being used and how best to save / load them

AngledLuffa · 2024-01-29T21:16:13Z

stanza/models/parser.py

@@ -191,7 +194,10 @@ def train(args):
        wandb.run.define_metric('dev_score', summary='max')

    logger.info("Training parser...")
-    trainer = Trainer(args=args, vocab=vocab, pretrain=pretrain, device=args['device'])
+    if args["continue_from"]:


Is there any mechanism that saves & loads the optimizer? I don't see such a thing. I think that would be a rather useful feature to add to the saving / loading & continuing.

The way I did this for the sentiment & constituency was to keep a "checkpoint" file which was always the latest optimizer state alongside the regular save file, which was the best so far. Then, if the checkpoint file exists when doing a training run with the same save_name, it would automatically load that checkpoint and continue from there. I suppose there should also be a flag which specifies that the user wants to discard the checkpoint, but I haven't done that so far.

I think we want to have a unified approach for the different models and their loading / checkpointing mechanisms

Jemoka · 2024-01-30T02:00:53Z

Done, the last few comments should address the two optimizers situation (only loads one, which one to load is stored in args and saved with the model) as well as implement the requested checkpointing w/ optim. Running a training run now to confirm everything.

AngledLuffa · 2024-01-30T08:06:06Z

Looks pretty good, I'd say. Thanks for making those changes!

There's a unit test for training the depparse (doesn't check the results, just checks that it runs)

stanza/tests/depparse/test_parser.py

How would you feel about extending it to check that it is producing the expected checkpoints and switches optimizer when expected? I can take that on myself tomorrow afternoon, actually. Probably isn't too difficult to test those items.

I think this can all be squashed into one change, what do you think?

AngledLuffa · 2024-01-30T08:33:08Z

I can take on the testing and the squashing tomorrow, actually, a couple of the other tasks I had lined up for this week are in "waiting for response" mode now

AngledLuffa · 2024-01-30T16:10:36Z

stanza/models/depparse/trainer.py

+            wandb.watch(self.model, log_freq=4, log="all", log_graph=True)
+
+    def __init_optim(self):
+        if not (self.args.get("second_stage", False) and self.args.get('second_optim')):


one small readability comment might be to switch the order of the boolean so that it's easier to read: if second_optim and second_stage, build second optimizer, otherwise build the first

Done, addressed by 0738498.

AngledLuffa · 2024-01-30T16:12:44Z

stanza/models/parser.py

-                    logger.info("Switching to AMSGrad")
+                if not is_second_stage and args.get('second_optim', None) is not None:
+                    logger.info("Switching to second optimizer: {}".format(args.get('second_optim', None)))
+                    args["second_stage"] = True


hang on - since the tagger is now copying the whole args with deepcopy (which I still think might not be necessary), will changing this field actually change the args in the tagger?

we pass the edited args into the constructor of Trainer, which, given we passed a model_file, will be passed into

stanza/stanza/models/depparse/trainer.py

Lines 40 to 42 in dbe6d1f

if model_file is not None:

# load everything from file

self.load(model_file, pretrain, args, foundation_cache)

from there, because the args are then passed in to load, it overwrite anything that the Trainer originally had:

stanza/stanza/models/depparse/trainer.py

Line 140 in dbe6d1f

if args is not None: self.args.update(args)

Apologies in advance if I mixed something up—but I'm pretty sure this should work in terms of stage switching. Also, empirically it seems to work with a single run I did.

Yes, that is correct. Thanks for pointing that out

stanza/models/depparse/trainer.py

1) save optimizer in checkpoint 2) two stage tracking using args Save the checkpoints after switching the optimizer, if applicable, so that reloading uses the new optimizer once it has been created

…s and the checkpoints are loadable Add a flag which forces the optimizer to switch after a certain number of steps - useful for writing tests which check the behavior of the second optimizer

…t when a checkpoint gets loaded, the training continues from the position it was formerly at rather than restarting from 0 Report some details of the model being loaded after loading it

Jemoka requested a review from AngledLuffa January 28, 2024 18:17

Jemoka changed the base branch from main to dev January 28, 2024 18:17

Jemoka mentioned this pull request Jan 28, 2024

PEFT on Named Entity Recognition #1336

Merged

Jemoka closed this Jan 29, 2024

Jemoka reopened this Jan 29, 2024

AngledLuffa reviewed Jan 29, 2024

View reviewed changes

Jemoka requested a review from AngledLuffa January 30, 2024 02:00

Jemoka force-pushed the depparse-ops branch from bce2c9d to aba0689 Compare January 30, 2024 04:28

AngledLuffa reviewed Jan 30, 2024

View reviewed changes

stanza/models/depparse/trainer.py Outdated Show resolved Hide resolved

Jemoka and others added 3 commits January 30, 2024 16:28

Implements 2 stage optimization for dependency parsing

0d53822

1) save optimizer in checkpoint 2) two stage tracking using args Save the checkpoints after switching the optimizer, if applicable, so that reloading uses the new optimizer once it has been created

Add a short test method that a single optimizer case saves checkpoint…

207a0d4

…s and the checkpoints are loadable Add a flag which forces the optimizer to switch after a certain number of steps - useful for writing tests which check the behavior of the second optimizer

Put the global_step and dev score history into the model files so tha…

7bc8cc9

…t when a checkpoint gets loaded, the training continues from the position it was formerly at rather than restarting from 0 Report some details of the model being loaded after loading it

AngledLuffa force-pushed the depparse-ops branch from c482a0c to 7bc8cc9 Compare January 31, 2024 00:28

AngledLuffa merged commit 5bc22dd into dev Jan 31, 2024
2 checks passed

AngledLuffa deleted the depparse-ops branch January 31, 2024 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrainOps + Two Stage Optim for depparse #1337

TrainOps + Two Stage Optim for depparse #1337

Jemoka commented Jan 28, 2024 •

edited

Jemoka commented Jan 28, 2024

AngledLuffa commented Jan 29, 2024

Jemoka commented Jan 29, 2024 •

edited

Jemoka commented Jan 29, 2024

Jemoka commented Jan 29, 2024

AngledLuffa Jan 29, 2024

AngledLuffa Jan 29, 2024

AngledLuffa Jan 29, 2024

AngledLuffa Jan 29, 2024

AngledLuffa Jan 29, 2024

Jemoka commented Jan 30, 2024 •

edited

AngledLuffa commented Jan 30, 2024

AngledLuffa commented Jan 30, 2024

AngledLuffa Jan 30, 2024

Jemoka Jan 30, 2024

AngledLuffa Jan 30, 2024

Jemoka Jan 30, 2024

AngledLuffa Jan 30, 2024

	if model_file is not None:
	# load everything from file
	self.load(model_file, pretrain, args, foundation_cache)

TrainOps + Two Stage Optim for depparse #1337

TrainOps + Two Stage Optim for depparse #1337

Conversation

Jemoka commented Jan 28, 2024 • edited

Jemoka commented Jan 28, 2024

AngledLuffa commented Jan 29, 2024

Jemoka commented Jan 29, 2024 • edited

Jemoka commented Jan 29, 2024

Jemoka commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jemoka commented Jan 30, 2024 • edited

AngledLuffa commented Jan 30, 2024

AngledLuffa commented Jan 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jemoka commented Jan 28, 2024 •

edited

Jemoka commented Jan 29, 2024 •

edited

Jemoka commented Jan 30, 2024 •

edited