Updating to Lightning 2.0 #210

RaulPPelaez · 2023-08-08T12:31:05Z

This is an effort to update the pytorch lightning dependency (see #168 )
A lot has changed since the LNNP module was written (current version is 1.6.4) and I expect this update would increase compatibility with some functionalities we seek (AKA torch compile/torchscript training).

RaulPPelaez · 2023-08-08T14:22:34Z

The resume_from_checkpoint option in train.py has disappeared:

    trainer = pl.Trainer(
        strategy=DDPStrategy(find_unused_parameters=False),
        max_epochs=args.num_epochs,
        accelerator="auto",
        devices=args.ngpus,
        num_nodes=args.num_nodes,
        default_root_dir=args.log_dir,
        #resume_from_checkpoint=None if args.reset_trainer else args.load_model,
        callbacks=[early_stopping, checkpoint_callback],
        logger=_logger,
        precision=args.precision,
        gradient_clip_val=args.gradient_clipping,
        inference_mode=False
    )

Trying to see how one is supposed to replace it

AntonioMirarchi · 2023-08-08T14:35:36Z

From LightningDeprecationWarning: Setting Trainer(resume_from_checkpoint=) is deprecated in v1.5 and will be removed in v1.7. Please pass Trainer.fit(ckpt_path=) directly instead.
So it could be something like this:

if not args.load_model:
   trainer.fit(model, data)
else:
   trainer.fit(model, data, ckpt_path=load_model)

RaulPPelaez · 2023-08-08T14:54:33Z

There is one lingering issue. Training seems to be overall x6 faster with this PR than current main:

But as you see the testing graphs are not being updated to wandb. I am scared it is just not testing, but the train and val loss seem identical to me.

RaulPPelaez · 2023-08-08T16:25:15Z

I found this issue in which they discuss why Lighting does not want to support testing during the training loop:
Lightning-AI/pytorch-lightning#9254
They also describe the trick that is being used now of adding the test dataloader as a validation dataloader so you can use it during training.

However, I do not understand why we are paying such a high price in performance to test during training. @PhilippThoelke , you implemented this AFAIK, would you provide some insights?

PhilippThoelke · 2023-08-08T16:38:50Z

Testing during training was useful since it can be difficult to estimate model performance on val loss, when val loss is also used to adjust the learning rate (e.g. through the ReduceLROnPlateau schedule). This is particularly important for fast prototyping and architecture development. I wasn't aware that this slows down training by 6x, that is insane! How much of that is actually due to the testing though, and how much are improvements in Lightning 2.0?
I don't know if that lr scheduler is still relevant for torchmd-net, but a 6x efficiency decrease is never a viable solution.
Possibly related: #27

RaulPPelaez · 2023-08-08T17:17:25Z

TBH I do not really know what amount of the speedup is due to other improvements in lighting.
It makes sense to me that not testing is the main cause of the speedup simply because one is not going over a whole dataset every few epochs, which requires reading a bunch of stuff from disk, etc.
I understand the fast prototyping argument. But we are currently going against the lighting devs.
I will try to recreate the multiple-dataloader hack (looking at the docs/github issues I do not see any other workaround) so the functionality is still provided and I can measure if this is indeed the cause of the slowdown.

Well I guess I can also run the baseline without testing during training -.-

If that turns out to be the case then perhaps we can print a warning in the "test during training" case to explain that you are paying a high price for it hehe

giadefa · 2023-08-08T19:16:52Z

I would drop it if it really makes such a huge difference. I don't see how validation is not a problem and test costs 6 times.

…

On Tue, Aug 8, 2023 at 7:17 PM Raul ***@***.***> wrote: TBH I do not really know what amount of the speedup is due to other improvements in lighting. It makes sense to me that not testing is the main cause of the speedup simply because one is not going over a whole dataset every few epochs, which requires reading a bunch of stuff from disk, etc. I understand the fast prototyping argument. But we are currently going against the lighting devs. I will try to recreate the multiple-dataloader hack (looking at the docs/github issues I do not see any other workaround) so the functionality is still provided and I can measure if this is indeed the cause of the slowdown. Well I guess I can also run the baseline without testing during training -.- If that turns out to be the case then perhaps we can print a warning in the "test during training" case to explain that you are paying a high price for it hehe — Reply to this email directly, view it on GitHub <#210 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOT6EZJEZKBHRM2IURLXUJYC7ANCNFSM6AAAAAA3IM2GXM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

RaulPPelaez · 2023-08-09T08:01:00Z

Ok since I do not know yet how to reproduce the trick in latest Lighting I ran current main with and without testing during validation to compare:

Note that I was not trying to benchmark when I noticed this, I was checking correctness. Thus this train example must not be the most representative one (it does not fully occupy the GPU), but still...
Testing is carried out every 10 epochs in the orange line.
Roughly it is a x3 slowdown if you test every 10 epochs. Without testing I see a x3 speedup between Lightning 1.6.3 and 2.0.4.

I get testing being super slow because the way it is set up in this example it just goes over the whole dataset each time. Besides being just a lot of forward passes it requires reading the whole thing from disk.

With this I know believe that:

Updating Ligthing is very worth it.
We should keep the testing functionality for prototyping.
We should include a warning explaining that test-during-training can be expensive.

Alas, it requires to reload the dataloaders every epoch when test_interval>0

RaulPPelaez · 2023-08-09T10:59:47Z

I managed to bring back the functionality.
As far as I can tell the only way to reproduce the test-during-train trick is to reload the dataloader every epoch.
This does not seem to have a profound effect in performance.
Sadly the performance when test-during-training is enabled does not look better than before.

I will test some more to make sure these performance numbers are not a fluke, but functionality-wise this should be ready to go!

RaulPPelaez · 2023-08-09T11:01:57Z

torchmdnet/data.py

-    def _get_dataloader(self, dataset, stage, store_dataloader=True):
-        store_dataloader = (
-            store_dataloader and self.trainer.reload_dataloaders_every_n_epochs <= 0


I removed this check here.
By removing it the dataloaders are always stored.
AFAIK it was never actually used the check the project, so the behavior is unchanged. But I want to point it out in case I am missing something.

guillemsimeon · 2023-08-12T05:06:50Z

I get that testing during training does only matter when you are prototyping, and in fact it helped in my case to stop training after few epochs when some hyperparameter change did not provide any improvement. If we are changing this, I have two suggestions:

Report validation MAEs on both energies and forces on top of (separate) validation loss(es), which are MSEs. This can give a better sense of how the training is going.
Even when no testing is performed during training, I would suggest in that case to provide test errors when the training is finished by default. Then people would not need to run inference separately on the test set to report performances.

guillemsimeon · 2023-08-12T05:16:47Z

Another option is, when the training is finished, look for the checkpoint with lowest validation loss and run test with this one. I think this is what happens in MACE, for example.

RaulPPelaez · 2023-08-14T07:09:38Z

Thanks for your insights!
I am convinced test-during-training has some useful cases, luckily I was able to cook it in again so I believe it is best to leave it there as an option at least for now. If you want to go go fast simply do not set the "test_interval" option.

Another option is, when the training is finished, look for the checkpoint with lowest validation loss and run test with this one. I think this is what happens in MACE, for example.

AFAIK this is exactly what torchmd-train does:

torchmd-net/torchmdnet/scripts/train.py

Lines 179 to 184 in dca6679

    
           trainer.fit(model, data) 
        
           # run test set after completing the fit 
        
           model = LNNP.load_from_checkpoint(trainer.checkpoint_callback.best_model_path) 
        
           trainer = pl.Trainer(logger=_logger) 
        
           trainer.test(model, data)

Report validation MAEs

As you say, the values reported as "val_loss_y/dy" are mse_loss. I can add another reported value called "val_l1_loss_y/dy". It should have no effect whatsoever on performance.

RaulPPelaez · 2023-08-14T15:05:15Z

Adding these metrics to the reports has zero cost on performance.

I changed things so that each val epoch loss is reported for every loss function in a list (currently l1 and l2).
Additionally, y, neg_dy and total losses are reported always for each loss function regardless of the weights of each one (even if total_loss=loss_neg_dy you might be interested in knowing how loss_y behaves).
At this point, the following metrics are reported when test-during-training is off (by default, set by test_interval<0):

Metric
train_loss
train_neg_dy_mse_loss
train_total_mse_loss
train_y_mse_loss
val_loss
val_neg_dy_l1_loss
val_neg_dy_mse_loss
val_total_l1_loss
val_total_mse_loss
val_y_l1_loss
val_y_mse_loss

train_loss and val_loss are aliases to train_total_mse_loss and val_total_mse_loss respectively. I left them there because the in place naming scheme for the checkpoints uses this particular names.

If test_interval>0 then test_[y,neg_dy,total]_l1_loss entries are added.

RaulPPelaez · 2023-08-14T16:19:49Z

I noticed there was no "on_test_epoch_end" member in LNNP. Meaning that the trainer.test line in train.py effectively just discarded all the work it did as far as I can tell. Also there is no logger attached to the test run, so nothing is written to disk about it.

I wrote one so that test losses are actually reported to the terminal like this:

Always log losses on y and neg_dy even if they are weighted 0 for the total loss

RaulPPelaez · 2023-09-05T07:02:51Z

This is ready to merge on my part.
@AntonioMirarchi, could you rerun some training with this PR to double check?
cc @raimis please review

tests/test_model.py

torchmdnet/module.py

stefdoerr · 2023-09-05T13:34:38Z

Test failed for some reason. Can you take a look?

RaulPPelaez · 2023-09-05T13:47:26Z

It was just an overprotective numerical check. Just retriggering the CI has fixed it.

guillemsimeon · 2023-09-05T13:53:39Z

In the end is this intrinsically faster or was it just the effect of testing during training?

RaulPPelaez · 2023-09-06T07:14:02Z

This small, not representative example, I am using is def faster beyond the test thing.
Although every test I have run turns out faster, I would not expect real life runs to be much better. For instance, the ET-SPICE.yaml test, removing test-during-training, is just about 5% faster. YMMV.
I would expect to see improvements when epochs are really short, for instance.

cuicathy · 2024-02-07T01:31:37Z

Hi, Thanks for the efforts to test-during-training. I wonder can we use test-during-training now? If yes, could you please give us an example? Thank you very much!
P.S. I do not always need to test-during-training, but in some project for specific research purpose I have to...

RaulPPelaez · 2024-02-07T08:47:55Z

Hi, The behavior is just like before this PR via the "test_interval" option. However, setting it to -1 will skip test during training.

Check out this doc page https://torchmd-net.readthedocs.io/en/latest/torchmd-train.html

RaulPPelaez added 5 commits August 8, 2023 13:48

Update to lightning 2.0

6b4d98d

Update env

e2ae531

Update env

1d34a63

Add pydantic<2

074a51f

Update

2590f4f

RaulPPelaez added 2 commits August 8, 2023 16:37

Update train.py

c568a34

Update a couple imports

72bb852

RaulPPelaez added 3 commits August 8, 2023 18:14

Update

290079a

Update

3c5ac74

Update

1b5a5cd

RaulPPelaez added 7 commits August 9, 2023 12:40

Default test_interval to -1, print warning if its positive

625a75a

Reproduce the trick used before to test during training

aa8c593

Alas, it requires to reload the dataloaders every epoch when test_interval>0

Blacken

07cab4e

Blacken

57f13f6

Small update

a5eb996

Fix typo

f099e03

Fix default reset_dataloaders_every_n_epochs

82e1535

RaulPPelaez commented Aug 9, 2023

View reviewed changes

RaulPPelaez added 4 commits August 14, 2023 17:08

blacken

ff990ff

Use zero_rank_warn

7e80f0c

Set inference_mode to false during testing

8c3c8b4

Add more arguments to testing Trainer

6b67782

Add a test result log

431cb3a

Always log losses on y and neg_dy even if they are weighted 0 for the total loss

RaulPPelaez changed the title ~~[WIP] Updating to Lightning 2.0~~ Updating to Lightning 2.0 Sep 5, 2023

raimis requested review from raimis, stefdoerr and PhilippThoelke September 5, 2023 12:08

stefdoerr reviewed Sep 5, 2023

View reviewed changes

tests/test_model.py Outdated Show resolved Hide resolved

stefdoerr reviewed Sep 5, 2023

View reviewed changes

torchmdnet/module.py Outdated Show resolved Hide resolved

stefdoerr reviewed Sep 5, 2023

View reviewed changes

torchmdnet/module.py Outdated Show resolved Hide resolved

stefdoerr reviewed Sep 5, 2023

View reviewed changes

torchmdnet/module.py Outdated Show resolved Hide resolved

RaulPPelaez added 4 commits September 5, 2023 15:09

Change lightning.pytorch to just lightning

d3df7cf

Change a comprehension into an if/else

450cd6b

Remove spurious self. in _get_mean_losses

3cf6cf4

Use defaultdict

1298473

RaulPPelaez added 2 commits September 5, 2023 15:35

Update to lightning 2.0.8

7ee1305

Blacken

0c84cb7

stefdoerr approved these changes Sep 5, 2023

View reviewed changes

RaulPPelaez merged commit ac16c09 into torchmd:main Sep 6, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating to Lightning 2.0 #210

Updating to Lightning 2.0 #210

RaulPPelaez commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

AntonioMirarchi commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

PhilippThoelke commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

giadefa commented Aug 8, 2023 via email

RaulPPelaez commented Aug 9, 2023

RaulPPelaez commented Aug 9, 2023

RaulPPelaez Aug 9, 2023

guillemsimeon commented Aug 12, 2023

guillemsimeon commented Aug 12, 2023

RaulPPelaez commented Aug 14, 2023

RaulPPelaez commented Aug 14, 2023

RaulPPelaez commented Aug 14, 2023

RaulPPelaez commented Sep 5, 2023

stefdoerr commented Sep 5, 2023

RaulPPelaez commented Sep 5, 2023

guillemsimeon commented Sep 5, 2023 •

edited

Loading

RaulPPelaez commented Sep 6, 2023

cuicathy commented Feb 7, 2024

RaulPPelaez commented Feb 7, 2024

Updating to Lightning 2.0 #210

Updating to Lightning 2.0 #210

Conversation

RaulPPelaez commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

AntonioMirarchi commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

PhilippThoelke commented Aug 8, 2023

RaulPPelaez commented Aug 8, 2023

giadefa commented Aug 8, 2023 via email

RaulPPelaez commented Aug 9, 2023

RaulPPelaez commented Aug 9, 2023

RaulPPelaez Aug 9, 2023

Choose a reason for hiding this comment

guillemsimeon commented Aug 12, 2023

guillemsimeon commented Aug 12, 2023

RaulPPelaez commented Aug 14, 2023

RaulPPelaez commented Aug 14, 2023

RaulPPelaez commented Aug 14, 2023

RaulPPelaez commented Sep 5, 2023

stefdoerr commented Sep 5, 2023

RaulPPelaez commented Sep 5, 2023

guillemsimeon commented Sep 5, 2023 • edited Loading

RaulPPelaez commented Sep 6, 2023

cuicathy commented Feb 7, 2024

RaulPPelaez commented Feb 7, 2024

guillemsimeon commented Sep 5, 2023 •

edited

Loading