-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating to Lightning 2.0 #210
Conversation
The resume_from_checkpoint option in train.py has disappeared: trainer = pl.Trainer(
strategy=DDPStrategy(find_unused_parameters=False),
max_epochs=args.num_epochs,
accelerator="auto",
devices=args.ngpus,
num_nodes=args.num_nodes,
default_root_dir=args.log_dir,
#resume_from_checkpoint=None if args.reset_trainer else args.load_model,
callbacks=[early_stopping, checkpoint_callback],
logger=_logger,
precision=args.precision,
gradient_clip_val=args.gradient_clipping,
inference_mode=False
) Trying to see how one is supposed to replace it |
From LightningDeprecationWarning: Setting
|
I found this issue in which they discuss why Lighting does not want to support testing during the training loop: However, I do not understand why we are paying such a high price in performance to test during training. @PhilippThoelke , you implemented this AFAIK, would you provide some insights? |
Testing during training was useful since it can be difficult to estimate model performance on val loss, when val loss is also used to adjust the learning rate (e.g. through the ReduceLROnPlateau schedule). This is particularly important for fast prototyping and architecture development. I wasn't aware that this slows down training by 6x, that is insane! How much of that is actually due to the testing though, and how much are improvements in Lightning 2.0? |
TBH I do not really know what amount of the speedup is due to other improvements in lighting. Well I guess I can also run the baseline without testing during training -.- If that turns out to be the case then perhaps we can print a warning in the "test during training" case to explain that you are paying a high price for it hehe |
I would drop it if it really makes such a huge difference. I don't see how
validation is not a problem and test costs 6 times.
…On Tue, Aug 8, 2023 at 7:17 PM Raul ***@***.***> wrote:
TBH I do not really know what amount of the speedup is due to other
improvements in lighting.
It makes sense to me that not testing is the main cause of the speedup
simply because one is not going over a whole dataset every few epochs,
which requires reading a bunch of stuff from disk, etc.
I understand the fast prototyping argument. But we are currently going
against the lighting devs.
I will try to recreate the multiple-dataloader hack (looking at the
docs/github issues I do not see any other workaround) so the functionality
is still provided and I can measure if this is indeed the cause of the
slowdown.
Well I guess I can also run the baseline without testing during training
-.-
If that turns out to be the case then perhaps we can print a warning in
the "test during training" case to explain that you are paying a high price
for it hehe
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOT6EZJEZKBHRM2IURLXUJYC7ANCNFSM6AAAAAA3IM2GXM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Alas, it requires to reload the dataloaders every epoch when test_interval>0
I managed to bring back the functionality. I will test some more to make sure these performance numbers are not a fluke, but functionality-wise this should be ready to go! |
def _get_dataloader(self, dataset, stage, store_dataloader=True): | ||
store_dataloader = ( | ||
store_dataloader and self.trainer.reload_dataloaders_every_n_epochs <= 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this check here.
By removing it the dataloaders are always stored.
AFAIK it was never actually used the check the project, so the behavior is unchanged. But I want to point it out in case I am missing something.
I get that testing during training does only matter when you are prototyping, and in fact it helped in my case to stop training after few epochs when some hyperparameter change did not provide any improvement. If we are changing this, I have two suggestions:
|
Another option is, when the training is finished, look for the checkpoint with lowest validation loss and run test with this one. I think this is what happens in MACE, for example. |
Thanks for your insights!
AFAIK this is exactly what torchmd-train does: torchmd-net/torchmdnet/scripts/train.py Lines 179 to 184 in dca6679
As you say, the values reported as "val_loss_y/dy" are mse_loss. I can add another reported value called "val_l1_loss_y/dy". It should have no effect whatsoever on performance. |
Adding these metrics to the reports has zero cost on performance. I changed things so that each val epoch loss is reported for every loss function in a list (currently l1 and l2).
train_loss and val_loss are aliases to train_total_mse_loss and val_total_mse_loss respectively. I left them there because the in place naming scheme for the checkpoints uses this particular names. If test_interval>0 then test_[y,neg_dy,total]_l1_loss entries are added. |
Always log losses on y and neg_dy even if they are weighted 0 for the total loss
This is ready to merge on my part. |
Test failed for some reason. Can you take a look? |
It was just an overprotective numerical check. Just retriggering the CI has fixed it. |
In the end is this intrinsically faster or was it just the effect of testing during training? |
This small, not representative example, I am using is def faster beyond the test thing. |
Hi, Thanks for the efforts to test-during-training. I wonder can we use test-during-training now? If yes, could you please give us an example? Thank you very much! |
Hi, The behavior is just like before this PR via the "test_interval" option. However, setting it to -1 will skip test during training. Check out this doc page https://torchmd-net.readthedocs.io/en/latest/torchmd-train.html |
This is an effort to update the pytorch lightning dependency (see #168 )
A lot has changed since the LNNP module was written (current version is 1.6.4) and I expect this update would increase compatibility with some functionalities we seek (AKA torch compile/torchscript training).