Fixing logging and errors blocking multi GPU trianing of Torch models #1509

solalatus · 2023-01-23T19:14:34Z

Summary

Very small fix that enables multi GPU training to run without any problems.

Other Information

As the original Lightning documentation states here in case of multiple GPUs one has to choose how one synchronizes the logging between them. As a first attempt rank_zero_only=True is not a good solution, since it can lead to silent exception and breaking of every logging facility including progbar and Tensorboard. Hence the sync_dist=True option was chosen, which waits, collects and averages the metrics to be logged from all GPUs. It works stable and has no observable effect on single GPU training.

…io/en/stable/extensions/logging.html#automatic-logging

solalatus · 2023-01-24T16:28:21Z

I am afraid I made some formatting error according to Black. I don't really know what. Can anyone please advise?

madtoinou · 2023-01-24T17:39:15Z

Hi,

You can find the instruction here; you need to install the dev-all requirements, run pre-commit install and call it on your branch.

solalatus · 2023-01-24T19:38:17Z

Ok, hopefully this time it works. :-)

codecov-commenter · 2023-01-25T10:31:14Z

Codecov Report

Base: 94.06% // Head: 94.02% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (f2fbf8b) compared to base (690b6f4).
Patch coverage: 100.00% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1509      +/-   ##
==========================================
- Coverage   94.06%   94.02%   -0.05%     
==========================================
  Files         125      125              
  Lines       11095    11081      -14     
==========================================
- Hits        10437    10419      -18     
- Misses        658      662       +4

Impacted Files	Coverage Δ
darts/models/forecasting/pl_forecasting_module.py	`93.79% <100.00%> (ø)`
darts/utils/data/tabularization.py	`99.27% <0.00%> (-0.73%)`	⬇️
darts/timeseries.py	`92.14% <0.00%> (-0.23%)`	⬇️
darts/ad/anomaly_model/filtering_am.py	`91.93% <0.00%> (-0.13%)`	⬇️
...arts/models/forecasting/torch_forecasting_model.py	`89.52% <0.00%> (-0.05%)`	⬇️
darts/models/forecasting/block_rnn_model.py	`98.24% <0.00%> (-0.04%)`	⬇️
darts/models/forecasting/nhits.py	`99.27% <0.00%> (-0.01%)`	⬇️
darts/datasets/__init__.py	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

hrzn

LGTM, @dennisbader do you want to have a look?

hrzn · 2023-01-25T10:39:29Z

darts/models/forecasting/pl_forecasting_module.py

+            loss,
+            batch_size=train_batch[0].shape[0],
+            prog_bar=True,
+            sync_dist=True,


The PTL doc says Use with care as this may lead to a significant communication overhead..
Do we have any idea if/when this could cause issues?

So far in practical testing on 8 GPU-s I noticed no adverse effects. Thus said, it depends also on the distribution strategy also. I used the default ddp_spawn, as mentioned.

solalatus · 2023-02-01T07:18:50Z

Maybe this advice should go somewhere into the documentation: #1287 (comment)

hrzn · 2023-02-16T09:32:57Z

Maybe this advice should go somewhere into the documentation: #1287 (comment)

That sounds like a good idea yes @solalatus .
Would you agree to add a short new subsection about multi-GPU usage to the GPU/TPU page of the userguide?
Thanks!

solalatus · 2023-02-18T09:05:50Z

@hrzn I have added a description section to the userguide here, please have a look!

…unit8co#1509) * added fix for multi GPU as per https://pytorch-lightning.readthedocs.io/en/stable/extensions/logging.html#automatic-logging * trying to add complete logging in case of distributed to avoid deadlock * fixing the logging on epoch end for multigpu training * Black fixes for formatting errors * Added description of multi GPU setup do User Guide. --------- Co-authored-by: Julien Herzen <julien@unit8.co> Co-authored-by: madtoinou <32447896+madtoinou@users.noreply.github.com>

solalatus and others added 4 commits January 14, 2023 16:07

added fix for multi GPU as per https://pytorch-lightning.readthedocs.…

f2de6c0

…io/en/stable/extensions/logging.html#automatic-logging

trying to add complete logging in case of distributed to avoid deadlock

54fcf1b

fixing the logging on epoch end for multigpu training

74477d9

Merge branch 'unit8co:master' into master

c5bd70e

solalatus requested review from hrzn and dennisbader as code owners January 23, 2023 19:14

solalatus and others added 2 commits January 24, 2023 20:22

Merge branch 'unit8co:master' into master

7038548

Black fixes for formatting errors

e5d2910

hrzn approved these changes Jan 25, 2023

View reviewed changes

hrzn reviewed Jan 25, 2023

View reviewed changes

Merge branch 'master' into master

5be81d7

Merge branch 'master' into master

bc64a1f

madtoinou and others added 3 commits February 16, 2023 12:58

Merge branch 'master' into master

5e5a006

Merge branch 'unit8co:master' into master

762d2a9

Added description of multi GPU setup do User Guide.

f2fbf8b

hrzn merged commit 955e2b5 into unit8co:master Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing logging and errors blocking multi GPU trianing of Torch models #1509

Fixing logging and errors blocking multi GPU trianing of Torch models #1509

solalatus commented Jan 23, 2023 •

edited by dennisbader

solalatus commented Jan 24, 2023

madtoinou commented Jan 24, 2023

solalatus commented Jan 24, 2023

codecov-commenter commented Jan 25, 2023 •

edited

hrzn left a comment

hrzn Jan 25, 2023

solalatus Jan 25, 2023

solalatus commented Feb 1, 2023

hrzn commented Feb 16, 2023

solalatus commented Feb 18, 2023 •

edited

Fixing logging and errors blocking multi GPU trianing of Torch models #1509

Fixing logging and errors blocking multi GPU trianing of Torch models #1509

Conversation

solalatus commented Jan 23, 2023 • edited by dennisbader

Summary

Other Information

solalatus commented Jan 24, 2023

madtoinou commented Jan 24, 2023

solalatus commented Jan 24, 2023

codecov-commenter commented Jan 25, 2023 • edited

Codecov Report

hrzn left a comment

Choose a reason for hiding this comment

hrzn Jan 25, 2023

Choose a reason for hiding this comment

solalatus Jan 25, 2023

Choose a reason for hiding this comment

solalatus commented Feb 1, 2023

hrzn commented Feb 16, 2023

solalatus commented Feb 18, 2023 • edited

solalatus commented Jan 23, 2023 •

edited by dennisbader

codecov-commenter commented Jan 25, 2023 •

edited

solalatus commented Feb 18, 2023 •

edited