Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved parallel utils #1023

Merged
merged 25 commits into from
May 11, 2020
Merged

Improved parallel utils #1023

merged 25 commits into from
May 11, 2020

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented May 8, 2020

PR to idist branch

Description:

  • added idist
  • computation models

Related to #992 #1014 and should enable #988

TODO:

  • documentation
  • decide about create_from_context API
  • decide on names: ComputationModel, _SerialModel, _NativeDistModel, _XlaDistModel

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@vfdev-5 vfdev-5 changed the base branch from master to idist May 8, 2020 09:34
@vfdev-5 vfdev-5 marked this pull request as ready for review May 8, 2020 09:37
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented May 8, 2020

@erip @sdesrozis please review

@vfdev-5 vfdev-5 mentioned this pull request May 8, 2020
3 tasks
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments :)

Using torch.distributed.launch, if env vars are not reset, context mode (i.e. dist.init_process_group(...) by user) should be ok, right ?

Spawn mode is more complex :

  • When spawned by ignite, local_rank could be added to _model
  • When spawned by user, it misses a method to let the user inform the local_rank.

ignite/distributed/comp_models/native.py Outdated Show resolved Hide resolved
@sdesrozis
Copy link
Contributor

I was thinking about mpi backend. With this backend, no local rank in env. The user will have to inform ignite in one way or another, and maybe he can’t 😔

@vfdev-5 vfdev-5 merged commit 177fb6f into pytorch:idist May 11, 2020
@vfdev-5 vfdev-5 deleted the parallel_utils branch May 11, 2020 15:59
vfdev-5 added a commit that referenced this pull request May 31, 2020
* Improved parallel utils (#1023)

* add utils for distributed

* autopep8 fix

* [WIP] Added comp models and tests

* [WIP] Added create_from_backend and create_from_context for _DistModel

* [WIP] Added spawn to _DistModel

* [WIP] Refactored comp models and added spawn for xla

* autopep8 fix

* Improved tests

* autopep8 fix

* Fixes flake8

* autopep8 fix

* - Removed is_distributed
- Renamed _DistModel -> _NativeDistModel

* autopep8 fix

* Added docs and tests for xla spawn

* Fixes conftest bug

* autopep8 fix

* Updates

* autopep8 fix

* Fixes available_backends bug

* autopep8 fix

* Fixed tests

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: AutoPEP8 <>

* [WIP] create from context for XLA

* autopep8 fix

* Tests for _sync_model for XLA

* autopep8 fix

* More tests and updates

* autopep8 fix

* [WIP] create from context for Native Torch Dist

* autopep8 fix

* Added tests for idist.* created from context for native dist settings

* [WIP] Fix tests

* Fixed metric related tests

* autopep8 fix

* [WIP] idist - Docs & code updates (#1034)

* [WIP] Updates docs

* Adapted metrics with idist, fixed tests
- added local rank estimation with hostname heuristics

* autopep8 fix

* Adapted metrics code and tests to use idist

* autopep8 fix

* Updated docs, docstrings

* Updated xla tests and fix a bug with tensor dtype

* autopep8 fix

* Fixed all gather using all reduce op

* autopep8 fix

* Improved tests of create_supervised_trainer on TPU

Co-authored-by: AutoPEP8 <>

* Tpu metrics (#1042)

* [WIP] make accumulation tests on TPU(s)

* Fixed tests with accumulation metric
- by decreasing tolerence

* Fixed all_gather bug

* autopep8 fix

* Added metric tests for xla

* autopep8 fix

* Fixed bug in test of precision
- updated other regression tests

* Fixed failing tests on TPU
- increased err tolerence

Co-authored-by: AutoPEP8 <>

* Increased err tol for mse and rmse tests on single TPU

* Fixes #991 (#1047)

- average output in RunningAverage

* add TPU checkpointing to CPU. (#1005)

* add TPU checkpointing to CPU.

* autopep8 fix

* update docstring to include TPU notice.

* add skip for non-TPU tests.

* autopep8 fix

* refactor to use idist API.

* autopep8 fix

* add complex save with TPU.

* autopep8 fix

* fix tests.

* fix typo in docstring.

Co-authored-by: vfdev <vfdev.5@gmail.com>

Co-authored-by: AutoPEP8 <>
Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com>
Co-authored-by: vfdev <vfdev.5@gmail.com>

* Updated tests on checkpoint and TPU

* Added barrier op in idist (#1050)

* Added barrier op in idist

* Fixed test and updated one_rank_only to use idist

* Moved one_rank_only to idist, adapted tests

* autopep8 fix

* Removed redundant imports

* Another test fix of setup_logger

Co-authored-by: AutoPEP8 <>

* Fixed bug with torch.cuda.set_device

* Fixed cuda device index, added warning if cuda device index != local rank

* autopep8 fix

* Issue 1011 (#1053)

* Fixed #1011

* autopep8 fix

* Fixed failing test

* Updated tests/ignite/contrib/engines/test_common.py

* Updated tests to improve coverage

* Fixed test

Co-authored-by: AutoPEP8 <>

* Improved device() method (#1062)

* Improved device() method
- fixes run_gpu_tests.sh to run tests that should skip if there is env `WORLD_SIZE`

* Improved code coverage

* Improve code coverage for serial model

* Idist kwargs dict (#1064)

* Added kwargs_dict to spawn, improved _set_model etc

* Fixed bug in tests/run_gpu_tests.sh

* Fixed black

* Updated typing

* removed badly merged _need_to_sync

* Improved device and setup_common_training_handlers (#1066)

* - idist.device() returns "torch.device('cuda')" if non-dist conf and cuda device is available

* Improved setup_common_training_handlers
- no need to handle train_sampler if idist.model_name() is not serial, but train_sampler is not setup as distributed due to 1 proc.
- Warn only if train_sampler has set_epoch

* Idist improve2 (#1075)

* Improve tests on XLA

* Fixes xla test when spawn without 'fork'

* Added test of dtype for XLA

* Added support for str input for all gather (#1081)

* Added support for str input for all gather

* More tests for better coverage

* Fix #1055 (#1068)

* issue_1055

* autopep8 fix

* decorate and refactor getattr

* remove decoration - need further discussions

* Added missing decorator for plx

* Added note about dist-friendly interface

* Updated Checkpoint to dist config + TPU

* autopep8 fix

* [WIP] Checkpoint in dist config

* autopep8 fix

* [WIP] Checkpoint in dist config

* autopep8 fix

* [WIP] Checkpoint on XLA

* autopep8 fix

* Fix checkpoint tests on XLA

* Put back Loggers as dist-unfriendly
+ tests for contrib savers

* Updated tests for XLA
- removed neptune xla tests

* autopep8 fix

* minor fix for coverage

* [WIP] New XLA tests for trains logger

* Fixed distrib tests for trains

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: AutoPEP8 <>
Co-authored-by: vfdev-5 <vfdev.5@gmail.com>

* Fix failing tests on multi-gpus

* Fix failing XLA tests

* Fixes failing tests on multi-GPUs

* autopep8 fix

* Remove useless barriers (#1085)

* remove useless barriers

* Fix failing tests

* Added missing barrier in test for XLA

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: vfdev-5 <vfdev.5@gmail.com>

* Fixes failing TPU with fork mp

* Applied review suggestions

* autopep8 fix

Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr>
Co-authored-by: AutoPEP8 <>
Co-authored-by: Elijah Rippeth <elijah.rippeth@gmail.com>
Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignite should expose the ability to run distributed jobs on TPUs
2 participants