Improved parallel utils #1023

vfdev-5 · 2020-05-08T09:32:28Z

PR to idist branch

Description:

added idist
computation models

Related to #992 #1014 and should enable #988

TODO:

documentation
decide about create_from_context API
decide on names: ComputationModel, _SerialModel, _NativeDistModel, _XlaDistModel

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

vfdev-5 · 2020-05-08T09:38:41Z

@erip @sdesrozis please review

- Renamed _DistModel -> _NativeDistModel

ignite/distributed/comp_models/native.py

ignite/distributed/comp_models/xla.py

sdesrozis

Some comments :)

Using torch.distributed.launch, if env vars are not reset, context mode (i.e. dist.init_process_group(...) by user) should be ok, right ?

Spawn mode is more complex :

When spawned by ignite, local_rank could be added to _model
When spawned by user, it misses a method to let the user inform the local_rank.

ignite/distributed/comp_models/native.py

sdesrozis · 2020-05-09T20:24:52Z

I was thinking about mpi backend. With this backend, no local rank in env. The user will have to inform ignite in one way or another, and maybe he can’t 😔

ignite/distributed/comp_models/base.py

…to parallel_utils

…lel_utils

* Improved parallel utils (#1023) * add utils for distributed * autopep8 fix * [WIP] Added comp models and tests * [WIP] Added create_from_backend and create_from_context for _DistModel * [WIP] Added spawn to _DistModel * [WIP] Refactored comp models and added spawn for xla * autopep8 fix * Improved tests * autopep8 fix * Fixes flake8 * autopep8 fix * - Removed is_distributed - Renamed _DistModel -> _NativeDistModel * autopep8 fix * Added docs and tests for xla spawn * Fixes conftest bug * autopep8 fix * Updates * autopep8 fix * Fixes available_backends bug * autopep8 fix * Fixed tests Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <> * [WIP] create from context for XLA * autopep8 fix * Tests for _sync_model for XLA * autopep8 fix * More tests and updates * autopep8 fix * [WIP] create from context for Native Torch Dist * autopep8 fix * Added tests for idist.* created from context for native dist settings * [WIP] Fix tests * Fixed metric related tests * autopep8 fix * [WIP] idist - Docs & code updates (#1034) * [WIP] Updates docs * Adapted metrics with idist, fixed tests - added local rank estimation with hostname heuristics * autopep8 fix * Adapted metrics code and tests to use idist * autopep8 fix * Updated docs, docstrings * Updated xla tests and fix a bug with tensor dtype * autopep8 fix * Fixed all gather using all reduce op * autopep8 fix * Improved tests of create_supervised_trainer on TPU Co-authored-by: AutoPEP8 <> * Tpu metrics (#1042) * [WIP] make accumulation tests on TPU(s) * Fixed tests with accumulation metric - by decreasing tolerence * Fixed all_gather bug * autopep8 fix * Added metric tests for xla * autopep8 fix * Fixed bug in test of precision - updated other regression tests * Fixed failing tests on TPU - increased err tolerence Co-authored-by: AutoPEP8 <> * Increased err tol for mse and rmse tests on single TPU * Fixes #991 (#1047) - average output in RunningAverage * add TPU checkpointing to CPU. (#1005) * add TPU checkpointing to CPU. * autopep8 fix * update docstring to include TPU notice. * add skip for non-TPU tests. * autopep8 fix * refactor to use idist API. * autopep8 fix * add complex save with TPU. * autopep8 fix * fix tests. * fix typo in docstring. Co-authored-by: vfdev <vfdev.5@gmail.com> Co-authored-by: AutoPEP8 <> Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> Co-authored-by: vfdev <vfdev.5@gmail.com> * Updated tests on checkpoint and TPU * Added barrier op in idist (#1050) * Added barrier op in idist * Fixed test and updated one_rank_only to use idist * Moved one_rank_only to idist, adapted tests * autopep8 fix * Removed redundant imports * Another test fix of setup_logger Co-authored-by: AutoPEP8 <> * Fixed bug with torch.cuda.set_device * Fixed cuda device index, added warning if cuda device index != local rank * autopep8 fix * Issue 1011 (#1053) * Fixed #1011 * autopep8 fix * Fixed failing test * Updated tests/ignite/contrib/engines/test_common.py * Updated tests to improve coverage * Fixed test Co-authored-by: AutoPEP8 <> * Improved device() method (#1062) * Improved device() method - fixes run_gpu_tests.sh to run tests that should skip if there is env `WORLD_SIZE` * Improved code coverage * Improve code coverage for serial model * Idist kwargs dict (#1064) * Added kwargs_dict to spawn, improved _set_model etc * Fixed bug in tests/run_gpu_tests.sh * Fixed black * Updated typing * removed badly merged _need_to_sync * Improved device and setup_common_training_handlers (#1066) * - idist.device() returns "torch.device('cuda')" if non-dist conf and cuda device is available * Improved setup_common_training_handlers - no need to handle train_sampler if idist.model_name() is not serial, but train_sampler is not setup as distributed due to 1 proc. - Warn only if train_sampler has set_epoch * Idist improve2 (#1075) * Improve tests on XLA * Fixes xla test when spawn without 'fork' * Added test of dtype for XLA * Added support for str input for all gather (#1081) * Added support for str input for all gather * More tests for better coverage * Fix #1055 (#1068) * issue_1055 * autopep8 fix * decorate and refactor getattr * remove decoration - need further discussions * Added missing decorator for plx * Added note about dist-friendly interface * Updated Checkpoint to dist config + TPU * autopep8 fix * [WIP] Checkpoint in dist config * autopep8 fix * [WIP] Checkpoint in dist config * autopep8 fix * [WIP] Checkpoint on XLA * autopep8 fix * Fix checkpoint tests on XLA * Put back Loggers as dist-unfriendly + tests for contrib savers * Updated tests for XLA - removed neptune xla tests * autopep8 fix * minor fix for coverage * [WIP] New XLA tests for trains logger * Fixed distrib tests for trains Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <> Co-authored-by: vfdev-5 <vfdev.5@gmail.com> * Fix failing tests on multi-gpus * Fix failing XLA tests * Fixes failing tests on multi-GPUs * autopep8 fix * Remove useless barriers (#1085) * remove useless barriers * Fix failing tests * Added missing barrier in test for XLA Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: vfdev-5 <vfdev.5@gmail.com> * Fixes failing TPU with fork mp * Applied review suggestions * autopep8 fix Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: AutoPEP8 <> Co-authored-by: Elijah Rippeth <elijah.rippeth@gmail.com> Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com>

Desroziers and others added 10 commits May 6, 2020 11:28

add utils for distributed

48b5b47

autopep8 fix

d97c648

[WIP] Added comp models and tests

fa34aaa

[WIP] Added create_from_backend and create_from_context for _DistModel

f02cd9b

[WIP] Added spawn to _DistModel

3771d61

[WIP] Refactored comp models and added spawn for xla

b694d7e

autopep8 fix

1be3678

Improved tests

7dc34e9

autopep8 fix

9619a80

Merge branch 'master' into parallel_utils

afbf024

vfdev-5 changed the base branch from master to idist May 8, 2020 09:34

vfdev-5 marked this pull request as ready for review May 8, 2020 09:37

vfdev-5 mentioned this pull request May 8, 2020

Utils for distributed computation #1019

Closed

3 tasks

vfdev-5 and others added 9 commits May 8, 2020 09:47

Fixes flake8

2cbaf82

autopep8 fix

d8511b3

- Removed is_distributed

d3fd545

- Renamed _DistModel -> _NativeDistModel

autopep8 fix

8f561be

Added docs and tests for xla spawn

768851d

Fixes conftest bug

8444ed8

autopep8 fix

e8ee1c9

Updates

9845236

autopep8 fix

34c4642