[FR] Parallel helper tools #1014

sdesrozis · 2020-05-05T08:29:51Z

Description:

This PR aims to provide tools for parallel. It aims to handle tpu, gpu and cpu using backends from xla and dist (gloo for cpu and nccl for gpu).

TODO:

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

examples/contrib/cifar10/main.py

ignite/distributed/auto.py

ignite/distributed/utils.py

ignite/distributed/auto.py

sdesrozis · 2020-05-05T12:57:19Z

@vfdev-5 thank you for your comments :) I will update soon this draft.

sdesrozis · 2020-05-18T20:02:03Z

Ok! A lot of work to do for auto. I first will decorate handlers.

vfdev-5 · 2020-05-25T00:01:42Z

Parallel does not work on Colab TPUs...
This is due to setup_logger which call idist.get_rank() => initialize context with 1 TPU...

sdesrozis

Ok !

- other code updates

- other cosmetics

- updated default LR

…lel_api

…to parallel_api

- Reverted unintended modifications

- updated ci

* [FR] Parallel helper tools (#1014) * [WIP] auto and parallel dist modules * [WIP] auto optim * Added xla optimizer wrapper - other code updates * Updated auto and cifar10 example * - Fixed resume from - other cosmetics * Fixed bug with _XLADistributedOptimizer - updated default LR * autopep8 fix * Updated README and minor fixes * autopep8 fix * - Removed mnist distributed example - Reverted unintended modifications * Tests of auto methods * autopep8 fix * Tests, docs and code updates * autopep8 fix * Up code, test, cifar10 example and docs * Added option to stop the training - updated ci * Updated readme and fixed ci configs * - Updated code, README and remove old cifar10 Co-authored-by: vfdev-5 <vfdev.5@gmail.com> Co-authored-by: AutoPEP8 <> * Fixes failing tests * Minor updates * Other minor updates * Example readme update and minor fixes * Added test on load_objects ddp to improve coverage * Added more tests for parallel launcher * Replaced pbars by logger * Updated link to cifar10 example * Fixes codecov upload * Updated coverage report type for gpu/tpu Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com>

* Added Windows/MacOSX CI for py3.7 only (#1113) * [WIP] Added Windows CI for py2.7 only * Excluded examples from windows ci * Update unittests.yml * Update unittests.yml * Fixed shell bash as suggested * Fixed failing tests on Win32 - added MNIST test for Win32 in Github actions - added tests on macosx in Github actions * Fixed isort * Fixes tests with IterableDataset * Skipped slow deterministic tests on win32 * skip failing timer tests on macos * fix macos platform name * fix _test_setup_logging * skip frequency tests on win platform * skip time tests on macos * fix flake8 * fix isort * Skip distrib tests for Win32 * skip time test for macos * Updated github actions yaml * skip modules for macos * Fixed bad skip of deterministic tests, reduced time for slow tests * Do not run dist tests on macosx Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> * [FR] Parallel helper tools (#1014) (#1116) * [FR] Parallel helper tools (#1014) * [WIP] auto and parallel dist modules * [WIP] auto optim * Added xla optimizer wrapper - other code updates * Updated auto and cifar10 example * - Fixed resume from - other cosmetics * Fixed bug with _XLADistributedOptimizer - updated default LR * autopep8 fix * Updated README and minor fixes * autopep8 fix * - Removed mnist distributed example - Reverted unintended modifications * Tests of auto methods * autopep8 fix * Tests, docs and code updates * autopep8 fix * Up code, test, cifar10 example and docs * Added option to stop the training - updated ci * Updated readme and fixed ci configs * - Updated code, README and remove old cifar10 Co-authored-by: vfdev-5 <vfdev.5@gmail.com> Co-authored-by: AutoPEP8 <> * Fixes failing tests * Minor updates * Other minor updates * Example readme update and minor fixes * Added test on load_objects ddp to improve coverage * Added more tests for parallel launcher * Replaced pbars by logger * Updated link to cifar10 example * Fixes codecov upload * Updated coverage report type for gpu/tpu Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> * Fixes #1120 (#1122) * Fixes #1120 - Aligned idist args and method names to torch.distributed.launch * replace missing num_procs_per_node * black format * fix bug * replace num_nodes by nnodes Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> * reverse order of remove/save in Checkpoint handling (#1117) * reverse order of remove/save so there is never an n+1 checkpoint situation. * evict if new item is better than candidate for eviction. * swap order of updating saved and saving to ensure consistency of state * remove redundant method. Co-authored-by: vfdev <vfdev.5@gmail.com> * Fix test auto tpu (#1126) * Fixed failing tpu tests * Updated docstring of cifar10 example * Auto pin_memory (#1129) * Auto pin_memory * autopep8 fix Co-authored-by: AutoPEP8 <> * fix auto pin_memory : idist.device().type should be used (#1131) * fix auto pin_memory : idist.device().type should be used * fix cuda in device * fix test * use idist.device().type to test * add missing () Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> * Update pascal voc12 example (#1125) * [WIP][Pascal-VOC12] Update/refactor example * [WIP][Pascal-VOC12] Update/refactor example 2 * [WIP] Updated mlflow files * Removed unused files * Fixed flake and black * Removed unused import and fixed version for mlflow Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> * fix cifar10 model : num_classes missing (#1134) Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> * Accuracy MultiLabel Handling and Error Message (#1132) * Updated check for multilabel and error message * Updated docstring and error message * Updated error message formatting Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> * Updated ImageNet example (#1138) * [WIP] Updated ImageNet example - minor fixes for Pascal VOC12 * Fixed flake8 * Updated pytorch-version-tests.yml to run cron every day at 00:00 UTC (#1141) Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> * Added check_compute_fn argument to EpochMetric and related metrics (#1140) * Added check_compute_fn argument to EpochMetric and related functions. * Updated docstrings * Added check_compute_fn to _BaseRegressionEpoch * Adding typing hints for check_compute_fn * Update roc_auc.py Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> Co-authored-by: vfdev <vfdev.5@gmail.com> * Docs cosmetics (#1142) * Updated docs, replaced single quote by double quote if is code - fixed missing link to Engine - cosmetics * More doc updates * More updates * Fix batch size calculation error (#1137) * Fix batch size calculation error * Add tests for fixed batch size calculation * Fix tests * Test for num_workers * Fix nproc comparison * Improve docs * Fixed docstring Co-authored-by: vfdev <vfdev.5@gmail.com> * Docs updates (#1139) * [WIP] Added teaser gif * [WIP] Updated README * [WIP] Updated README * [WIP] Updated docs * Reverted unintended pyproject.toml edits * Updated README and examples parts * More updates of README * Added badge to check pytorch/python compatible versions * Updated README * Added ref to blog "Using Optuna to Optimize PyTorch Ignite Hyperparameters" * Update README.md * Fixed bad internal link in examples * Updated README * Fixes docs (#1147) * Fixed bad link on teaser * Added manual_seed into docs * Issue #1115 : pbar persists due to specific rule in tqdm (notebook) when n < total (#1145) * Issue #1115 pbar persists in notebook due to specific rules when n < total * close pbar doesn't rise danger bar * fix when pbar.total is None Co-authored-by: vfdev <vfdev.5@gmail.com> Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> * Updated codebase such that torch>=1.3 (#1150) Co-authored-by: vfdev <vfdev.5@gmail.com> * add wandb (#1152) wandb integration already exists, just adding it to the requirements file * Fixed typo and missing part of "Where to go next" (#1151) * Fixes #1153 (#1154) - temporary downgrade of scipy to 1.4.1 instead of 1.5.0 * Use global_step as priority, if it exists (#1155) * Use global_step as priority, if it exists * Fix flake8 error * Style fix Co-authored-by: vfdev <vfdev.5@gmail.com> * Fix TrainsSaver handling of Checkpoint's n_saved (#1135) * Utilize Trains framework callbacks to better support checkpoint saving and respect Checkpoint.n_saved * Update trains callbacks to new format * autopep8 fix * Fix trains mnist example (store checkpoints in local folder) * Use trains 0.15.1rc0 until PR is approved * Use CallbackType for Trains callback type resolution. Add unit test for Trains callbacks * Update trains version * Updated test_trains_saver_callbacks Co-authored-by: jkhenning <> Co-authored-by: vfdev <vfdev.5@gmail.com> * Stateful handlers (#1156) * Stateful handlers * Added state_dict/load_state_dict tests for Checkpoint * integration test * Updated docstring and added include_self to ModelCheckpoint * An integreation test for checkpointing with stateful handlers * Black and flake8 Co-authored-by: vfdev-5 <vfdev.5@gmail.com> * Bump version to 0.4rc.0.post1 * bump version to v0.4.0 🎉 Co-authored-by: vfdev <vfdev.5@gmail.com> Co-authored-by: Sylvain Desroziers <sylvain.desroziers@gmail.com> Co-authored-by: Desroziers <sylvain.desroziers@ifpen.fr> Co-authored-by: Marijan Smetko <marijansmetko123@gmail.com> Co-authored-by: Anmol Joshi <anmolsjoshi@gmail.com> Co-authored-by: Lavanya Shukla <lavanya.shukla12@gmail.com> Co-authored-by: Akihiro Matsukawa <amatsukawa@users.noreply.github.com> Co-authored-by: Jake Henning <59198928+jkhenning@users.noreply.github.com>

sdesrozis self-assigned this May 5, 2020

sdesrozis marked this pull request as draft May 5, 2020 08:29

vfdev-5 reviewed May 5, 2020

View reviewed changes

ignite/distributed/utils.py Outdated Show resolved Hide resolved

vfdev-5 reviewed May 5, 2020

View reviewed changes

ignite/distributed/utils.py Outdated Show resolved Hide resolved

vfdev-5 reviewed May 5, 2020

View reviewed changes

ignite/distributed/auto.py Outdated Show resolved Hide resolved

sdesrozis mentioned this pull request May 6, 2020

Utils for distributed computation #1019

Closed

3 tasks

vfdev-5 mentioned this pull request May 8, 2020

Improved parallel utils #1023

Merged

6 tasks

vfdev-5 changed the base branch from master to idist May 18, 2020 08:45

vfdev-5 marked this pull request as ready for review May 18, 2020 08:59

vfdev-5 mentioned this pull request May 22, 2020

[WIP] Merge idist into master #1045

Merged

9 tasks

vfdev-5 force-pushed the parallel_api branch from b9d386a to 80d3407 Compare May 23, 2020 22:12

vfdev-5 force-pushed the parallel_api branch from 48d1eca to 05cf69b Compare May 25, 2020 08:40

sdesrozis commented May 30, 2020

View reviewed changes

vfdev-5 added this to In progress in 0.4.0 via automation May 31, 2020

vfdev-5 changed the base branch from idist to master June 1, 2020 21:52

vfdev-5 force-pushed the parallel_api branch from f64095e to 86ddef0 Compare June 5, 2020 10:54

vfdev-5 changed the title ~~[WIP] parallel tools~~ [FR] Parallel helper tools Jun 5, 2020

vfdev-5 added 6 commits June 5, 2020 23:32

[WIP] auto and parallel dist modules

015eea6

[WIP] auto optim

468ca14

Added xla optimizer wrapper

361ae2c

- other code updates

Updated auto and cifar10 example

991a73d

- Fixed resume from

dd81ce9

- other cosmetics

Fixed bug with _XLADistributedOptimizer

589a52f

- updated default LR

vfdev-5 force-pushed the parallel_api branch from 86ddef0 to 589a52f Compare June 7, 2020 00:18

AutoPEP8 and others added 2 commits June 7, 2020 00:19

autopep8 fix

b779029

Updated README and minor fixes

f9b401f

AutoPEP8 and others added 11 commits June 7, 2020 00:38

autopep8 fix

57fe57b

Merge branch 'master' into parallel_api

411eca6

Merge branch 'master' of https://github.com/pytorch/ignite into paral…

34bfbf3

…lel_api

Merge branch 'parallel_api' of https://github.com/sdesrozis/ignite in…

2fc0e68

…to parallel_api

- Removed mnist distributed example

9f7f4fa

- Reverted unintended modifications

Tests of auto methods

1d6631a

autopep8 fix

1b7b775

Tests, docs and code updates

830404e

autopep8 fix

db1a599

Up code, test, cifar10 example and docs

b46912a

Merge branch 'master' into parallel_api

52ac2a2

vfdev-5 changed the base branch from master to parallel_api June 8, 2020 23:49

vfdev-5 added 3 commits June 9, 2020 12:12

Added option to stop the training

bb3810d

- updated ci

Updated readme and fixed ci configs

0c7e019

- Updated code, README and remove old cifar10

3aa47cf

vfdev-5 merged commit 199888e into pytorch:parallel_api Jun 9, 2020

0.4.0 automation moved this from In progress to Done Jun 9, 2020

vfdev-5 mentioned this pull request Jun 9, 2020

[FR] Parallel helper tools (#1014) #1116

Merged

3 tasks

sdesrozis deleted the parallel_api branch June 15, 2020 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Parallel helper tools #1014

[FR] Parallel helper tools #1014

sdesrozis commented May 5, 2020 •

edited by vfdev-5

sdesrozis commented May 5, 2020

sdesrozis commented May 18, 2020

vfdev-5 commented May 25, 2020 •

edited

sdesrozis left a comment

[FR] Parallel helper tools #1014

[FR] Parallel helper tools #1014

Conversation

sdesrozis commented May 5, 2020 • edited by vfdev-5

sdesrozis commented May 5, 2020

sdesrozis commented May 18, 2020

vfdev-5 commented May 25, 2020 • edited

sdesrozis left a comment

Choose a reason for hiding this comment

sdesrozis commented May 5, 2020 •

edited by vfdev-5

vfdev-5 commented May 25, 2020 •

edited