-
-
Notifications
You must be signed in to change notification settings - Fork 657
Distrib #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distrib #573
Conversation
- Test on 2 GPUS single node - Added cmd in .travis.yml to indicate how to test locally - Updated travis to run tests in 4 processes
* Fixes issue pytorch#543 Previous CM implementation suffered from the problem if target contains non-contiguous indices. New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117 This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class. Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0) = no classes (background class), then confusion matrix does not count any true/false predictions. * Update confusion_matrix.py
Hi @vfdev-5, I don't have any experience with distributed training, so I don't think I can give constructive suggestions. But I have some general comments. Ignore me if I am saying something wrong.
|
@zasdfgbnm thank you for the review and your comments ! I'll try to give my point of view on this.
Originally, I saw the synchronization in maskrcnn-benchmark and now in torchvision references, when they print metrics: https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/references/classification/train.py#L69 As far as I understand synchronization goes together with a reduction op, so I did not think separate them.
Yes, you are right, in most of our metrics, they are computed by cumulating things as sum of something and number of samples. In these cases,
I tried to answer this question here : https://github.com/pytorch/ignite/blob/99a6b4a515acacc84bb7438f0e6afebcb6378d70/docs/source/metrics.rst#how-to-create-a-custom-metric After rereading this I think we need to add more info about how to play with distributed decorators and device as you asked. In case of non distributed computations, I would say everything remain the same (even without decorators). Thank you again for the comments and please let me know what do you think ? |
I changed the behaviour of Another remark, in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vfdev-5, the change is big, and I just started looking at it. I am a little busy recently so I won't finish very quick. But I will keep reading and discussing. I do have some general suggestions besides my few comments on the code. But in general, do you think it would be better if we could create an ignite/metrics/distributed.py
, and move the decorators and _sync_all_reduce
inside that file. I mean, for the Metric
class we do not make any changes. Also, if the methods that need to decorate is the same, why not create a single decorator, so that the user could use it like:
from ignite.metrics.dist import all_reduced_synchronized
@all_reduced_synchronized('myvar1', 'myvar2')
class MyMetric:
def .....
if isinstance(tensor, torch.Tensor): | ||
# check if the tensor is at specified device | ||
if tensor.device != self._device: | ||
tensor = tensor.to(self._device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about distributed training, but what would happen if we don't do so? This is only called in self.compute
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood, if we have multiple gpus and call an op as reduce, than internally it collect data data from all gpus. In case if tensor is not defined on one of the gpus, the operation would hang...
So, in update we do not force all internal variables to be on the associated device (rank), but in compute it is necessary.
return wrapper | ||
|
||
|
||
def reinit_is_reduced(func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name reinit_is_reduced
might look confusing to users. Without looking at the source code of Metric
, it is hard to know what is is_reduced
and why it needs reinit. BTW: I only see code that write this variable, but nothing is reading it. Am I correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really ignite's internal method and should not be used by users except to decorate custom metrics. The purpose of the method is to ensure that compute
always results the same output.
""" | ||
pass | ||
|
||
def _sync_all_reduce(self, tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we define this inside the body of sync_all_reduce
, instead of here?
@zasdfgbnm thank you for the review !
I thought in the begining about this and as
Yes, this could be helpful in majority of metrics, except some corner cases as The purpose of r1 = m.compute()
r2 = m.compute()
assert r1 == r2 b) dist without r1 = m.compute()
r2 = m.compute()
assert r1 != r2 Because, |
* Added mlflow logger without tests * Added mlflow tests, updated mlflow logger code and other tests * Updated docs and added mlflow in travis * Added tests for mlflow OptimizerParamsHandler - additionally added OptimizerParamsHandler for plx with tests
* Update .travis.yml * Update .travis.yml * Fixed tests and improved travis
* Update .travis.yml * Update .travis.yml * Fixed tests and improved travis * Fixes SSL problem to download model weights
* Add tests for event removable handle. Add feature tests for engine.add_event_handler returning removable event handles. * Return RemovableEventHandle from Engine.add_event_handler. * Fixup removable event handle test in python 2.7. Explicitly trigger gc, allowing cycle detection between engine and state, in removable handle weakref test. Python 2.7 cycle detection appears to be less aggressive than python 3+. * Add removable event handler docs. Add autodoc configuration for RemovableEventHandler, expand "concepts" documentation with event remove example following event add example. * Update concepts.rst
* [WIP] Added cifar10 distributed example * [WIP] Metric with all reduce decorator and tests * [WIP] Added tests for accumulation metric * [WIP] Updated with reinit_is_reduced * [WIP] Distrib adaptation for other metrics * [WIP] Warnings for EpochMetric and Precision/Recall when distrib * Updated metrics and tests to run on distributed configuration - Test on 2 GPUS single node - Added cmd in .travis.yml to indicate how to test locally - Updated travis to run tests in 4 processes * Minor fixes and cosmetics * Fixed bugs and improved contrib/cifar10 example * Updated docs * Update metrics.rst * Updated docs and set device as "cuda" in distributed instead of raising error * [WIP] Fix missing _is_reduced in precision/recall with tests * Updated other tests * Updated travis and renamed tbptt test gpu -> cuda * Distrib (#573) * [WIP] Added cifar10 distributed example * [WIP] Metric with all reduce decorator and tests * [WIP] Added tests for accumulation metric * [WIP] Updated with reinit_is_reduced * [WIP] Distrib adaptation for other metrics * [WIP] Warnings for EpochMetric and Precision/Recall when distrib * Updated metrics and tests to run on distributed configuration - Test on 2 GPUS single node - Added cmd in .travis.yml to indicate how to test locally - Updated travis to run tests in 4 processes * Minor fixes and cosmetics * Fixed bugs and improved contrib/cifar10 example * Updated docs * Fixes issue #543 (#572) * Fixes issue #543 Previous CM implementation suffered from the problem if target contains non-contiguous indices. New implementation is almost taken from torchvision's https://github.com/pytorch/vision/blob/master/references/segmentation/utils.py#L75-L117 This commit also removes the case of targets as (batchsize, num_categories, ...) where num_categories excludes background class. Confusion matrix computation is possible almost similarly for (batchsize, ...), but when target is all zero (0, ..., 0) = no classes (background class), then confusion matrix does not count any true/false predictions. * Update confusion_matrix.py * Update metrics.rst * Updated docs and set device as "cuda" in distributed instead of raising error * [WIP] Fix missing _is_reduced in precision/recall with tests * Updated other tests * Added mlflow logger (#558) * Added mlflow logger without tests * Added mlflow tests, updated mlflow logger code and other tests * Updated docs and added mlflow in travis * Added tests for mlflow OptimizerParamsHandler - additionally added OptimizerParamsHandler for plx with tests * Update to PyTorch v1.2.0 (#580) * Update .travis.yml * Update .travis.yml * Fixed tests and improved travis * Fix SSL problem of failing travis (#581) * Update .travis.yml * Update .travis.yml * Fixed tests and improved travis * Fixes SSL problem to download model weights * Fixed travis for deploy and nightly * Fixes #583 (#584) * Fixes docs build warnings (#585) * Return removable handle from Engine.add_event_handler(). (#588) * Add tests for event removable handle. Add feature tests for engine.add_event_handler returning removable event handles. * Return RemovableEventHandle from Engine.add_event_handler. * Fixup removable event handle test in python 2.7. Explicitly trigger gc, allowing cycle detection between engine and state, in removable handle weakref test. Python 2.7 cycle detection appears to be less aggressive than python 3+. * Add removable event handler docs. Add autodoc configuration for RemovableEventHandler, expand "concepts" documentation with event remove example following event add example. * Update concepts.rst * Updated travis and renamed tbptt test gpu -> cuda * Compute IoU, Precision, Recall based on CM on CPU * Fixes incomplete merge with 1856c8e * Update distrib branch and CIFAR10 example (#647) * Added tests with gloo, minor updates and fixes * Added single/multi node tests with gloo and [WIP] with nccl * Added tests for multi-node nccl, improved examples/contrib/cifar10 example * Experiments: 1n1gpu, 1n2gpus, 2n2gpus * Fix flake8 * Fixes #645 (#646) - fix CI and improve create_lr_scheduler_with_warmup * Fix tests for python 2.7 * Finalized Cifar10 example (#649) * Added gcp tb logger image and updated README * Added gcp ai platform scripts to run trainings * Improved docs and readmes
Fixes #568
Description: I created "distrib" branch to work on #568 adapting our code to compute metrics while in distributed configuration. Idea is to merge to this branch and test the code in various conditions. Iteratively we can improve the code by merging to the branch before merging to master.
Tests are added and passes on single node 2 GPU config. Test pytest.fixture initialize every time 'nccl' backended group and runs various code tests. Sometimes running all the tests in a single run, they can stuck, but passes separately.
Check list:
cc @zasdfgbnm if you could take a look and give your opinion, would be awesome ! thanks