Performance regression when training MaskRCNN with CUDA 11 #1458

changlan · 2020-06-24T20:41:49Z

It seems that training MaskRCNN with CUDA 11 is underperforming. However, I did not see this issue when training other models without tensorpack (e.g. official ResNet). I am actually not sure whether this is an tensorpack issue or not (and if so, what would be the root cause).

1. What you did:

(1) If you're using examples, what's the command you run:

mpirun -np 64 -hostfile HOSTFILE -mca plm_rsh_no_tree_spawn 1 --allow-run-as-root -bind-to socket -map-by slot -x TF_CPP_MIN_LOG_LEVEL=0 -x NCCL_SOCKET_IFNAME=^lo,docker0 -x TF_CUDNN_USE_AUTOTUNE=0 -x CUDA_VISIBLE_DEVICES=0,1,3,2,7,6,4,5 -x NCCL_DEBUG=INFO -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 /opt/conda/bin/python tensorpack/examples/FasterRCNN/train.py --config BACKBONE.WEIGHTS=ImageNet-R50-AlignPadding.npz DATA.BASEDIR=coco TRAINER=horovod TRAIN.EVAL_PERIOD=0 TRAIN.LR_SCHEDULE="[32000, 32000, 32000]

(2) If you're using examples, have you made any changes to the examples? Paste git status; git diff here:

N/A

2. What you observed:

The training throughput is lower than CUDA 10. For 8 * 8 V100 servers, the throughput with CUDA 11 was ~180 samples/sec, while the throughput with CUDA 10 was ~270 samples/sec.

3. What you expected, if not obvious.

Higher throughput expected.

4. Your environment:

Paste the output of this command: python -c 'import tensorpack.tfutils as u; print(u.collect_env_info())'
If this command failed, tell us your version of Python/TF/tensorpack.

sys.platform          linux
Python                3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:57:50) [GCC 7.5.
0]
Tensorpack            v0.10.1-9-g9c1b1b7b-dirty
Numpy                 1.19.0
TensorFlow            1.15.3/v1.15.3-1-geff0eb3b3c
TF Compiler Version   5.4.0 20160609
TF CUDA support       True
TF MKL support        False
TF XLA support        False
Nvidia Driver         /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.36.06
CUDA                  /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0.171
CUDNN                 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so
NCCL                  /usr/local/nccl2/lib/libnccl.so.2.7.5
CUDA_VISIBLE_DEVICES  Unspecified
GPU 0,1,2,3,4,5,6,7   Tesla V100-SXM2-16GB
Free RAM              553.92/614.16 GB
CPU Count             96
Horovod               0.19.5
cv2                   3.4.2
msgpack               1.0.0
python-prctl          False

The text was updated successfully, but these errors were encountered:

ppwwyyxx · 2020-06-24T21:15:37Z

Thanks for reporting. I have not had a chance to use cuda 11, but some things that can be used to root cause it:

compare the two settings using 1 machine and eventually 1GPU. This may tell whether this is a scaling issue with horovod.
If cuda 11 is slower even in 1 GPU setting, it's possible that there is a regression in some ops Mask R-CNN uses. What I would do is to bisect the model (e.g., cut the later half of the model and return a naive loss directly) to find ops that behave differently in the two settings. TensorFlow's profiler (https://tensorpack.readthedocs.io/modules/callbacks.html#tensorpack.callbacks.GraphProfiler) may help as well but I didn't have a good user experience with it in the past

maybe related: https://github.com/tensorpack/benchmarks/tree/master/ResNet-MultiGPU can be used to check whether there is a regression in a plain ResNet.

changlan · 2020-07-27T22:34:43Z

Quick update: I saw ~15% slowdown even in 1 GPU setting. Haven't had the chance to do any detailed profiling. Were you be able to reproduce the slowdown?

ppwwyyxx · 2020-07-27T22:54:16Z

Maybe I'll be able to test cuda 11 on a 1080 soon, but it will take a long time before I'm able to access a cuda11-capable V100 machine.

I guess you were using cuda 10 with cudnn7 but cuda 11 with cudnn8? It might help to try cudnn 8 with cuda 10 as well since nvidia has provided such combination. Presumably, cuda and cudnn are the only two variables between the two settings.

changlan · 2020-07-27T23:43:02Z

Yes, I also tested https://github.com/tensorpack/benchmarks/tree/master/ResNet-MultiGPU and CUDA 11 + CUDNN 8 shows slightly better performance than CUDA 10 + CUDNN 7. This might suggest that the issue is related to some ops in Mask R-CNN.

ppwwyyxx · 2020-07-27T23:53:45Z

The two models (plain ResNet and Mask R-CNN) are very different in the way they use cudnn. For Mask R-CNN we use TF_CUDNN_USE_AUTOTUNE=0 to avoid the long tuning time, and with tuning disabled, it's very likely that different cudnn versions choose different algorithms (I've seen this in the past even with changes in minor cudnn version number).

The easiest way to verify this might be to enable tuning with TF_CUDNN_USE_AUTOTUNE=1 instead of 0. However, to avoid the long tuning time you'll need to resize images to the same resolution with this change:

diff --git i/examples/FasterRCNN/data.py w/examples/FasterRCNN/data.py
index 35d8bd4f..eefe193f 100644
--- i/examples/FasterRCNN/data.py
+++ w/examples/FasterRCNN/data.py
@@ -73,7 +73,8 @@ class TrainingDataPreprocessor:
     def __init__(self, cfg):
         self.cfg = cfg
         self.aug = imgaug.AugmentorList([
-            CustomResize(cfg.PREPROC.TRAIN_SHORT_EDGE_SIZE, cfg.PREPROC.MAX_SIZE),
+            #CustomResize(cfg.PREPROC.TRAIN_SHORT_EDGE_SIZE, cfg.PREPROC.MAX_SIZE),
+            imgaug.Resize((800, 800)),
             imgaug.Flip(horiz=True)
         ])

I often do this when need to benchmark the full power of GPUs.

Then, benchmark the two environment, both with cudnn tuning enabled and see if they give similar speed. This should rule out differences in algorithm heuristics between cudnn versions, if any.

ppwwyyxx · 2020-08-31T03:32:55Z

I got access to a machine with new enough nvidia driver for cuda 11, however, apparently TF 1.15 cannot be built with cuda 11 / cudnn 8: the support was added later at tensorflow/tensorflow@28feb4d , tensorflow/tensorflow@255f590, etc.

How did you use TF 1.15 with cuda11/cudnn8? Is there a version maintained elsewhere?

ppwwyyxx · 2020-09-03T09:48:37Z

I can reproduce the regression with TF2.3 on 1 GTX1080Ti.

The regression comes from cudnn8: cudnn8+cuda10.2 or cudnn8+cuda11 are equally slow, while cudnn7 + cuda10.2 is faster.

The regression only appears when cudnn autotune is disabled. If I apply the above patch, use MODE_MASK=False and TF_CUDNN_USE_AUTOTUNE=1, I found no regression.

So it seems cudnn8 change some algorithm selection heuristics that affects some convolution shapes used in this R-CNN.

changlan · 2020-09-04T04:20:13Z

Thanks for the update! Confirmed that TF_CUDNN_USE_AUTOTUNE=1 helped converge to the same performance level without regression.

Re: TF 1.15 with cuda11, I used the tf-latest-gpu-gvnic-debian-10 VM image in GCP's Deep Learning VM in which TF 1.15 has backported support for cu11. Unfortunately I'm not aware of other TF 1.15 distributions with cu11 support.

ppwwyyxx · 2020-09-04T19:48:14Z

cuDNNv8 deprecated the old algorithm selection APIs (tensorflow/tensorflow@255f590#diff-3ddecd9a9809669183ca2750a865f73a) and the new API seems to have regression.
Reported upstream at https://forums.developer.nvidia.com/t/cudnn8-regression-in-algorithm-selection-heuristics/153667.

ppwwyyxx closed this as completed Sep 4, 2020

ppwwyyxx added the upstream issue issues in other libraries label Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression when training MaskRCNN with CUDA 11 #1458

Performance regression when training MaskRCNN with CUDA 11 #1458

changlan commented Jun 24, 2020 •

edited

ppwwyyxx commented Jun 24, 2020 •

edited

changlan commented Jul 27, 2020 •

edited

ppwwyyxx commented Jul 27, 2020

changlan commented Jul 27, 2020 •

edited

ppwwyyxx commented Jul 27, 2020 •

edited

ppwwyyxx commented Aug 31, 2020

ppwwyyxx commented Sep 3, 2020

changlan commented Sep 4, 2020

ppwwyyxx commented Sep 4, 2020

Performance regression when training MaskRCNN with CUDA 11 #1458

Performance regression when training MaskRCNN with CUDA 11 #1458

Comments

changlan commented Jun 24, 2020 • edited

1. What you did:

2. What you observed:

3. What you expected, if not obvious.

4. Your environment:

ppwwyyxx commented Jun 24, 2020 • edited

changlan commented Jul 27, 2020 • edited

ppwwyyxx commented Jul 27, 2020

changlan commented Jul 27, 2020 • edited

ppwwyyxx commented Jul 27, 2020 • edited

ppwwyyxx commented Aug 31, 2020

ppwwyyxx commented Sep 3, 2020

changlan commented Sep 4, 2020

ppwwyyxx commented Sep 4, 2020

changlan commented Jun 24, 2020 •

edited

ppwwyyxx commented Jun 24, 2020 •

edited

changlan commented Jul 27, 2020 •

edited

changlan commented Jul 27, 2020 •

edited

ppwwyyxx commented Jul 27, 2020 •

edited