Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression when training MaskRCNN with CUDA 11 #1458

Closed
changlan opened this issue Jun 24, 2020 · 9 comments
Closed

Performance regression when training MaskRCNN with CUDA 11 #1458

changlan opened this issue Jun 24, 2020 · 9 comments
Labels
upstream issue issues in other libraries

Comments

@changlan
Copy link

changlan commented Jun 24, 2020

It seems that training MaskRCNN with CUDA 11 is underperforming. However, I did not see this issue when training other models without tensorpack (e.g. official ResNet). I am actually not sure whether this is an tensorpack issue or not (and if so, what would be the root cause).

1. What you did:

(1) If you're using examples, what's the command you run:

mpirun -np 64 -hostfile HOSTFILE -mca plm_rsh_no_tree_spawn 1 --allow-run-as-root -bind-to socket -map-by slot -x TF_CPP_MIN_LOG_LEVEL=0 -x NCCL_SOCKET_IFNAME=^lo,docker0 -x TF_CUDNN_USE_AUTOTUNE=0 -x CUDA_VISIBLE_DEVICES=0,1,3,2,7,6,4,5 -x NCCL_DEBUG=INFO -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 /opt/conda/bin/python tensorpack/examples/FasterRCNN/train.py --config BACKBONE.WEIGHTS=ImageNet-R50-AlignPadding.npz DATA.BASEDIR=coco TRAINER=horovod TRAIN.EVAL_PERIOD=0 TRAIN.LR_SCHEDULE="[32000, 32000, 32000]

(2) If you're using examples, have you made any changes to the examples? Paste git status; git diff here:

N/A

2. What you observed:

The training throughput is lower than CUDA 10. For 8 * 8 V100 servers, the throughput with CUDA 11 was ~180 samples/sec, while the throughput with CUDA 10 was ~270 samples/sec.

3. What you expected, if not obvious.

Higher throughput expected.

4. Your environment:

Paste the output of this command: python -c 'import tensorpack.tfutils as u; print(u.collect_env_info())'
If this command failed, tell us your version of Python/TF/tensorpack.

sys.platform          linux
Python                3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:57:50) [GCC 7.5.
0]
Tensorpack            v0.10.1-9-g9c1b1b7b-dirty
Numpy                 1.19.0
TensorFlow            1.15.3/v1.15.3-1-geff0eb3b3c
TF Compiler Version   5.4.0 20160609
TF CUDA support       True
TF MKL support        False
TF XLA support        False
Nvidia Driver         /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.36.06
CUDA                  /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0.171
CUDNN                 /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so
NCCL                  /usr/local/nccl2/lib/libnccl.so.2.7.5
CUDA_VISIBLE_DEVICES  Unspecified
GPU 0,1,2,3,4,5,6,7   Tesla V100-SXM2-16GB
Free RAM              553.92/614.16 GB
CPU Count             96
Horovod               0.19.5
cv2                   3.4.2
msgpack               1.0.0
python-prctl          False
@ppwwyyxx
Copy link
Collaborator

ppwwyyxx commented Jun 24, 2020

Thanks for reporting. I have not had a chance to use cuda 11, but some things that can be used to root cause it:

  1. compare the two settings using 1 machine and eventually 1GPU. This may tell whether this is a scaling issue with horovod.
  2. If cuda 11 is slower even in 1 GPU setting, it's possible that there is a regression in some ops Mask R-CNN uses. What I would do is to bisect the model (e.g., cut the later half of the model and return a naive loss directly) to find ops that behave differently in the two settings. TensorFlow's profiler (https://tensorpack.readthedocs.io/modules/callbacks.html#tensorpack.callbacks.GraphProfiler) may help as well but I didn't have a good user experience with it in the past

maybe related: https://github.com/tensorpack/benchmarks/tree/master/ResNet-MultiGPU can be used to check whether there is a regression in a plain ResNet.

@changlan
Copy link
Author

changlan commented Jul 27, 2020

Quick update: I saw ~15% slowdown even in 1 GPU setting. Haven't had the chance to do any detailed profiling. Were you be able to reproduce the slowdown?

@ppwwyyxx
Copy link
Collaborator

Maybe I'll be able to test cuda 11 on a 1080 soon, but it will take a long time before I'm able to access a cuda11-capable V100 machine.

I guess you were using cuda 10 with cudnn7 but cuda 11 with cudnn8? It might help to try cudnn 8 with cuda 10 as well since nvidia has provided such combination. Presumably, cuda and cudnn are the only two variables between the two settings.

@changlan
Copy link
Author

changlan commented Jul 27, 2020

Yes, I also tested https://github.com/tensorpack/benchmarks/tree/master/ResNet-MultiGPU and CUDA 11 + CUDNN 8 shows slightly better performance than CUDA 10 + CUDNN 7. This might suggest that the issue is related to some ops in Mask R-CNN.

@ppwwyyxx
Copy link
Collaborator

ppwwyyxx commented Jul 27, 2020

The two models (plain ResNet and Mask R-CNN) are very different in the way they use cudnn. For Mask R-CNN we use TF_CUDNN_USE_AUTOTUNE=0 to avoid the long tuning time, and with tuning disabled, it's very likely that different cudnn versions choose different algorithms (I've seen this in the past even with changes in minor cudnn version number).

The easiest way to verify this might be to enable tuning with TF_CUDNN_USE_AUTOTUNE=1 instead of 0. However, to avoid the long tuning time you'll need to resize images to the same resolution with this change:

diff --git i/examples/FasterRCNN/data.py w/examples/FasterRCNN/data.py
index 35d8bd4f..eefe193f 100644
--- i/examples/FasterRCNN/data.py
+++ w/examples/FasterRCNN/data.py
@@ -73,7 +73,8 @@ class TrainingDataPreprocessor:
     def __init__(self, cfg):
         self.cfg = cfg
         self.aug = imgaug.AugmentorList([
-            CustomResize(cfg.PREPROC.TRAIN_SHORT_EDGE_SIZE, cfg.PREPROC.MAX_SIZE),
+            #CustomResize(cfg.PREPROC.TRAIN_SHORT_EDGE_SIZE, cfg.PREPROC.MAX_SIZE),
+            imgaug.Resize((800, 800)),
             imgaug.Flip(horiz=True)
         ])

I often do this when need to benchmark the full power of GPUs.

Then, benchmark the two environment, both with cudnn tuning enabled and see if they give similar speed. This should rule out differences in algorithm heuristics between cudnn versions, if any.

@ppwwyyxx
Copy link
Collaborator

I got access to a machine with new enough nvidia driver for cuda 11, however, apparently TF 1.15 cannot be built with cuda 11 / cudnn 8: the support was added later at tensorflow/tensorflow@28feb4d , tensorflow/tensorflow@255f590, etc.

How did you use TF 1.15 with cuda11/cudnn8? Is there a version maintained elsewhere?

@ppwwyyxx
Copy link
Collaborator

ppwwyyxx commented Sep 3, 2020

I can reproduce the regression with TF2.3 on 1 GTX1080Ti.

The regression comes from cudnn8: cudnn8+cuda10.2 or cudnn8+cuda11 are equally slow, while cudnn7 + cuda10.2 is faster.

The regression only appears when cudnn autotune is disabled. If I apply the above patch, use MODE_MASK=False and TF_CUDNN_USE_AUTOTUNE=1, I found no regression.

So it seems cudnn8 change some algorithm selection heuristics that affects some convolution shapes used in this R-CNN.

@changlan
Copy link
Author

changlan commented Sep 4, 2020

Thanks for the update! Confirmed that TF_CUDNN_USE_AUTOTUNE=1 helped converge to the same performance level without regression.

Re: TF 1.15 with cuda11, I used the tf-latest-gpu-gvnic-debian-10 VM image in GCP's Deep Learning VM in which TF 1.15 has backported support for cu11. Unfortunately I'm not aware of other TF 1.15 distributions with cu11 support.

@ppwwyyxx
Copy link
Collaborator

ppwwyyxx commented Sep 4, 2020

cuDNNv8 deprecated the old algorithm selection APIs (tensorflow/tensorflow@255f590#diff-3ddecd9a9809669183ca2750a865f73a) and the new API seems to have regression.
Reported upstream at https://forums.developer.nvidia.com/t/cudnn8-regression-in-algorithm-selection-heuristics/153667.

@ppwwyyxx ppwwyyxx closed this as completed Sep 4, 2020
@ppwwyyxx ppwwyyxx added the upstream issue issues in other libraries label Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream issue issues in other libraries
Projects
None yet
Development

No branches or pull requests

2 participants