Enable ROCm multi-gpu with Gloo #18640

bddppq · 2019-03-29T23:27:44Z

No description provided.

petrex · 2019-04-01T22:26:22Z

           Z$ZZ,$:,..                   
         :$$$ZO:O~=$8                   
        ~$$$$ZZZ:?888888D.              
      .,Z$$$ZZZZ88888888D=              
      .Z$$$ZZZ8DZ$$$ZDDDD,              
      ZZ$$Z$O$MNIMIIMIDD=.              
    ..ZZZOZZ7I??+=I+?+=~:.              
    .ZZZZON:,,:=.+:::,::~=,             
   .ZZZZO888:,:~NZ,:::~~=+++~,,...      
   ..O:=:88D=,,MDDNM?+???$7I,~..:.~.    
    .:~~++D8:,,,7OMMMMMMM=$ZII:+,=+.    
    ..::~,:::::::~=+=8NI7,$I?7?I??=,    
      .=~~7~~~~~~==NO8+I=.+$7I=:=++.    
    ....=ZZZZZZ7??+~=?I=....O$II?+++.   
    .Z$$ZZZZZOOOZZZ88DNDDDDDND?+?8=.    
   .$$$ZZOOOOOOOZZ$77ZOO8$ODNDOOOO..    
 ..$$$ZZOO88888OO$++IZZZOOZIDD88O=      
..+Z$ZOO8+~.O88O$$$$ZOZZZOO88D88+       
 ,.,ZO88=  .OO$$$$$$ZZZZZOO8I .......   
...::~?I=    Z$$$$$$$777$$ZO8..$7$ZO .. 
:..,:~I7I    Z$$$$$7$7777$$$ZOI?I7$OON. 
.,~,,~?I$    $$$$$$$$$$$7$$$$$???$I7$7I 
 :,:~??+I    $$$$$$$$Z$$$$$$Z7IIZ+$$77$,
..==+?==.    $$$$$$$ZZZ$$$$$7777?777777.
   .~.  .~Z8Z=$$$ZZZZZOZZ$$Z$$$O7$777$= 
        I77ZZZOOZOZOOOZOOZZZ$$OI$$$$$=. 
      .,$?$$$$Z888888888=.$ZZ$ZZ88OO.   
      ..77ZZZZZO888888O=. .OO8$777$=.   
      ?$I$ZZZZOO88888O+.   ..D$7$$=.    
    ..$7?7ZOO8888O8I=.       .....      
    .7$I?78O88=,..                      
    ..Z$7$O88O                          
      .OOOOOO=

bddppq · 2019-04-05T18:13:29Z

@pytorchbot retest this please

bddppq · 2019-04-05T20:28:27Z

@iotamudelta I'm running into these memory allocation issue when trying to enable the multi gpu support on ROCm:

20:14:54 test_sync_params_with_buffers (test_c10d.DistributedDataParallelTest) ... Process process 1:
20:14:55 Traceback (most recent call last):
20:14:55   File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
20:14:55     self.run()
20:14:55   File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
20:14:55     self._target(*self._args, **self._kwargs)
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 473, in _run
20:14:55     getattr(self, self.id().split(".")[2])()
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 437, in wrapper
20:14:55     fn(self)
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 53, in wrapper
20:14:55     return func(*args, **kwargs)
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 1638, in test_sync_params_with_buffers
20:14:55     target = torch.arange(10, dtype=torch.float64, device='cuda:{}'.format(devices[0])).chunk(5)
20:14:55 RuntimeError: HIP out of memory. Tried to allocate 2.00 MiB (GPU 2; 15.98 GiB total capacity; 0 bytes already allocated; 15.73 GiB free; 0 bytes cached)
20:14:59 FAIL

I suspect this is related to hip initialization in a forked sub-process. Could you help take a look?

petrex · 2019-04-22T22:11:33Z

@jithunnair-amd Can you pls review the failure and triangulate to the relevent experts? thanks

facebook-github-bot

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

bddppq · 2019-05-02T18:20:38Z

I'm able to use this to do rocm distributed training. The unittests failures are related to re-initializting ROCm after fork. Let's land this first without enabling the tests.

jithunnair-amd · 2019-05-03T16:11:43Z

@bddppq We have a potential fix for the tests failing with "HIP out of memory" error. It just involves setting the "HCC_LAZYINIT=ON" environment variable for ROCm builds. It works locally, so I'd like to test it on the CI. Should I submit that change in this PR, or wait for this PR to be closed, so I can submit a different PR that builds on top of this one?

pietern

LGTM. Where does the hipification of ProcessGroupGloo happen? We have a bunch of CUDA specifics in there that should eligible for hipification, but I don't see its path listed in this commit.

jithunnair-amd · 2019-05-06T17:02:28Z

@pietern I believe the hipification of ProcessGroupGloo happens here: https://github.com/pytorch/pytorch/blob/master/tools/amd_build/pyHIPIFY/hipify_python.py#L785

I see it hipified in my local build.

bddppq · 2019-05-06T17:35:41Z

@jithunnair-amd HCC_LAZYINIT sounds like exactly what we want. Is there any downside of having it always set to ON?

facebook-github-bot

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

bddppq · 2019-05-06T23:00:19Z

@jithunnair-amd The tests failed with HCC_LAZYINIT set to ON:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-devtoolset7-rocmrpm-centos7.5-trigger/13520/
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-trigger/28231/

facebook-github-bot

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jithunnair-amd · 2019-05-06T23:04:20Z

@jithunnair-amd The tests failed with HCC_LAZYINIT set to ON:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-devtoolset7-rocmrpm-centos7.5-trigger/13520/
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-trigger/28231/

I looked at the logs, but couldn't really make out where the SIGSEGV is coming from. All I see is this:

22:21:26 OK (skipped=52, expected failures=1)
22:21:28 Traceback (most recent call last):
22:21:28   File "test/run_test.py", line 435, in <module>
22:21:28     main()
22:21:28   File "test/run_test.py", line 427, in main
22:21:28     raise RuntimeError(message)
22:21:28 RuntimeError: test_nn failed! Received signal: SIGSEGV

Is there a way to get more info about where the SIGSEGV occurred?

bddppq · 2019-05-07T17:00:45Z

@jithunnair-amd Let me try to create a repro for you to debug.

facebook-github-bot · 2019-05-07T18:09:52Z

@bddppq merged this pull request in bc53984.

Summary: Pull Request resolved: pytorch/pytorch#18640 Differential Revision: D15185822 Pulled By: bddppq fbshipit-source-id: 1b49ab3fb0f251cfc7ef3ddd62033ae0065a4ec3

bddppq added the module: rocm AMD GPU support for Pytorch label Mar 29, 2019

bddppq requested review from apaszke, mrshenli and pietern as code owners March 29, 2019 23:27

bddppq force-pushed the pt-gloo branch 2 times, most recently from 936a5ed to fb98e12 Compare March 31, 2019 03:23

bddppq changed the title ~~Enable ROCm multi-gpu with Gloo~~ [WIP] Enable ROCm multi-gpu with Gloo Mar 31, 2019

bddppq force-pushed the pt-gloo branch 2 times, most recently from c513064 to bbfefae Compare April 5, 2019 17:05

bddppq force-pushed the pt-gloo branch 4 times, most recently from d66603a to 448a54f Compare April 15, 2019 21:02

bddppq force-pushed the pt-gloo branch from 448a54f to 2af43b5 Compare May 2, 2019 18:15

bddppq changed the title ~~[WIP] Enable ROCm multi-gpu with Gloo~~ Enable ROCm multi-gpu with Gloo May 2, 2019

facebook-github-bot reviewed May 2, 2019

View reviewed changes

pietern approved these changes May 6, 2019

View reviewed changes

bddppq added 2 commits May 6, 2019 10:36

Enable ROCm multi-gpu with Gloo

d75828e

Re-disable distributed tests for rocm

3e55c00

bddppq force-pushed the pt-gloo branch from 2af43b5 to 18b6e09 Compare May 6, 2019 17:39

facebook-github-bot reviewed May 6, 2019

View reviewed changes

bddppq force-pushed the pt-gloo branch from 18b6e09 to 3e55c00 Compare May 6, 2019 22:59

facebook-github-bot reviewed May 6, 2019

View reviewed changes

facebook-github-bot closed this in bc53984 May 7, 2019

facebook-github-bot added the merged label May 7, 2019

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 7, 2019

Enable ROCm multi-gpu with Gloo

4931395

Summary: Pull Request resolved: pytorch/pytorch#18640 Differential Revision: D15185822 Pulled By: bddppq fbshipit-source-id: 1b49ab3fb0f251cfc7ef3ddd62033ae0065a4ec3

bddppq deleted the pt-gloo branch May 7, 2019 19:00

bddppq mentioned this pull request May 8, 2019

Set HCC_LAZYINIT=ON in caffe2 CI #20261

Closed

Enable ROCm multi-gpu with Gloo #18640

Enable ROCm multi-gpu with Gloo #18640

Uh oh!

Conversation

bddppq commented Mar 29, 2019

Uh oh!

petrex commented Apr 1, 2019

Uh oh!

bddppq commented Apr 5, 2019

Uh oh!

bddppq commented Apr 5, 2019

Uh oh!

petrex commented Apr 22, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

bddppq commented May 2, 2019

Uh oh!

jithunnair-amd commented May 3, 2019

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented May 6, 2019

Uh oh!

bddppq commented May 6, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

bddppq commented May 6, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented May 6, 2019

Uh oh!

bddppq commented May 7, 2019

Uh oh!

facebook-github-bot commented May 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants