Skip to content

Conversation

bddppq
Copy link
Contributor

@bddppq bddppq commented Mar 29, 2019

No description provided.

@bddppq bddppq added the module: rocm AMD GPU support for Pytorch label Mar 29, 2019
@bddppq bddppq force-pushed the pt-gloo branch 2 times, most recently from 936a5ed to fb98e12 Compare March 31, 2019 03:23
@bddppq bddppq changed the title Enable ROCm multi-gpu with Gloo [WIP] Enable ROCm multi-gpu with Gloo Mar 31, 2019
@petrex
Copy link
Contributor

petrex commented Apr 1, 2019

           Z$ZZ,$:,..                   
         :$$$ZO:O~=$8                   
        ~$$$$ZZZ:?888888D.              
      .,Z$$$ZZZZ88888888D=              
      .Z$$$ZZZ8DZ$$$ZDDDD,              
      ZZ$$Z$O$MNIMIIMIDD=.              
    ..ZZZOZZ7I??+=I+?+=~:.              
    .ZZZZON:,,:=.+:::,::~=,             
   .ZZZZO888:,:~NZ,:::~~=+++~,,...      
   ..O:=:88D=,,MDDNM?+???$7I,~..:.~.    
    .:~~++D8:,,,7OMMMMMMM=$ZII:+,=+.    
    ..::~,:::::::~=+=8NI7,$I?7?I??=,    
      .=~~7~~~~~~==NO8+I=.+$7I=:=++.    
    ....=ZZZZZZ7??+~=?I=....O$II?+++.   
    .Z$$ZZZZZOOOZZZ88DNDDDDDND?+?8=.    
   .$$$ZZOOOOOOOZZ$77ZOO8$ODNDOOOO..    
 ..$$$ZZOO88888OO$++IZZZOOZIDD88O=      
..+Z$ZOO8+~.O88O$$$$ZOZZZOO88D88+       
 ,.,ZO88=  .OO$$$$$$ZZZZZOO8I .......   
...::~?I=    Z$$$$$$$777$$ZO8..$7$ZO .. 
:..,:~I7I    Z$$$$$7$7777$$$ZOI?I7$OON. 
.,~,,~?I$    $$$$$$$$$$$7$$$$$???$I7$7I 
 :,:~??+I    $$$$$$$$Z$$$$$$Z7IIZ+$$77$,
..==+?==.    $$$$$$$ZZZ$$$$$7777?777777.
   .~.  .~Z8Z=$$$ZZZZZOZZ$$Z$$$O7$777$= 
        I77ZZZOOZOZOOOZOOZZZ$$OI$$$$$=. 
      .,$?$$$$Z888888888=.$ZZ$ZZ88OO.   
      ..77ZZZZZO888888O=. .OO8$777$=.   
      ?$I$ZZZZOO88888O+.   ..D$7$$=.    
    ..$7?7ZOO8888O8I=.       .....      
    .7$I?78O88=,..                      
    ..Z$7$O88O                          
      .OOOOOO=                    

@bddppq bddppq force-pushed the pt-gloo branch 2 times, most recently from c513064 to bbfefae Compare April 5, 2019 17:05
@bddppq
Copy link
Contributor Author

bddppq commented Apr 5, 2019

@pytorchbot retest this please

@bddppq
Copy link
Contributor Author

bddppq commented Apr 5, 2019

@iotamudelta I'm running into these memory allocation issue when trying to enable the multi gpu support on ROCm:

20:14:54 test_sync_params_with_buffers (test_c10d.DistributedDataParallelTest) ... Process process 1:
20:14:55 Traceback (most recent call last):
20:14:55   File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
20:14:55     self.run()
20:14:55   File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
20:14:55     self._target(*self._args, **self._kwargs)
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 473, in _run
20:14:55     getattr(self, self.id().split(".")[2])()
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 437, in wrapper
20:14:55     fn(self)
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 53, in wrapper
20:14:55     return func(*args, **kwargs)
20:14:55   File "/var/lib/jenkins/workspace/test/test_c10d.py", line 1638, in test_sync_params_with_buffers
20:14:55     target = torch.arange(10, dtype=torch.float64, device='cuda:{}'.format(devices[0])).chunk(5)
20:14:55 RuntimeError: HIP out of memory. Tried to allocate 2.00 MiB (GPU 2; 15.98 GiB total capacity; 0 bytes already allocated; 15.73 GiB free; 0 bytes cached)
20:14:59 FAIL

I suspect this is related to hip initialization in a forked sub-process. Could you help take a look?

@bddppq bddppq force-pushed the pt-gloo branch 4 times, most recently from d66603a to 448a54f Compare April 15, 2019 21:02
@petrex
Copy link
Contributor

petrex commented Apr 22, 2019

@jithunnair-amd Can you pls review the failure and triangulate to the relevent experts? thanks

@bddppq bddppq changed the title [WIP] Enable ROCm multi-gpu with Gloo Enable ROCm multi-gpu with Gloo May 2, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@bddppq
Copy link
Contributor Author

bddppq commented May 2, 2019

I'm able to use this to do rocm distributed training. The unittests failures are related to re-initializting ROCm after fork. Let's land this first without enabling the tests.

@jithunnair-amd
Copy link
Collaborator

@bddppq We have a potential fix for the tests failing with "HIP out of memory" error. It just involves setting the "HCC_LAZYINIT=ON" environment variable for ROCm builds. It works locally, so I'd like to test it on the CI. Should I submit that change in this PR, or wait for this PR to be closed, so I can submit a different PR that builds on top of this one?

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Where does the hipification of ProcessGroupGloo happen? We have a bunch of CUDA specifics in there that should eligible for hipification, but I don't see its path listed in this commit.

@jithunnair-amd
Copy link
Collaborator

@pietern I believe the hipification of ProcessGroupGloo happens here: https://github.com/pytorch/pytorch/blob/master/tools/amd_build/pyHIPIFY/hipify_python.py#L785

I see it hipified in my local build.

@bddppq
Copy link
Contributor Author

bddppq commented May 6, 2019

@jithunnair-amd HCC_LAZYINIT sounds like exactly what we want. Is there any downside of having it always set to ON?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@bddppq
Copy link
Contributor Author

bddppq commented May 6, 2019

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bddppq has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jithunnair-amd
Copy link
Collaborator

@jithunnair-amd The tests failed with HCC_LAZYINIT set to ON:

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-devtoolset7-rocmrpm-centos7.5-trigger/13520/
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-trigger/28231/

I looked at the logs, but couldn't really make out where the SIGSEGV is coming from. All I see is this:

22:21:26 OK (skipped=52, expected failures=1)
22:21:28 Traceback (most recent call last):
22:21:28   File "test/run_test.py", line 435, in <module>
22:21:28     main()
22:21:28   File "test/run_test.py", line 427, in main
22:21:28     raise RuntimeError(message)
22:21:28 RuntimeError: test_nn failed! Received signal: SIGSEGV

Is there a way to get more info about where the SIGSEGV occurred?

@bddppq
Copy link
Contributor Author

bddppq commented May 7, 2019

@jithunnair-amd Let me try to create a repro for you to debug.

@facebook-github-bot
Copy link
Contributor

@bddppq merged this pull request in bc53984.

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 7, 2019
Summary: Pull Request resolved: pytorch/pytorch#18640

Differential Revision: D15185822

Pulled By: bddppq

fbshipit-source-id: 1b49ab3fb0f251cfc7ef3ddd62033ae0065a4ec3
@bddppq bddppq deleted the pt-gloo branch May 7, 2019 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: rocm AMD GPU support for Pytorch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants