Improve DataChannelMPI #3817

apaszke · 2017-11-21T19:38:15Z

Remove unnecessary messages and make certain functions in-place.

This commit weakens error checking, but I think it's fine to make
it UB for now, and implement a better asynchronous mechanism later.

This also makes the backend work with CUDA-aware MPI implementations.

Remove unnecessary messages and make certain functions in-place. This commit weakens error checking, but I think it's fine to make it UB for now, and implement a better asynchronous mechanism later. This is much needed for achieving high performance. This also adds support for CUDA-aware MPI implementations.

mjacar · 2017-11-22T00:49:10Z

This PR breaks build for me.

Scanning dependencies of target THD
[  4%] Building CXX object CMakeFiles/THD.dir/base/data_channels/DataChannelTCP.cpp.o
[  8%] Building CXX object CMakeFiles/THD.dir/base/init_methods/InitMethodEnv.cpp.o
[ 16%] Building CXX object CMakeFiles/THD.dir/base/TensorDescriptor.cpp.o
[ 16%] Building CXX object CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘at::Tensor thd::DataChannelMPI::_newLikeFlat(std::vector<at::Tensor>&) const’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:157:57: warning: narrowing conversion of ‘((& t)->at::Tensor::is_cuda() ? (& t)->at::Tensor::get_device() : -1l)’ from ‘long int’ to ‘int’ inside { } [-Wnarrowing]
   AutoGPU gpu_guard { t.is_cuda() ? t.get_device() : -1 };
                                                         ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:158:45: warning: narrowing conversion of ‘(& tensors)->std::vector<_Tp, _Alloc>::size<at::Tensor, std::allocator<at::Tensor> >()’ from ‘std::vector<at::Tensor>::size_type {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
   std::vector<int64_t> sizes { tensors.size() };  // sizes = [output.size()] + input.sizes()
                                             ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:158:45: warning: narrowing conversion of ‘(& tensors)->std::vector<_Tp, _Alloc>::size<at::Tensor, std::allocator<at::Tensor> >()’ from ‘std::vector<at::Tensor>::size_type {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:160:10: error: ‘input’ was not declared in this scope
   return input.type().tensor(sizes);
          ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::scatter(std::vector<at::Tensor>&, at::Tensor&, thd::rank_type, THDGroup)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: error: no match for ‘operator!’ (operand type is ‘at::Tensor’)
   if (!output.contiguous())
       ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: note: candidate is:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: note: operator!(bool) <built-in>
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: note:   no known conversion for argument 1 from ‘at::Tensor’ to ‘bool’
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::reduce(at::Tensor&, THDReduceOp, thd::rank_type, THDGroup)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: error: no match for ‘operator!’ (operand type is ‘at::Tensor’)
   if (!data.contiguous())
       ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: note: candidate is:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: note: operator!(bool) <built-in>
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: note:   no known conversion for argument 1 from ‘at::Tensor’ to ‘bool’
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::broadcast(at::Tensor&, thd::rank_type, THDGroup)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: error: no match for ‘operator!’ (operand type is ‘at::Tensor’)
   if (!data.contiguous())
       ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: note: candidate is:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: note: operator!(bool) <built-in>
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: note:   no known conversion for argument 1 from ‘at::Tensor’ to ‘bool’
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::send(thd::Scalar&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:309:56: error: request for member ‘scalarType’ in ‘(& data)->thd::Scalar::type()’, which is of non-class type ‘thd::RPCType’
   MPI_Send(data.data(), 1, mpi_datatype.at(data.type().scalarType()),
                                                        ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::receive(thd::Scalar&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:324:17: error: ‘struct thd::Scalar’ has no member named ‘data_ptr’
   MPI_Recv(data.data_ptr(), 1, mpi_datatype.at(data.type().scalarType()),
                 ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:324:60: error: request for member ‘scalarType’ in ‘(& data)->thd::Scalar::type()’, which is of non-class type ‘thd::RPCType’
   MPI_Recv(data.data_ptr(), 1, mpi_datatype.at(data.type().scalarType()),
                                                            ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::receive(at::Tensor&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:346:10: error: ‘status’ was not declared in this scope
   return status.MPI_SOURCE;
          ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:346:17: error: return-statement with a value, in function returning 'void' [-fpermissive]
   return status.MPI_SOURCE;
                 ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual thd::DataChannelMPI::RequestMPI* thd::DataChannelMPI::ireceive(at::Tensor&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:384:10: error: cannot convert ‘std::unique_ptr<thd::DataChannelMPI::RequestMPI>’ to ‘thd::DataChannelMPI::RequestMPI*’ in return
   return request;
          ^
[ 20%] Building CXX object CMakeFiles/THD.dir/base/init_methods/InitMethodUtils.cpp.o
[ 24%] Building CXX object CMakeFiles/THD.dir/base/init_methods/InitMethodFile.cpp.o
CMakeFiles/THD.dir/build.make:62: recipe for target 'CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o' failed
make[2]: *** [CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/THD.dir/all' failed
make[1]: *** [CMakeFiles/THD.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

Reverting this PR fixes the problem. I'll file an issue.

miraclewkf · 2017-11-22T08:09:40Z

@mjacar I meet the same problem，could you tell me how to deal with this problem？

apaszke · 2017-11-22T09:12:55Z

For now just reset your local copy to point before this commit. I will push a fix soon

apaszke force-pushed the mpi_refactor branch from e55e8ef to 6c445fd Compare November 21, 2017 19:47

soumith approved these changes Nov 21, 2017

View reviewed changes

soumith merged commit cf3ca13 into master Nov 21, 2017

mjacar mentioned this pull request Nov 22, 2017

Building from source on master is broken #3822

Closed

apaszke mentioned this pull request Nov 23, 2017

MPI distributed backend needs CUDA support or checks #3803

Closed

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DataChannelMPI #3817

Improve DataChannelMPI #3817

apaszke commented Nov 21, 2017 •

edited

mjacar commented Nov 22, 2017

miraclewkf commented Nov 22, 2017

apaszke commented Nov 22, 2017

Improve DataChannelMPI #3817

Improve DataChannelMPI #3817

Conversation

apaszke commented Nov 21, 2017 • edited

mjacar commented Nov 22, 2017

miraclewkf commented Nov 22, 2017

apaszke commented Nov 22, 2017

apaszke commented Nov 21, 2017 •

edited