Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DataChannelMPI #3817

Merged
merged 1 commit into from Nov 21, 2017
Merged

Improve DataChannelMPI #3817

merged 1 commit into from Nov 21, 2017

Conversation

apaszke
Copy link
Contributor

@apaszke apaszke commented Nov 21, 2017

Remove unnecessary messages and make certain functions in-place.

This commit weakens error checking, but I think it's fine to make
it UB for now, and implement a better asynchronous mechanism later.

This also makes the backend work with CUDA-aware MPI implementations.

Remove unnecessary messages and make certain functions in-place.
This commit weakens error checking, but I think it's fine to make
it UB for now, and implement a better asynchronous mechanism later.
This is much needed for achieving high performance.

This also adds support for CUDA-aware MPI implementations.
@soumith soumith merged commit cf3ca13 into master Nov 21, 2017
@mjacar
Copy link
Contributor

mjacar commented Nov 22, 2017

This PR breaks build for me.

Scanning dependencies of target THD
[  4%] Building CXX object CMakeFiles/THD.dir/base/data_channels/DataChannelTCP.cpp.o
[  8%] Building CXX object CMakeFiles/THD.dir/base/init_methods/InitMethodEnv.cpp.o
[ 16%] Building CXX object CMakeFiles/THD.dir/base/TensorDescriptor.cpp.o
[ 16%] Building CXX object CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘at::Tensor thd::DataChannelMPI::_newLikeFlat(std::vector<at::Tensor>&) const’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:157:57: warning: narrowing conversion of ‘((& t)->at::Tensor::is_cuda() ? (& t)->at::Tensor::get_device() : -1l)’ from ‘long int’ to ‘int’ inside { } [-Wnarrowing]
   AutoGPU gpu_guard { t.is_cuda() ? t.get_device() : -1 };
                                                         ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:158:45: warning: narrowing conversion of ‘(& tensors)->std::vector<_Tp, _Alloc>::size<at::Tensor, std::allocator<at::Tensor> >()’ from ‘std::vector<at::Tensor>::size_type {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
   std::vector<int64_t> sizes { tensors.size() };  // sizes = [output.size()] + input.sizes()
                                             ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:158:45: warning: narrowing conversion of ‘(& tensors)->std::vector<_Tp, _Alloc>::size<at::Tensor, std::allocator<at::Tensor> >()’ from ‘std::vector<at::Tensor>::size_type {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:160:10: error: ‘input’ was not declared in this scope
   return input.type().tensor(sizes);
          ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::scatter(std::vector<at::Tensor>&, at::Tensor&, thd::rank_type, THDGroup)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: error: no match for ‘operator!’ (operand type is ‘at::Tensor’)
   if (!output.contiguous())
       ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: note: candidate is:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: note: operator!(bool) <built-in>
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:233:7: note:   no known conversion for argument 1 from ‘at::Tensor’ to ‘bool’
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::reduce(at::Tensor&, THDReduceOp, thd::rank_type, THDGroup)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: error: no match for ‘operator!’ (operand type is ‘at::Tensor’)
   if (!data.contiguous())
       ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: note: candidate is:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: note: operator!(bool) <built-in>
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:281:7: note:   no known conversion for argument 1 from ‘at::Tensor’ to ‘bool’
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::broadcast(at::Tensor&, thd::rank_type, THDGroup)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: error: no match for ‘operator!’ (operand type is ‘at::Tensor’)
   if (!data.contiguous())
       ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: note: candidate is:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: note: operator!(bool) <built-in>
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:299:7: note:   no known conversion for argument 1 from ‘at::Tensor’ to ‘bool’
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::send(thd::Scalar&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:309:56: error: request for member ‘scalarType’ in ‘(& data)->thd::Scalar::type()’, which is of non-class type ‘thd::RPCType’
   MPI_Send(data.data(), 1, mpi_datatype.at(data.type().scalarType()),
                                                        ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::receive(thd::Scalar&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:324:17: error: ‘struct thd::Scalar’ has no member named ‘data_ptr’
   MPI_Recv(data.data_ptr(), 1, mpi_datatype.at(data.type().scalarType()),
                 ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:324:60: error: request for member ‘scalarType’ in ‘(& data)->thd::Scalar::type()’, which is of non-class type ‘thd::RPCType’
   MPI_Recv(data.data_ptr(), 1, mpi_datatype.at(data.type().scalarType()),
                                                            ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual void thd::DataChannelMPI::receive(at::Tensor&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:346:10: error: ‘status’ was not declared in this scope
   return status.MPI_SOURCE;
          ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:346:17: error: return-statement with a value, in function returning 'void' [-fpermissive]
   return status.MPI_SOURCE;
                 ^
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp: In member function ‘virtual thd::DataChannelMPI::RequestMPI* thd::DataChannelMPI::ireceive(at::Tensor&, thd::rank_type)’:
/home/michael/pytorch/torch/lib/THD/base/data_channels/DataChannelMPI.cpp:384:10: error: cannot convert ‘std::unique_ptr<thd::DataChannelMPI::RequestMPI>’ to ‘thd::DataChannelMPI::RequestMPI*’ in return
   return request;
          ^
[ 20%] Building CXX object CMakeFiles/THD.dir/base/init_methods/InitMethodUtils.cpp.o
[ 24%] Building CXX object CMakeFiles/THD.dir/base/init_methods/InitMethodFile.cpp.o
CMakeFiles/THD.dir/build.make:62: recipe for target 'CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o' failed
make[2]: *** [CMakeFiles/THD.dir/base/data_channels/DataChannelMPI.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/THD.dir/all' failed
make[1]: *** [CMakeFiles/THD.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

Reverting this PR fixes the problem. I'll file an issue.

@miraclewkf
Copy link

@mjacar I meet the same problem,could you tell me how to deal with this problem?

@apaszke
Copy link
Contributor Author

apaszke commented Nov 22, 2017

For now just reset your local copy to point before this commit. I will push a fix soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants