New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch support for Horovod #267

Merged
merged 27 commits into from May 17, 2018

Conversation

Projects
None yet
3 participants
@alsrgv
Collaborator

alsrgv commented May 13, 2018

This PR brings first-class PyTorch support in Horovod.

@alsrgv alsrgv requested review from sblotner, jeevandev and jiezhang May 13, 2018

@alsrgv alsrgv self-assigned this May 13, 2018

class _DistributedOptimizer(torch.optim.Optimizer):
# TODO (doc): make it clear that parameters are taken from optimizer,

This comment has been minimized.

@alsrgv

alsrgv May 13, 2018

Collaborator

Will resolve TODOs before merging.

@sblotner

Minor edits

README.md Outdated
@@ -4,8 +4,8 @@
<p align="center"><img src="https://user-images.githubusercontent.com/16640218/34506318-84d0c06c-efe0-11e7-8831-0425772ed8f2.png" alt="Logo" width="200"/></p>
Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning
fast and easy to use.
Horovod is a distributed training framework for TensorFlow, Keras and PyTorch. The goal of Horovod is to make

This comment has been minimized.

@sblotner

sblotner May 13, 2018

Collaborator

Add comma after “Keras”

compilation terminated.
```
You can do this by installing a `python-dev` or `python3-dev` package. For example, on Debian or Ubuntu system:

This comment has been minimized.

@sblotner

sblotner May 13, 2018

Collaborator

On Debian —> on a Debian

@alsrgv

This comment has been minimized.

Collaborator

alsrgv commented May 14, 2018

@sblotner, could you take another look, I've added more docs in the torch/ package.

average gradient values before applying gradients to model weights.
Allreduce operations are executed after each gradient is computed by `loss.backward()`
in parallel with each other. `step()` method ensures that all allreduce operations are

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

'step()' --> The 'step()'

in parallel with each other. `step()` method ensures that all allreduce operations are
finished before applying gradients to the model.
DistributedOptimizer exposes `synchronize()` method which forces allreduce operations

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

synchronize() --> the synchronize()
Add comma after "method"

allreduce operations. Typically just `model.named_parameters()`.
"""
# We dynamically create a new class that inherits from the optimizer that was passed in.
# The goal is to override `step()` method with an allreduce implementation.

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

'step()' --> The 'step()'

"""
Broadcasts the parameters from root rank to all other processes.
Typical usage is to broadcast the `model.state_dict()`,
`model.named_parameters()` or `model.parameters()`.

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

Add comma before "or"

A function that performs asynchronous averaging or summation of the input tensor
over all the Horovod processes. The input tensor is not modified.
The reduction operation is keyed by the name. If name is not provided, incremented

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

incremented --> an incremented

A function that performs averaging or summation of the input tensor over all the
Horovod processes. The input tensor is not modified.
The reduction operation is keyed by the name. If name is not provided, incremented

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

incremented --> an incremented

A function that performs asynchronous in-place averaging or summation of the input
tensor over all the Horovod processes.
The reduction operation is keyed by the name. If name is not provided, incremented

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

incremented --> an incremented

A function that performs in-place averaging or summation of the input tensor over
all the Horovod processes.
The reduction operation is keyed by the name. If name is not provided, incremented

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

incremented --> an incremented

def allgather_async(tensor, name=None):
"""
A function which asynchronously concatenates the input tensor with the same input

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

which --> that

def allgather(tensor, name=None):
"""
A function which concatenates the input tensor with the same input tensor on

This comment has been minimized.

@sblotner

sblotner May 14, 2018

Collaborator

which --> that

buffer_ = new char[size];
} else {
#if HAVE_CUDA
THCudaCheck(THCudaMalloc(state, (void**)&buffer_, size));

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

No destructor to release memory?

This comment has been minimized.

@alsrgv

alsrgv May 14, 2018

Collaborator

It was omitted on purpose. Unfortunately, by the time this destructor would be called in normal circumstances (application shutdown), CUDA context is already destroyed and cudaFree() operations prints nasty errors in the log - in a pretty normal termination scenario.]]

If we add functionality to terminate Horovod without restarting the application, we should revisit this logic. I'll add a comment with clarifications.

std::shared_ptr<Status> HandleManager::ReleaseHandle(int handle) {
std::lock_guard<std::mutex> guard(mutex_);
auto status = results_[handle];

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

Check if handle exists first.

This comment has been minimized.

@alsrgv

alsrgv May 14, 2018

Collaborator

Good catch!

_ffi.RTLD_GLOBAL)
__all__ = []
def _import_symbols(locals):

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

Extract into a lib to avoid code dupe?

This comment has been minimized.

@alsrgv

alsrgv May 14, 2018

Collaborator

This part was actually auto-generated by cffi compiler. But I have to modify the file a little bit to make it do RTLD_GLOBAL loading instead of default behavior.

return synchronize(handle)
def allreduce_async_(tensor, average=True, name=None):

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

Why trailing underscore?

This comment has been minimized.

@alsrgv

alsrgv May 14, 2018

Collaborator

Trailing underscore is common PyTorch naming convention for in-place operations.

"""
# We dynamically create a new class that inherits from the optimizer that was passed in.
# The goal is to override the `step()` method with an allreduce implementation.
cls = type(optimizer.__class__.__name__, (optimizer.__class__,),

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

It may be more readable to define _DistributedOptimizer here as a nested class. Creating a new class with the same name as the original optimizer seems a bit weird.

This comment has been minimized.

@alsrgv

alsrgv May 15, 2018

Collaborator

I think it would be better to do this as a separate change, since Keras version have the same code structure. I'll take a note to do it later.

template <class T>
Status TorchOpContext<T>::AllocateOutput(TensorShape shape,
std::shared_ptr<Tensor>* tensor) {
int64_t* shape_array = new int64_t[shape.dims()];

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

use vector

This comment has been minimized.

@alsrgv

alsrgv May 15, 2018

Collaborator

Underlying function THTensor##_resizeNd(tensor, nDimension, size, stride); uses an array pointer, so I'm just saving conversion back and forth.

// See the License for the specific language governing permissions and
// limitations under the License.
// =============================================================================

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

missing include guard?

This comment has been minimized.

@alsrgv

alsrgv May 15, 2018

Collaborator

This and the other interface.h file are actually just used by CFFI. CFFI is doing its own parsing of a header file and it's pretty simplistic in terms of a subset of C that it understands. Specifically, I don't believe it understands guards.

// See the License for the specific language governing permissions and
// limitations under the License.
// =============================================================================

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

missing include guard?

auto enqueue_result = EnqueueTensorAllreduce(
hvd_context, hvd_cpu_buffer, hvd_cpu_buffer, ready_event,
"allreduce." + name_or_handle, CPU_DEVICE_ID,

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

Can you extract "allreduce." + name_or_handle into a helper function that takes handle, name_prefix?

if handle not in _handle_map:
return
mpi_lib.horovod_torch_wait_and_clear(handle)
_, output = _handle_map[handle]

This comment has been minimized.

@jiezhang

jiezhang May 14, 2018

Collaborator

You can merge these two lines using pop.

This comment has been minimized.

@alsrgv

alsrgv May 15, 2018

Collaborator

Good catch!

alsrgv added some commits May 15, 2018

@alsrgv alsrgv merged commit 172e9fd into master May 17, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details

@alsrgv alsrgv deleted the pytorch branch May 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment