New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Synthetic Benchmark #545

Merged
merged 2 commits into from Oct 8, 2018

Conversation

Projects
None yet
3 participants
@alsrgv
Collaborator

alsrgv commented Oct 6, 2018

No description provided.

@alsrgv alsrgv self-assigned this Oct 6, 2018

@alsrgv alsrgv requested a review from tgaddair Oct 6, 2018

# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# TODO: needs bugfix
#hvd.broadcast_optimizer_state(optimizer, root_rank=0)

This comment has been minimized.

@alsrgv

alsrgv Oct 6, 2018

Collaborator

@tgaddair, we should fix this bug before landing. optim.SGD w/o momentum & weight decay cause the following issue with broadcast_optimizer_state:

Traceback (most recent call last):
  File "pytorch_synthetic_benchmark.py", line 64, in <module>
    hvd.broadcast_optimizer_state(optimizer, root_rank=0)
  File "/usr/local/lib/python2.7/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state
    param_state = state_dict['state'][pid]
KeyError: 4592867280

This comment has been minimized.

@tgaddair

tgaddair Oct 8, 2018

Collaborator

Done #548.

# Horovod: broadcast parameters & optimizer state.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# TODO: needs bugfix
#hvd.broadcast_optimizer_state(optimizer, root_rank=0)

This comment has been minimized.

@tgaddair

tgaddair Oct 8, 2018

Collaborator

Done #548.

@alsrgv alsrgv merged commit 983a06e into master Oct 8, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details

@alsrgv alsrgv deleted the pytorch_benchmark branch Oct 8, 2018

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 19, 2018

@alsrgv I'm getting the following error when running the pytorch_synthetic_benchmark.py

Traceback (most recent call last): File "syn.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] KeyError: 140230636548384

I get the error in parallel as well. Same output k times...For example

aprun -n 4 -N 1 ~/miniconda3/bin/python syn.py

Gives
Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> Traceback (most recent call last): File "syn.py", line 64, in <module> hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state hvd.broadcast_optimizer_state(optimizer, root_rank=0) File "~/miniconda3/lib/python3.6/site-packages/horovod/torch/__init__.py", line 213, in broadcast_optimizer_state param_state = state_dict['state'][pid] param_state = state_dict['state'][pid] KeyError: 46913509835904 KeyError: 46913509835904 param_state = state_dict['state'][pid] KeyError: 46913509835904 param_state = state_dict['state'][pid] KeyError: 46913509835904

I've included the reference call below in /torch/init.py

def _create_callback(pid, name, t, p):
def _from_tensor():
state_dict['state'][pid][name] = t(p.numpy()[0])
return _from_tensor

@alsrgv

This comment has been minimized.

Collaborator

alsrgv commented Oct 19, 2018

@bapriddy, can you comment out this line or install Horovod from master?

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 20, 2018

Yes. I'll check it. Thanks.

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 20, 2018

It worked. Thanks!!

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 20, 2018

@alsrgv Awesome!!! So nice to have this. Again, Thanks!!!

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 20, 2018

@alsrgv How does the code decide when to stop "warmup" and proceed with the test?? Just curious. Also, would switching to fp16 have any effect?

@alsrgv

This comment has been minimized.

Collaborator

alsrgv commented Oct 22, 2018

@bapriddy, warmup runs for --num-warmup-batches, which defaults to 10. Using --fp16-allreduce should improve performance if your network is slow.

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 22, 2018

@alsrgv Is it possible to modify the pytorch_synthetic_benchmark.py for resnet18, resnet101, or other imagenet models? I did this with pytorch_imagenet_resnet50.py by changing line 114.

# Set up standard ResNet-50 model.
model = models.resnet50()

@alsrgv

This comment has been minimized.

Collaborator

alsrgv commented Oct 22, 2018

@bapriddy, yeah, you can just pass --model resnet101.

@bapriddy

This comment has been minimized.

Contributor

bapriddy commented Oct 22, 2018

@alsrgv Got it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment