Skip to content

Conversation

@gujinghui
Copy link
Collaborator

Fix fallback issues to handle inplace case

@gujinghui
Copy link
Collaborator Author

@yinghai pls help review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum doesn't support 2-input case? If it does, we can just use fallback_sum_ to simplify things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if sum supports broadcast add...

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinghai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@gujinghui
Copy link
Collaborator Author

@yinghai
Sorry, submitted wrong patch before.
The sum fallback should be reviewed in #15267.
So sorry for inconvenience.

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please rebase to resolve conflicts.

Change-Id: I46a58f5e07ea874bfa033fce50e92fb11e4cc591
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
@gujinghui
Copy link
Collaborator Author

@yinghai
rebased. pls help review.

@gujinghui
Copy link
Collaborator Author

@yinghai
The failure is not caused by this PR.

Jan 09 11:00:24 test_scatter_stress_cuda (main.ProcessGroupGlooTest) ... Process process 1:
Jan 09 11:00:24 Traceback (most recent call last):
Jan 09 11:00:24 File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Jan 09 11:00:24 self.run()
Jan 09 11:00:24 File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
Jan 09 11:00:24 self._target(*self._args, **self._kwargs)
Jan 09 11:00:24 File "test_c10d.py", line 487, in _run
Jan 09 11:00:24 getattr(self, self.id().split(".")[2])()
Jan 09 11:00:24 File "test_c10d.py", line 453, in wrapper
Jan 09 11:00:24 fn(self)
Jan 09 11:00:24 File "test_c10d.py", line 50, in wrapper
Jan 09 11:00:24 return func(*args, **kwargs)
Jan 09 11:00:24 File "test_c10d.py", line 867, in test_scatter_stress_cuda
Jan 09 11:00:24 self._test_scatter_stress(inputs, lambda t: t.clone().cuda())
Jan 09 11:00:24 File "test_c10d.py", line 844, in _test_scatter_stress
Jan 09 11:00:24 work_handle.wait()
Jan 09 11:00:24 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1000ms for recv operation to complete
Jan 09 11:00:25 Process process 2:
Jan 09 11:00:25 Traceback (most recent call last):
Jan 09 11:00:25 File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Jan 09 11:00:25 self.run()
Jan 09 11:00:25 File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
Jan 09 11:00:25 self._target(*self._args, **self._kwargs)
Jan 09 11:00:25 File "test_c10d.py", line 487, in _run
Jan 09 11:00:25 getattr(self, self.id().split(".")[2])()
Jan 09 11:00:25 File "test_c10d.py", line 453, in wrapper
Jan 09 11:00:25 fn(self)
Jan 09 11:00:25 File "test_c10d.py", line 50, in wrapper
Jan 09 11:00:25 return func(*args, **kwargs)
Jan 09 11:00:25 File "test_c10d.py", line 867, in test_scatter_stress_cuda
Jan 09 11:00:25 self._test_scatter_stress(inputs, lambda t: t.clone().cuda())
Jan 09 11:00:25 File "test_c10d.py", line 844, in _test_scatter_stress
Jan 09 11:00:25 work_handle.wait()
Jan 09 11:00:25 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:543] Connection closed by peer [127.0.0.1]:36555
Jan 09 11:00:25 Process process 3:
Jan 09 11:00:25 Traceback (most recent call last):
Jan 09 11:00:25 File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Jan 09 11:00:25 self.run()
Jan 09 11:00:25 File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
Jan 09 11:00:25 self._target(*self._args, **self._kwargs)
Jan 09 11:00:25 File "test_c10d.py", line 487, in _run
Jan 09 11:00:25 getattr(self, self.id().split(".")[2])()
Jan 09 11:00:25 File "test_c10d.py", line 453, in wrapper
Jan 09 11:00:25 fn(self)
Jan 09 11:00:25 File "test_c10d.py", line 50, in wrapper
Jan 09 11:00:25 return func(*args, **kwargs)
Jan 09 11:00:25 File "test_c10d.py", line 867, in test_scatter_stress_cuda
Jan 09 11:00:25 self._test_scatter_stress(inputs, lambda t: t.clone().cuda())
Jan 09 11:00:25 File "test_c10d.py", line 844, in _test_scatter_stress
Jan 09 11:00:25 work_handle.wait()
Jan 09 11:00:25 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1000ms for recv operation to complete

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinghai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants