Skip to content

Conversation

kshitij12345
Copy link
Collaborator

Reference: #54261

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 28, 2021

💊 CI failures summary and remediations

As of commit c08c481 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

May 29 11:19:16 RuntimeError: tensorflow/compil...OK() (Unknown: Could not start gRPC server vs. OK)
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 314, in _setup_replication
May 29 11:19:16     device = xm.xla_device()
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 232, in xla_device
May 29 11:19:16     devkind=devkind if devkind is not None else None)
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 137, in get_xla_supported_devices
May 29 11:19:16     xla_devices = _DEVICES.value
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
May 29 11:19:16     self._value = self._gen_fn()
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 19, in <lambda>
May 29 11:19:16     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
May 29 11:19:16 RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Unknown: Could not start gRPC server vs. OK)
May 29 11:19:16 Traceback (most recent call last):
May 29 11:19:16   File "/var/lib/jenkins/workspace/xla/test/test_mp_all_to_all.py", line 34, in <module>
May 29 11:19:16     xmp.spawn(_mp_fn, args=())
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.9-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
May 29 11:19:16     start_method=start_method)
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
May 29 11:19:16     while not context.join():
May 29 11:19:16   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
May 29 11:19:16     exit_code=exitcode
May 29 11:19:16 torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 17

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

('true_divide', (S, S, S), (uniform_scalar(0.1),), 'scalar_broadcast_rhs', (True,)),
('true_divide', (), (uniform_scalar(0.1),), 'scalar_broadcast_lhs', (True,)),
('true_divide', torch.rand(S, S, S) + 1e-1, (3.14,), 'constant', (True,)),
('true_divide', uniform_scalar(1e-1, requires_grad=True), (3.14,), 'scalar_constant', (True,)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cases covered by binary_pwise,

scalar = 3.14 + 3.14j if dtype.is_complex else (3.14 if dtype.is_floating_point else 3)
scalar = 1 if dtype is torch.bool else scalar
tests_list = [
((S, S, S), (S, S, S), False),
((S, S, S), (S, S), False),
((), (), False),
((S, S, S), (), False),
((S, S, S), scalar, False),
((), scalar, False)
]
tests_with_lhs_broadcasting = [
((S, S), (S, S, S), True),
((), (S, S, S), True),
((S, 1, S), (M, S), True),
]

@kshitij12345 kshitij12345 requested a review from krshrimali May 28, 2021 17:04
@kshitij12345 kshitij12345 marked this pull request as ready for review May 28, 2021 17:05
@kshitij12345 kshitij12345 requested a review from mruberry May 28, 2021 17:05
@mruberry
Copy link
Collaborator

Another NNC issue:

======================================================================
ERROR [0.366s]: test_unsupported_true_divide_cpu_float32 (__main__.TestNNCOpInfoCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 297, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 292, in instantiated_test
    result = test_fn(self, *args)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 266, in test_wrapper
    return test(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 729, in only_fn
    return fn(slf, device, *args, **kwargs)
  File "test_jit_fuser_te.py", line 2001, in test_unsupported
    raise RuntimeError("Expected test to fail. If it now works, move op into works_list")
RuntimeError: Expected test to fail. If it now works, move op into works_list

cc @Chillee

Also we should rename that test to be more specific, like "test_nnc_unsupported"

@mruberry
Copy link
Collaborator

Another NNC issue:

======================================================================
ERROR [0.366s]: test_unsupported_true_divide_cpu_float32 (__main__.TestNNCOpInfoCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 297, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 292, in instantiated_test
    result = test_fn(self, *args)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 266, in test_wrapper
    return test(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 729, in only_fn
    return fn(slf, device, *args, **kwargs)
  File "test_jit_fuser_te.py", line 2001, in test_unsupported
    raise RuntimeError("Expected test to fail. If it now works, move op into works_list")
RuntimeError: Expected test to fail. If it now works, move op into works_list

cc @Chillee

Also we should rename that test to be more specific, like "test_nnc_unsupported"

And can we move it into test_ops.py? That way people only have to run one test file when adding an OpInfo. The jit support tests are also in test_ops.py, for example.

real = torch.tensor(torch.finfo(float_dtype).eps, device=device, dtype=dtype)
imag = torch.tensor(torch.finfo(float_dtype).eps, device=device, dtype=dtype)
replace_with = torch.complex(real, imag)
float_eps = torch.tensor(torch.finfo(float_dtype).eps, device=device, dtype=float_dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @kshitij12345. This needs a fix for the NNC test and a small comment update.

I also left a couple notes for @Chillee to move that test into test_ops.py (if possible), improve its naming, and (these notes are elsewhere) add metadata to OpInfos for whether they support NNC or not. That should make it easier to work with that test. But we should do any of those things in this PR.

Is div next?

@Chillee
Copy link
Collaborator

Chillee commented May 29, 2021

I also left a couple notes for @Chillee to move that test into test_ops.py (if possible), improve its naming, and (these notes are elsewhere) add metadata to OpInfos for whether they support NNC or not. That should make it easier to work with that test. But we should do any of those things in this PR.

These changes sound fine to me - the main consideration is just maximizing the ease of adding new OpInfos + maintainibility of these OpInfos.

I'd be glad to put up a PR for it, although I'm PTO the next 2 weeks, so it's possible I won't get to it (unless I do it now... maybe). Feel free to move the code to where you'd like.

Copy link
Contributor

@krshrimali krshrimali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @kshitij12345!

@facebook-github-bot
Copy link
Contributor

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry
Copy link
Collaborator

I also left a couple notes for @Chillee to move that test into test_ops.py (if possible), improve its naming, and (these notes are elsewhere) add metadata to OpInfos for whether they support NNC or not. That should make it easier to work with that test. But we should do any of those things in this PR.

These changes sound fine to me - the main consideration is just maximizing the ease of adding new OpInfos + maintainibility of these OpInfos.

I'd be glad to put up a PR for it, although I'm PTO the next 2 weeks, so it's possible I won't get to it (unless I do it now... maybe). Feel free to move the code to where you'd like.

Enjoy your PTO! I filed #59185 (comment) to track this

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in ea465f7.

facebook-github-bot pushed a commit that referenced this pull request Jun 1, 2021
Summary:
Reference: #54261

Depends on: #59154

Pull Request resolved: #59173

Reviewed By: ngimel

Differential Revision: D28785178

Pulled By: mruberry

fbshipit-source-id: 902310f2d77e499a2355a23b2d5a8c0b21b8c5bb
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Reference: pytorch#54261

Pull Request resolved: pytorch#59154

Reviewed By: ngimel

Differential Revision: D28780115

Pulled By: mruberry

fbshipit-source-id: 91e254698597fa0c7d4df6053ec017a85e180304
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Reference: pytorch#54261

Depends on: pytorch#59154

Pull Request resolved: pytorch#59173

Reviewed By: ngimel

Differential Revision: D28785178

Pulled By: mruberry

fbshipit-source-id: 902310f2d77e499a2355a23b2d5a8c0b21b8c5bb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants