Skip to content

Conversation

nikithamalgifb
Copy link
Contributor

@nikithamalgifb nikithamalgifb commented Feb 23, 2021

Stack from ghstack:

nikithamalgifb added a commit that referenced this pull request Feb 23, 2021
ghstack-source-id: e398af5
Pull Request resolved: #52679
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Feb 23, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 23, 2021

💊 CI failures summary and remediations

As of commit a66d5c9 (more details on the Dr. CI page):


  • 5/5 failures possibly* introduced in this PR
    • 1/5 non-scanned failure(s)

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/generate_config_yml.py
Auto-merging .circleci/generate_config_yml.py
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/generate_config_yml.py
Auto-merging .circleci/generate_config_yml.py
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (3/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Feb 26 08:59:56 [E request_callback_no_python.cpp:656] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 26 08:59:56 At:
Feb 26 08:59:56   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 26 08:59:56   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 26 08:59:56 
Feb 26 08:59:56 [E request_callback_no_python.cpp:656] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 26 08:59:56 
Feb 26 08:59:56 At:
Feb 26 08:59:56   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 26 08:59:56   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 26 08:59:56 
Feb 26 08:59:56 [E request_callback_no_python.cpp:656] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 26 08:59:56 
Feb 26 08:59:56 At:
Feb 26 08:59:56   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 26 08:59:56   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 26 08:59:56 
Feb 26 08:59:56 ok (1.228s)
Feb 26 08:59:57   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend.
Feb 26 08:59:57 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend.
Feb 26 08:59:57 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend.
Feb 26 08:59:57 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend.

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/generate_config_yml.py
Auto-merging .circleci/generate_config_yml.py
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/simple/docker_definitions.py
Auto-merging .circleci/cimodel/data/simple/docker_definitions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Copy link

@SplitInfinity SplitInfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits

def forward(self):
device_index = torch.cuda._current_device()
s = torch.jit.cuda.Stream(device_index, 0)
s = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = torch.jit.cuda.Stream(priority=0)
s = torch.jit.cuda.Stream()

device_index = torch.cuda._current_device()
s0 = torch.cuda.current_stream(device_index)
s1 = torch.jit.cuda.Stream(device_index, 0)
s1 = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s1 = torch.jit.cuda.Stream(priority=0)
s1 = torch.jit.cuda.Stream()

def test_event_synchronize() -> float:
device_index = torch.cuda._current_device()
s = torch.jit.cuda.Stream(device_index, 0)
s = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = torch.jit.cuda.Stream(priority=0)
s = torch.jit.cuda.Stream()

def test_stream_synchronize() -> float:
device_index = torch.cuda._current_device()
s = torch.jit.cuda.Stream(device_index, 0)
s = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = torch.jit.cuda.Stream(priority=0)
s = torch.jit.cuda.Stream()

Comment on lines 89 to 110
def event_default_args_1() -> bool:
e = torch.jit.cuda.Event(blocking=True)
return e is not None

@torch.jit.script
def event_default_args_2() -> bool:
e = torch.jit.cuda.Event(enable_timing=True)
return e is not None

@torch.jit.script
def event_default_args_3() -> bool:
e = torch.jit.cuda.Event(interprocess=True)
return e is not None

@torch.jit.script
def event_default_args_4() -> bool:
e = torch.jit.cuda.Event(interprocess=True, blocking=True)
return e is not None

@torch.jit.script
def event_default_args_5() -> bool:
e = torch.jit.cuda.Event(enable_timing=True, blocking=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using better test names here, and testing that the arguments and default arguments are being respect. That is, the event in event_default_args_3 should not be blocking and we should be able to verify that (I think?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, to verify blocking we would ideally need sleep support in JIT and since that isn't available, it becomes a lot harder to verify if blocking was truly disabled. This is the case even with interprocess. Some of the tests like event_default_args_5 they are already tested - ref (test_event_wait) and hence I didn't add a redundant test for the same.

device_index = torch.cuda._current_device()
current_stream = torch.cuda.current_stream(device_index)
user_stream = torch.jit.cuda.Stream(device_index, 0)
user_stream = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
user_stream = torch.jit.cuda.Stream(priority=0)
user_stream = torch.jit.cuda.Stream()

current_stream = torch.cuda.current_stream(device_index)
default_stream = torch.cuda.default_stream(device_index)
user_stream = torch.jit.cuda.Stream(device_index, 0)
user_stream = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
user_stream = torch.jit.cuda.Stream(priority=0)
user_stream = torch.jit.cuda.Stream()

def test_simple_stream():
device_index = torch.cuda._current_device()
s = torch.jit.cuda.Stream(device_index, 0)
s = torch.jit.cuda.Stream(priority=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s = torch.jit.cuda.Stream(priority=0)
s = torch.jit.cuda.Stream()

nikithamalgifb added a commit that referenced this pull request Feb 24, 2021
ghstack-source-id: e398af5
Pull Request resolved: #52679
nikithamalgifb added a commit that referenced this pull request Feb 24, 2021
ghstack-source-id: 675a341
Pull Request resolved: #52679
Co-authored-by: SplitInfinity <meghanl@fb.com>
Copy link

@SplitInfinity SplitInfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions about the tests and how they make sure the default arguments work as expected.

As you alluded to in your comment, there isn't much we can do to really test these properties of events. Maybe it's better to have no test rather than a test that doesn't actually test anything?

Comment on lines 93 to 99
e.record(s0)
e.wait(s)
A = torch.rand(1000, 1000, device="cuda")
with torch.jit.cuda.stream(s):
B = torch.mm(A, A)
s.synchronize()
e.record(s0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test? I thought we would want to test that timing is disabled for this event (since that is the default argument). Would trying to call elapsed_time throw an exception that we could check for using self.assertRaisesRegex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing this, but elapsed_time didn't throw any exception. So I tried to verify the blocking=True. Although, it is difficult to prove that the event waits on the stream s here.

Comment on lines 105 to 115
e_tik = torch.jit.cuda.Event(enable_timing=True)
e_tok = torch.jit.cuda.Event(enable_timing=True)
s0 = torch.cuda.current_stream(0)
e_tik.record(s0)
A = torch.rand(1000, 1000, device="cuda")
with torch.jit.cuda.stream(s0):
B = torch.mm(A, A)
s0.synchronize()
e_tok.record(s0)
e_tok.synchronize()
return e_tik.elapsed_time(e_tok)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test? I thought we would want to test that these events are nonblocking (since that is the default argument).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to prove that they are non-blocking, because of sleep not being available. So I instead verified if enable_timing=True can be verified.

Comment on lines 119 to 132
e = torch.jit.cuda.Event(interprocess=True)
s1 = torch.jit.cuda.Stream()
s2 = torch.jit.cuda.Stream()
A = torch.rand(1000, 1000, device="cuda")
with torch.jit.cuda.stream(s1):
B = torch.mm(A, A)
s1.record_event(e)

with torch.jit.cuda.stream(s2):
C = torch.mm(A, A)
s2.record_event(e)
s1.synchronize()
s2.synchronize()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here - I thought we would want to test that the event is nonblocking and has timing disabled.

Copy link
Contributor Author

@nikithamalgifb nikithamalgifb Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't verify the event is nonblocking and has timing disabled and hence in this test, it verifies if interprocess was enabled. For interprocess=True , the event can be shared between two process. So in this test, I have tried using the event to be recorded between two streams.

@nikithamalgifb
Copy link
Contributor Author

nikithamalgifb commented Feb 26, 2021

I have a few questions about the tests and how they make sure the default arguments work as expected.

As you alluded to in your comment, there isn't much we can do to really test these properties of events. Maybe it's better to have no test rather than a test that doesn't actually test anything?

I agree, we can probably eliminate the tests. My only objective of introducing the tests were to make sure , even with default arguments, the event object was getting created. But I guess that doesn't add much value here. Wdyt?

@SplitInfinity
Copy link

Yeah, let's just remove the (new) tests.

nikithamalgifb added a commit that referenced this pull request Feb 26, 2021
ghstack-source-id: 60b1ee6
Pull Request resolved: #52679
@nikithamalgifb
Copy link
Contributor Author

Yeah, let's just remove the (new) tests.

Done.

Copy link

@SplitInfinity SplitInfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to address merge conflicts before merging.

@nikithamalgifb
Copy link
Contributor Author

Closing this PR. This PR was merged by another #53025

@facebook-github-bot facebook-github-bot deleted the gh/nikithamalgifb/8/head branch April 2, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants