Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rpc] special case tensor type check when getting RRef #33582

Closed
wants to merge 4 commits into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented Feb 20, 2020

Stack from ghstack:

Differential Revision: D20009837

@dr-ci
Copy link

dr-ci bot commented Feb 20, 2020

💊 CircleCI build failures summary and remediations

As of commit 44f99e4:

None of the build failures appear to be your fault.

  • 1/1 recognized as flaky ❄️
    • Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

❄️ 1 failure recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_test (1/1)

Step: "Test" (full log | pattern match details) ❄️

Feb 27 01:02:52 ConnectionResetError: [Errno 104] Connection reset by peer
Feb 27 01:02:52   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept 
Feb 27 01:02:52     deliver_challenge(c, self._authkey) 
Feb 27 01:02:52   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge 
Feb 27 01:02:52     response = connection.recv_bytes(256)        # reject large message 
Feb 27 01:02:52   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Feb 27 01:02:52     buf = self._recv_bytes(maxlength) 
Feb 27 01:02:52   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Feb 27 01:02:52     buf = self._recv(4) 
Feb 27 01:02:52   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Feb 27 01:02:52     chunk = read(handle, remaining) 
Feb 27 01:02:52 ConnectionResetError: [Errno 104] Connection reset by peer 
Feb 27 01:02:52 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Feb 27 01:02:52   len(cache)) 
Feb 27 01:02:55 Process ErrorTrackingProcess-122: 
Feb 27 01:02:55 Traceback (most recent call last): 
Feb 27 01:02:55   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Feb 27 01:02:55     self.run() 
Feb 27 01:02:55   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 333, in run 
Feb 27 01:02:55     super(ErrorTrackingProcess, self).run() 
Feb 27 01:02:55   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Feb 27 01:02:55     self._target(*self._args, **self._kwargs) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 8 times.

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are real test failures, please see inline comments as well

TORCH_INTERNAL_ASSERT(ownerRRef->type()->isSubtypeOf(TensorType::get()));
} else {
TORCH_INTERNAL_ASSERT(ownerRRef->type() == type);
}
return ownerRRef;
} else {
return createUserRRef(ownerId, rrefId, forkId, type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about createUserRRef case when type is not exactly matched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc createUserRRef will not try to find if there's an existing UserRRef in the RRefContext, so there would be no type match problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean whether there will be an issue if one user created user rref with subtypeOf(TensorType::get()), then the user shared this user rref to another user, and another user will create user rref here with a TensorType::get(). iiuc, the owner rref will have subtypeof(TensorType::get()).

So some user rref will have slightly different type based on above, I'm wondering whether this will be an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plain TensorType is always a subtype of specialized SubTensorType, the only reason why this is failing is that we have a Type equal assertion here, I didn't see this in other places. For the case that you described, the forked UserRRef will holding the plain TensorType, which is subtype compatible with the SubTensorType, which should be safe. In fact, we can only get into your described case when we fist run the ScriptFunction locally, then call rpc.remote on this ScriptFunction again remotely. But when we run the ScriptFunction in remote, we shouldn't preserve the Specialized SubTensorType information (because that's the information we get from the local run). So I think the fix here should be enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining! Sounds good, would you please also add this to comment?

@facebook-github-bot
Copy link
Contributor

@wanchaol merged this pull request in 4dad00b.

hczhu pushed a commit that referenced this pull request Feb 28, 2020
Summary: Pull Request resolved: #33582

Test Plan: Imported from OSS

Differential Revision: D20009837

Pulled By: wanchaol

fbshipit-source-id: 7e9ab87d4dddb822c7575891a2b620eff83bfa00
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/85/head branch March 1, 2020 15:18
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary: Pull Request resolved: pytorch#33582

Test Plan: Imported from OSS

Differential Revision: D20009837

Pulled By: wanchaol

fbshipit-source-id: 7e9ab87d4dddb822c7575891a2b620eff83bfa00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants