Skip to content

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Dec 3, 2019

Summary:
Note: This PR has been merged into master at efe1859 after the 1.4 branch cut (see original PR: #30217). This PR is to merge it into the 1.4 branch.

---- Original Commit Description Follows ---

Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.

@mrshenli
Copy link
Contributor Author

mrshenli commented Dec 3, 2019

Only merge after #30545 please.

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests, the GitHub actions failures are irrelevant.

Summary:
Pull Request resolved: pytorch#30217

Before this commit, RRefContext throws an error if it detects any
RRef leak during shutdown. However, this requires applications to
make sure that is has freed all references to RRefs in application
code, which can be a bad debugging experience when for large
applications. Besides, this also relies on Python GC to free things
up in time, which might not always be true. After this commit,
RRefContext would ignore leaking RRefs during shutdown, as shutdown
is called when the application has finished training and no longer
care about local states. Hence, it should be OK to just ignore
those leaks and destroy OwnerRRefs. If application would like to
enforce no leaks, just set torch.distributed.rpc.api._ignore_rref_leak
to False.

Test Plan: Imported from OSS

Differential Revision: D18632546

Pulled By: mrshenli

fbshipit-source-id: 2744b2401dafdd16de0e0a76cf8e07777bed0f38
@mrshenli
Copy link
Contributor Author

mrshenli commented Dec 4, 2019

@mingbowan I keep getting merge conflicts in the following test on the files I didn't touch. Do you know why?

https://app.circleci.com/jobs/github/pytorch/pytorch/3768805

It seems that test explicitly works on the master branch?

Merge master branch into pull/30699 before build in environment pytorch-linux-xenial-py3.6-gcc5.4-build
+ git config --global user.email circleci.ossci@gmail.com
+ git config --global user.name CircleCI
+ git config remote.origin.url https://github.com/pytorch/pytorch.git
+ git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master
+ git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet

@mrshenli
Copy link
Contributor Author

mrshenli commented Dec 4, 2019

The same errors already exist on v1.4.0 branch (#30684). I am merging this.

@mrshenli mrshenli merged commit 3eda9e7 into pytorch:v1.4.0 Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants