-
Notifications
You must be signed in to change notification settings - Fork 25.6k
nightly robustness fixes for linking across devices #43771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #43771 +/- ##
=========================================
Coverage ? 69.34%
=========================================
Files ? 378
Lines ? 46698
Branches ? 0
=========================================
Hits ? 32381
Misses ? 14317
Partials ? 0 Continue to review full report at Codecov.
|
tools/nightly.py
Outdated
if platform.startswith("win"): | ||
return False | ||
src = os.path.join(source_dir, "__nightly_test__") | ||
trg = os.path.join(target_dir, "__nightky_test__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nightky?
Can you briefly describe how you tested this fixed the bug in question? Also, another common fix for this cross-link problem is to just put the tmp files in the same directory as where the eventual location will be; see also https://rachelbythebay.com/w/2020/08/11/files/ |
I tested it locally as did @lezcano (who originally reported the issue).
It is true, but I though this was a more reasonable solution given the existing architecture and that most of the files in the conda package will not actually be linked, so in the event of a failure they will be cleaned up by the OS automatically, whereas if the tmpdir is in the target dir then you are relying on Python to clean them up. Basically, this seems more robust overall and less likely to get the entirety of the nightly pytorch accidentally committed to the repo. |
tools/nightly.py
Outdated
src = os.path.join(source_dir, "__nightly_test__") | ||
trg = os.path.join(target_dir, "__nightky_test__") | ||
try: | ||
with open(src, "w"): | ||
pass | ||
os.link(src, trg) | ||
linkable = True | ||
except OSError: | ||
linkable = False | ||
finally: | ||
if os.path.isfile(trg): | ||
os.remove(trg) | ||
if os.path.isfile(src): | ||
os.remove(src) | ||
return linkable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wouldn't using os.stat
be a much simpler solution?
src = os.path.join(source_dir, "__nightly_test__") | |
trg = os.path.join(target_dir, "__nightky_test__") | |
try: | |
with open(src, "w"): | |
pass | |
os.link(src, trg) | |
linkable = True | |
except OSError: | |
linkable = False | |
finally: | |
if os.path.isfile(trg): | |
os.remove(trg) | |
if os.path.isfile(src): | |
os.remove(src) | |
return linkable | |
try: | |
# Hard linking is possible between two folders on the same device | |
return os.stat(source_dir).st_dev == os.stat(target_dir).st_devsrc = os.path.join(source_dir, "__nightly_test__") | |
except OSError: | |
# hard-linking between non-existing/non-accessible directories is not possible | |
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking a long time to get back to this. I don't believe that using stat to test the device is semantically correct here. We want to test if we can link, so the the way to do that is by trying to link a temporary file. The implementation details about which devices allow linking to which other devices isn't a strong enough guarantee that linking can happen, though it may be necessary based on the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. On the other hand, existing implementation does not, strictly speaking, answer the question whether files between two folders can be linked, because it will return false if user is not allowed to create files in source folder or if __nightly_test__
already exists in source folder or if __nightky_test__
exists in destination folder.
Considering all that, I think ideal approach would be to just try hard_link and if it fails to copy as suggested in https://github.com/pytorch/pytorch/pull/43771/files#r479539719
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider simpler implementation of _can_link
, or perhaps just fall back to try: except:
between link_file
and copy_file
tools/nightly.py
Outdated
if _can_link(platform, source_dir, target_dir): | ||
_link_files(listing, source_dir, target_dir) | ||
else: | ||
_copy_files(listing, source_dir, target_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't try: except:
paradigm be better than if: then:
in this case?
if _can_link(platform, source_dir, target_dir): | |
_link_files(listing, source_dir, target_dir) | |
else: | |
_copy_files(listing, source_dir, target_dir) | |
try:: | |
_link_files(listing, source_dir, target_dir) | |
except OSError: | |
_copy_files(listing, source_dir, target_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try is in the _can_link()
function. There is nothing particularly exceptional about choosing which function should be used to transfer files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach eliminate the need of _can_link
function completely, isn't it?
@malfet are you satisfied by the response here, or is there more stuff we have to do? It would be good to move this along. |
tools/nightly.py
Outdated
if _can_link(platform, source_dir, target_dir): | ||
_link_files(listing, source_dir, target_dir) | ||
else: | ||
_copy_files(listing, source_dir, target_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach eliminate the need of _can_link
function completely, isn't it?
tools/nightly.py
Outdated
src = os.path.join(source_dir, "__nightly_test__") | ||
trg = os.path.join(target_dir, "__nightky_test__") | ||
try: | ||
with open(src, "w"): | ||
pass | ||
os.link(src, trg) | ||
linkable = True | ||
except OSError: | ||
linkable = False | ||
finally: | ||
if os.path.isfile(trg): | ||
os.remove(trg) | ||
if os.path.isfile(src): | ||
os.remove(src) | ||
return linkable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. On the other hand, existing implementation does not, strictly speaking, answer the question whether files between two folders can be linked, because it will return false if user is not allowed to create files in source folder or if __nightly_test__
already exists in source folder or if __nightky_test__
exists in destination folder.
Considering all that, I think ideal approach would be to just try hard_link and if it fails to copy as suggested in https://github.com/pytorch/pytorch/pull/43771/files#r479539719
OK, here is a version that works with try-except and removes the |
💊 CI failures summary and remediationsAs of commit 677b919 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet:
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Fixes #43761
CC @rgommers @ezyang