Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object Spilling] Clean up FS storage upon sigint for ray.init(). #13649

Merged
merged 8 commits into from
Jan 27, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Jan 23, 2021

Why are these changes needed?

This cleans up FS storage upon sigint. It also prints the progress when it is shutting down (so that users won't be confused why their driver is not terminated quickly).

The current output looks like...

Removing remaining files that are spilled to /private/var/folders/b8/khqvq5l13d76pjwz49bw76ph0000gn/T/pytest-of-sangbincho/pytest-98/test_file_deleted_when_driver_0/spill.

Total number of files spilled: 21
Path: /private/var/folders/b8/khqvq5l13d76pjwz49bw76ph0000gn/T/pytest-of-sangbincho/pytest-98/test_file_deleted_when_driver_0/spill.
Removed [5 / 21]
Removed [9 / 21]
Removed [13 / 21]
Removed [17 / 21]

Removed: 21
Path: /private/var/folders/b8/khqvq5l13d76pjwz49bw76ph0000gn/T/pytest-of-sangbincho/pytest-98/test_file_deleted_when_driver_0/spill.

Note

  • It doesn't work with ray start. Making it work complicates the design, and it is unlikely that anyone who uses ray start will have the issue that their objects are not deleted from their external storage (because we already delete objects when refs are gone out of scope).
  • I found Ray doesn't stop upon Sigterm. Not sure if it was intended.
  • Didn't implement S3 yet.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 changed the title [Object Spilling] Clean up FS storage upon sigint. [Object Spilling] Clean up FS storage upon sigint for ray.init(). Jan 23, 2021
@@ -243,6 +256,43 @@ def delete_spilled_objects(self, urls: List[str]):
filename = parse_url_with_offset(url.decode()).base_url
os.remove(os.path.join(self.directory_path, filename))

def destroy_external_storage(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just delete the entire directory with one call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is a simpler logic; The only concern I have is the UX is probably bad (since we cannot guarantee users will give us a directory that is solely for object spilling). What about this? I will always create a directory appended to a given spilling directory and delete that directory directly. For example,

file_system_object_spilling_config = {
    "type": "filesystem",
    "params": {
        "directory_path": "/tmp/something"
    }
}

This will create a new directory /tmp/something/ray_spilled_objects/[files] and I will delete ray_spilled_objects so that we can avoid deleting files that are not related to object spilling.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2021
@ericl
Copy link
Contributor

ericl commented Jan 24, 2021 via email

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@rkooo567
Copy link
Contributor Author

It is simplified.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@rkooo567
Copy link
Contributor Author

@ericl Found another race condition when we go to the deleting directory solution. When the directory deletion is performed, if there are still IO workers that are deleting files, it throws an exception and fails. To avoid it I added an additional logic. If you prefer just the previous solution since it also has complexity lmk.

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
# deleting the file at the same time.
pass
except Exception:
print("There were unexpected errors while deleting "
Copy link
Contributor

@ericl ericl Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("There were unexpected errors while deleting "
logger.exception("Error cleaning up spill files")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still want to display the traceback right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lmk if you'd like me to delete a traceback msgs.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
except Exception:
logger.exception("Error cleaning up spill files\n"
f"Directory path: {self.directory_path}\n"
f"Traceback: {traceback.format_exc()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to include the traceback; logger.exception already does this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Didn't know about that. I will fix this and merge the PR.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2021
@rkooo567 rkooo567 merged commit d2963f4 into ray-project:master Jan 27, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
…y-project#13649)

* Initial iteration done.

* Remove unnecessary messages.

* Addressed code review.

* Addressed code review.

* fix issues.

* addressed code review.

* Addressed the last code review.
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants