Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Pass on KMS-related kwargs for s3fs #35938

Merged
merged 4 commits into from
Jun 1, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented May 31, 2023

Why are these changes needed?

We currently only parse and pass limited selection of options to s3fs. One recent request was related to passing KMS settings. This PR extends the s3 uri string to allow configuration of signature version, sse, sse key ID, and ACLs in s3 URIs if s3fs is used.

This PR also changes the fs caching logic, which is a requirement for options to be parsed again, e.g. when a key ID is changed in subsequent calls. FS cache keys now include the query string, and cache items are stale after 5 minutes and re-created. As a side-effect, this should fix any problems that come with cached filesystems, e.g. expiring credentials.

Related issue number

Closes #35561
Might close #35519 (pending investigation)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>
Comment on lines +226 to +229
option_map = {
"ServerSideEncryption": "ServerSideEncryption",
"SSEKMSKeyId": "SSEKMSKeyId",
"GrantFullControl": "GrantFullControl",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to come up with a dynamic approach to cover all possible options here without churning the APIs too much. I'd like to revisit this when we think about custom upload logic and Syncer deprecation, and for now go with manually selected properties to forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that we should revisit this. I was searching for other possibilities and ACL is another param that can be passed into s3_additional_kwargs. Then, GCS also has a bunch of query params that users can specify.

Another thing to note: with the current Syncer abstraction, users could technically just create an s3fs client and customize all of these settings there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, the custom Syncer doesn't affect Checkpoint.to_uri at all.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

I do have some questions about getting rid of the cache:

  • Should we calculate some before/after timing of the benefit this cache actually had?
  • Regarding getting rid of the cache being a "requirement for options to be parsed again": how do users actually pass in these options again? They only get to specify storage_path once at the beginning, so even if the KeyID or access_key get refreshed, the query params would still be the same as before, and we'd run into the same problem. Am I missing something here?

Comment on lines +226 to +229
option_map = {
"ServerSideEncryption": "ServerSideEncryption",
"SSEKMSKeyId": "SSEKMSKeyId",
"GrantFullControl": "GrantFullControl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that we should revisit this. I was searching for other possibilities and ACL is another param that can be passed into s3_additional_kwargs. Then, GCS also has a bunch of query params that users can specify.

Another thing to note: with the current Syncer abstraction, users could technically just create an s3fs client and customize all of these settings there.

@krfricke
Copy link
Contributor Author

krfricke commented Jun 1, 2023

Looks good to me!

I do have some questions about getting rid of the cache:

* Should we calculate some before/after timing of the benefit this cache actually had?

I'll do some small-scale benchmarking before merge, but I don't expect any significant drawdowns as we call get_fs_and_path relatively rarely.

* Regarding getting rid of the cache being a "requirement for options to be parsed again": how do users actually pass in these options again? They only get to specify `storage_path` once at the beginning, so even if the `KeyID` or `access_key` get refreshed, the query params would still be the same as before, and we'd run into the same problem. Am I missing something here?

It's mostly for calls to e.g. Checkpoint.to_uri() where the URI can be re-specified.

@krfricke
Copy link
Contributor Author

krfricke commented Jun 1, 2023

Before:

> time python -c "from ray.air._internal.remote_storage import get_fs_and_path; [get_fs_and_path('s3://some') for i in range(10000)]"
python -c   1.21s user 2.53s system 428% cpu 0.872 total

After:

> time python -c "from ray.air._internal.remote_storage import get_fs_and_path; [get_fs_and_path('s3://some') for i in range(10000)]"
python -c   1.23s user 2.56s system 403% cpu 0.940 total

Looks like no significant speed reduction

@krfricke
Copy link
Contributor Author

krfricke commented Jun 1, 2023

Hm, some tests are failing because we re-create the mock filesystem. It looks like we need to at least cache that.

A few options:

  1. We continue to cache and include the query parameters in the caching key
  2. We only cache the mock/memory filesystem
  3. We patch get_fs_and_path for mock/memory tests
  4. We use moto/s3 instead of mock/memory filesystems
  1. would require to use a time-based cache to address [Train] Syncer fails to write to S3 after 2 days of training #35519
  2. means we would keep test-specific code in the library, which is not ideal
  3. means boilerplate for the tests and mocking
  4. would increase runtime of the tests

From these options I believe 1 is the best. We can keep the timeout relatively small (say a minute or so).

I'll update the PR.

Kai Fricke added 3 commits June 1, 2023 14:45
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, so mock:// is something provided by pyarrow, and it gets wiped if we don't cache if for tests.

@krfricke krfricke merged commit ba31fdf into ray-project:master Jun 1, 2023
@krfricke krfricke deleted the air/remote-storage-kms branch June 1, 2023 17:17
scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023
We currently only parse and pass limited selection of options to s3fs. One recent request was related to passing KMS settings. This PR extends the s3 uri string to allow configuration of signature version, sse, sse key ID, and ACLs in s3 URIs if s3fs is used.

This PR also changes the fs caching logic, which is a requirement for options to be parsed again, e.g. when a key ID is changed in subsequent calls. FS cache keys now include the query string, and cache items are stale after 5 minutes and re-created. As a side-effect, this should fix any problems that come with cached filesystems, e.g. expiring credentials.

Signed-off-by: Kai Fricke <kai@anyscale.com>
@matthewdeng matthewdeng mentioned this pull request Jul 22, 2023
8 tasks
krfricke pushed a commit that referenced this pull request Jul 25, 2023
#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. 

Signed-off-by: Matthew Deng <matt@anyscale.com>
matthewdeng added a commit to matthewdeng/ray that referenced this pull request Jul 28, 2023
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. 

Signed-off-by: Matthew Deng <matt@anyscale.com>
rickyyx pushed a commit that referenced this pull request Jul 28, 2023
#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported.

Signed-off-by: Matthew Deng <matt@anyscale.com>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported. 

Signed-off-by: Matthew Deng <matt@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
We currently only parse and pass limited selection of options to s3fs. One recent request was related to passing KMS settings. This PR extends the s3 uri string to allow configuration of signature version, sse, sse key ID, and ACLs in s3 URIs if s3fs is used.

This PR also changes the fs caching logic, which is a requirement for options to be parsed again, e.g. when a key ID is changed in subsequent calls. FS cache keys now include the query string, and cache items are stale after 5 minutes and re-created. As a side-effect, this should fix any problems that come with cached filesystems, e.g. expiring credentials.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
ray-project#35938 introduced a pyarrow typehint, though pyarrow is being lazily imported.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants