Skip to content

[Data] Fix credential drop and IMDS herd in download expression#62894

Closed
xyuzh wants to merge 2 commits into
ray-project:masterfrom
xyuzh:fix/download-credentials-and-imds-herd
Closed

[Data] Fix credential drop and IMDS herd in download expression#62894
xyuzh wants to merge 2 commits into
ray-project:masterfrom
xyuzh:fix/download-credentials-and-imds-herd

Conversation

@xyuzh
Copy link
Copy Markdown
Member

@xyuzh xyuzh commented Apr 23, 2026

Summary

Fixes two independent bugs in the download expression pipeline that compound under concurrent S3 workloads on EC2.

Bug 1 — Silent credential drop (obstore path)

When a user passes an fsspec S3FileSystem with Okta/STS/profile-based auth (wrapped as PyFileSystem(FSSpecHandler(s3fs_fs))), _extract_credentials_from_filesystem read only static attrs (key/secret/token, storage_options). Those attrs are None for session-backed s3fs — credentials live on fs.session.get_credentials() and may rotate. The function returned {}, obstore's from_url got no keys, and obstore silently fell back to its own credential chain (IMDS on EC2). Users never saw a warning.

Bug 2 — IMDS thundering herd (threaded fallback)

download_bytes_threaded spawned 16 worker threads via make_async_gen. Each worker independently called _resolve_paths_and_filesystem(uri, cached_fs=None) on its first URI, constructing a fresh pyarrow.fs.S3FileSystem → triggering an IMDS credential fetch. With N concurrent Download tasks on a node that was ~16×N near-simultaneous IMDS HTTP calls — enough to trip the per-instance rate limit and produce intermittent NoCredentialsError.

Changes

  • _extract_credentials_from_filesystem now returns Optional[Dict]. None signals "route to threaded" so the user's filesystem stays authoritative. New _frozen_s3fs_credentials helper snapshots via session.get_credentials().get_frozen_credentials() for both sync (botocore) and async (aiobotocore) sessions. Unrecognized non-None filesystems also return None per the new contract.
  • _plan_obstore_routing(fs) -> (use_obstore, fs_kwargs) centralizes dispatch with a one-shot WARNING per filesystem when credentials can't be extracted.
  • plan_download_op pre-decides obstore-vs-threaded at plan time and routes AsyncPartitionActor accordingly — no silent downgrade.
  • download_bytes_threaded resolves the filesystem exactly once per (block, column) before make_async_gen and shares it across all 16 workers via closure default-arg binding. Probe loop requires both paths and fs to be usable before breaking, avoiding leak-through when _resolve_paths_and_filesystem silently drops a bad URI.
  • download_bytes_async extracts credentials once in sync context and threads kwargs through to _download_uris_with_obstore. Fixes a correctness issue where aiobotocore's async get_credentials couldn't run from inside an existing event loop.
  • AsyncPartitionActor fails closed with a clear RuntimeError if it's ever constructed with an unextractable filesystem (dispatch-bug guard).

Test plan

  • pre-commit run passes on all changed files (ruff, pydoclint, black, docstyle, semgrep, import order, mock-method / logger checks).
  • Added unit tests in test_obstore_download.py:
    • TestSessionBackedFsspecCredentials — boto3 sync session, aiobotocore async session, _session fallback for older s3fs, unresolvable session returns None, no-access-key returns None, anon=True skips session, static attrs win over session.
    • TestPlanObstoreRoutingNone filesystem → obstore, non-S3 fsspec → threaded, unextractable fsspec-S3 → threaded with warning, warning dedup (same FS twice = one warning), extractable fsspec-S3 → obstore, AsyncPartitionActor raises on unextractable creds.
    • TestThreadedDownloadPreResolve — exactly one probe across 16 workers, supplied-FS skips probe, bad-first-URI still finds FS, all-bad URIs don't crash, empty block doesn't spawn workers, regression test for ([], fs) probe-break bug.
  • End-to-end against moto/minio with Okta-style fsspec S3FileSystem over ~100 URIs (no NoCredentialsError, moto sees signed requests).
  • Run with RAY_DATA_USE_OBSTORE=0 over concurrent tasks, verifying S3FileSystem.__init__ is invoked exactly once per block rather than per worker.

Two independent bugs in the `download` expression pipeline that compound
under concurrent S3 workloads on EC2:

1. **Silent credential drop (obstore path).** For fsspec S3 filesystems
   with Okta/STS/profile-based auth, static attrs (`key`/`secret`/`token`,
   `storage_options`) are `None` — credentials live on
   `fs.session.get_credentials()` and may rotate. Extraction returned
   `{}`, obstore's `from_url` got no keys, and obstore fell back to its
   own credential chain (IMDS on EC2). Users never saw a warning.

2. **IMDS thundering herd (threaded fallback).** `download_bytes_threaded`
   spawned 16 worker threads, each calling `_resolve_paths_and_filesystem`
   independently on its first URI. With N concurrent Download tasks on a
   node that was ~16xN simultaneous IMDS calls, tripping the per-instance
   rate limit and producing intermittent `NoCredentialsError`.

Changes:

- `_extract_credentials_from_filesystem` now returns `Optional[Dict]`:
  `None` signals "route to threaded" (user's FS stays authoritative).
  New `_frozen_s3fs_credentials` helper snapshots via
  `session.get_credentials().get_frozen_credentials()` (sync boto3 and
  async aiobotocore), so Okta/STS/profile-based credentials reach obstore.
  Unrecognized non-None filesystems also return `None` per the contract.
- New `_plan_obstore_routing(fs) -> (use_obstore, fs_kwargs)` centralizes
  the dispatch decision with a one-shot WARNING per filesystem when
  credentials can't be extracted.
- `plan_download_op` pre-decides obstore-vs-threaded at plan time and
  routes `AsyncPartitionActor` accordingly (no silent downgrade).
- `download_bytes_threaded` resolves the filesystem exactly once per
  (block, column) before `make_async_gen`, then shares it across all 16
  workers — no per-worker IMDS storm. The probe loop requires both
  `paths` and `fs` to be usable before breaking, avoiding a leak-through
  when `_resolve_paths_and_filesystem` silently drops a bad URI.
- `download_bytes_async` extracts credentials once in sync context and
  threads the kwargs through to `_download_uris_with_obstore` (fixes a
  correctness issue where aiobotocore's async `get_credentials` couldn't
  run from inside an existing event loop).
- `AsyncPartitionActor` fails closed with a clear `RuntimeError` if it's
  ever constructed with an unextractable filesystem (dispatch bug guard).

Tests: added coverage for session-based credential snapshotting (boto3
sync + aiobotocore async), warning dedup, dispatch routing on
unextractable filesystems, and the threaded pre-resolve (one probe
across workers, supplied-FS skips probe, bad-first-URI handling,
empty-paths regression, empty-block safety, fail-closed actor init).
@xyuzh xyuzh requested a review from a team as a code owner April 23, 2026 23:25
@xyuzh xyuzh added the go add ONLY when ready to merge, run all tests label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the obstore download path by adding support for session-backed s3fs credentials (Okta/STS) and introducing a routing logic that falls back to threaded PyArrow downloads when credentials cannot be statically extracted. It also mitigates the "IMDS thundering herd" problem in the threaded path by pre-resolving the filesystem once per block. Reviewers noted that obstore support could be extended to GCS and Azure fsspec protocols and raised concerns about a potential regression in handling mixed-scheme blocks due to the new "sticky" filesystem behavior in threaded downloads.

Comment on lines +264 to +268
if protocol not in ("s3", "s3a"):
# Non-S3 fsspec — obstore can't use these. Callers usually filter
# this out via _obstore_filesystem_requires_threaded_download, but
# direct callers must also route to the threaded path.
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment states that obstore cannot use non-S3 fsspec protocols, but obstore (via the underlying object_store Rust crate) does support GCS and Azure. While the current credential extraction logic is focused on S3, it might be worth clarifying the comment or considering future support for GCS/Azure fsspec protocols to avoid unnecessary fallbacks to the threaded path when obstore could potentially be used.

Comment on lines +270 to +273
resolved_paths, _ = _resolve_paths_and_filesystem(
uri, filesystem=resolved_fs
)
fs_for_read = wrapped_fs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of download_bytes_threaded now uses a single pre-resolved filesystem (resolved_fs) for all URIs in a block to mitigate the IMDS thundering herd. This introduces a 'sticky' filesystem behavior per block. If a block contains mixed-scheme URIs (e.g., S3 and GCS), URIs that are incompatible with the first resolved filesystem will fail to download, as _resolve_paths_and_filesystem will return the incompatible input filesystem when one is provided (per its documented behavior). While mixed-scheme blocks are rare in Ray Data, this is a regression in flexibility compared to the previous per-worker resolution. Consider if a per-scheme cache within the worker would be a more robust alternative to support mixed-scheme blocks while still avoiding the thundering herd.

Direct-caller guard: if _download_uris_with_obstore is invoked with a
filesystem whose credentials can't be statically extracted, raise
RuntimeError instead of coercing extraction's None to {}. The old
coercion silently handed obstore's ambient credential chain (IMDS/env)
to any internal/test caller that bypassed _plan_obstore_routing —
exactly the silent-drop contract the new routing was designed to
prevent. filesystem=None still legitimately yields fs_kwargs={}.

Adds a regression test covering the new raise.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 53bfece. Configure here.

# Unrecognized non-None filesystem — route to threaded so the user's FS
# stays authoritative. Obstore with empty kwargs would silently use its
# own credential chain, dropping any configuration on the user's FS.
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrecognized native filesystems unnecessarily forced to threaded path

Low Severity

_extract_credentials_from_filesystem now returns None for all unrecognized non-None filesystems (e.g. LocalFileSystem), whereas it previously returned {}. This causes _plan_obstore_routing to route these to the threaded path and emit a misleading warning about "S3 credentials" — even though obstore natively supports file:// URIs and no credentials are needed. Users explicitly passing pafs.LocalFileSystem() would see an unnecessary performance downgrade and a confusing warning.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 53bfece. Configure here.

@xyuzh
Copy link
Copy Markdown
Member Author

xyuzh commented Apr 23, 2026

Splitting into two independent PRs for easier review:

The two bugs are independent; either PR can land first. Closing this combined PR in favor of the split.

@xyuzh xyuzh closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant