Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Can't read files if input paths share common directory #39043

Closed
bveeramani opened this issue Aug 29, 2023 · 2 comments
Closed

[Data] Can't read files if input paths share common directory #39043

bveeramani opened this issue Aug 29, 2023 · 2 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

I passed in several input paths to read_images. Each path points to a folder in a bucket. I expected Ray to read all of the images in the folders, but I got an unexpected error instead:

ray.exceptions.RayTaskError(ValueError): ray::_get_reader() (pid=48565, ip=10.0.36.183)
  File "/tmp/ray/session_2023-08-28_22-16-18_573417_6518/runtime_resources/py_modules_files/_ray_pkg_b75ae321665ac451/ray/data/read_api.py", line 2348, in _get_reader
    reader = ds.create_reader(**kwargs)
  File "/tmp/ray/session_2023-08-28_22-16-18_573417_6518/runtime_resources/py_modules_files/_ray_pkg_b75ae321665ac451/ray/data/datasource/image_datasource.py", line 66, in create_reader
    return _ImageDatasourceReader(
  File "/tmp/ray/session_2023-08-28_22-16-18_573417_6518/runtime_resources/py_modules_files/_ray_pkg_b75ae321665ac451/ray/data/datasource/image_datasource.py", line 159, in __init__
    super().__init__(
  File "/tmp/ray/session_2023-08-28_22-16-18_573417_6518/runtime_resources/py_modules_files/_ray_pkg_b75ae321665ac451/ray/data/datasource/file_based_datasource.py", line 508, in __init__
    raise ValueError(
ValueError: No input files found to read. Please double check that 'partition_filter' field is set properly.

Versions / Dependencies

Ray: ad4ce20

Reproduction script

from ray.data.datasource.file_based_datasource import _resolve_paths_and_filesystem
from ray.data.datasource.image_datasource import _ImageFileMetadataProvider

input_paths = [
    "s3://anonymous@air-example-data-2/100TB-tif/1000034852_tile_id_45198/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000052395_tile_id_45170/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000093814_tile_id_5682/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000147279_tile_id_37315/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000174431_tile_id_14397/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000249111_tile_id_8939/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000263288_tile_id_10853/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000273212_tile_id_33684/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000321272_tile_id_34945/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000481847_tile_id_10149/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000534279_tile_id_44645/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000555931_tile_id_19042/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000613374_tile_id_42535/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000708946_tile_id_38641/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000759254_tile_id_9292/",
    "s3://anonymous@air-example-data-2/100TB-tif/1000821758_tile_id_27188/",
]

input_paths, filesystem = _resolve_paths_and_filesystem(input_paths, None)
meta_provider = _ImageFileMetadataProvider()
expanded_paths, _ = map(
    list,
    zip(
        *meta_provider.expand_paths(
            input_paths,
            filesystem,
        )
    ),
)
assert len(expanded_paths) > len(input_paths), len(expanded_paths)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't data Ray Data-related issues labels Aug 29, 2023
@bveeramani
Copy link
Member Author

This issue only occurs when we go down this code path:

# 2. Common path prefix case.
# Get longest common path of all paths.
common_path = os.path.commonpath(paths)
# If parent directory (or base directory, if using partitioning) is common to
# all paths, fetch all file infos at that prefix and filter the response to the
# provided paths.
if (
partitioning is not None
and common_path == _unwrap_protocol(partitioning.base_dir)
) or all(str(pathlib.Path(path).parent) == common_path for path in paths):
yield from _get_file_infos_common_path_prefix(
paths, common_path, filesystem, ignore_missing_paths
)

@anyscalesam anyscalesam added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 31, 2023
@bveeramani
Copy link
Member Author

Looks like this was fixed by #39592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants