Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

Closed
bdewilde opened this issue Mar 9, 2023 · 2 comments · Fixed by #33185
Closed

[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

bdewilde opened this issue Mar 9, 2023 · 2 comments · Fixed by #33185
Labels
docs An issue or change related to documentation

Comments

@bdewilde
Copy link

bdewilde commented Mar 9, 2023

Description

Hi! I'd like to read a parquet dataset from a directory of files on local disk where each file corresponds exactly to one block in the dataset. The docs suggest that this should be the case —

Read Parquet files into a tabular Dataset. The Parquet data will be read into Arrow Table blocks. Although this simple example demonstrates reading a single file, note that Datasets can also read directories of Parquet files, with one tabular block created per file.

— but I keep running into situations where Dataset.num_blocks() != len(Dataset.input_files()). I can modify the number of blocks by tweaking DatasetContext's block_splitting_enabled and target_max_block_size values. I delved into the source code but couldn't find where the disconnect between input files and output blocks was occurring. (I got as far as the BlockOutputBuffer before tapping out...)

I'm not doing anything unusual with the ray.data.read_parquet(dir_path) call, so I'm assuming this is an expected feature, and the problem is actually misleading info in the linked documentation. Please let me know if there's a way to guarantee 1:1 file:block reading of parquet data! If not, clarifying in the docs when that relationship doesn't hold would be a help.

Link

https://docs.ray.io/en/latest/data/creating-datasets.html#supported-file-formats

@bdewilde bdewilde added docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2023
@bveeramani
Copy link
Member

Hey @bdewilde, thanks for opening an issue!

Looks like this an inaccuracy in our docs. I've opened a PR to fix: #33185.

To achieve 1:1 file-to-block reading, you could try setting parallelism to the number of files:

ray.data.read_parquet(..., parallelism=NUM_FILES)

Also, just of out curiosity, why're you interested in a 1:1 mapping between files blocks?

@bdewilde
Copy link
Author

Hi @bveeramani , thanks for confirming! I'll try setting parallelism as you suggest, I didn't realize that would guarantee the 1:1 mapping between file and batch.

Each file represents an independent chunk of data from a much larger dataset that I've already grouped using AWS Athena. I'd like to process each chunk via Dataset.map_batches(batch_size=None) or, possibly, BatchMapper(batch_size=None).transform(ds). I could in principle re-group the full dataset via Dataset.groupby(), but it's a bit clunky — it's grouped on multiple columns, which ray doesn't support, so requires workarounds — and slow, since the full dataset is very large. It's much faster to read in N files at a time, process each independently, then write the results back to disk in a more streaming fashion.

@bveeramani bveeramani removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs An issue or change related to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants