[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

bdewilde · 2023-03-09T21:54:22Z

Description

Hi! I'd like to read a parquet dataset from a directory of files on local disk where each file corresponds exactly to one block in the dataset. The docs suggest that this should be the case —

Read Parquet files into a tabular Dataset. The Parquet data will be read into Arrow Table blocks. Although this simple example demonstrates reading a single file, note that Datasets can also read directories of Parquet files, with one tabular block created per file.

— but I keep running into situations where Dataset.num_blocks() != len(Dataset.input_files()). I can modify the number of blocks by tweaking DatasetContext's block_splitting_enabled and target_max_block_size values. I delved into the source code but couldn't find where the disconnect between input files and output blocks was occurring. (I got as far as the BlockOutputBuffer before tapping out...)

I'm not doing anything unusual with the ray.data.read_parquet(dir_path) call, so I'm assuming this is an expected feature, and the problem is actually misleading info in the linked documentation. Please let me know if there's a way to guarantee 1:1 file:block reading of parquet data! If not, clarifying in the docs when that relationship doesn't hold would be a help.

Link

https://docs.ray.io/en/latest/data/creating-datasets.html#supported-file-formats

The text was updated successfully, but these errors were encountered:

bveeramani · 2023-03-09T23:58:11Z

Hey @bdewilde, thanks for opening an issue!

Looks like this an inaccuracy in our docs. I've opened a PR to fix: #33185.

To achieve 1:1 file-to-block reading, you could try setting parallelism to the number of files:

ray.data.read_parquet(..., parallelism=NUM_FILES)

Also, just of out curiosity, why're you interested in a 1:1 mapping between files blocks?

bdewilde · 2023-03-10T00:40:18Z

Hi @bveeramani , thanks for confirming! I'll try setting parallelism as you suggest, I didn't realize that would guarantee the 1:1 mapping between file and batch.

Each file represents an independent chunk of data from a much larger dataset that I've already grouped using AWS Athena. I'd like to process each chunk via Dataset.map_batches(batch_size=None) or, possibly, BatchMapper(batch_size=None).transform(ds). I could in principle re-group the full dataset via Dataset.groupby(), but it's a bit clunky — it's grouped on multiple columns, which ray doesn't support, so requires workarounds — and slow, since the full dataset is very large. It's much faster to read in N files at a time, process each independently, then write the results back to disk in a more streaming fashion.

bdewilde added docs An issue or change related to documentation triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2023

bveeramani mentioned this issue Mar 9, 2023

[Datasets] [Docs] Fix description of read_parquet #33185

Merged

7 tasks

bveeramani removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Mar 10, 2023

clarkzinzow closed this as completed in #33185 Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

bdewilde commented Mar 9, 2023 •

edited

Loading

bveeramani commented Mar 9, 2023

bdewilde commented Mar 10, 2023

[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

[data] Dataset.num_blocks() not always == len(Dataset.input_files()) #33179

Comments

bdewilde commented Mar 9, 2023 • edited Loading

Description

Link

bveeramani commented Mar 9, 2023

bdewilde commented Mar 10, 2023

bdewilde commented Mar 9, 2023 •

edited

Loading