You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'd like to read a parquet dataset from a directory of files on local disk where each file corresponds exactly to one block in the dataset. The docs suggest that this should be the case —
Read Parquet files into a tabular Dataset. The Parquet data will be read into Arrow Table blocks. Although this simple example demonstrates reading a single file, note that Datasets can also read directories of Parquet files, with one tabular block created per file.
— but I keep running into situations where Dataset.num_blocks() != len(Dataset.input_files()). I can modify the number of blocks by tweaking DatasetContext's block_splitting_enabled and target_max_block_size values. I delved into the source code but couldn't find where the disconnect between input files and output blocks was occurring. (I got as far as the BlockOutputBuffer before tapping out...)
I'm not doing anything unusual with the ray.data.read_parquet(dir_path) call, so I'm assuming this is an expected feature, and the problem is actually misleading info in the linked documentation. Please let me know if there's a way to guarantee 1:1 file:block reading of parquet data! If not, clarifying in the docs when that relationship doesn't hold would be a help.
The text was updated successfully, but these errors were encountered:
bdewilde
added
docs
An issue or change related to documentation
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 9, 2023
Hi @bveeramani , thanks for confirming! I'll try setting parallelism as you suggest, I didn't realize that would guarantee the 1:1 mapping between file and batch.
Each file represents an independent chunk of data from a much larger dataset that I've already grouped using AWS Athena. I'd like to process each chunk via Dataset.map_batches(batch_size=None) or, possibly, BatchMapper(batch_size=None).transform(ds). I could in principle re-group the full dataset via Dataset.groupby(), but it's a bit clunky — it's grouped on multiple columns, which ray doesn't support, so requires workarounds — and slow, since the full dataset is very large. It's much faster to read in N files at a time, process each independently, then write the results back to disk in a more streaming fashion.
Description
Hi! I'd like to read a parquet dataset from a directory of files on local disk where each file corresponds exactly to one block in the dataset. The docs suggest that this should be the case —
— but I keep running into situations where
Dataset.num_blocks() != len(Dataset.input_files())
. I can modify the number of blocks by tweakingDatasetContext
'sblock_splitting_enabled
andtarget_max_block_size
values. I delved into the source code but couldn't find where the disconnect between input files and output blocks was occurring. (I got as far as theBlockOutputBuffer
before tapping out...)I'm not doing anything unusual with the
ray.data.read_parquet(dir_path)
call, so I'm assuming this is an expected feature, and the problem is actually misleading info in the linked documentation. Please let me know if there's a way to guarantee 1:1 file:block reading of parquet data! If not, clarifying in the docs when that relationship doesn't hold would be a help.Link
https://docs.ray.io/en/latest/data/creating-datasets.html#supported-file-formats
The text was updated successfully, but these errors were encountered: