[Data] Improve block size selection when reading large jsonl data chunks #41196
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
P1
Issue that should be fixed within a few weeks
ray 2.10
size:small
Description
As a followup to #40533, add logic to gracefully handle cases where the block size configured in Arrow's JSON loader is too small.
Hugging Face has an implementation where they handle this dynamically, we could do something similar by adding an improved default value and fallback logic to increase the block size to a sufficient value: https://github.com/huggingface/datasets/blob/d122b3ddc67705cc2b622bcbd79de9ff943a5742/src/datasets/packaged_modules/json/json.py#L119
Related issues:
Use case
No response
The text was updated successfully, but these errors were encountered: