Skip to content

Commit

Permalink
[Data] Add read_json docs example for setting PyArrow block size wh…
Browse files Browse the repository at this point in the history
…en reading large files (#40533)

Adds an example for a workaround when reading large files with ray.data.read_json, which involves setting the block size used by PyArrow's JSON loader. Generated docs page

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
  • Loading branch information
scottjlee committed Nov 15, 2023
1 parent 227fe5d commit d4cae1d
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions python/ray/data/read_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1028,6 +1028,21 @@ def read_json(
>>> ds.take(1)
[{'order_number': 10107, 'quantity': 30, 'year': '2022', 'month': '09'}]
When reading large files, the default block size configured in PyArrow can be too small,
resulting in the following error:
``pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries
(try to increase block size?)``.
To resolve this, use the ``read_options`` parameter to set a larger block size:
>>> import pyarrow.json as pajson
>>> block_size = 10 << 20 # Set block size to 10MB
>>> ray.data.read_json(
... "s3://anonymous@ray-example-data/log.json",
... read_options=pajson.ReadOptions(block_size=block_size)
... )
Dataset(num_blocks=8, num_rows=1, schema={timestamp: timestamp[s], size: int64})
Args:
paths: A single file or directory, or a list of file or directory paths.
A list of paths can contain both files and directories.
Expand Down

0 comments on commit d4cae1d

Please sign in to comment.