Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add read_json docs example for setting PyArrow block size when reading large files #40533

Merged
merged 7 commits into from
Nov 15, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Oct 20, 2023

Why are these changes needed?

Adds an example for a workaround when reading large files with ray.data.read_json, which involves setting the block size used by PyArrow's JSON loader. Generated docs page

Screenshot at Oct 20 12-36-54

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
When reading large files, the default block size configured in PyArrow can be too small,
resulting in the following error:
``pyarrow.lib.ArrowInvalid: straddling object straddles two block boundaries
(try to increase block size?)``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(try to increase block size?)``
(try to increase block size?)``.

Comment on lines 1048 to 1049
>>> ray.data.read_json(...,
... read_options=pajson.ReadOptions(block_size=block_size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> ray.data.read_json(...,
... read_options=pajson.ReadOptions(block_size=block_size))
>>> ray.data.read_json(
... s3://anonymous@ray-example-data/log.json,
... read_options=pajson.ReadOptions(block_size=block_size)
... )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to, you could add a large-log.json to make this example more realistic, but I don't think that's necessary

Signed-off-by: Scott Lee <sjl@anyscale.com>
>>> import pyarrow.json as pajson
>>> block_size = 10 << 20 # Set block size to 10MB
>>> ray.data.read_json(
... s3://anonymous@ray-example-data/log.json,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the "" - "s3://anonymous@ray-example-data/log.json"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the catch, just fixed

scottjlee and others added 5 commits October 23, 2023 09:50
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@bveeramani bveeramani merged commit d4cae1d into ray-project:master Nov 15, 2023
16 of 19 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
…en reading large files (ray-project#40533)

Adds an example for a workaround when reading large files with ray.data.read_json, which involves setting the block size used by PyArrow's JSON loader. Generated docs page

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants