Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Support JSONL file format #37637

Merged
merged 4 commits into from
Jul 22, 2023
Merged

[Data] Support JSONL file format #37637

merged 4 commits into from
Jul 22, 2023

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jul 21, 2023

Why are these changes needed?

This is to add supporting for JSONL file format. Our read_json and write_json already supports JSONL file format. The actual code change is to support ".jsonl" file extension in read_json, and update documentation and example.

Related issue number

Closes #37611

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there's some misalignment in our APIs:

  • read_json can only read JSONL files
  • write_json always writes to JSON files

To avoid confusion, maybe we should rename read_json to read_jsonl?

@@ -933,10 +933,10 @@ def read_json(
ignore_missing_paths: bool = False,
**arrow_json_args,
) -> Dataset:
"""Creates a :class:`~ray.data.Dataset` from JSON files.
"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow's read_json only support JSONL, not normal JSON.

Currently only the line-delimited JSON format is supported.

Suggested change
"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files.
"""Creates a :class:`~ray.data.Dataset` from JSONL files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it supports both JSONL and JSON format. Essentially JSON is just a line of JSONL.

Tried with the example in https://json.org/example.html:

>>> import ray
>>> ds = ray.data.read_json("/Users/chengsu/try/json/input.json")
>>> ds
Dataset(
   num_blocks=20,
   num_rows=1,
   schema={
      glos...: struct<title: string, GlossDiv: struct<title: string, GlossList: struct<GlossEntry: struct<ID: string, SortAs: string, GlossTerm: string, Acronym: string, Abbrev: string, GlossDef: struct<para: string, GlossSeeAlso: list<item: string>>, GlossSee: string>>>>
   }
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can document this: For an input json file, the whole file will be read as one row. For an input jsonl file, each line will be read as one row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated.


Examples:
Read a file in remote storage.
Read a JSON file in remote storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs.json is actually a JSONL file with the JSON extension. To avoid confusion, maybe we should rename logs.json to logs.jsonl? We also don't need to keep both examples because they both read JSONL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good catch. I will add a file with a single line of JSON, to represent JSON file.

python/ray/data/tests/test_json.py Show resolved Hide resolved
@@ -13,7 +13,7 @@

@PublicAPI
class JSONDatasource(FileBasedDatasource):
"""JSON datasource, for reading and writing JSON files.
"""JSON datasource, for reading and writing JSON and JSONL files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""JSON datasource, for reading and writing JSON and JSONL files.
"""JSON datasource, for reading JSONL and writing JSON files.

@@ -2535,6 +2535,11 @@ def write_json(
>>> ds = ray.data.range(100)
>>> ds.write_json("local:///tmp/data")

Write the dataset as JSONL files to a local directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need both of these examples. They both write to JSON format files.

Copy link
Contributor Author

@c21 c21 Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to have two examples, one is to write a JSON file, another is to write multiple JSONL files.

@@ -2513,7 +2513,7 @@ def write_json(
ray_remote_args: Dict[str, Any] = None,
**pandas_json_args,
) -> None:
"""Writes the :class:`~ray.data.Dataset` to JSON files.
"""Writes the :class:`~ray.data.Dataset` to JSON and JSONL files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we use Pandas' to_json, and that always writes to JSON (not JSONL)?

Suggested change
"""Writes the :class:`~ray.data.Dataset` to JSON and JSONL files.
"""Writes the :class:`~ray.data.Dataset` to JSON files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried ds.write_json("xxx.jsonl") locally, and it did write jsonl files. But I'm not sure it's because the ".jsonl" suffix or something else that makes it to write jsonl files. I think we need to figure it out and also make it clear in the doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always writing in JSONL format. If it only has a single line, it becomes a JSON file. More context in #37637 (comment) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. let's update the doc to reflect this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated.

@c21
Copy link
Contributor Author

c21 commented Jul 21, 2023

@bveeramani, @raulchen, just to add some context. JSONL (https://jsonlines.org/examples/) format is essentially lines of JSON format (https://json.org/example.html).

A JSONL file (multiple lines of JSON):

{"name": "foo", ...}
{"name": "bar", ...}
...

A JSON file (a single line of JSON):

{"name": "foo", ...}

We use Arrow read_json, which supports both JSON and JSONL files. JSON file is a special case of JSONL file anyway - JSON file is a JSONL file with a single line.

We use Pandas to_json to write JSON and JSONL files. We set lines=True, so if there're multiple lines of JSON, each JSON will be written in separate line (JSONL format).

@c21
Copy link
Contributor Author

c21 commented Jul 21, 2023

To avoid confusion, maybe we should rename read_json to read_jsonl?

Given all other libraries (Arrow, Pandas) are using read_json, to_json. I would love us to stick with read_json and avoid breaking API naming change as well. Does it make sense?

Signed-off-by: Cheng Su <scnju13@gmail.com>
@@ -2513,7 +2513,7 @@ def write_json(
ray_remote_args: Dict[str, Any] = None,
**pandas_json_args,
) -> None:
"""Writes the :class:`~ray.data.Dataset` to JSON files.
"""Writes the :class:`~ray.data.Dataset` to JSON and JSONL files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. let's update the doc to reflect this?

@@ -933,10 +933,10 @@ def read_json(
ignore_missing_paths: bool = False,
**arrow_json_args,
) -> Dataset:
"""Creates a :class:`~ray.data.Dataset` from JSON files.
"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can document this: For an input json file, the whole file will be read as one row. For an input jsonl file, each line will be read as one row.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Comment on lines 2518 to 2519
When the dataset has multiple rows, the output file is in JSONL format.
Otherwise, the output file is in JSON format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: IMO it's better to exclude this information, as well the example of writing to a single JSON file. How often will users work with single-row datasets? I'm concerned this would introduce confusion, and it'd be easier to understand "write_json always writes JSONL".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I agree single-row dataset would be very rare. Removed these two lines.

Signed-off-by: Cheng Su <scnju13@gmail.com>
@c21 c21 merged commit 2bc97f0 into ray-project:master Jul 22, 2023
59 of 62 checks passed
@c21 c21 deleted the jsonl branch July 22, 2023 21:30
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example.

Signed-off-by: Cheng Su <scnju13@gmail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Support ".jsonl" file extension in read_json
3 participants