-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Support JSONL file format #37637
Conversation
Signed-off-by: Cheng Su <scnju13@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there's some misalignment in our APIs:
read_json
can only read JSONL fileswrite_json
always writes to JSON files
To avoid confusion, maybe we should rename read_json
to read_jsonl
?
@@ -933,10 +933,10 @@ def read_json( | |||
ignore_missing_paths: bool = False, | |||
**arrow_json_args, | |||
) -> Dataset: | |||
"""Creates a :class:`~ray.data.Dataset` from JSON files. | |||
"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArrow's read_json
only support JSONL, not normal JSON.
Currently only the line-delimited JSON format is supported.
"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. | |
"""Creates a :class:`~ray.data.Dataset` from JSONL files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it supports both JSONL and JSON format. Essentially JSON is just a line of JSONL.
Tried with the example in https://json.org/example.html:
>>> import ray
>>> ds = ray.data.read_json("/Users/chengsu/try/json/input.json")
>>> ds
Dataset(
num_blocks=20,
num_rows=1,
schema={
glos...: struct<title: string, GlossDiv: struct<title: string, GlossList: struct<GlossEntry: struct<ID: string, SortAs: string, GlossTerm: string, Acronym: string, Abbrev: string, GlossDef: struct<para: string, GlossSeeAlso: list<item: string>>, GlossSee: string>>>>
}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can document this: For an input json file, the whole file will be read as one row. For an input jsonl file, each line will be read as one row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated.
|
||
Examples: | ||
Read a file in remote storage. | ||
Read a JSON file in remote storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logs.json
is actually a JSONL file with the JSON extension. To avoid confusion, maybe we should rename logs.json
to logs.jsonl
? We also don't need to keep both examples because they both read JSONL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good catch. I will add a file with a single line of JSON, to represent JSON file.
@@ -13,7 +13,7 @@ | |||
|
|||
@PublicAPI | |||
class JSONDatasource(FileBasedDatasource): | |||
"""JSON datasource, for reading and writing JSON files. | |||
"""JSON datasource, for reading and writing JSON and JSONL files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""JSON datasource, for reading and writing JSON and JSONL files. | |
"""JSON datasource, for reading JSONL and writing JSON files. |
@@ -2535,6 +2535,11 @@ def write_json( | |||
>>> ds = ray.data.range(100) | |||
>>> ds.write_json("local:///tmp/data") | |||
|
|||
Write the dataset as JSONL files to a local directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need both of these examples. They both write to JSON format files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to have two examples, one is to write a JSON file, another is to write multiple JSONL files.
@@ -2513,7 +2513,7 @@ def write_json( | |||
ray_remote_args: Dict[str, Any] = None, | |||
**pandas_json_args, | |||
) -> None: | |||
"""Writes the :class:`~ray.data.Dataset` to JSON files. | |||
"""Writes the :class:`~ray.data.Dataset` to JSON and JSONL files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we use Pandas' to_json
, and that always writes to JSON (not JSONL)?
"""Writes the :class:`~ray.data.Dataset` to JSON and JSONL files. | |
"""Writes the :class:`~ray.data.Dataset` to JSON files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried ds.write_json("xxx.jsonl")
locally, and it did write jsonl files. But I'm not sure it's because the ".jsonl" suffix or something else that makes it to write jsonl files. I think we need to figure it out and also make it clear in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always writing in JSONL format. If it only has a single line, it becomes a JSON file. More context in #37637 (comment) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. let's update the doc to reflect this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated.
@bveeramani, @raulchen, just to add some context. JSONL (https://jsonlines.org/examples/) format is essentially lines of JSON format (https://json.org/example.html). A JSONL file (multiple lines of JSON):
A JSON file (a single line of JSON):
We use Arrow read_json, which supports both JSON and JSONL files. JSON file is a special case of JSONL file anyway - JSON file is a JSONL file with a single line. We use Pandas to_json to write JSON and JSONL files. We set |
Given all other libraries (Arrow, Pandas) are using |
Signed-off-by: Cheng Su <scnju13@gmail.com>
@@ -2513,7 +2513,7 @@ def write_json( | |||
ray_remote_args: Dict[str, Any] = None, | |||
**pandas_json_args, | |||
) -> None: | |||
"""Writes the :class:`~ray.data.Dataset` to JSON files. | |||
"""Writes the :class:`~ray.data.Dataset` to JSON and JSONL files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. let's update the doc to reflect this?
@@ -933,10 +933,10 @@ def read_json( | |||
ignore_missing_paths: bool = False, | |||
**arrow_json_args, | |||
) -> Dataset: | |||
"""Creates a :class:`~ray.data.Dataset` from JSON files. | |||
"""Creates a :class:`~ray.data.Dataset` from JSON and JSONL files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can document this: For an input json file, the whole file will be read as one row. For an input jsonl file, each line will be read as one row.
Signed-off-by: Cheng Su <scnju13@gmail.com>
python/ray/data/dataset.py
Outdated
When the dataset has multiple rows, the output file is in JSONL format. | ||
Otherwise, the output file is in JSON format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: IMO it's better to exclude this information, as well the example of writing to a single JSON file. How often will users work with single-row datasets? I'm concerned this would introduce confusion, and it'd be easier to understand "write_json always writes JSONL".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I agree single-row dataset
would be very rare. Removed these two lines.
Signed-off-by: Cheng Su <scnju13@gmail.com>
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example. Signed-off-by: Cheng Su <scnju13@gmail.com>
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This is to add supporting for JSONL file format. Our `read_json` and `write_json` already supports JSONL file format. The actual code change is to support ".jsonl" file extension in `read_json`, and update documentation and example. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This is to add supporting for JSONL file format. Our
read_json
andwrite_json
already supports JSONL file format. The actual code change is to support ".jsonl" file extension inread_json
, and update documentation and example.Related issue number
Closes #37611
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.