Skip to content

renames and exports; benchmarks #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ regex = "1"
name = "bench_compare"
harness = false

[[bench]]
name = "bench_deserialize"
harness = false

[[example]]
crate-type = ["bin"]
path = "examples/json5-doublequote-fixer/src/main.rs"
Expand Down
278 changes: 110 additions & 168 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,195 +33,137 @@ fn main() {
{
name: 'Hello',
count: 42,
maybe: NaN
maybe: null
}
"#;

let parsed = from_str::<MyData>(source).unwrap();
let expected = MyData {name: "Hello".to_string(), count: 42, maybe: Some(NaN)}
assert_eq!(parsed, expected)
let expected = MyData {name: "Hello".to_string(), count: 42, maybe: None};
assert_eq!(parsed, expected);
}
```
## Examples

See the `examples/` directory for examples of programs that utilize round-tripping features.

- `examples/json5-doublequote-fixer` gives an example of tokenization-based round-tripping edits
- `examples/json5-trailing-comma-formatter` gives an example of model-based round-tripping edits

## Benchmarking

Benchmarks are available in the `benches/` directory. Test data is in the `data/` directory. A couple of benchmarks use
big files that are not committed to this repo. So run `./data/setupdata.sh` to download the required data files
so that you don't skip the big benchmarks. The benchmarks compare `json_five` (this crate) to
[serde_json](https://github.com/serde-rs/json) and [json5-rs](https://github.com/callum-oakley/json5-rs).

Notwithstanding the general caveats of benchmarks, in initial testing, `json_five` outperforms `json5-rs`.
In typical scenarios: 3-4x performance, it seems. At time of writing (pre- v0) no performance optimizations have been done. I
expect performance to improve, if at least marginally, in the future.

These benchmarks were run on Windows on an i9-10900K. This table won't be updated unless significant changes happen.

| test | json_five | serde_json | json5 |
|--------------------|---------------|---------------|---------------|
| big (25MB) | 580.31 ms | 150.39 ms | 3.0861 s |
| medium-ascii (5MB) | 199.88 ms | 59.008 ms | 706.94 ms |
| empty | 228.62 ns | 38.786 ns | 708.00 ns |
| arrays | 578.24 ns | 100.95 ns | 1.3228 µs |
| objects | 922.91 ns | 205.75 ns | 2.0748 µs |
| nested-array | 22.990 µs | 5.0483 µs | 29.356 µs |
| nested-objects | 50.659 µs | 14.755 µs | 132.75 µs |
| string | 421.17 ns | 91.051 ns | 3.5691 µs |
| number | 238.75 ns | 36.179 ns | 779.13 ns |



# Round-trip model

The `rt` module contains the round-trip parser. This is intended to be ergonomic for round-trip use cases, although
it is still very possible to use the default parser (which is more performance-oriented) for certain round-trip use cases.
The round-trip AST model produced by the round-trip parser includes additional `context` fields that describe the whitespace, comments,
and (where applicable) trailing commas on each production. Moreover, unlike the default parser, the AST consists
entirely of owned types, allowing for simplified in-place editing.


The `context` field holds a single field struct that contains the field `wsc` (meaning 'white space and comments')
which holds a tuple of `String`s that represent the contextual whitespace and comments. The last element in
the `wsc` tuple in the `context` of `JSONArrayValue` and `JSONKeyValuePair` objects is an `Option<String>` -- which
is used as a marker to indicate an optional trailing comma and any whitespace that may follow that optional comma.

The `context` field is always an `Option`.

Contexts are associated with the following structs (which correspond to the JSON5 productions) and their context layout:

## `rt::parser::JSONText`

Represents the top-level Text production of a JSON5 document. It consists solely of a single (required) value.
It may have whitespace/comments before or after the value. The `value` field contains any `JSONValue` and the `context`
field contains the context struct containing the `wsc` field, a two-length tuple that describes the whitespace before and after the value.
In other words: `{ wsc.0 } value { wsc.1 }`
Serializing also works in the usual way. The re-exported `to_string` function comes from the `ser` module and works
how you'd expect with default formatting.

```rust
use json_five::rt::parser::from_str;
use json_five::rt::parser::JSONValue;

let doc = from_str(" 'foo'\n").unwrap();
let context = doc.context.unwrap();

assert_eq!(&context.wsc.0, " ");
assert_eq!(doc.value, JSONValue::SingleQuotedString("foo".to_string()));
assert_eq!(&context.wsc.1, "\n");
use serde::Serialize;
use json_five::to_string;
#[derive(Serialize)]
struct Test {
int: u32,
seq: Vec<&'static str>,
}
let test = Test {
int: 1,
seq: vec!["a", "b"],
};
let expected = r#"{"int": 1, "seq": ["a", "b"]}"#;
assert_eq!(to_string(&test).unwrap(), expected);
```

You may also use the `to_string_formatted` with a `FormatConfiguration` to control the output format, including
indentation, trailing commas, and key/item separators.

## `rt::parser::JSONValue::JSONObject`

Member of the `rt::parser::JSONValue` enum representing [JSON5 objects](https://spec.json5.org/#objects).

There are two fields: `key_value_pairs`, which is a `Vec` of `JSONKeyValuePair`s, and `context` whose `wsc` is
a one-length tuple containing the whitespace/comments that occur after the opening brace. In non-empty objects,
the whitespace that precedes the closing brace is part of the last item in the `key_value_pairs` Vec.
In other words: `LBRACE { wsc.0 } [ key_value_pairs ] RBRACE`
and: `.context.wsc: (String,)`

### `rt::parser::KeyValuePair`

The `KeyValuePair` struct represents the ['JSON5Member' production](https://spec.json5.org/#prod-JSON5Member).
It has three fields: `key`, `value`, and `context`. The `key` is a `JSONValue`, in practice limited to `JSONValue::Identifier`,
`JSONValue::DoubleQuotedString` or a `JSONValue::SingleQuotedString`. The `value` is any `JSONValue`.

Its context describes whitespace/comments that are between the key
and `:`, between the `:` and the value, after the value, and (optionally) a trailing comma and whitespace trailing the
comma.
In other words, roughly: `key { wsc.0 } COLON { wsc.1 } value { wsc.2 } [ COMMA { wsc.3 } [ next_key_value_pair ] ]`
and: `.context.wsc: (String, String, String, Option<String>)`

When `context.wsc.3` is `Some()`, it indicates the presence of a trailing comma (not included in the string) and
whitespace that follows the comma. This item MUST be `Some()` when it is not the last member in the object.

## `rt::parser::JSONValue::JSONArray`

Member of the `rt::parser::JSONValue` enum representing [JSON5 arrays](https://spec.json5.org/#arrays).

There are two fields on this struct: `values`, which is of type `Vec<JSONArrayValue>`, and `context` which holds
a one-length tuple containing the whitespace/comments that occur after the opening bracket. In non-empty arrays,
the whitespace that precedes the closing bracket is part of the last item in the `values` Vec.
In other words: `LBRACKET { wsc.0 } [ values ] RBRACKET`
and: `.context.wsc: (String,)`


### `rt::parser::JSONArrayValue`

The `JSONArrayValue` struct represents a single member of a JSON5 Array. It has two fields: `value`, which is any
`JSONValue`, and `context` which contains the contextual whitespace/comments around the member. The `context`'s `wsc`
field is a two-length tuple for the whitespace that may occur after the value and (optionally) after the comma following the value.
In other words, roughly: `value { wsc.0 } [ COMMA { wsc.1 } [ next_value ]]`
and: `.context.wsc: (String, Option<String>)`

When `context.wsc.1` is `Some()` it indicates the presence of the comma (not included in the string) and any whitespace
following the comma is contained in the string. This item MUST be `Some()` when it is not the last member of the array.

## Other `rt::parser::JSONValue`s

```rust
use serde::Serialize;
use json_five::{to_string_formatted, FormatConfiguration, TrailingComma};
#[derive(Serialize)]
struct Test {
int: u32,
seq: Vec<&'static str>,
}
let test = Test {
int: 1,
seq: vec!["a", "b"],
};

let config = FormatConfiguration::with_indent(4, TrailingComma::ALL);
let formatted_doc = to_string_formatted(&test, config).unwrap();

let expected = r#"{
"int": 1,
"seq": [
"a",
"b",
],
}"#;

assert_eq!(formatted_doc, expected);
```

## Examples

- `JSONValue::Integer(String)`
- `JSONValue::Float(String)`
- `JSONValue::Exponent(String)`
- `JSONValue::Null`
- `JSONValue::Infinity`
- `JSONValue::NaN`
- `JSONValue::Hexadecimal(String)`
- `JSONValue::Bool(bool)`
- `JSONValue::DoubleQuotedString(String)`
- `JSONValue::SingleQuotedString(String)`
- `JSONValue::Unary { operator: UnaryOperator, value: Box<JSONValue> }`
- `JSONValue::Identifier(String)` (for object keys only!).
See the `examples/` directory for examples of programs that utilize round-tripping features.

Where these enum members have `String`s, they represent the object as it was tokenized without any modifications (that
is, for example, without any escape sequences un-escaped). The single- and double-quoted `String`s do not include the surrounding
quote characters. These members alone have no `context`.
- `examples/json5-doublequote-fixer` gives an example of tokenization-based round-tripping edits
- `examples/json5-trailing-comma-formatter` gives an example of model-based round-tripping edits

# round-trip tokenizer

The `rt::tokenizer` module contains some useful tools for round-tripping tokens. The `Token`s produced by the
rt tokenizer are owned types containing the lexeme from the source. There are two key functions in the tokenizer module:
# Benchmarking

- `rt::tokenize::source_to_tokens`
- `rt::tokenize::tokens_to_source`
Benchmarks are available in the `benches/` directory. Test data is in the `data/` directory. A couple of benchmarks use
big files that are not committed to this repo. So run `./data/setupdata.sh` to download the required data files
so that you don't skip the big benchmarks. The benchmarks compare `json_five` (this crate) to
[serde_json](https://github.com/serde-rs/json) and [json5-rs](https://github.com/callum-oakley/json5-rs).

Each `Token` generated from `source_to_tokens` also contains some contextual information, such as line/col numbers, offsets, etc.
This contextual information is not required for `tokens_to_source` -- that is: you can create new tokens and insert them
into your tokens array and process those tokens back to JSON5 source without issue.
Notwithstanding the general caveats of benchmarks, in initial testing, `json_five` definitively outperforms `json5-rs`.
In typical scenarios observations have been 3-4x performance, and up to 20x faster in some synthetic tests on extremely large data.
At time of writing (pre- v0) no performance optimizations have been done. I expect performance to improve,
if at least marginally, in the future.

These benchmarks were run on Windows on an i9-10900K with rustc 1.83.0 (90b35a623 2024-11-26). This table won't be updated unless significant changes happen.


| test | json_five | json5 | serde_json |
|----------------------------|-----------|-----------|------------|
| big (25MB) | 580.31 ms | 3.0861 s | 150.39 ms |
| medium-ascii (5MB) | 199.88 ms | 706.94 ms | 59.008 ms |
| empty | 228.62 ns | 708.00 ns | 38.786 ns |
| arrays | 578.24 ns | 1.3228 µs | 100.95 ns |
| objects | 922.91 ns | 2.0748 µs | 205.75 ns |
| nested-array | 22.990 µs | 29.356 µs | 5.0483 µs |
| nested-objects | 50.659 µs | 132.75 µs | 14.755 µs |
| string | 421.17 ns | 3.5691 µs | 91.051 ns |
| number | 238.75 ns | 779.13 ns | 36.179 ns |
| deserialize (size 10) | 6.9898µs | 58.398µs | 886.33ns |
| deserialize (size 10) | 6.9898µs | 58.398µs | 886.33ns |
| deserialize (size 10) | 6.9898µs | 58.398µs | 886.33ns |
| deserialize (size 100) | 66.005µs | 830.79µs | 9.9705µs |
| deserialize (size 100) | 66.005µs | 830.79µs | 9.9705µs |
| deserialize (size 100) | 66.005µs | 830.79µs | 9.9705µs |
| deserialize (size 1000) | 599.39µs | 8.4952ms | 69.110µs |
| deserialize (size 1000) | 599.39µs | 8.4952ms | 69.110µs |
| deserialize (size 1000) | 599.39µs | 8.4952ms | 69.110µs |
| deserialize (size 10000) | 5.9841ms | 82.591ms | 734.40µs |
| deserialize (size 10000) | 5.9841ms | 82.591ms | 734.40µs |
| deserialize (size 10000) | 5.9841ms | 82.591ms | 734.40µs |
| deserialize (size 100000) | 66.841ms | 955.37ms | 11.638ms |
| deserialize (size 100000) | 66.841ms | 955.37ms | 11.638ms |
| deserialize (size 100000) | 66.841ms | 955.37ms | 11.638ms |
| deserialize (size 1000000) | 674.13ms | 9.5758s | 119.03ms |
| deserialize (size 1000000) | 674.13ms | 9.5758s | 119.03ms |
| deserialize (size 1000000) | 674.13ms | 9.5758s | 119.03ms |
| serialize (size 10) | 2.3496µs | 48.915µs | 891.85ns |
| serialize (size 10) | 2.3496µs | 48.915µs | 891.85ns |
| serialize (size 10) | 2.3496µs | 48.915µs | 891.85ns |
| serialize (size 100) | 19.602µs | 458.98µs | 6.7109µs |
| serialize (size 100) | 19.602µs | 458.98µs | 6.7109µs |
| serialize (size 100) | 19.602µs | 458.98µs | 6.7109µs |
| serialize (size 1000) | 194.19µs | 4.6035ms | 62.667µs |
| serialize (size 1000) | 194.19µs | 4.6035ms | 62.667µs |
| serialize (size 1000) | 194.19µs | 4.6035ms | 62.667µs |
| serialize (size 10000) | 2.2104ms | 47.253ms | 761.10µs |
| serialize (size 10000) | 2.2104ms | 47.253ms | 761.10µs |
| serialize (size 10000) | 2.2104ms | 47.253ms | 761.10µs |
| serialize (size 100000) | 24.418ms | 502.35ms | 11.410ms |
| serialize (size 100000) | 24.418ms | 502.35ms | 11.410ms |
| serialize (size 100000) | 24.418ms | 502.35ms | 11.410ms |
| serialize (size 1000000) | 245.26ms | 4.6211s | 115.84ms |
| serialize (size 1000000) | 245.26ms | 4.6211s | 115.84ms |
| serialize (size 1000000) | 245.26ms | 4.6211s | 115.84ms |

The `tok_type` attribute leverages the same `json_five::tokenize::TokType` types. Those are:

- `LeftBrace`
- `RightBrace`
- `LeftBracket`
- `RightBracket`
- `Comma`
- `Colon`
- `Name` (Identifiers)
- `SingleQuotedString`
- `DoubleQuotedString`
- `BlockComment`
- `LineComment` note: the lexeme includes the singular trailing newline, if present (e.g., not a comment just before EOF with no newline at end of file)
- `Whitespace`
- `True`
- `False`
- `Null`
- `Integer`
- `Float`
- `Infinity`
- `Nan`
- `Exponent`
- `Hexadecimal`
- `Plus`
- `Minus`
- `EOF`

Note: string tokens will include surrounding quotes.


# Notes
Expand Down
Loading