Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unicode idents (matching rust) #444

Closed
wants to merge 15 commits into from

Conversation

ModProg
Copy link
Contributor

@ModProg ModProg commented Mar 25, 2023

  • I've included my change in CHANGELOG.md

This not only changes the ident parsing, but also uses &str instead of &[u8]
for the rest of the parsing (sometime as a byte slice through .bytes()).

If you prefer to have a more restricted implementation that only changes
.*ident*() methods on Bytes I can revert the other changes and use
from_utf8_unchecked instead.

But as I saw that the benchmark using this implementation was more or less on
par with the current implementation I wanted to share it at least:

ron-new is this PR's string based implementation, ron is the current git
version.

Benchmarking Serde Deserialization/json/data/canada: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 89.8s, or reduce sample count to 550.
Serde Deserialization/json/data/canada
                        time:   [8.9827 ms 8.9938 ms 9.0056 ms]
                        change: [-3.4193% -2.9076% -2.4310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 860 outliers among 10000 measurements (8.60%)
  1 (0.01%) low mild
  493 (4.93%) high mild
  366 (3.66%) high severe
Benchmarking Serde Deserialization/ron/data/canada: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 317.2s, or reduce sample count to 150.
Serde Deserialization/ron/data/canada
                        time:   [31.000 ms 31.028 ms 31.056 ms]
                        change: [-1.5615% -1.2462% -0.9532%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 521 outliers among 10000 measurements (5.21%)
  209 (2.09%) high mild
  312 (3.12%) high severe
Benchmarking Serde Deserialization/ron-new/data/canada: Warming up for 3.0000 s
Warning: Unable to complete 10000 samples in 5.0s. You may wish to increase target time to 320.1s, or reduce sample count to 150.
Serde Deserialization/ron-new/data/canada
                        time:   [30.619 ms 30.638 ms 30.658 ms]
                        change: [-2.9465% -2.7434% -2.5733%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 164 outliers among 10000 measurements (1.64%)
  63 (0.63%) high mild
  101 (1.01%) high severe

I also didn't add any tests yet.

fixes #321

src/parse.rs Outdated Show resolved Hide resolved
src/parse.rs Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Mar 25, 2023

Codecov Report

Patch coverage: 86.23% and project coverage change: -2.88 ⚠️

Comparison is base (5a407f3) 86.73% compared to head (091f715) 83.85%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #444      +/-   ##
==========================================
- Coverage   86.73%   83.85%   -2.88%     
==========================================
  Files          59       60       +1     
  Lines        7280     7561     +281     
==========================================
+ Hits         6314     6340      +26     
- Misses        966     1221     +255     
Impacted Files Coverage Δ
tests/struct_integers.rs 95.65% <ø> (-4.35%) ⬇️
src/parse.rs 68.40% <81.35%> (-24.00%) ⬇️
tests/unicode.rs 82.75% <82.35%> (-17.25%) ⬇️
src/error.rs 37.76% <88.00%> (-4.69%) ⬇️
src/de/mod.rs 74.26% <96.39%> (-2.67%) ⬇️
src/de/tests.rs 100.00% <100.00%> (ø)
src/extensions.rs 100.00% <100.00%> (ø)
src/ser/mod.rs 75.65% <100.00%> (+4.50%) ⬆️
tests/321_unicode_ident.rs 100.00% <100.00%> (ø)
tests/407_raw_value.rs 100.00% <100.00%> (ø)
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@juntyr
Copy link
Member

juntyr commented Mar 25, 2023

@ModProg I've only had time to briefly skim your changes so far, but (formatting excluded) they look good so far. Adding some tests would definitely be the next step, and I'll try to review the PR more thoroughly over the coming days. Thank you already for your work!

@juntyr juntyr self-requested a review March 25, 2023 21:41
@ModProg
Copy link
Contributor Author

ModProg commented Mar 25, 2023

Yes, def needs some cleanup.

@juntyr
Copy link
Member

juntyr commented Mar 26, 2023

Once you get through the final cleanup, there are also still some dbg!() macros left

@juntyr
Copy link
Member

juntyr commented Mar 27, 2023

Could you also add tests for the new crash cases you discovered to document which errors they should produce (and on which column)?

@ModProg
Copy link
Contributor Author

ModProg commented Mar 27, 2023

Could you also add tests for the new crash cases you discovered to document which errors they should produce (and on which column)?

The crashes occurred due to a mistake I made with byte offsets, checked them in by accident.

src/parse.rs Outdated Show resolved Hide resolved
src/parse.rs Outdated Show resolved Hide resolved
src/parse.rs Show resolved Hide resolved
src/parse.rs Outdated Show resolved Hide resolved
@juntyr
Copy link
Member

juntyr commented Mar 28, 2023

I am wondering if we could not entirely stop relying on bytes but replace all "next byte" methods with "next character" methods. Would this be possible? What are your thoughts @ModProg?

@ModProg
Copy link
Contributor Author

ModProg commented Mar 28, 2023

I am wondering if we could not entirely stop relying on bytes but replace all "next byte" methods with "next character" methods. Would this be possible? What are your thoughts @ModProg?

probably, we should run the benchmark with that change

@ModProg
Copy link
Contributor Author

ModProg commented Mar 29, 2023

I just noticed that I made a stupid mistake while benchmarking and benchmarked the wrong versions... The performance impact of some of my changes is larger than I thought and I will have to check that tomorrow.

@ModProg
Copy link
Contributor Author

ModProg commented Mar 29, 2023

I just noticed that I made a stupid mistake while benchmarking and benchmarked the wrong versions... The performance impact of some of my changes is larger than I thought and I will have to check that tomorrow.

(in total they ended up increasing the benchmark by ca. 50%)

src/de/mod.rs Outdated Show resolved Hide resolved
src/parse.rs Outdated Show resolved Hide resolved
@juntyr
Copy link
Member

juntyr commented Mar 29, 2023

I just noticed that I made a stupid mistake while benchmarking and benchmarked the wrong versions... The performance impact of some of my changes is larger than I thought and I will have to check that tomorrow.

(in total they ended up increasing the benchmark by ca. 50%)

That is significant ... perhaps once the code changes have stabilised a bit we can look into where the perf hit comes from. So far your changes seem to make the code more readable, which would be a counter-benefit.

@ModProg
Copy link
Contributor Author

ModProg commented Mar 29, 2023

Performance still 30% worse

@ModProg
Copy link
Contributor Author

ModProg commented Mar 30, 2023

@juntyr I got performance back to where it was before this PR, but it would maybe be best if someone could verify.

@juntyr
Copy link
Member

juntyr commented Mar 30, 2023

@juntyr I got performance back to where it was before this PR, but it would maybe be best if someone could verify.

Ok, that sounds wonderful! I'll have a deeper look sometime later, check the performance, and see if I can find anything else :)

@juntyr
Copy link
Member

juntyr commented Apr 2, 2023

@ModProg Ok, I've been doing a bit of toying around with the changes. First, I just did a bit of cleanup with clippy and extended the docs. I also tested removing all possible byte strings from parsing, please feel free to disregard those changes. You can find my experiments here:
juntyr@c7e428b

I also tested the other side of ron, serialisation. Previously we also used byte strings heavily there and just hoped that it would all come out as UTF8 on the other end. I've tested switching it to UTF8 without breaking the existing API surface in:
juntyr@c6c0997

I'm quite happy with this PR! If you include the clippy fixes, we can definitely land the existing changes and add onto them with UTF8 serialisation and a better Value for ints in follow-up commits.

@ModProg
Copy link
Contributor Author

ModProg commented Apr 2, 2023

I also tested removing all possible byte strings from parsing, please feel free to disregard those changes. You can find my experiments here:
juntyr@c7e428b

Those seem to add about 30% of time to the benchmark on my machine (from 29 to 38ms)

@juntyr
Copy link
Member

juntyr commented Apr 5, 2023

(sorry for the long review delay, I'm finishing up my second thesis atm and it will occupy me for a few more days)

@ModProg
Copy link
Contributor Author

ModProg commented Apr 6, 2023

(sorry for the long review delay, I'm finishing up my second thesis atm and it will occupy me for a few more days)

no worries, I am in no rush.

@juntyr
Copy link
Member

juntyr commented Aug 24, 2023

@ModProg I am really sorry that it has taken me so long to get back to this PR, life has been happening.

I want to get this PR in before v0.9, as it is the last big change I intend to include in that. I will get started with rebasing the PR (on top of #438) over the coming days and will try to get it merged soon.

I am slowly trying to make ron even more Rusty (e.g. with Rusty byte strings in #438 and typed number suffixes and underscores in floats in #481), and this PR is an important step in that direction. Thank you so much for working on it!

@juntyr juntyr marked this pull request as draft August 25, 2023 22:38
@juntyr
Copy link
Member

juntyr commented Sep 3, 2023

#488 has now landed, which supersedes this PR

@juntyr juntyr closed this Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deserializer doesn't support unicode identifiers
3 participants