Implement number parsing from simdjson v0.3.1 - second try #253

philss · 2022-10-16T21:02:54Z

This PR introduces the more correct number parsing from simdjson, version 0.3.1.

The work is based on a previous PR from @ijl - #127

This PR includes a new feature flag to enable the behaviour prior to this: approx-number-parsing. Thanks @Licenser for the work on this.

Another difference is that this includes the 128bit feature for the new version.

Closes #251

PS 1: I need to post the benchmarks. Going to do this tomorrow.
PS 2: I didn't change the docs/commentary, neither fix issues from clippy. I can do this as well tomorrow.

codecov · 2022-10-16T21:14:03Z

Codecov Report

Base: 82.60% // Head: 82.68% // Increases project coverage by +0.07% 🎉

Coverage data is based on head (fa8df70) compared to base (224fcc8).
Patch coverage: 95.84% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #253      +/-   ##
==========================================
+ Coverage   82.60%   82.68%   +0.07%     
==========================================
  Files          34       35       +1     
  Lines       10623    10545      -78     
==========================================
- Hits         8775     8719      -56     
+ Misses       1848     1826      -22

Flag	Coverage Δ
	`82.13% <95.84%> (+0.11%)`	⬆️
128bit	`82.52% <96.48%> (+0.04%)`	⬆️
avx2	`82.19% <95.84%> (+0.07%)`	⬆️
beef	`82.69% <95.84%> (+0.09%)`	⬆️
knownkey	`82.68% <95.84%> (+0.07%)`	⬆️
pclmulqdq	`82.17% <95.84%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/error.rs	`75.00% <ø> (ø)`
src/numberparse.rs	`100.00% <ø> (+8.96%)`	⬆️
src/numberparse/correct.rs	`95.64% <95.64%> (ø)`
src/lib.rs	`97.50% <100.00%> (+0.01%)`	⬆️
src/stage2.rs	`96.86% <100.00%> (-0.86%)`	⬇️
src/serde/value/borrowed/se.rs	`41.74% <0.00%> (-0.19%)`	⬇️
src/serde/value/borrowed/de.rs	`59.09% <0.00%> (-0.17%)`	⬇️
src/serde/value/owned/de.rs	`62.46% <0.00%> (-0.16%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/numberparse/correct.rs

Licenser · 2022-10-17T09:05:35Z

Really nice 👍, I like it, only thing is the parse_number function signature I'm not too worried about which of the two to pick just that they're the same :)

This makes the interface similar to the "correct" number parsing, and avoid the duplication at "stage2".

Licenser

🚀 Brilliant! Thank you :D

philss · 2022-10-17T17:05:55Z

@Licenser so here is a comparison between the new version and the old one:

$ cargo bench
apache_builds/simd_json::to_tape
                        time:   [45.094 µs 45.346 µs 45.591 µs]
                        change: [-6.1785% -4.7876% -3.3903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
apache_builds/simd_json::to_borrowed_value
                        time:   [119.68 µs 119.99 µs 120.28 µs]
                        change: [+0.1303% +0.3994% +0.6900%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
apache_builds/simd_json::to_owned_value
                        time:   [239.90 µs 240.81 µs 241.85 µs]
                        change: [+0.5983% +0.9693% +1.3318%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

canada/simd_json::to_tape
                        time:   [2.9022 ms 2.9234 ms 2.9501 ms]
                        change: [+61.780% +64.279% +66.726%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking canada/simd_json::to_borrowed_value: Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 20.0s. You may wish to increase target time to 28.1s, enable flat sampling, or reduce sample count to 50.
canada/simd_json::to_borrowed_value
                        time:   [5.3654 ms 5.3756 ms 5.3870 ms]
                        change: [+25.664% +25.954% +26.266%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) low mild
  1 (1.00%) high severe
Benchmarking canada/simd_json::to_owned_value: Warming up for 1.0000 s
Warning: Unable to complete 100 samples in 20.0s. You may wish to increase target time to 28.8s, enable flat sampling, or reduce sample count to 50.
canada/simd_json::to_owned_value
                        time:   [5.5307 ms 5.5462 ms 5.5643 ms]
                        change: [+33.910% +34.522% +35.157%] (p = 0.00 < 0.05)
                        Performance has regressed.

citm_catalog/simd_json::to_tape
                        time:   [721.99 µs 723.01 µs 724.10 µs]
                        change: [+15.370% +17.422% +19.529%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
citm_catalog/simd_json::to_borrowed_value
                        time:   [1.7609 ms 1.7637 ms 1.7664 ms]
                        change: [+9.1757% +9.9526% +10.669%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  14 (14.00%) low severe
  4 (4.00%) high mild
  2 (2.00%) high severe
citm_catalog/simd_json::to_owned_value
                        time:   [2.3977 ms 2.3990 ms 2.4003 ms]
                        change: [+5.3221% +5.8969% +6.4064%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 25 outliers among 100 measurements (25.00%)
  14 (14.00%) low severe
  5 (5.00%) high mild
  6 (6.00%) high severe

log/simd_json::to_tape  time:   [901.81 ns 906.80 ns 910.67 ns]
                        change: [-2.4722% -1.2121% +0.1193%] (p = 0.07 > 0.05)
                        No change in performance detected.
log/simd_json::to_borrowed_value
                        time:   [1.6889 µs 1.6977 µs 1.7047 µs]
                        change: [+0.1057% +0.7008% +1.2702%] (p = 0.02 < 0.05)
                        Change within noise threshold.
log/simd_json::to_owned_value
                        time:   [3.6619 µs 3.6634 µs 3.6651 µs]
                        change: [-0.1406% -0.0755% +0.0016%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low severe
  8 (8.00%) high mild
  5 (5.00%) high severe

twitter/simd_json::to_tape
                        time:   [242.57 µs 243.27 µs 243.82 µs]
                        change: [+0.9054% +2.7445% +4.6065%] (p = 0.00 < 0.05)
                        Change within noise threshold.
twitter/simd_json::to_borrowed_value
                        time:   [599.43 µs 601.60 µs 603.31 µs]
                        change: [+1.1898% +1.7737% +2.3186%] (p = 0.00 < 0.05)
                        Performance has regressed.
twitter/simd_json::to_owned_value
                        time:   [1.0780 ms 1.0804 ms 1.0825 ms]
                        change: [+0.1648% +0.4393% +0.6964%] (p = 0.00 < 0.05)
                        Change within noise threshold.

And running the perf file on main:

$ cargo run --release --example perf --features perf -- --baseline
                                    Instructions             Cache.              Cycle/byte
        Name           Cycles    Normal.     Branch     Misses   References    Best       Avg
apache_builds            473733    1501796     203584       1615      41891      3.677      3.722
canada                 16377411   52059260    8339754      50203     809507      7.205      7.275
citm_catalog            6186946   20904404    3059458      43122     551051      3.207      3.582
github_events            196072     577217      70402       2042      18309      2.905      3.010
gsoc-2018               5727720   15881320    1467366      45027     634238      1.691      1.721
instruments             1086079    3101098     439331      11511      95058      4.845      4.929
log                        8324      23915       2914         21        829      3.563      3.827
marine_ik              26121243   68833970   11378275      71421    1495889      8.681      8.755
mesh                    6123522   15449576    2673161       6539     352969      8.420      8.463
mesh.pretty             7494155   21262520    3401106       8008     461639      4.726      4.751
numbers                  841116    2310800     345855       1085      50913      5.568      5.603
random                  2987024    9093028    1278887      32010     294740      5.818      5.851
twitter                 2340567    6746892     787240      25008     213382      3.670      3.706
twitterescaped          3151726    9386014    1284856      26224     208258      5.561      5.604
update-center           2354684    6419202     854014      15985     183999      4.368      4.416

Versus this branch:

$  cargo run --release --example perf --features perf
                                    Instructions             Cache.              Cycle/byte
        Name           Cycles    Normal.     Branch     Misses   References    Best       Avg
apache_builds            430096    1428620     190849       1324      38388      3.342      3.379
apache_builds(+/-)     -10.146%    -5.122%    -6.673%   -21.979%    -9.125%   -10.003%   -10.146%
canada                 18623941   64874450   10156178      40839     801088      8.223      8.273
canada(+/-)             12.063%    19.754%    17.885%   -22.929%    -1.051%    12.381%    12.063%
citm_catalog            5682995   19846716    2786410      17146     479069      3.181      3.290
citm_catalog(+/-)       -8.868%    -5.329%    -9.799%  -151.499%   -15.025%    -0.809%    -8.868%
github_events            187436     571008      69506       2178      19165      2.802      2.878
github_events(+/-)      -4.607%    -1.087%    -1.289%     6.244%     4.466%    -3.668%    -4.607%
gsoc-2018               5536468   15726945    1447318      43783     631928      1.633      1.664
gsoc-2018(+/-)          -3.454%    -0.982%    -1.385%    -2.841%    -0.366%    -3.576%    -3.454%
instruments             1064396    3212796     443614      10520      94552      4.742      4.831
instruments(+/-)        -2.037%     3.477%     0.965%    -9.420%    -0.535%    -2.173%    -2.037%
log                        8201      23280       2827         26        852      3.494      3.771
log(+/-)                -1.500%    -2.728%    -3.077%    19.231%     2.700%    -1.974%    -1.500%
marine_ik              27950053   80932676   13025486      65427    1480295      9.286      9.368
marine_ik(+/-)           6.543%    14.949%    12.646%    -9.161%    -1.053%     6.523%     6.543%
mesh                    6586025   18914288    3062289       6860     353356      9.059      9.102
mesh(+/-)                7.022%    18.318%    12.707%     4.679%     0.110%     7.050%     7.022%
mesh.pretty             9328963   31226203    5064481       8328     461656      5.871      5.914
mesh.pretty(+/-)        19.668%    31.908%    32.844%     3.842%     0.004%    19.513%    19.668%
numbers                  946921    3007911     445869       1198      50708      6.264      6.308
numbers(+/-)            11.174%    23.176%    22.431%     9.432%    -0.404%    11.112%    11.174%
random                  2933293    9126465    1285915      34969     295091      5.706      5.746
random(+/-)             -1.832%     0.366%     0.547%     8.462%     0.119%    -1.961%    -1.832%
twitter                 2258294    6621953     771737      25760     214103      3.533      3.576
twitter(+/-)            -3.643%    -1.887%    -2.009%     2.919%     0.337%    -3.901%    -3.643%
twitterescaped          3088402    9255751    1269368      28086     209138      5.447      5.491
twitterescaped(+/-)     -2.050%    -1.407%    -1.220%     6.630%     0.421%    -2.097%    -2.050%
update-center           2278730    6323905     850337      15561     181607      4.234      4.274
update-center(+/-)      -3.333%    -1.507%    -0.432%    -2.725%    -1.317%    -3.167%    -3.333%

So yeah, it's slower, but not that much if you don't have a bunch of floats in the file.

Licenser · 2022-10-18T08:52:25Z

Ja looks good and original / approximate / faster behaviour is preserved via flag 👍

philss added 7 commits October 14, 2022 16:43

WIP: bring some files from ijl-parse-number branch

ddd6388

WIP: make approximation version work

13e18ca

WIP: run clippy against "correct" float parsing

51617a3

Fix "correct" parser of numbers

ad6cbab

Add test that with float 0.6

03ad36d

Fix branches of stage2 for parsing numbers approx

505798e

Fix 128bit feature for "correct" number parsing

41a66c5

philss added 2 commits October 16, 2022 23:21

WIP: fix importing

1d42a9d

Fix clippy issues

97a2875

Licenser reviewed Oct 17, 2022

View reviewed changes

src/numberparse/correct.rs Outdated Show resolved Hide resolved

philss added 3 commits October 17, 2022 12:17

Rename for consistency

fd93629

Add "0.6" as a float test for the new algo

f0217db

Change approximet parse number to take a buffer slice

fa8df70

This makes the interface similar to the "correct" number parsing, and avoid the duplication at "stage2".

Licenser approved these changes Oct 17, 2022

View reviewed changes

Licenser merged commit 90144db into simd-lite:main Oct 18, 2022

philss deleted the ps-parse-number-simdjson-0.3.1 branch October 18, 2022 13:11

philss mentioned this pull request Oct 21, 2022

Float parsing errors in the JSON parser after 0.24.0 pola-rs/polars#5021

Closed

2 tasks

Licenser mentioned this pull request Oct 21, 2022

align number parsing with upstream #61

Closed

lmatz mentioned this pull request Oct 26, 2022

connector: precision issue of float numbers with different JSON parser risingwavelabs/risingwave#5804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement number parsing from simdjson v0.3.1 - second try #253

Implement number parsing from simdjson v0.3.1 - second try #253

philss commented Oct 16, 2022

codecov bot commented Oct 16, 2022 •

edited

Licenser commented Oct 17, 2022

Licenser left a comment

philss commented Oct 17, 2022

Licenser commented Oct 18, 2022

Implement number parsing from simdjson v0.3.1 - second try #253

Implement number parsing from simdjson v0.3.1 - second try #253

Conversation

philss commented Oct 16, 2022

codecov bot commented Oct 16, 2022 • edited

Codecov Report

Licenser commented Oct 17, 2022

Licenser left a comment

Choose a reason for hiding this comment

philss commented Oct 17, 2022

Licenser commented Oct 18, 2022

codecov bot commented Oct 16, 2022 •

edited