Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns on whitespace emitting and parsing #113

Open
jekku opened this issue Sep 8, 2022 · 3 comments
Open

Concerns on whitespace emitting and parsing #113

jekku opened this issue Sep 8, 2022 · 3 comments

Comments

@jekku
Copy link

jekku commented Sep 8, 2022

Context

We use Saxy to parse XML files from several APIs.

There have been changes introduced to emit whitespaces correctly, connected to [this issue and its pull request](#51).

Concerns

For our first concern related to the above, we noticed a reduction in performance when using the latest version of Saxy (v1.4.0), as opposed to a fork based on v0.9.1.

We did some digging to check whether it was changes on our end, or if it was a performance regression in this library.

We observed that the generated SimpleForm data from a pretty-printed XML, using the latest version of saxy is double the file size from the original one. Very likely caused or correlated by the emitted / parsed whitespaces.

Running benchmarks between SimpleForm results coming from both versions, we found that parsed / emitted whitespaces cause some performance regressions.

Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 16 GB
Elixir 1.13.4
Erlang 25.0.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 5 s
parallel: 5
inputs: none specified
Estimated total run time: 24 s

Benchmarking new_saxy_version ...
Benchmarking old_saxy_version ...

Name                       ips        average  deviation         median         99th %
old_saxy_version          4.33      231.03 ms     ±3.94%      229.68 ms      257.67 ms
new_saxy_version          4.22      237.09 ms     ±4.67%      234.80 ms      265.52 ms

Comparison:
old_saxy_version          4.33
new_saxy_version          4.22 - 1.03x slower +6.05 ms

Reduction count statistics:

Name                     average  deviation         median         99th %
old_saxy_version         20.86 M     ±0.04%        20.86 M        20.87 M
new_saxy_version         24.84 M     ±0.03%        24.84 M        24.85 M

Comparison:
old_saxy_version         20.86 M
new_saxy_version         24.84 M - 1.19x reduction count +3.98 M

The second concern is about whitespace values within tags. Previously, an XML in the shape of:

<Body> <Value> </Value> </Body>

Would result into the contents of Value being parsed as nil. In the current build, it is parsed as a string containing whitespace.

In a previous issue mentioned in the beginning, this should only happen when certain attributes are provided. Like so:

<Body> <Value xml:space="preserve"> </Value> </Body>

Proposed solutions

  • We should only emit whitespaces when xml:space="preserve" is provided. By default whitespaces should not be emitted.
  • Alternatively, providing an option to ignore whitespaces might be a way to solve it, such that we can do calls similar to these:
Saxy.parse_stream(stream, Saxy.SimpleForm.Handler, [emit_whitespaces?: false])
@qcam
Copy link
Owner

qcam commented Sep 18, 2022

@jekku Thanks for the comprehensive report!

Unfortunately #51 was a bug in the parser that we need to fix. Saxy should match and emit all CharData characters, and leave the content handling to the user.

I understand it introduced an inconvenient breaking change. The patch was released with a major version bump (from v0 to v1) though I learned the CHANGELOG doesn't emphasize that enough. I did expect some slowness since the fix indeed adds work for the parser but didn't anticipate it to be noticeable.

If you could provide the benchmark suite with the sample code which causes the performance penalty, I could look into what could possibly be optimized in the parser.

@qcam
Copy link
Owner

qcam commented Sep 18, 2022

Linking #103 since it could be relevant.

@jekku
Copy link
Author

jekku commented Oct 18, 2022

Hello. Just got back to this.

I forked the old saxy version into a different package locally to avoid namespace conflicts and then did:

Benchee.run(
  "old_saxy_version" => OlderSaxy.parse_stream(stream, OldSaxy.SimpleForm.Handler, []),
  "new_saxy_version" => Saxy.parse_stream(stream, Saxy.SimpleForm.Handler, []),
)

The file we used to stream is a big pretty printed formatted XML file that's around 10MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants