Various performance optimizations #173

sharifhsn · 2022-10-08T00:13:14Z

I did some profiling of hexyl using cargo flamegraph and identified some key performance optimizations that could be made.

The three methods that take up the most time are:

write_all into the buffer and then stdout
format! call in print_position_panel
multiple uses of the write! and writeln! macro in print_* functions

The three commits in the pull request address these three issues.

The first replaces the manual buffers used by hexyl for reading and writing by BufReader and BufWriter, respectively. This required some major refactoring of code because of the way that buffer_line and raw_line were used. I feel that the refactoring was appropriate and made the code much easier to read, so it was worth it even without performance optimizations. With this addition, here is the benchmark:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl data`	1.068 ± 0.015	1.045	1.084	1.23 ± 0.03
`hexyl --no-squeezing data`	1.073 ± 0.035	1.041	1.164	1.23 ± 0.04
`target/release/hexyl data`	0.895 ± 0.019	0.871	0.917	1.03 ± 0.03
`target/release/hexyl --no-squeezing data`	0.871 ± 0.013	0.851	0.899	1.00

The second commit replaces the format! call in print_position_panel with a workaround. Previously, the function would take the position and format it as hexademical, then painted it if color was enabled. This was expensive! In order to avoid this, I used the same strategy as with the byte_hex_panel and byte_char_panel and precomputed the values in the new function so they would only be created once through byte_hex_panel_g and byte_char_panel_g. This adds a tiny amount of overhead to the startup but it's invisible in profiling. Here is the benchmark with both the first and second commit:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl data`	1.181 ± 0.025	1.135	1.215	1.67 ± 0.05
`hexyl --no-squeezing data`	1.122 ± 0.024	1.091	1.165	1.59 ± 0.05
`target/release/hexyl data`	0.738 ± 0.015	0.713	0.767	1.04 ± 0.03
`target/release/hexyl --no-squeezing data`	0.706 ± 0.014	0.687	0.729	1.00

The last and most dramatic optimization was the elimination of the write! macro sprinkled throughout the three print methods that are repeated every line, replacing it with the more direct write_all method on self.writer. This wasn't actually too difficult, because all of the write! macro calls didn't use any formatting machinery. I didn't remove every use of write! because profiling shows that it wasn't important outside of the loop. Here is the benchmark with all of these commits:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl data`	1.062 ± 0.027	1.024	1.101	3.69 ± 0.11
`hexyl --no-squeezing data`	1.072 ± 0.011	1.060	1.099	3.73 ± 0.07
`target/release/hexyl data`	0.332 ± 0.007	0.323	0.343	1.15 ± 0.03
`target/release/hexyl --no-squeezing data`	0.287 ± 0.005	0.282	0.299	1.00

And here is the final flamegraph:

…iter`

…harkdp#171

sharifhsn · 2022-10-08T14:59:52Z

As an interesting footnote, I looked over #66 when writing this for reference on profiling. With the new optimizations, hexyl is only about 1.5x slower than xxd on pure random data:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`xxd data`	116.9 ± 2.7	112.9	122.8	1.00
`target/release/hexyl data`	183.8 ± 5.2	179.5	199.3	1.57 ± 0.06

but beats it by 2.5x on pure squeezed data:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`xxd data`	113.5 ± 2.8	110.6	122.1	2.63 ± 0.13
`target/release/hexyl data`	43.2 ± 1.8	41.8	54.7	1.00

For hexdump, the opposite is true. hexyl is faster by about 5.5x on pure random data:

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexdump -C data`	1.031 ± 0.012	1.016	1.050	5.67 ± 0.09
`target/release/hexyl data`	0.182 ± 0.002	0.179	0.187	1.00

but is about 4.25x slower on pure squeezed data:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`hexdump -C data`	10.2 ± 0.3	9.2	11.5	1.00
`target/release/hexyl data`	43.2 ± 1.4	41.8	48.4	4.25 ± 0.20

sharkdp · 2022-10-08T18:40:45Z

Awesome work and fantastic documentation - Thank you very much! It might be a few days before I get to a full review.

but is about 4.25x slower on pure squeezed data:

I don't really know how hexdump does this. To be completely honest, I don't even know how we do it currently. But I could imagine that the fastest way would be to use something similarly optimized as memchr (Rust wrapper: https://docs.rs/memchr/latest/memchr/). But in reverse (search for anything but \0).

sharifhsn · 2022-10-09T19:46:21Z

I've added a few micro-optimizations, but there's a limit to how much more I can optimize.

The fundamental problem is that hexyl processes the input byte by byte instead of line by line. It's not possible to search an entire line at once to check if a non-squeeze byte is in it because hexyl doesn't know ahead of time what the next bytes are. This places a hard limit on how fast the squeezer can be.

Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization.

However, that would essentially constitute an entire rewrite of Printer and Squeezer. I'm down for the challenge, but I think the optimizations I've already written stand on their own. We should merge this PR, then I'll do some experimental rewriting to see if it's worth it.

sharifhsn · 2022-10-09T19:48:58Z

Also, implementing a version of memchr would almost certainly not be worth it. The haystack in this case is the width of a single line, which will usually be 16 bytes and probably not more than 128. memchr is optimized for throughput, not latency. It was worth investigating though, and I might use it in some other future program.

sharkdp · 2022-10-09T20:40:09Z

Also, implementing a version of memchr would almost certainly not be worth it. The haystack in this case is the width of a single line, which will usually be 16 bytes and probably not more than 128. memchr is optimized for throughput, not latency. It was worth investigating though, and I might use it in some other future program.

Hm, why would the haystack be limited to a single line? I was thinking of files with large blocks of zeros. Couldn't we "scan ahead" and only then print the required output? Something like this: If we discover a full line of zeros with our usual (byte-based/line-based) method, don't print anything yet. Start a fast lookahead search for the next nonzero byte to figure out the size of the zero-block. Print the appropriate "squeezed lines" output. Continue normally after the zero block.

The fundamental problem is that hexyl processes the input byte by byte instead of line by line. It's not possible to search an entire line at once to check if a non-squeeze byte is in it because hexyl doesn't know ahead of time what the next bytes are. This places a hard limit on how fast the squeezer can be.

Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization.

Sounds great. To be completely honest with you, I'm not at all happy with the current architecture as well. hexyl grew way too fast from a very simple tool to something more feature complete, and I haven't payed enough attention to SW architecture. Mostly, because I wasn't too interested in extending the project anymore. Many different contributors wanted to add new features and I failed as a maintainer in terms of keeping a certain quality.

So I'd be perfectly fine with a major redesign if it also improves the structure of the code. If it's only for the sake of squeezing (heh) even more performance out, I'm not so sure. I am a big fan of fast tools, but I also believe that optimizing hexyl is more of an intellectual exercise. As I have mentioned elsewhere, it's already faster than most terminal emulators can handle... and I haven't really heard of any real world use cases where hexyl is too slow.

sharifhsn · 2022-10-13T15:03:23Z

I've completed the architectural rewrite, which can be tested for speed. On performance, it is slightly faster than xxd on pure random data, and 1.7x faster than hexdump on squeezed data. This rewrite significantly simplifies the code structure, making the entire squeezer module unnecessary. By processing each line at a time, squeezing can be done at the read step instead of after-the-fact.

It still has a major roadblock in terms of passing tests, because it currently doesn't print out the last line when it is incomplete. I'm working on fixing this.

src/lib.rs

sharifhsn · 2022-10-15T13:19:41Z

I fixed a few corner cases where it printed incorrectly.

I also added a big optimization when squeezing zero bytes, since that is by far the most common usage of squeezing. Instead of checking that every single byte in the line is the same, the line is divided into usize-sized chunks, and each chunk is checked to equal zero. Checking that a usize equals zero optimizes extremely well.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`hexdump -C zerodata`	997.0 ± 19.6	960.7	1030.9	2.88 ± 0.09
`target/release/hexyl zerodata`	346.0 ± 8.6	338.6	367.4	1.00

sharifhsn · 2022-10-20T23:14:49Z

For what it's worth, the improvement of the squeezer over the original is extreme, which is the only performance benchmark that really matters. As you've noted, the performance of printing out random data is limited by the terminal emulator. However, the squeezing of data is now faster than any hex viewer I could find in my system package repositories. For squeezing zero data, it is now 67 times faster.

Command	Mean [s]	Min [s]	Max [s]	Relative
`hexyl zerodata`	11.968 ± 0.070	11.883	12.072	67.33 ± 0.85
`target/release/hexyl zerodata`	0.178 ± 0.002	0.173	0.180	1.00

sharkdp · 2022-11-04T19:46:50Z

Thank you very much. That sounds absolutely fantastic!

I would like to see some (unit and/or integration) tests for the squeezing behavior before we merge this. I know those tests should already be there (see also my comment above). But it's better to introduce them now than never. Performance is great, but I would like to ensure that we're (still) doing everything correctly first. And the (new) code is complex enough to warrant some tests, I believe.

Now this does not have to be done by you, of course. We can also write those tests in another PR and then come back to this later.

sharifhsn · 2022-11-04T22:37:51Z

Great idea! I've implemented three tests that should cover the squeeze behavior.

Unnecessary information beyond this point...

Interestingly, I ran into some unexplainable buggy behavior while creating the tests. When I call read on the reader, it should pull as many bytes as possible into the buffer. If the end of file is reached, it will pull less bytes into the buffer than the buffer's size. This can be checked for, which is the "leftover" case in the cleanup code. Curiously, when an odd number of panels are used, and the squeeze ends with eight squeeze bytes followed by non-squeeze bytes, then read will suddenly terminate, which gives the appearance that end of file has been reached. Luckily, this behavior is caught by the hello_world_elf64 test file, so I was able to find the issue and solve it by applying a second read check to assure that end of file has been reached. The question for me then, is why this happens? I dug through the standard library code and it seems to be just using libc's read function. Not sure if there's some bizarre optimization going on or some weird aspect of Linux file reading. Either way, the issue is fixed.

sharifhsn · 2022-11-13T17:05:14Z

Are there any other blockers to merging?

sharkdp · 2022-11-13T20:36:21Z

Thank you so much for the updates. And for discovering and fixing that buggy behavior. Sorry for all the delays, but I'd like to understand this better before merging.

When I call read on the reader, it should pull as many bytes as possible into the buffer.

According to the documentation, this is not necessarily the case:

It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.

Curiously, when an odd number of panels are used, and the squeeze ends with eight squeeze bytes followed by non-squeeze bytes, then read will suddenly terminate, which gives the appearance that end of file has been reached. Luckily, this behavior is caught by the hello_world_elf64 test file, so I was able to find the issue and solve it by applying a second read check to assure that end of file has been reached.

Can I reproduce this on master somehow?

sharifhsn · 2022-11-13T20:43:41Z

According to the documentation, this is not necessarily the case:

Hmm, that's good to know. The workaround I made checks again to see if read returns Ok(0), which is guaranteed to be EOF as per the documentation, so that checks out.

Can I reproduce this on master somehow?

It was a bug introduced by my architectural rewrite, so no. Apparently the former method of reading one byte at a time doesn't cause this issue, which makes sense.

sharkdp · 2022-11-13T21:19:35Z

Thank you very much for your patience and this great work. Not just does it make hexyl faster, but also better structured and tested 👍.

More of a note to self: I have to admit that I did not fully review the code in all detail, something that I usually try to do. The reasons are twofold: (1) from the whole conversation, the concept, and the superficial code review, this seems like a high quality contribution. (2) Due to time restrictions, I need to prioritize my open source projects a bit. Over the years, hexyl has dropped down a bit on the list of projects I care about deeply.

sharifhsn · 2022-11-13T22:54:32Z

I totally get it, compared to your other projects it's somewhat less popular. I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people ❤️

sharkdp · 2022-11-14T20:45:18Z

I totally get it, compared to your other projects it's somewhat less popular.

To be clear, that's not the deciding factor 😄.

I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people heart

I'm glad you did!

sharifhsn added 4 commits October 7, 2022 17:28

Refactor code to replace manual buffering with BufReader and `BufWr…

3b89f72

…iter`

Precalculate position formatting in Printer state

e87b0d9

Remove write! formatting from hot loop

d69d929

cargo fmt

1aa909d

sharifhsn marked this pull request as ready for review October 8, 2022 00:52

sharifhsn added 10 commits October 7, 2022 21:29

Fix overflow when printing without char panel

49ddef8

Fix char panel showing even if disabled

98e87d7

Fixes --no-characters causes output artifacts when squeezing lines s…

5a651b8

…harkdp#171

Fix issues with hello_world_elf64

a5b47f6

Fix for no position panel

4057332

Satisfy clippy

8a86208

Undo perf regression on squeezer

b5a6396

Print first line always

4b4b177

Inline StdoutLock variable

8831288

Undo inlining to satisfy MSRV

148d792

sharifhsn added 3 commits October 8, 2022 22:42

Store squeezed byte instead of pushing to vector

234fb05

Eliminate unnecessary call to active

fb1c52a

Track squeeze_byte correctly

becc394

mkatychev mentioned this pull request Oct 9, 2022

--no-characters causes output artifacts when squeezing lines #171

Closed

sharifhsn added 3 commits October 13, 2022 10:55

Complete architectural rewrite (intentionally fails)

c426698

Alter newline usage

77931b3

Remove unnecessary squeezer module

0c80cd3

sharifhsn added 2 commits October 13, 2022 11:07

Keep newline consistent

b0af98a

Remove unnecessary import

8d9f977

sharkdp reviewed Oct 13, 2022

View reviewed changes

src/lib.rs Show resolved Hide resolved

sharkdp reviewed Oct 13, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

sharifhsn added 2 commits October 13, 2022 18:58

Satisfy clippy

b1871d3

Document byte_hex_panel_g

78e94db

mkatychev mentioned this pull request Oct 14, 2022

Fix/no char no pos squeeze #174

Closed

sharifhsn added 5 commits October 15, 2022 08:38

Optimize line check when squeezing zero bytes

dc2bf9f

Remove unnecessary ending case (covered by next)

0f30a8b

Fix incomplete line ending when squeezed

be85e20

Print asterisk when no position panel

2ddf3d3

Print last line only when squeezed

ab1edce

sharifhsn added 2 commits November 4, 2022 18:23

Add unit tests for new implementation of squeeze

ea15144

Add second check if read fails unexpectedly

f6ac7f9

sharifhsn added 4 commits November 5, 2022 11:45

Use repeated byte optimization on nonzero bytes

ee5cc5f

Clean up flags for ergonomics

5a7babb

Use repeat byte when discovering squeeze

f221c18

Document the main loop

1166852

sharkdp merged commit 0caab15 into sharkdp:master Nov 13, 2022

sharkdp mentioned this pull request Nov 13, 2022

Support byte grouping #170

Merged

sharifhsn deleted the perf branch November 14, 2022 00:54

mkatychev mentioned this pull request Nov 26, 2022

Added char squeeze tests #177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various performance optimizations #173

Various performance optimizations #173

sharifhsn commented Oct 8, 2022 •

edited

sharifhsn commented Oct 8, 2022

sharkdp commented Oct 8, 2022

sharifhsn commented Oct 9, 2022

sharifhsn commented Oct 9, 2022

sharkdp commented Oct 9, 2022

sharifhsn commented Oct 13, 2022 •

edited

sharifhsn commented Oct 15, 2022

sharifhsn commented Oct 20, 2022

sharkdp commented Nov 4, 2022

sharifhsn commented Nov 4, 2022

sharifhsn commented Nov 13, 2022

sharkdp commented Nov 13, 2022

sharifhsn commented Nov 13, 2022

sharkdp commented Nov 13, 2022

sharifhsn commented Nov 13, 2022

sharkdp commented Nov 14, 2022 •

edited

Various performance optimizations #173

Various performance optimizations #173

Conversation

sharifhsn commented Oct 8, 2022 • edited

sharifhsn commented Oct 8, 2022

sharkdp commented Oct 8, 2022

sharifhsn commented Oct 9, 2022

sharifhsn commented Oct 9, 2022

sharkdp commented Oct 9, 2022

sharifhsn commented Oct 13, 2022 • edited

sharifhsn commented Oct 15, 2022

sharifhsn commented Oct 20, 2022

sharkdp commented Nov 4, 2022

sharifhsn commented Nov 4, 2022

sharifhsn commented Nov 13, 2022

sharkdp commented Nov 13, 2022

sharifhsn commented Nov 13, 2022

sharkdp commented Nov 13, 2022

sharifhsn commented Nov 13, 2022

sharkdp commented Nov 14, 2022 • edited

sharifhsn commented Oct 8, 2022 •

edited

sharifhsn commented Oct 13, 2022 •

edited

sharkdp commented Nov 14, 2022 •

edited