New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various performance optimizations #173
Conversation
As an interesting footnote, I looked over #66 when writing this for reference on profiling. With the new optimizations,
but beats it by 2.5x on pure squeezed data:
For
but is about 4.25x slower on pure squeezed data:
|
Awesome work and fantastic documentation - Thank you very much! It might be a few days before I get to a full review.
I don't really know how |
I've added a few micro-optimizations, but there's a limit to how much more I can optimize. The fundamental problem is that Personally, I'm not a fan of this architectural decision and I would like to change it so it's line by line. That way, each squeezed line could be skipped at the read step before trying to print it. It's impossible to profile without actually writing it but I think this could be another opportunity for optimization. However, that would essentially constitute an entire rewrite of |
Also, implementing a version of |
Hm, why would the haystack be limited to a single line? I was thinking of files with large blocks of zeros. Couldn't we "scan ahead" and only then print the required output? Something like this: If we discover a full line of zeros with our usual (byte-based/line-based) method, don't print anything yet. Start a fast lookahead search for the next nonzero byte to figure out the size of the zero-block. Print the appropriate "squeezed lines" output. Continue normally after the zero block.
Sounds great. To be completely honest with you, I'm not at all happy with the current architecture as well. hexyl grew way too fast from a very simple tool to something more feature complete, and I haven't payed enough attention to SW architecture. Mostly, because I wasn't too interested in extending the project anymore. Many different contributors wanted to add new features and I failed as a maintainer in terms of keeping a certain quality. So I'd be perfectly fine with a major redesign if it also improves the structure of the code. If it's only for the sake of squeezing (heh) even more performance out, I'm not so sure. I am a big fan of fast tools, but I also believe that optimizing |
I've completed the architectural rewrite, which can be tested for speed. On performance, it is slightly faster than It still has a major roadblock in terms of passing tests, because it currently doesn't print out the last line when it is incomplete. I'm working on fixing this. |
I fixed a few corner cases where it printed incorrectly. I also added a big optimization when squeezing zero bytes, since that is by far the most common usage of squeezing. Instead of checking that every single byte in the line is the same, the line is divided into
|
For what it's worth, the improvement of the squeezer over the original is extreme, which is the only performance benchmark that really matters. As you've noted, the performance of printing out random data is limited by the terminal emulator. However, the squeezing of data is now faster than any hex viewer I could find in my system package repositories. For squeezing zero data, it is now 67 times faster.
|
Thank you very much. That sounds absolutely fantastic! I would like to see some (unit and/or integration) tests for the squeezing behavior before we merge this. I know those tests should already be there (see also my comment above). But it's better to introduce them now than never. Performance is great, but I would like to ensure that we're (still) doing everything correctly first. And the (new) code is complex enough to warrant some tests, I believe. Now this does not have to be done by you, of course. We can also write those tests in another PR and then come back to this later. |
Great idea! I've implemented three tests that should cover the squeeze behavior. Unnecessary information beyond this point... Interestingly, I ran into some unexplainable buggy behavior while creating the tests. When I call |
Are there any other blockers to merging? |
Thank you so much for the updates. And for discovering and fixing that buggy behavior. Sorry for all the delays, but I'd like to understand this better before merging.
According to the documentation, this is not necessarily the case: It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
Can I reproduce this on |
Hmm, that's good to know. The workaround I made checks again to see if
It was a bug introduced by my architectural rewrite, so no. Apparently the former method of reading one byte at a time doesn't cause this issue, which makes sense. |
Thank you very much for your patience and this great work. Not just does it make More of a note to self: I have to admit that I did not fully review the code in all detail, something that I usually try to do. The reasons are twofold: (1) from the whole conversation, the concept, and the superficial code review, this seems like a high quality contribution. (2) Due to time restrictions, I need to prioritize my open source projects a bit. Over the years, |
I totally get it, compared to your other projects it's somewhat less popular. I just got interested in this particular problem for myself, and I'm happy I could contribute to an application used by thousands of people ❤️ |
To be clear, that's not the deciding factor 😄.
I'm glad you did! |
I did some profiling of
hexyl
usingcargo flamegraph
and identified some key performance optimizations that could be made.The three methods that take up the most time are:
write_all
into the buffer and thenstdout
format!
call inprint_position_panel
write!
andwriteln!
macro inprint_*
functionsThe three commits in the pull request address these three issues.
The first replaces the manual buffers used by
hexyl
for reading and writing byBufReader
andBufWriter
, respectively. This required some major refactoring of code because of the way thatbuffer_line
andraw_line
were used. I feel that the refactoring was appropriate and made the code much easier to read, so it was worth it even without performance optimizations. With this addition, here is the benchmark:hexyl data
hexyl --no-squeezing data
target/release/hexyl data
target/release/hexyl --no-squeezing data
The second commit replaces the
format!
call inprint_position_panel
with a workaround. Previously, the function would take the position and format it as hexademical, then painted it if color was enabled. This was expensive! In order to avoid this, I used the same strategy as with thebyte_hex_panel
andbyte_char_panel
and precomputed the values in thenew
function so they would only be created once throughbyte_hex_panel_g
andbyte_char_panel_g
. This adds a tiny amount of overhead to the startup but it's invisible in profiling. Here is the benchmark with both the first and second commit:hexyl data
hexyl --no-squeezing data
target/release/hexyl data
target/release/hexyl --no-squeezing data
The last and most dramatic optimization was the elimination of the
write!
macro sprinkled throughout the threeprint
methods that are repeated every line, replacing it with the more directwrite_all
method onself.writer
. This wasn't actually too difficult, because all of thewrite!
macro calls didn't use any formatting machinery. I didn't remove every use ofwrite!
because profiling shows that it wasn't important outside of the loop. Here is the benchmark with all of these commits:hexyl data
hexyl --no-squeezing data
target/release/hexyl data
target/release/hexyl --no-squeezing data
And here is the final flamegraph: