New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memcpy/memcmp macros #1109
Memcpy/memcmp macros #1109
Conversation
ca1b8d0
to
9e31ad4
Compare
This pull request introduces 15 alerts when merging 9e31ad4 into b2ef108 - view on LGTM.com new alerts:
|
9e31ad4
to
d40a42d
Compare
This pull request introduces 15 alerts when merging d40a42d into b2ef108 - view on LGTM.com new alerts:
|
d40a42d
to
8e12f2b
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1109 +/- ##
===========================================
- Coverage 80.23% 79.71% -0.52%
===========================================
Files 98 92 -6
Lines 9040 8943 -97
Branches 1438 1432 -6
===========================================
- Hits 7253 7129 -124
- Misses 1223 1247 +24
- Partials 564 567 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
70239dc
to
de97ab4
Compare
This does look very promising, I wish we could do some broader benchmarking of this though, like aarch32 for example. |
Yeah I agree. I don't have any access to ARM machines. 😟 |
de97ab4
to
6df6a0b
Compare
Rebased. |
Oracle cloud has free amprere arm VMs |
I've got an odroid n2 for aarch64 and a quad powermac g5 for ppc64 |
6df6a0b
to
6db1377
Compare
Rebased. |
I plan to split some of this PR out into a PR just for moving unaligned detection into code. |
6db1377
to
fffd789
Compare
Rebased. |
Looks like it still happens with |
We knew limiting minimum match length to 4 instead of 3 would hurt compression in some cases... |
It hurts decompression.. |
The compress stream changes and somehow decompression is hurt. Maybe it is having to seek far back into the window all the time? I did a comparison of minigzip output with z_verbose=2 between level 7 and 8 and their didn't seem to be that much of a difference imo. Any debugging ideas? |
I would check distances of 3 and 4 byte matches... It's possible that the chain lenghts are long enough to hurt performance. |
Well my CPU may need to be bringing memory in and out of cache to perform these operations if they are too far back in the window is my thinking. If somebody else can run the test using |
I think I would do a profiling comparison, and see where the huge amount of extra time is spent. Not sure whether it will actually tell us what is "wrong" about the data, but hopefully it would give a hint. My thinking is that the commit that you found did not cause this, it just modified the compressed data in a way that made the issue crop up in this specific circumstance. The compressed data is valid after all, and it would surprise me if it is very different from level 7, making the increased time spent even more surprising. Interestingly, I do not see the same problem with the same data. But we do have several cases where we compress the same data differently depending on what arch-specific optimizations are enabled etc. Not sure how that still happens with You could upload the compressed level 7 and level 8 files, then we could all test decompression to see whether we can at least replicate the slowdown with the same compressed data as you. |
Last night I tried to do some performance profiling in Visual Studio 2022 but nothing stood out at me. |
Nothing terribly conclusive there - just that 4% less of the total runtime is burned in zng_inflate_fast on level 8 than level 7 (inclusive, not just self). There are also seems to be ~5% less time in the C runtime though it's hard to say if that's a good apples to apples comparison here (there's going to be some run to run variance and it's hard to get great resolution). Do you see this same behavior with the larger version of the test data? |
Here is benchmark results on my Tiger Lake laptop: MSVC 17.0.5 x86-64 silesta-small.tar DEVELOP 748127e
Oddly enough on level 0, the decompression time is also higher. |
As far as I can go back in the git tree, it seems that inflate stored takes that same amount of time. |
So that's not too surprising. At level 0 you're likely incurring much more IO if you're actually decompressing from block storage. You're running nearly double the amount of data through memory and back out to storage. |
That is what I have always seen too, that the time spent to inflate a file increases roughly linearly with its compressed size. I have not gotten around to testing the file you uploaded yet. |
Xeon E5-2650, x86-64, GCC 4.8 Baseline GCC4.8 88f331e
PR #1109 GCC4.8 727254d
Slightly faster on older compiler for x86-64, that is very good to see. That alleviates my worries about this introducing any potential regressions. |
@nmoinvaz No, you are right. Level 0 being so much slower is indeed seriously strange, and I would guess it is probably related to the level 8 slowdown. It involves less code, so possibly that could make it easier to narrow down the problem. Interestingly, my Xeon benchmark above makes the exact same file as yours on level 8, but in my tests the performance is right in line with what is expected. And my level 0 decompression is more than 5x faster than level 1, not twice as slow like yours seem to be. This one is tricky.
One possible experiment that might or might not tell us anything: If you append silesia-small.tar onto itself, so the same bitstream comes twice after eachother in a twice the size file, does that file have the same behavior? I mean, it should at least have the same slowdown for the first half of the file since the bitstreams would be identical up to that point. However if it turns out to not be slower than expected, then it becomes seriously weird and I'd begin to suspect Windows itself.. Also, not trimming any of the worst benchmark results would tell us for sure whether this slowdown sometimes happens on the other levels, and whether some runs are even worse than this on levels 0 and 8. (Still makes no sense that it would happen always on levels 0 and 8, but I am really grasping at straws here) |
@Dead2 are you sure it's not time lost in IO for @nmoinvaz at level 0? It'd be helpful if these were fed from a ram disk. At least that point you're only measuring the additional memory access time (which still would likely be slower but not by the same scales at least). Is there a way to measure system time of a process in windows to narrow it down? If it's all in system time that would 100% agree with that theory. Also something to consider: that xeon has a 20 mb LLC, his laptop -U series tiger lake CPU only has a 12mb LLC. |
Here is a run on WSL2 using GCC 9.3.0.
It seems to be only related to this dataset. A single run using
|
@KungFuJesus I thought maybe if I used WSL2 I could have access to all the perf tools but they were all missing or needed to be compiled manually. So I couldn't get access to the sys timings. |
Perf is not a strictly userspace component either, though WSL2 does provide a kernel so perhaps that doesn't matter anymore. You should be able to do a direct comparison with time -v (should give you memory utilization, system time, etc). You will want to restrict it to just the compression levels you're testing though, and perhaps just a single run. |
@KungFuJesus For sure, it could absolutely be explained by slow IO. AFAIK, there is no way the OS should be aware of what compression level was used for those files. Also, the file is compressed before each decompression test, so it is not even the same file that was decompressed 70 times, it is 70 different (and probably cached) files that should not look any different to the OS (compared to level 7 for example) than the quite minor differences in size. The OS could be slower when doing IO because of those exact file sizes, but that would be a quite weird bug. Also, remember the fact that using a different files for testing does not result in this phenomenon.. |
Well given that compression is kind of a state machine with opcodes, it's not super surprising that differently compressible files would react differently here. If I had to wager a guess, you're fitting more into cache than he is for this data set. Perf can be used to measure LLC misses, so it'd be interesting to see from the xeon and from the tiger lake -U CPU what those were for those given compression levels. |
@KungFuJesus Level 7 and Level 8 should not differ so much that it would affect speed much. And certainly level 0 should not be very challenging at all, since it does not in fact compress the data at all. I have a feeling this is OS or timing related, but it is pretty hard to do any kind of deep inspection like this on windows. Remember that even the Raspberry Pi 1B shows a much faster level 0 than level 1, and level 8 is just about as expected. And that thing does not have a lot of cache at all. |
I used SysInternal's ProcessMonitor to watch the file read/writes from Here are rough (not all my apps shutdown or videos stopped playing) benchmarks with Windows Defender on and without. For those who don't know, Windows Defender is anti-virus/anti-malware that comes with Windows - it's not something I installed. Real-time protection enabled
Real-time protection disabled
Notice that level 0 and level 8 return to normal when real-time protection is disabled. |
I'll contend that the level 7 & 8 different could very well be a bug or a really strange side effect. However the level 0 being slower than level 1 despite being far less work for the CPU doesn't seem strange to me, especially if level 1 is not significantly more work and the compression reduces the traffic from memory to disk by nearly half. Even if it were just dumping it back into memory with a round trip, I could see the reduction in memory bandwidth and cache pressure being a win. This is precisely the argument made for compressed swap or file systems with transparent compression. Though, on my system level 0 is beating level 1 by a long shot (at least when I operating entirely out of a tmpfs backed file system):
I do likely have cache depraved systems I could try this on, but I might even have something a step further - a slow 7200 RPM disk:
Ok, I'll eat my words. What is going on with Windows? Are you plugged in? |
Lol, oh yeah, that's likely it. We've seen all kinds of weird performance issues with Windows defender. You might want to whitelist the directory you're benchmarking to, it intercepts all writes and tries to inspect them. Man, I wonder how many times you've had bad benchmark info due to defender chowing down on your IO performance. This definitely justifies the Google Benchmark microbenchmarks quite a bit more. |
Good idea! I had forgotten about that which I had done before... but I recently reimaged my machine and now need to set that up. |
zmemcpy/zmemcmp
that force unaligned access on supported architectureszmemcpy/zmemcmp
in the code where necessary.