-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Transposition #2730
Faster Transposition #2730
Conversation
Would it make sense to ensure that allocations, (and lines) are cache aligned? |
In ImageMagick it is called transverse. I think this is appropriate name. |
It's faster about 11% for large image, but there is no easy and cross-platform way to do that. |
libImaging/Geometry.c
Outdated
out[xr] = in[x]; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think this suggestion is less clear, feel free to ignore.
for (y = 0; y < imIn->ysize; y++, yr--) {
if (imIn->image8) {
UINT8* in = (UINT8*) imIn->image8[y];
UINT8* out = (UINT8*) imOut->image8[yr];
} else {
UINT32* in = (UINT32*) imIn->image32[y];
UINT32* out = (UINT32*) imOut->image32[yr];
}
xr = imIn->xsize-1;
for (x = 0; x < imIn->xsize; x++, xr--)
out[xr] = in[x];
}
}
Transverse is transpose around the other diagonal, right? I agree that hitting cache lines would be hard on many processors, (and could waste a decent amount of memory). Is there a gain to be had by aligning to something at SSE width, or at the Chunk=8? |
Yes, this is right meaning. @radarhere I rewrite |
Coverage decreased because of last commit: it removed 28 covered lines and added only 4 (because macros are not lines according to coveralls) |
Last time I speedup image transposition using 2-level loops which works better with CPU cache. There was a bunch of benchmarks to prove that 2-level loops works much faster.
Unfortunately, big matrixes transposition is still unexplainable slow operation for me. On the current master, i5-4430 CPU transposes large 5120x3200 RGB image at speed of 295Mpx/s. I.e. it rotates 1180 megabytes per second, which is 1/18 of memory bandwidth. Each pixel requires one read operation and one write, roughly speaking 1-2 extra operations are required for loops, so in worst case this requires 1.2GHz power of 3.2GHz CPU.
So I started experiments. Last time I've selected chunk size 128 based on several tests. This time I've tested different chunk sizes on four available CPUs with three image sizes: relatively small to fit into L3 cache, relatively large, 2-5 sizes of desktop CPUs L3 cache, and extra large 2-3 sizes of server CPUs L3 cache. This is performance of
Image.TRANSPOSE
but it is almost the same forImage.ROTATE_90
andImage.ROTATE_270
.There are several interesting patterns.
I believe that Chunk=16 works fast in many cases because 16 * 4 bytes per pixel = 64 bytes, which is size of cache line in every modern CPU. On the other hand, Chunk=8 often works even faster because the data is not aligned with 64 bytes and often cross the lines.
This does not look like that the current value (Chunk=128) is the best choice. But there is also no another obvious leader. So, what to do?
Introducing 3-level loops :-)
We have chunk size 128, this is
128*128*4bpc*2images = 128kb
of data. Enough to fit into L2 cache. What if we reduce chunk size to smaller value to fit into L1 cache but also add another external level, which fits in L2/L3 cache?In all cases small images are rotated on near the maximum speed. On each CPU fastest 3-level loop is faster than 2-level loop for medium and large images. So it is obvious win. It remains only to decide which chunks size pair is better.
In the most cases inner loop of size 8 is faster than loop of size 16. Chunk=256,8 is fastest 3 times and Chunk=512,8 is fastest 5 times. So I choose the last.
Later I noticed that test implementation was not so optimal, but I didn't restart all benchmarks on all CPU, only the chosen chunk size. This is the comparison: the current chunk size, chosen chunk size without fix, best results without fix, chosen size with fix (final performance).
One more thing
I've implemented the last possible transformation: it is equivalent of TRANSPOSITION + ROTATE_180 and I don't know better name for it than
TRANSPOSE_ROTATE_180
.