Faster Transposition #2730

homm · 2017-09-11T18:51:39Z

Last time I speedup image transposition using 2-level loops which works better with CPU cache. There was a bunch of benchmarks to prove that 2-level loops works much faster.

Unfortunately, big matrixes transposition is still unexplainable slow operation for me. On the current master, i5-4430 CPU transposes large 5120x3200 RGB image at speed of 295Mpx/s. I.e. it rotates 1180 megabytes per second, which is 1/18 of memory bandwidth. Each pixel requires one read operation and one write, roughly speaking 1-2 extra operations are required for loops, so in worst case this requires 1.2GHz power of 3.2GHz CPU.

So I started experiments. Last time I've selected chunk size 128 based on several tests. This time I've tested different chunk sizes on four available CPUs with three image sizes: relatively small to fit into L3 cache, relatively large, 2-5 sizes of desktop CPUs L3 cache, and extra large 2-3 sizes of server CPUs L3 cache. This is performance of Image.TRANSPOSE but it is almost the same for Image.ROTATE_90 and Image.ROTATE_270.

i5-4258U
                 512x320         2560x1600       5120x3200
Chunk=8         773.00 Mpx/s    299.88 Mpx/s    178.38 Mpx/s
Chunk=16        784.47 Mpx/s    314.09 Mpx/s    227.64 Mpx/s
Chunk=32        783.57 Mpx/s    338.01 Mpx/s    239.08 Mpx/s
Chunk=64        712.86 Mpx/s    336.15 Mpx/s    224.80 Mpx/s
Chunk=128       526.99 Mpx/s    326.40 Mpx/s    240.56 Mpx/s
Chunk=256       406.62 Mpx/s    278.09 Mpx/s    262.29 Mpx/s
Chunk=512       407.59 Mpx/s    275.88 Mpx/s    223.16 Mpx/s
Chunk=1024      410.76 Mpx/s    188.63 Mpx/s    111.92 Mpx/s

i5-4430
                 512x320         2560x1600       5120x3200
Chunk=8         866.58 Mpx/s    556.31 Mpx/s    298.67 Mpx/s
Chunk=16        862.23 Mpx/s    505.81 Mpx/s    312.78 Mpx/s
Chunk=32        862.23 Mpx/s    458.42 Mpx/s    322.88 Mpx/s
Chunk=64        799.06 Mpx/s    432.70 Mpx/s    303.64 Mpx/s
Chunk=128       597.56 Mpx/s    415.84 Mpx/s    294.89 Mpx/s
Chunk=256       457.52 Mpx/s    354.08 Mpx/s    310.84 Mpx/s
Chunk=512       457.52 Mpx/s    357.91 Mpx/s    280.47 Mpx/s
Chunk=1024      459.05 Mpx/s    326.90 Mpx/s    199.09 Mpx/s

E5-2666 v3
                 512x320         2560x1600       5120x3200
Chunk=8         831.95 Mpx/s    419.63 Mpx/s    161.96 Mpx/s
Chunk=16        831.95 Mpx/s    449.82 Mpx/s    141.83 Mpx/s
Chunk=32        815.18 Mpx/s    453.25 Mpx/s    133.27 Mpx/s
Chunk=64        761.86 Mpx/s    364.06 Mpx/s    136.88 Mpx/s
Chunk=128       604.39 Mpx/s    362.03 Mpx/s    123.80 Mpx/s
Chunk=256       452.40 Mpx/s    327.50 Mpx/s    124.97 Mpx/s
Chunk=512       453.89 Mpx/s    344.67 Mpx/s    169.71 Mpx/s
Chunk=1024      447.68 Mpx/s    282.17 Mpx/s    135.36 Mpx/s

E5-2680 v2
                 512x320         2560x1600       5120x3200
Chunk=8         766.10 Mpx/s    430.21 Mpx/s    175.39 Mpx/s
Chunk=16        783.57 Mpx/s    473.91 Mpx/s    154.12 Mpx/s
Chunk=32        776.49 Mpx/s    468.32 Mpx/s    146.73 Mpx/s
Chunk=64        751.85 Mpx/s    398.37 Mpx/s    141.45 Mpx/s
Chunk=128       560.98 Mpx/s    408.33 Mpx/s    142.27 Mpx/s
Chunk=256       443.92 Mpx/s    350.87 Mpx/s    147.06 Mpx/s
Chunk=512       441.64 Mpx/s    349.43 Mpx/s    137.96 Mpx/s
Chunk=1024      448.85 Mpx/s    328.57 Mpx/s     84.50 Mpx/s

There are several interesting patterns.

First of all, on each CPU the smaller chunk, the faster small transposition (512x320) works.
On the server CPUs the smallest chunk size is also faster (with one exception) for large images (5120x3200), but this is not true for desktop CPU, where Chunk=32 and Chunk=256 are faster than neighbors. For some reasons Chunk=512 is far faster than others on E5-2666 v3, but its neighbors relatively slow.
As of medium size (2560x1600), it looks totally unpredictable.

I believe that Chunk=16 works fast in many cases because 16 * 4 bytes per pixel = 64 bytes, which is size of cache line in every modern CPU. On the other hand, Chunk=8 often works even faster because the data is not aligned with 64 bytes and often cross the lines.

This does not look like that the current value (Chunk=128) is the best choice. But there is also no another obvious leader. So, what to do?

Introducing 3-level loops :-)

We have chunk size 128, this is 128*128*4bpc*2images = 128kb of data. Enough to fit into L2 cache. What if we reduce chunk size to smaller value to fit into L1 cache but also add another external level, which fits in L2/L3 cache?

i5-4258U
                 512x320         2560x1600       5120x3200
Chunk=64,8      773.00 Mpx/s    443.34 Mpx/s    256.59 Mpx/s
Chunk=128,8     766.10 Mpx/s    450.11 Mpx/s    281.14 Mpx/s
Chunk=256,8     776.49 Mpx/s    462.25 Mpx/s    295.64 Mpx/s
Chunk=512,8     783.57 Mpx/s    473.03 Mpx/s    292.84 Mpx/s
Chunk=64,16     769.54 Mpx/s    419.42 Mpx/s    253.36 Mpx/s
Chunk=128,16    769.54 Mpx/s    413.69 Mpx/s    277.51 Mpx/s
Chunk=256,16    765.25 Mpx/s    430.07 Mpx/s    291.23 Mpx/s
Chunk=512,16    776.49 Mpx/s    426.89 Mpx/s    292.11 Mpx/s

i5-4430
                 512x320         2560x1600       5120x3200
Chunk=64,8      849.44 Mpx/s    566.60 Mpx/s    346.38 Mpx/s
Chunk=128,8     823.97 Mpx/s    570.40 Mpx/s    339.70 Mpx/s
Chunk=256,8     857.92 Mpx/s    584.91 Mpx/s    346.96 Mpx/s
Chunk=512,8     870.97 Mpx/s    573.03 Mpx/s    350.03 Mpx/s
Chunk=64,16     844.22 Mpx/s    530.23 Mpx/s    338.49 Mpx/s
Chunk=128,16    836.00 Mpx/s    522.79 Mpx/s    332.62 Mpx/s
Chunk=256,16    826.95 Mpx/s    516.19 Mpx/s    337.27 Mpx/s
Chunk=512,16    857.92 Mpx/s    512.89 Mpx/s    346.55 Mpx/s

E5-2666 v3
                 512x320         2560x1600       5120x3200
Chunk=64,8      822.99 Mpx/s    463.03 Mpx/s    147.79 Mpx/s
Chunk=128,8     827.95 Mpx/s    459.04 Mpx/s    140.30 Mpx/s
Chunk=256,8     836.00 Mpx/s    470.32 Mpx/s    144.62 Mpx/s
Chunk=512,8     845.26 Mpx/s    474.35 Mpx/s    171.73 Mpx/s
Chunk=64,16     811.33 Mpx/s    470.27 Mpx/s    137.55 Mpx/s
Chunk=128,16    807.51 Mpx/s    468.60 Mpx/s    140.57 Mpx/s
Chunk=256,16    822.99 Mpx/s    468.70 Mpx/s    143.72 Mpx/s
Chunk=512,16    830.95 Mpx/s    470.27 Mpx/s    155.74 Mpx/s

E5-2680 v2
                 512x320         2560x1600       5120x3200
Chunk=64,8      766.10 Mpx/s    457.09 Mpx/s    158.32 Mpx/s
Chunk=128,8     769.54 Mpx/s    467.85 Mpx/s    163.17 Mpx/s
Chunk=256,8     776.49 Mpx/s    472.82 Mpx/s    166.31 Mpx/s
Chunk=512,8     787.16 Mpx/s    464.94 Mpx/s    175.66 Mpx/s
Chunk=64,16     776.49 Mpx/s    477.51 Mpx/s    152.06 Mpx/s
Chunk=128,16    776.49 Mpx/s    487.73 Mpx/s    153.67 Mpx/s
Chunk=256,16    776.49 Mpx/s    486.34 Mpx/s    155.96 Mpx/s
Chunk=512,16    787.16 Mpx/s    488.08 Mpx/s    160.12 Mpx/s

In all cases small images are rotated on near the maximum speed. On each CPU fastest 3-level loop is faster than 2-level loop for medium and large images. So it is obvious win. It remains only to decide which chunks size pair is better.

In the most cases inner loop of size 8 is faster than loop of size 16. Chunk=256,8 is fastest 3 times and Chunk=512,8 is fastest 5 times. So I choose the last.

Later I noticed that test implementation was not so optimal, but I didn't restart all benchmarks on all CPU, only the chosen chunk size. This is the comparison: the current chunk size, chosen chunk size without fix, best results without fix, chosen size with fix (final performance).

i5-4258U
                 512x320         2560x1600       5120x3200
Chunk=128       526.99 Mpx/s    326.40 Mpx/s    240.56 Mpx/s
Chunk=512,8     783.57 Mpx/s    473.03 Mpx/s    292.84 Mpx/s
Fastest         783.57 Mpx/s    473.03 Mpx/s    295.64 Mpx/s
Optimized       987.35 Mpx/s    501.40 Mpx/s    298.20 Mpx/s

i5-4430
                 512x320         2560x1600       5120x3200
Chunk=128       597.56 Mpx/s    415.84 Mpx/s    294.89 Mpx/s
Chunk=512,8     870.97 Mpx/s    573.03 Mpx/s    350.03 Mpx/s
Fastest         870.97 Mpx/s    573.03 Mpx/s    350.03 Mpx/s
Optimized      1085.62 Mpx/s    606.72 Mpx/s    352.94 Mpx/s

E5-2666 v3
                 512x320         2560x1600       5120x3200
Chunk=128       604.39 Mpx/s    362.03 Mpx/s    123.80 Mpx/s
Chunk=512,8     845.26 Mpx/s    474.35 Mpx/s    171.73 Mpx/s
Fastest         845.26 Mpx/s    474.35 Mpx/s    171.73 Mpx/s
Optimized       987.35 Mpx/s    475.23 Mpx/s    171.19 Mpx/s

E5-2680 v2
                 512x320         2560x1600       5120x3200
Chunk=128       560.98 Mpx/s    408.33 Mpx/s    142.27 Mpx/s
Chunk=512,8     787.16 Mpx/s    464.94 Mpx/s    175.66 Mpx/s
Fastest         787.16 Mpx/s    488.08 Mpx/s    175.66 Mpx/s
Optimized       963.81 Mpx/s    476.89 Mpx/s    177.08 Mpx/s

One more thing

I've implemented the last possible transformation: it is equivalent of TRANSPOSITION + ROTATE_180 and I don't know better name for it than TRANSPOSE_ROTATE_180.

wiredfool · 2017-09-11T19:11:53Z

I believe that Chunk=16 works fast in many cases because 16 * 4 bytes per pixel = 64 bytes, which is size of cache line in every modern CPU. On the other hand, Chunk=8 often works even faster because the data is not aligned with 64 bytes and often cross the lines.

Would it make sense to ensure that allocations, (and lines) are cache aligned?

homm · 2017-09-11T19:12:40Z

I don't know better name for it than TRANSPOSE_ROTATE_180.

In ImageMagick it is called transverse. I think this is appropriate name.

homm · 2017-09-11T19:42:05Z

Would it make sense to ensure that allocations, (and lines) are cache aligned?

It's faster about 11% for large image, but there is no easy and cross-platform way to do that.

radarhere · 2017-09-12T02:48:26Z

libImaging/Geometry.c

+                out[xr] = in[x];
+        }
+    }
+


If you think this suggestion is less clear, feel free to ignore.

for (y = 0; y < imIn->ysize; y++, yr--) { if (imIn->image8) { UINT8* in = (UINT8*) imIn->image8[y]; UINT8* out = (UINT8*) imOut->image8[yr]; } else { UINT32* in = (UINT32*) imIn->image32[y]; UINT32* out = (UINT32*) imOut->image32[yr]; } xr = imIn->xsize-1; for (x = 0; x < imIn->xsize; x++, xr--) out[xr] = in[x]; } }

wiredfool · 2017-09-12T07:24:22Z

Transverse is transpose around the other diagonal, right?

I agree that hitting cache lines would be hard on many processors, (and could waste a decent amount of memory). Is there a gain to be had by aligning to something at SSE width, or at the Chunk=8?

homm · 2017-09-12T11:09:04Z

Transverse is transpose around the other diagonal, right?

Yes, this is right meaning.

@radarhere I rewrite ImagingRotate180 and ImagingFlipLeftRightusing macros, like for other functions

homm · 2017-09-12T13:37:54Z

Coverage decreased because of last commit: it removed 28 covered lines and added only 4 (because macros are not lines according to coveralls)

homm added 4 commits September 11, 2017 01:07

2 times faster ImagingRotate180

6745979

Implement ImagingTransposeRotate180

a2a2d8d

Change geometry chunk size

b8789e6

3-level transpospose

fd297fe

homm added the Needs Tests label Sep 11, 2017

homm added 2 commits September 11, 2017 22:58

rename TRANSPOSE_ROTATE_180 to TRANSVERSE

29515f5

tests for transverse, add to docs

b6b3b00

homm removed the Needs Tests label Sep 11, 2017

homm mentioned this pull request Sep 11, 2017

Release Pillow 4.3.0 on October 1, 2017 #2664

Closed

radarhere reviewed Sep 12, 2017

View reviewed changes

Use macros for FLIP_LEFT_RIGHT and ROTATE_180

bc13a9d

homm added this to the 4.3.0 milestone Sep 12, 2017

wiredfool merged commit 7541755 into python-pillow:master Sep 19, 2017

homm deleted the fast-geometry branch September 19, 2017 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Transposition #2730

Faster Transposition #2730

homm commented Sep 11, 2017 •

edited

Loading

wiredfool commented Sep 11, 2017

homm commented Sep 11, 2017

homm commented Sep 11, 2017

radarhere Sep 12, 2017

wiredfool commented Sep 12, 2017

homm commented Sep 12, 2017

homm commented Sep 12, 2017

Faster Transposition #2730

Faster Transposition #2730

Conversation

homm commented Sep 11, 2017 • edited Loading

Introducing 3-level loops :-)

One more thing

wiredfool commented Sep 11, 2017

homm commented Sep 11, 2017

homm commented Sep 11, 2017

radarhere Sep 12, 2017

Choose a reason for hiding this comment

wiredfool commented Sep 12, 2017

homm commented Sep 12, 2017

homm commented Sep 12, 2017

homm commented Sep 11, 2017 •

edited

Loading