Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Transposition #2730

Merged
merged 7 commits into from
Sep 19, 2017
Merged

Conversation

homm
Copy link
Member

@homm homm commented Sep 11, 2017

Last time I speedup image transposition using 2-level loops which works better with CPU cache. There was a bunch of benchmarks to prove that 2-level loops works much faster.

Unfortunately, big matrixes transposition is still unexplainable slow operation for me. On the current master, i5-4430 CPU transposes large 5120x3200 RGB image at speed of 295Mpx/s. I.e. it rotates 1180 megabytes per second, which is 1/18 of memory bandwidth. Each pixel requires one read operation and one write, roughly speaking 1-2 extra operations are required for loops, so in worst case this requires 1.2GHz power of 3.2GHz CPU.

So I started experiments. Last time I've selected chunk size 128 based on several tests. This time I've tested different chunk sizes on four available CPUs with three image sizes: relatively small to fit into L3 cache, relatively large, 2-5 sizes of desktop CPUs L3 cache, and extra large 2-3 sizes of server CPUs L3 cache. This is performance of Image.TRANSPOSE but it is almost the same for Image.ROTATE_90 and Image.ROTATE_270.

i5-4258U
                 512x320         2560x1600       5120x3200
Chunk=8         773.00 Mpx/s    299.88 Mpx/s    178.38 Mpx/s
Chunk=16        784.47 Mpx/s    314.09 Mpx/s    227.64 Mpx/s
Chunk=32        783.57 Mpx/s    338.01 Mpx/s    239.08 Mpx/s
Chunk=64        712.86 Mpx/s    336.15 Mpx/s    224.80 Mpx/s
Chunk=128       526.99 Mpx/s    326.40 Mpx/s    240.56 Mpx/s
Chunk=256       406.62 Mpx/s    278.09 Mpx/s    262.29 Mpx/s
Chunk=512       407.59 Mpx/s    275.88 Mpx/s    223.16 Mpx/s
Chunk=1024      410.76 Mpx/s    188.63 Mpx/s    111.92 Mpx/s

i5-4430
                 512x320         2560x1600       5120x3200
Chunk=8         866.58 Mpx/s    556.31 Mpx/s    298.67 Mpx/s
Chunk=16        862.23 Mpx/s    505.81 Mpx/s    312.78 Mpx/s
Chunk=32        862.23 Mpx/s    458.42 Mpx/s    322.88 Mpx/s
Chunk=64        799.06 Mpx/s    432.70 Mpx/s    303.64 Mpx/s
Chunk=128       597.56 Mpx/s    415.84 Mpx/s    294.89 Mpx/s
Chunk=256       457.52 Mpx/s    354.08 Mpx/s    310.84 Mpx/s
Chunk=512       457.52 Mpx/s    357.91 Mpx/s    280.47 Mpx/s
Chunk=1024      459.05 Mpx/s    326.90 Mpx/s    199.09 Mpx/s

E5-2666 v3
                 512x320         2560x1600       5120x3200
Chunk=8         831.95 Mpx/s    419.63 Mpx/s    161.96 Mpx/s
Chunk=16        831.95 Mpx/s    449.82 Mpx/s    141.83 Mpx/s
Chunk=32        815.18 Mpx/s    453.25 Mpx/s    133.27 Mpx/s
Chunk=64        761.86 Mpx/s    364.06 Mpx/s    136.88 Mpx/s
Chunk=128       604.39 Mpx/s    362.03 Mpx/s    123.80 Mpx/s
Chunk=256       452.40 Mpx/s    327.50 Mpx/s    124.97 Mpx/s
Chunk=512       453.89 Mpx/s    344.67 Mpx/s    169.71 Mpx/s
Chunk=1024      447.68 Mpx/s    282.17 Mpx/s    135.36 Mpx/s

E5-2680 v2
                 512x320         2560x1600       5120x3200
Chunk=8         766.10 Mpx/s    430.21 Mpx/s    175.39 Mpx/s
Chunk=16        783.57 Mpx/s    473.91 Mpx/s    154.12 Mpx/s
Chunk=32        776.49 Mpx/s    468.32 Mpx/s    146.73 Mpx/s
Chunk=64        751.85 Mpx/s    398.37 Mpx/s    141.45 Mpx/s
Chunk=128       560.98 Mpx/s    408.33 Mpx/s    142.27 Mpx/s
Chunk=256       443.92 Mpx/s    350.87 Mpx/s    147.06 Mpx/s
Chunk=512       441.64 Mpx/s    349.43 Mpx/s    137.96 Mpx/s
Chunk=1024      448.85 Mpx/s    328.57 Mpx/s     84.50 Mpx/s

There are several interesting patterns.

  • First of all, on each CPU the smaller chunk, the faster small transposition (512x320) works.
  • On the server CPUs the smallest chunk size is also faster (with one exception) for large images (5120x3200), but this is not true for desktop CPU, where Chunk=32 and Chunk=256 are faster than neighbors. For some reasons Chunk=512 is far faster than others on E5-2666 v3, but its neighbors relatively slow.
  • As of medium size (2560x1600), it looks totally unpredictable.

I believe that Chunk=16 works fast in many cases because 16 * 4 bytes per pixel = 64 bytes, which is size of cache line in every modern CPU. On the other hand, Chunk=8 often works even faster because the data is not aligned with 64 bytes and often cross the lines.

This does not look like that the current value (Chunk=128) is the best choice. But there is also no another obvious leader. So, what to do?

Introducing 3-level loops :-)

We have chunk size 128, this is 128*128*4bpc*2images = 128kb of data. Enough to fit into L2 cache. What if we reduce chunk size to smaller value to fit into L1 cache but also add another external level, which fits in L2/L3 cache?

i5-4258U
                 512x320         2560x1600       5120x3200
Chunk=64,8      773.00 Mpx/s    443.34 Mpx/s    256.59 Mpx/s
Chunk=128,8     766.10 Mpx/s    450.11 Mpx/s    281.14 Mpx/s
Chunk=256,8     776.49 Mpx/s    462.25 Mpx/s    295.64 Mpx/s
Chunk=512,8     783.57 Mpx/s    473.03 Mpx/s    292.84 Mpx/s
Chunk=64,16     769.54 Mpx/s    419.42 Mpx/s    253.36 Mpx/s
Chunk=128,16    769.54 Mpx/s    413.69 Mpx/s    277.51 Mpx/s
Chunk=256,16    765.25 Mpx/s    430.07 Mpx/s    291.23 Mpx/s
Chunk=512,16    776.49 Mpx/s    426.89 Mpx/s    292.11 Mpx/s

i5-4430
                 512x320         2560x1600       5120x3200
Chunk=64,8      849.44 Mpx/s    566.60 Mpx/s    346.38 Mpx/s
Chunk=128,8     823.97 Mpx/s    570.40 Mpx/s    339.70 Mpx/s
Chunk=256,8     857.92 Mpx/s    584.91 Mpx/s    346.96 Mpx/s
Chunk=512,8     870.97 Mpx/s    573.03 Mpx/s    350.03 Mpx/s
Chunk=64,16     844.22 Mpx/s    530.23 Mpx/s    338.49 Mpx/s
Chunk=128,16    836.00 Mpx/s    522.79 Mpx/s    332.62 Mpx/s
Chunk=256,16    826.95 Mpx/s    516.19 Mpx/s    337.27 Mpx/s
Chunk=512,16    857.92 Mpx/s    512.89 Mpx/s    346.55 Mpx/s

E5-2666 v3
                 512x320         2560x1600       5120x3200
Chunk=64,8      822.99 Mpx/s    463.03 Mpx/s    147.79 Mpx/s
Chunk=128,8     827.95 Mpx/s    459.04 Mpx/s    140.30 Mpx/s
Chunk=256,8     836.00 Mpx/s    470.32 Mpx/s    144.62 Mpx/s
Chunk=512,8     845.26 Mpx/s    474.35 Mpx/s    171.73 Mpx/s
Chunk=64,16     811.33 Mpx/s    470.27 Mpx/s    137.55 Mpx/s
Chunk=128,16    807.51 Mpx/s    468.60 Mpx/s    140.57 Mpx/s
Chunk=256,16    822.99 Mpx/s    468.70 Mpx/s    143.72 Mpx/s
Chunk=512,16    830.95 Mpx/s    470.27 Mpx/s    155.74 Mpx/s

E5-2680 v2
                 512x320         2560x1600       5120x3200
Chunk=64,8      766.10 Mpx/s    457.09 Mpx/s    158.32 Mpx/s
Chunk=128,8     769.54 Mpx/s    467.85 Mpx/s    163.17 Mpx/s
Chunk=256,8     776.49 Mpx/s    472.82 Mpx/s    166.31 Mpx/s
Chunk=512,8     787.16 Mpx/s    464.94 Mpx/s    175.66 Mpx/s
Chunk=64,16     776.49 Mpx/s    477.51 Mpx/s    152.06 Mpx/s
Chunk=128,16    776.49 Mpx/s    487.73 Mpx/s    153.67 Mpx/s
Chunk=256,16    776.49 Mpx/s    486.34 Mpx/s    155.96 Mpx/s
Chunk=512,16    787.16 Mpx/s    488.08 Mpx/s    160.12 Mpx/s

In all cases small images are rotated on near the maximum speed. On each CPU fastest 3-level loop is faster than 2-level loop for medium and large images. So it is obvious win. It remains only to decide which chunks size pair is better.

In the most cases inner loop of size 8 is faster than loop of size 16. Chunk=256,8 is fastest 3 times and Chunk=512,8 is fastest 5 times. So I choose the last.

Later I noticed that test implementation was not so optimal, but I didn't restart all benchmarks on all CPU, only the chosen chunk size. This is the comparison: the current chunk size, chosen chunk size without fix, best results without fix, chosen size with fix (final performance).

i5-4258U
                 512x320         2560x1600       5120x3200
Chunk=128       526.99 Mpx/s    326.40 Mpx/s    240.56 Mpx/s
Chunk=512,8     783.57 Mpx/s    473.03 Mpx/s    292.84 Mpx/s
Fastest         783.57 Mpx/s    473.03 Mpx/s    295.64 Mpx/s
Optimized       987.35 Mpx/s    501.40 Mpx/s    298.20 Mpx/s

i5-4430
                 512x320         2560x1600       5120x3200
Chunk=128       597.56 Mpx/s    415.84 Mpx/s    294.89 Mpx/s
Chunk=512,8     870.97 Mpx/s    573.03 Mpx/s    350.03 Mpx/s
Fastest         870.97 Mpx/s    573.03 Mpx/s    350.03 Mpx/s
Optimized      1085.62 Mpx/s    606.72 Mpx/s    352.94 Mpx/s

E5-2666 v3
                 512x320         2560x1600       5120x3200
Chunk=128       604.39 Mpx/s    362.03 Mpx/s    123.80 Mpx/s
Chunk=512,8     845.26 Mpx/s    474.35 Mpx/s    171.73 Mpx/s
Fastest         845.26 Mpx/s    474.35 Mpx/s    171.73 Mpx/s
Optimized       987.35 Mpx/s    475.23 Mpx/s    171.19 Mpx/s

E5-2680 v2
                 512x320         2560x1600       5120x3200
Chunk=128       560.98 Mpx/s    408.33 Mpx/s    142.27 Mpx/s
Chunk=512,8     787.16 Mpx/s    464.94 Mpx/s    175.66 Mpx/s
Fastest         787.16 Mpx/s    488.08 Mpx/s    175.66 Mpx/s
Optimized       963.81 Mpx/s    476.89 Mpx/s    177.08 Mpx/s

One more thing

I've implemented the last possible transformation: it is equivalent of TRANSPOSITION + ROTATE_180 and I don't know better name for it than TRANSPOSE_ROTATE_180.

@wiredfool
Copy link
Member

I believe that Chunk=16 works fast in many cases because 16 * 4 bytes per pixel = 64 bytes, which is size of cache line in every modern CPU. On the other hand, Chunk=8 often works even faster because the data is not aligned with 64 bytes and often cross the lines.

Would it make sense to ensure that allocations, (and lines) are cache aligned?

@homm
Copy link
Member Author

homm commented Sep 11, 2017

I don't know better name for it than TRANSPOSE_ROTATE_180.

In ImageMagick it is called transverse. I think this is appropriate name.

@homm
Copy link
Member Author

homm commented Sep 11, 2017

Would it make sense to ensure that allocations, (and lines) are cache aligned?

It's faster about 11% for large image, but there is no easy and cross-platform way to do that.

out[xr] = in[x];
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think this suggestion is less clear, feel free to ignore.

for (y = 0; y < imIn->ysize; y++, yr--) {
	if (imIn->image8) {
		UINT8* in = (UINT8*) imIn->image8[y];
		UINT8* out = (UINT8*) imOut->image8[yr];
	} else {
		UINT32* in = (UINT32*) imIn->image32[y];
		UINT32* out = (UINT32*) imOut->image32[yr];
	}
	xr = imIn->xsize-1;
	for (x = 0; x < imIn->xsize; x++, xr--)
		out[xr] = in[x];
	}
}

@wiredfool
Copy link
Member

Transverse is transpose around the other diagonal, right?

I agree that hitting cache lines would be hard on many processors, (and could waste a decent amount of memory). Is there a gain to be had by aligning to something at SSE width, or at the Chunk=8?

@homm
Copy link
Member Author

homm commented Sep 12, 2017

Transverse is transpose around the other diagonal, right?

Yes, this is right meaning.

@radarhere I rewrite ImagingRotate180 and ImagingFlipLeftRightusing macros, like for other functions

@homm homm added this to the 4.3.0 milestone Sep 12, 2017
@homm
Copy link
Member Author

homm commented Sep 12, 2017

Coverage decreased because of last commit: it removed 28 covered lines and added only 4 (because macros are not lines according to coveralls)

@wiredfool wiredfool merged commit 7541755 into python-pillow:master Sep 19, 2017
@homm homm deleted the fast-geometry branch September 19, 2017 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants