Stretch optimization #977

homm · 2014-10-26T17:06:55Z

This request speeds up main Pillow functionality: image resizing. It also includes changes from #970.

Main part

I've changed the traversal order of the pixels.

Old: There are two passes: horizontal and vertical stretch. In horizontal pass pixels are iterated from left to right first, then from top to bottom. That allows to calculate coefficients for each row only once, but is very inefficient for CPU cache. In vertical pass the order of traversal is changed: from top to bottom first, then from left to right, but calculation of each pixel of destination image requires pixels from different lines of source image.

New: Stretch always works in horizontal direction, pixels are iterated from top to bottom first, then from left to right. Calculation of each pixel in destination image requires only pixels from the same line of source image. CPU cache is used as efficiently as possible. Coefficients for each column are calculated in advance and stored separately, which prevents overheads. For vertical stretch image is transposed and stretched in horizontal direction again. Then it is transposed again to match the original image. Transposition is also made in effective for the CPU cache way: in block of 64 pixels (16kb).

Other optimizations

Effective cache usage allows other optimizations. In common this is using of integer numbers insted of floats when possible. Also, huge boost was given by bands loop unrolling.

Benchmarks

I've tested performance before and after this changes. Also i've tested GraphicsMagick 1.3.18 with following filters: antialias = lanczos, bicubic = catrom, bilinear = triangle.

Source image: http://www.apple.com/imac-with-retina/5k.html

Filter	Destination size	GM	Before	After
Antialias	2048×1152	0.88 s 17 Mpx/s	1.016 s 14 Mpx/s	0.421 s 35 Mpx/s
	320×240	0.65 s 23 Mpx/s	0.473 s 31 Mpx/s	0.225 s 66 Mpx/s
Bicubic	2048×1152	0.64 s 23 Mpx/s	0.840 s 18 Mpx/s	0.326 s 45 Mpx/s
	320×240	0.37 s 39 Mpx/s	0.326 s 45 Mpx/s	0.154 s 96 Mpx/s
Bilinear	2048×1152	0.42 s 35 Mpx/s	0.650 s 23 Mpx/s	0.237 s 62 Mpx/s
	320×240	0.19 s 78 Mpx/s	0.184 s 80 Mpx/s	0.082 s 180 Mpx/s

Test source:

from PIL import Image
import time

def timeit(n, f, *args, **kwargs):
    def run():
        start = time.time()
        f(*args, **kwargs)
        return time.time() - start

    return min(run() for _ in range(n))

n = 20
image = Image.open('5k_image.png').copy()
print 'warmup', timeit(n // 4, image.im.stretch, (2048, 1152), Image.ANTIALIAS)
print 'Antialias | 2048x1152 | {:.4}'.format(timeit(n, image.im.stretch, (2048, 1152), Image.ANTIALIAS))
print 'Antialias | 320x240   | {:.4}'.format(timeit(n, image.im.stretch, (320, 240),   Image.ANTIALIAS))
print 'Bicubic   | 2048x1152 | {:.4}'.format(timeit(n, image.im.stretch, (2048, 1152), Image.BICUBIC))
print 'Bicubic   | 320x240   | {:.4}'.format(timeit(n, image.im.stretch, (320, 240),   Image.BICUBIC))
print 'Bilinear  | 2048x1152 | {:.4}'.format(timeit(n, image.im.stretch, (2048, 1152), Image.BILINEAR))
print 'Bilinear  | 320x240   | {:.4}'.format(timeit(n, image.im.stretch, (320, 240),   Image.BILINEAR))

coveralls · 2014-10-26T17:18:35Z

Changes Unknown when pulling 6f11bfd on homm:fast-stretch into * on python-pillow:master*.

homm · 2014-10-27T11:53:26Z

For now code generated on 64 bit Linux has almost same speed as before due to bug in GCC. I'll try to find a workaround for this.

homm · 2014-10-28T21:15:00Z

I've added fix for 64 bit GCC 4.8 suggested by Vyacheslav Egorov. Vyacheslav is V8 core developer. I've asked him in twitter, if someone wrote similar code for V8, would he accept it? He said yes, if it properly commented and inside preprocessor directives.

Due to directives this fix should be only used on 64 bit x86 platform on GCC version lower than 4.9, but not on Clang, with SSE enabled. Alternative i2f version is absolutely valid C code without any hacks and should work with any compiler. I've tested it on 32 and 64 bit Ubuntu, OS X and 64 bit Windows, it works as expected.

I think this fix is important because 64 bit Linux with GCC 4.8 is large Pillow installation base. I think it is about 50% of all Pillow installations, maybe more.

homm · 2014-10-28T21:40:02Z

If someone are looking for even faster version, I've implemented main pixel loop for 8 bit channels on SSE4 instructions. It is fast-stretch-sse branch. I don't think it can be merged Pillow, use it on your own risk! Results are very impressive, though:

Filter	Destination size	Before	After	SSE
Antialias	2048×1152	1.016 s 14 Mpx/s	0.421 s 35 Mpx/s	0.1845 s 80 Mpx/s
	320×240	0.473 s 31 Mpx/s	0.225 s 66 Mpx/s	0.1088 s 136 Mpx/s
Bicubic	2048×1152	0.840 s 18 Mpx/s	0.326 s 45 Mpx/s	0.1509 s 98 Mpx/s
	320×240	0.326 s 45 Mpx/s	0.154 s 96 Mpx/s	0.0732 s 201 Mpx/s
Bilinear	2048×1152	0.650 s 23 Mpx/s	0.237 s 62 Mpx/s	0.1206 s 122 Mpx/s
	320×240	0.184 s 80 Mpx/s	0.082 s 180 Mpx/s	0.0372 s 396 Mpx/s

wiredfool · 2014-10-28T21:45:54Z

Wow. That's fast. Thanks for your work on this.

I'd like to understand this before merging -- so I'm probably going to be asking questions about it as I go through.

homm · 2014-11-06T02:11:51Z

I have rebased this branch to subtract changes from #970

coveralls · 2014-11-06T02:26:19Z

Changes Unknown when pulling 5f4859e on homm:fast-stretch into * on python-pillow:master*.

wiredfool · 2014-11-06T05:12:02Z

libImaging/Antialias.c

+                            ss1 += i2f((UINT8) imIn->image[yy][x*4 + 1]) * k[x - xmin];
+                            ss2 += i2f((UINT8) imIn->image[yy][x*4 + 2]) * k[x - xmin];
+                        }
+                        imOut->image32[yy][xx] =


Need to check endianness issues here.

Confirmed endianness problems -- fix coming. maybe, test was borked.

Ok, looks like the RGBA version doesn't have a problem. Only because it swaps the bands twice, since it's run, transposed, and run again.

The RGB version does exhibit the problem.

====================================================================== FAIL: TestImagingStretch.test_endianess ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/erics/Pillow/Tests/test_imaging_stretch.py", line 83, in test_endianess self.assert_image_equal(stretched.split()[0],target, 'rxRGB R channel fail') File "/home/erics/Pillow/Tests/helper.py", line 81, in assert_image_equal self.fail(msg or "got different content") AssertionError: rxRGB R channel fail ---------------------------------------------------------------------- Ran 4 tests in 0.740s

I've fixed both in my branch https://github.com/wiredfool/Pillow/tree/bicubic-stretch here: wiredfool@d387499

wiredfool · 2014-11-07T23:00:46Z

libImaging/Antialias.c

+                    xmin = xbounds[xx * 2 + 0];
+                    xmax = xbounds[xx * 2 + 1];
+                    k = &kk[xx * kmax];
+                    if (imIn->bands == 3) {


There's also a case (LA) where im->pixelsize = 4 and im->bands=2
[edit]
Never mind, it'll fall through to the else clause, and the two junk bands will be stretched right along with the rest of it.

homm · 2014-11-08T23:42:41Z

@wiredfool Looks like I was wrong. I've tested UINT32 vs 4xUINT8 assignment again and come to conclusion what there is no noticeable difference in performance. So I exclude this doubtful code at all.

Also I've improve endianness test readability (at least I want to believe), and make special case for LA images.

wiredfool · 2014-11-09T00:05:59Z

Can you rebase on top of master so that this is mergable?

homm · 2014-11-09T00:09:59Z

@wiredfool yes, already rebased :)

wiredfool · 2014-11-09T00:13:57Z

I see that now. :)

Imaging.stretch optimization

This was referenced Oct 30, 2014

Fix bicubic stretch interpolation #970

Merged

Fast Gaussian blur #961

Merged

wiredfool reviewed Nov 6, 2014
View reviewed changes

wiredfool mentioned this pull request Nov 7, 2014

Add transpose and speedup rotation #994

Merged

wiredfool reviewed Nov 7, 2014
View reviewed changes

homm added 11 commits November 9, 2014 03:05

Hide stretch implementation detail in Antialias.c

c8471bc

two ImagingStretchHorizaontal pass with two transposes

40f9f48

Precompute coefficients for all x

b77521b

Iterate pixels in native order

01b947c

move ww into coefficients

e276e6a

make x indexes int

a484d28

faster float to 8bit convertion

e9fc720

optimize memory usage

c22af89

speedup by unrolling loops

42967dd

fix performance regression on 64 bit GCC 4.8.

1cd6da4

typo. Free mem after ModeError.

b8a2b5b

wiredfool and others added 4 commits November 9, 2014 03:07

Test for endianness issues in stretch

2657c0d

Fix for endianness issues in stretch

af02f2b

generalize endianess test

1a7c9b7

two bands case

abc5e11

homm added 2 commits November 9, 2014 03:08

Replace UINT32 assignment with per-channel UINT8 assignment

7a64f7b

fix typo

3894dbe

Move ImagingTransposeToNew from Antialias.c to Geometry.c

90ee223

homm mentioned this pull request Nov 9, 2014

Replace resize method #997

Merged

wiredfool added a commit that referenced this pull request Nov 9, 2014

Merge pull request #977 from homm/fast-stretch

d781d56

Imaging.stretch optimization

wiredfool merged commit d781d56 into python-pillow:master Nov 9, 2014

wiredfool mentioned this pull request Nov 16, 2014

Experimental: OpenMP #1013

Open

homm deleted the fast-stretch branch April 27, 2016 21:42

homm mentioned this pull request Aug 13, 2017

Fast filters #2679

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stretch optimization #977

Stretch optimization #977

homm commented Oct 26, 2014

coveralls commented Oct 26, 2014

homm commented Oct 27, 2014

homm commented Oct 28, 2014

homm commented Oct 28, 2014

wiredfool commented Oct 28, 2014

homm commented Nov 6, 2014

coveralls commented Nov 6, 2014

wiredfool Nov 6, 2014

wiredfool Nov 7, 2014

wiredfool Nov 7, 2014

wiredfool Nov 7, 2014

homm commented Nov 8, 2014

wiredfool commented Nov 9, 2014

homm commented Nov 9, 2014

wiredfool commented Nov 9, 2014

Stretch optimization #977

Stretch optimization #977

Conversation

homm commented Oct 26, 2014

Main part

Other optimizations

Benchmarks

coveralls commented Oct 26, 2014

homm commented Oct 27, 2014

homm commented Oct 28, 2014

homm commented Oct 28, 2014

wiredfool commented Oct 28, 2014

homm commented Nov 6, 2014

coveralls commented Nov 6, 2014

wiredfool Nov 6, 2014

Choose a reason for hiding this comment

wiredfool Nov 7, 2014

Choose a reason for hiding this comment

wiredfool Nov 7, 2014

Choose a reason for hiding this comment

wiredfool Nov 7, 2014

Choose a reason for hiding this comment

homm commented Nov 8, 2014

wiredfool commented Nov 9, 2014

homm commented Nov 9, 2014

wiredfool commented Nov 9, 2014