Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stretch optimization #977

Merged
merged 18 commits into from
Nov 9, 2014
Merged

Stretch optimization #977

merged 18 commits into from
Nov 9, 2014

Conversation

homm
Copy link
Member

@homm homm commented Oct 26, 2014

This request speeds up main Pillow functionality: image resizing. It also includes changes from #970.

Main part

I've changed the traversal order of the pixels.

Old: There are two passes: horizontal and vertical stretch. In horizontal pass pixels are iterated from left to right first, then from top to bottom. That allows to calculate coefficients for each row only once, but is very inefficient for CPU cache. In vertical pass the order of traversal is changed: from top to bottom first, then from left to right, but calculation of each pixel of destination image requires pixels from different lines of source image.

New: Stretch always works in horizontal direction, pixels are iterated from top to bottom first, then from left to right. Calculation of each pixel in destination image requires only pixels from the same line of source image. CPU cache is used as efficiently as possible. Coefficients for each column are calculated in advance and stored separately, which prevents overheads. For vertical stretch image is transposed and stretched in horizontal direction again. Then it is transposed again to match the original image. Transposition is also made in effective for the CPU cache way: in block of 64 pixels (16kb).

Other optimizations

Effective cache usage allows other optimizations. In common this is using of integer numbers insted of floats when possible. Also, huge boost was given by bands loop unrolling.

Benchmarks

I've tested performance before and after this changes. Also i've tested GraphicsMagick 1.3.18 with following filters: antialias = lanczos, bicubic = catrom, bilinear = triangle.

Source image: http://www.apple.com/imac-with-retina/5k.html

Filter Destination size GM Before After
Antialias 2048×1152 0.88 s 17 Mpx/s 1.016 s 14 Mpx/s 0.421 s 35 Mpx/s
320×240 0.65 s 23 Mpx/s 0.473 s 31 Mpx/s 0.225 s 66 Mpx/s
Bicubic 2048×1152 0.64 s 23 Mpx/s 0.840 s 18 Mpx/s 0.326 s 45 Mpx/s
320×240 0.37 s 39 Mpx/s 0.326 s 45 Mpx/s 0.154 s 96 Mpx/s
Bilinear 2048×1152 0.42 s 35 Mpx/s 0.650 s 23 Mpx/s 0.237 s 62 Mpx/s
320×240 0.19 s 78 Mpx/s 0.184 s 80 Mpx/s 0.082 s 180 Mpx/s

Test source:

from PIL import Image
import time

def timeit(n, f, *args, **kwargs):
    def run():
        start = time.time()
        f(*args, **kwargs)
        return time.time() - start

    return min(run() for _ in range(n))

n = 20
image = Image.open('5k_image.png').copy()
print 'warmup', timeit(n // 4, image.im.stretch, (2048, 1152), Image.ANTIALIAS)
print 'Antialias | 2048x1152 | {:.4}'.format(timeit(n, image.im.stretch, (2048, 1152), Image.ANTIALIAS))
print 'Antialias | 320x240   | {:.4}'.format(timeit(n, image.im.stretch, (320, 240),   Image.ANTIALIAS))
print 'Bicubic   | 2048x1152 | {:.4}'.format(timeit(n, image.im.stretch, (2048, 1152), Image.BICUBIC))
print 'Bicubic   | 320x240   | {:.4}'.format(timeit(n, image.im.stretch, (320, 240),   Image.BICUBIC))
print 'Bilinear  | 2048x1152 | {:.4}'.format(timeit(n, image.im.stretch, (2048, 1152), Image.BILINEAR))
print 'Bilinear  | 320x240   | {:.4}'.format(timeit(n, image.im.stretch, (320, 240),   Image.BILINEAR))

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 6f11bfd on homm:fast-stretch into * on python-pillow:master*.

@homm
Copy link
Member Author

homm commented Oct 27, 2014

For now code generated on 64 bit Linux has almost same speed as before due to bug in GCC. I'll try to find a workaround for this.

@homm
Copy link
Member Author

homm commented Oct 28, 2014

I've added fix for 64 bit GCC 4.8 suggested by Vyacheslav Egorov. Vyacheslav is V8 core developer. I've asked him in twitter, if someone wrote similar code for V8, would he accept it? He said yes, if it properly commented and inside preprocessor directives.

Due to directives this fix should be only used on 64 bit x86 platform on GCC version lower than 4.9, but not on Clang, with SSE enabled. Alternative i2f version is absolutely valid C code without any hacks and should work with any compiler. I've tested it on 32 and 64 bit Ubuntu, OS X and 64 bit Windows, it works as expected.

I think this fix is important because 64 bit Linux with GCC 4.8 is large Pillow installation base. I think it is about 50% of all Pillow installations, maybe more.

@homm
Copy link
Member Author

homm commented Oct 28, 2014

If someone are looking for even faster version, I've implemented main pixel loop for 8 bit channels on SSE4 instructions. It is fast-stretch-sse branch. I don't think it can be merged Pillow, use it on your own risk! Results are very impressive, though:

Filter Destination size Before After SSE
Antialias 2048×1152 1.016 s 14 Mpx/s 0.421 s 35 Mpx/s 0.1845 s 80 Mpx/s
320×240 0.473 s 31 Mpx/s 0.225 s 66 Mpx/s 0.1088 s 136 Mpx/s
Bicubic 2048×1152 0.840 s 18 Mpx/s 0.326 s 45 Mpx/s 0.1509 s 98 Mpx/s
320×240 0.326 s 45 Mpx/s 0.154 s 96 Mpx/s 0.0732 s 201 Mpx/s
Bilinear 2048×1152 0.650 s 23 Mpx/s 0.237 s 62 Mpx/s 0.1206 s 122 Mpx/s
320×240 0.184 s 80 Mpx/s 0.082 s 180 Mpx/s 0.0372 s 396 Mpx/s

@wiredfool
Copy link
Member

Wow. That's fast. Thanks for your work on this.

I'd like to understand this before merging -- so I'm probably going to be asking questions about it as I go through.

This was referenced Oct 30, 2014
@homm
Copy link
Member Author

homm commented Nov 6, 2014

I have rebased this branch to subtract changes from #970

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5f4859e on homm:fast-stretch into * on python-pillow:master*.

ss1 += i2f((UINT8) imIn->image[yy][x*4 + 1]) * k[x - xmin];
ss2 += i2f((UINT8) imIn->image[yy][x*4 + 2]) * k[x - xmin];
}
imOut->image32[yy][xx] =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check endianness issues here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed endianness problems -- fix coming. maybe, test was borked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks like the RGBA version doesn't have a problem. Only because it swaps the bands twice, since it's run, transposed, and run again.

The RGB version does exhibit the problem.

======================================================================                                                                                                                                                                         
FAIL: TestImagingStretch.test_endianess                                                                                                                                                                                                        
----------------------------------------------------------------------                                                                                                                                                                         
Traceback (most recent call last):                                                                                                                                                                                                             
  File "/home/erics/Pillow/Tests/test_imaging_stretch.py", line 83, in test_endianess                                                                                                                                                          
    self.assert_image_equal(stretched.split()[0],target, 'rxRGB R channel fail')                                                                                                                                                               
  File "/home/erics/Pillow/Tests/helper.py", line 81, in assert_image_equal                                                                                                                                                                    
    self.fail(msg or "got different content")                                                                                                                                                                                                  
AssertionError: rxRGB R channel fail                                                                                                                                                                                                           

----------------------------------------------------------------------                                                                                                                                                                         
Ran 4 tests in 0.740s                                                                                                                                                                                                                          

I've fixed both in my branch https://github.com/wiredfool/Pillow/tree/bicubic-stretch here: wiredfool@d387499

xmin = xbounds[xx * 2 + 0];
xmax = xbounds[xx * 2 + 1];
k = &kk[xx * kmax];
if (imIn->bands == 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a case (LA) where im->pixelsize = 4 and im->bands=2
[edit]
Never mind, it'll fall through to the else clause, and the two junk bands will be stretched right along with the rest of it.

@homm
Copy link
Member Author

homm commented Nov 8, 2014

@wiredfool Looks like I was wrong. I've tested UINT32 vs 4xUINT8 assignment again and come to conclusion what there is no noticeable difference in performance. So I exclude this doubtful code at all.

Also I've improve endianness test readability (at least I want to believe), and make special case for LA images.

@wiredfool
Copy link
Member

Can you rebase on top of master so that this is mergable?

@homm
Copy link
Member Author

homm commented Nov 9, 2014

@wiredfool yes, already rebased :)

@wiredfool
Copy link
Member

I see that now. :)

@homm homm mentioned this pull request Nov 9, 2014
wiredfool added a commit that referenced this pull request Nov 9, 2014
Imaging.stretch optimization
@wiredfool wiredfool merged commit d781d56 into python-pillow:master Nov 9, 2014
@homm homm deleted the fast-stretch branch April 27, 2016 21:42
@homm homm mentioned this pull request Aug 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants