-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stretch optimization #977
Stretch optimization #977
Conversation
Changes Unknown when pulling 6f11bfd on homm:fast-stretch into * on python-pillow:master*. |
For now code generated on 64 bit Linux has almost same speed as before due to bug in GCC. I'll try to find a workaround for this. |
I've added fix for 64 bit GCC 4.8 suggested by Vyacheslav Egorov. Vyacheslav is V8 core developer. I've asked him in twitter, if someone wrote similar code for V8, would he accept it? He said yes, if it properly commented and inside preprocessor directives. Due to directives this fix should be only used on 64 bit x86 platform on GCC version lower than 4.9, but not on Clang, with SSE enabled. Alternative I think this fix is important because 64 bit Linux with GCC 4.8 is large Pillow installation base. I think it is about 50% of all Pillow installations, maybe more. |
If someone are looking for even faster version, I've implemented main pixel loop for 8 bit channels on SSE4 instructions. It is fast-stretch-sse branch. I don't think it can be merged Pillow, use it on your own risk! Results are very impressive, though:
|
Wow. That's fast. Thanks for your work on this. I'd like to understand this before merging -- so I'm probably going to be asking questions about it as I go through. |
I have rebased this branch to subtract changes from #970 |
Changes Unknown when pulling 5f4859e on homm:fast-stretch into * on python-pillow:master*. |
ss1 += i2f((UINT8) imIn->image[yy][x*4 + 1]) * k[x - xmin]; | ||
ss2 += i2f((UINT8) imIn->image[yy][x*4 + 2]) * k[x - xmin]; | ||
} | ||
imOut->image32[yy][xx] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check endianness issues here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed endianness problems -- fix coming. maybe, test was borked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks like the RGBA version doesn't have a problem. Only because it swaps the bands twice, since it's run, transposed, and run again.
The RGB version does exhibit the problem.
======================================================================
FAIL: TestImagingStretch.test_endianess
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/erics/Pillow/Tests/test_imaging_stretch.py", line 83, in test_endianess
self.assert_image_equal(stretched.split()[0],target, 'rxRGB R channel fail')
File "/home/erics/Pillow/Tests/helper.py", line 81, in assert_image_equal
self.fail(msg or "got different content")
AssertionError: rxRGB R channel fail
----------------------------------------------------------------------
Ran 4 tests in 0.740s
I've fixed both in my branch https://github.com/wiredfool/Pillow/tree/bicubic-stretch here: wiredfool@d387499
xmin = xbounds[xx * 2 + 0]; | ||
xmax = xbounds[xx * 2 + 1]; | ||
k = &kk[xx * kmax]; | ||
if (imIn->bands == 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a case (LA) where im->pixelsize = 4 and im->bands=2
[edit]
Never mind, it'll fall through to the else clause, and the two junk bands will be stretched right along with the rest of it.
@wiredfool Looks like I was wrong. I've tested UINT32 vs 4xUINT8 assignment again and come to conclusion what there is no noticeable difference in performance. So I exclude this doubtful code at all. Also I've improve endianness test readability (at least I want to believe), and make special case for LA images. |
Can you rebase on top of master so that this is mergable? |
@wiredfool yes, already rebased :) |
I see that now. :) |
This request speeds up main Pillow functionality: image resizing. It also includes changes from #970.
Main part
I've changed the traversal order of the pixels.
Old: There are two passes: horizontal and vertical stretch. In horizontal pass pixels are iterated from left to right first, then from top to bottom. That allows to calculate coefficients for each row only once, but is very inefficient for CPU cache. In vertical pass the order of traversal is changed: from top to bottom first, then from left to right, but calculation of each pixel of destination image requires pixels from different lines of source image.
New: Stretch always works in horizontal direction, pixels are iterated from top to bottom first, then from left to right. Calculation of each pixel in destination image requires only pixels from the same line of source image. CPU cache is used as efficiently as possible. Coefficients for each column are calculated in advance and stored separately, which prevents overheads. For vertical stretch image is transposed and stretched in horizontal direction again. Then it is transposed again to match the original image. Transposition is also made in effective for the CPU cache way: in block of 64 pixels (16kb).
Other optimizations
Effective cache usage allows other optimizations. In common this is using of integer numbers insted of floats when possible. Also, huge boost was given by bands loop unrolling.
Benchmarks
I've tested performance before and after this changes. Also i've tested GraphicsMagick 1.3.18 with following filters:
antialias
=lanczos
,bicubic
=catrom
,bilinear
=triangle
.Source image: http://www.apple.com/imac-with-retina/5k.html
Test source: