Experimental: OpenMP #1013

wiredfool · 2014-11-16T06:09:11Z

While looking through the SSE version of @homm's benchmarking of the new stretch implementation, I ran across an Intel paper on speeding up imaging operations. In addition to vectorization, they used openmp to parallelize the loops with very little developer effort. GCC 4.4+ ships with OpenMP 3.0, which is good enough for what we would need to do here. I've put the pragma on the horizontal stretch inner loop and put together the necessary bits to get Pillow built using OpenMP.

python setup.py build_ext --enable-openmp install or simply make install-openmp

Using the same benchmark as #977,

Current master:

Interpolation	Size	Time
Antialias	2048x1152	0.4238
Antialias	320x240	0.2331
Bicubic	2048x1152	0.3306
Bicubic	320x240	0.1509
Bilinear	2048x1152	0.2369
Bilinear	320x240	0.08423

With OpenMP on a 4 cores of an i7 from 2008 in a Ubuntu 12.04 vm, running kvm. (So, not cutting edge, but not wimpy)

Interpolation	Size	Time
Antialias	2048x1152	0.1845
Antialias	320x240	0.08882
Bicubic	2048x1152	0.212
Bicubic	320x240	0.1025
Bilinear	2048x1152	0.1741
Bilinear	320x240	0.06423

Note that the speeds of the OpenMP version seem to be roughly constrained by memory bandwidth rather than in processor ops. I think this is a win, but needs further investigation.

There are at least a few places where I think work may be required:

Currently, gcc only. I think windows could be supported easily, there's currently no support for openmp in clang/xcode. There is openmp for clang, but it's not in mainline yet.
Unsure of the packaging issues for binaries.
Not sure what the performance will be like on systems that advertise 32 threads but only provide 1.5 worth of cores. (e.g. travis).
Unsure of interaction with SSE. Would be awesome if it boosted that as well.
Testing and benching is going to be important.
Can't return from the loop, need to trap errors elsewhere. Didn't fix that here, but would need to be done prior to actual usage.

Links:

coveralls · 2014-11-16T06:19:18Z

Changes Unknown when pulling 5f1455f on wiredfool:openmp into * on python-pillow:master*.

homm · 2014-11-16T13:14:54Z

I very impressed how easy to use OMP. We should understand what this is not true performance win. This is just one way to use parallelism. For example my application is web server which resizes images on the fly. It already works in parallel if two requests come at the same time. And in my case it is more important to ensure what small and moderate images are processed in consistent time, rather than try to speedup large images resizing using all available resource. But without a doubt, OMP can be useful for wide range of tasks.

Not sure what the performance will be like on systems that advertise 32 threads but only provide 1.5 worth of cores. (e.g. travis).

You can easily test any number of threads with OMP_NUM_THREADS env var. I haven't noticed any slowdown on VM with one core and OMP_NUM_THREADS=32. Also we can set OMP_NUM_THREADS to 4 for example specifically for Travis.

Unsure of interaction with SSE. Would be awesome if it boosted that as well.

I have a i5 with two cores and hyper-threading. I have speedup in 1.8—1.95 times with two threads and no additionally speedup for three or four threads. Same for SSE and scalar versions. So SSE instructions on different cores don't interfere with each other.

speeds of the OpenMP version seem to be roughly constrained by memory bandwidth

I don't think so. SSE version with OMP works almost two times faster, so version without SSE can't be constrained by memory bandwidth. Maximum throughput for Bilinear resize on my system is 820Mpx/s or 3.2GB/s, what is far from maximum 25GB/s memory bandwidth.

wiredfool · 2014-11-18T05:31:36Z

I think that this is a win, for a couple of reasons. We're basically able to speed up pillow in 3 ways: better algorithms (e.g. iterative approximations in box blur)), better implementations of those algorithms (cache awareness, SSE), and parallelization. Parallelization is orthogonal to the other two in most cases, and that's the real win here. This is a minimally invasive fine grained parallelization technique. It is easily enabled and disabled, should you have coarse grained methods already running. It doesn't appear to be dangerous -- it could introduce bugs, but it's likely to introduce far fewer than if we were adding threads manually. It is additive over the other methods, as there are SSE units available on each core (though it may impact cache awareness, but that's a tuning thing).

I remember a guest lecture back in school -- the speaker was talking about the advances in large scale finite element analysis solving. Over a decade or more, there was a 6 order of magnitude increase in the raw capacity of the machines, and over the same time there was just of much of an advance in solving the linear equations. (iirc, it was Invert -> LU -> several different iterative methods, the later ones converging very quickly). Of course, what that really meant was that the problems got bigger and the solution time never really changed.

The other option we have for parallelization is to look at GPU based kernels, likely with opencl. This is a far more developer intensive method, and much less useful for most server based operations. On the other hand, the speedups are likely to be better than we'd see than just throughput*ncores.

I have speedup in 1.8—1.95 times with two threads and no additionally speedup for three or four threads

I may be jumping to conclusions, but that would seem to imply that RealCores(tm) are reasonably well saturated by this workload, and that there aren't a whole lot of stalls or other inefficiencies that are easily exploitable by additional hardware threads. Could probably test this with by cpu pinning my virtual machine to specific cores. Or, there are stalls, but the other thread isn't able to take advantage of them because it's stalled as well. I know I've seen some details on how to dig in and instrument that, but I'd have to find them again.

wiredfool · 2014-11-18T06:18:00Z

Running tonight I'm noticing that the benchmarks are jumping around a lot -- I'm seeing +- 50% on some of them on sequential runs with the same code.

coveralls · 2014-11-18T06:37:50Z

Changes Unknown when pulling 522db0f on wiredfool:openmp into * on python-pillow:master*.

homm · 2014-11-18T10:48:26Z

libImaging/Antialias.c

+        case IMAGING_TYPE_FLOAT32:
+            break;
+        default:
+            ImagingSectionLeave(&cookie);


Al least there is an error in this line.

In general, we can move this code before any allocations and don't free them.

Right, I see that.

homm · 2014-11-19T09:40:59Z

Running tonight I'm noticing that the benchmarks are jumping around a lot

Indeed. Result across 20 runs on each size:

Without OpenMP
Antialias | 2048x1152 | min 0.4323 max 0.4961 average 0.4561
Antialias | 320x240   | min 0.2299 max 0.2884 average 0.2408
Bicubic   | 2048x1152 | min 0.3391 max 0.3988 average 0.3549
Bicubic   | 320x240   | min 0.1581 max 0.1913 average 0.1656
Bilinear  | 2048x1152 | min 0.2419 max 0.2977 average 0.2586
Bilinear  | 320x240   | min 0.0845 max 0.1098 average 0.0887

OpenMP, One thread
Antialias | 2048x1152 | min 0.4404 max 0.5129 average 0.4621
Antialias | 320x240   | min 0.2337 max 0.2973 average 0.2477
Bicubic   | 2048x1152 | min 0.3417 max 0.4104 average 0.3591
Bicubic   | 320x240   | min 0.1607 max 0.1918 average 0.1688
Bilinear  | 2048x1152 | min 0.2492 max 0.3304 average 0.2624
Bilinear  | 320x240   | min 0.0856 max 0.1060 average 0.0886

OpenMP, Two threads
Antialias | 2048x1152 | min 0.2344 max 0.3851 average 0.2925
Antialias | 320x240   | min 0.1198 max 0.2172 average 0.1585
Bicubic   | 2048x1152 | min 0.1780 max 0.3180 average 0.2313
Bicubic   | 320x240   | min 0.0816 max 0.1428 average 0.0936
Bilinear  | 2048x1152 | min 0.1312 max 0.2244 average 0.1741
Bilinear  | 320x240   | min 0.0442 max 0.0843 average 0.0548

For two threads maximum is always almost as large as minimum on one thread, and average for two threads is only 1.5—1.6 times faster.

wiredfool · 2014-11-19T17:56:52Z

I've added standard deviations to mine, and I've noticed that the more iterations, generally the higher the standard deviation. The non-opemnp versions are somewhat more consistent, but not significantly. All of these are with n=40, but I've tried up to 200. The results are representative, not necessarily the most consistent or the worst of the runs. Runs are much more consistent when the VM cores are pinned to specific processors, with <= 4 cores they go to 100% each, at 8 cores (4+hyperthreading) they were all running at 85% or so.

Deviation % is the standard deviation/mean

No openmp

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.517	0.663	0.589	0.599	0.0459	7.8%
Antialias	320x240	0.310	0.376	0.330	0.329	0.0178	5.4%
Bicubic	2048x1152	0.439	0.543	0.472	0.463	0.0339	7.2%
Bicubic	320x240	0.205	0.229	0.208	0.206	0.0071	3.4%
Bilinear	2048x1152	0.318	0.357	0.338	0.336	0.0111	3.3%
Bilinear	320x240	0.114	0.124	0.115	0.114	0.0023	2.0%

OpenMP

4 Seperate cores:

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.248	0.313	0.274	0.276	0.0222	8.1%
Antialias	320x240	0.172	0.198	0.179	0.174	0.0101	5.6%
Bicubic	2048x1152	0.265	0.311	0.289	0.286	0.0146	5.0%
Bicubic	320x240	0.136	0.162	0.149	0.159	0.0114	7.7%
Bilinear	2048x1152	0.201	0.244	0.214	0.216	0.0105	4.9%
Bilinear	320x240	0.075	0.077	0.075	0.075	0.0005	0.7%

two cores + two hyperthreads

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.250	0.331	0.284	0.287	0.0249	8.8%
Antialias	320x240	0.183	0.215	0.193	0.185	0.0121	6.3%
Bicubic	2048x1152	0.261	0.307	0.271	0.265	0.0129	4.8%
Bicubic	320x240	0.146	0.179	0.162	0.173	0.0135	8.3%
Bilinear	2048x1152	0.200	0.253	0.226	0.228	0.0127	5.6%
Bilinear	320x240	0.074	0.079	0.076	0.076	0.0014	1.9%

8 cores

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.149	0.234	0.181	0.183	0.0199	11.0%
Antialias	320x240	0.102	0.197	0.128	0.119	0.0216	16.9%
Bicubic	2048x1152	0.190	0.275	0.234	0.234	0.0245	10.4%
Bicubic	320x240	0.105	0.178	0.133	0.130	0.0176	13.2%
Bilinear	2048x1152	0.142	0.276	0.191	0.188	0.0279	14.6%
Bilinear	320x240	0.050	0.140	0.087	0.086	0.0206	23.8%

2 cores

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.293	0.408	0.329	0.324	0.0310	9.4%
Antialias	320x240	0.191	0.234	0.210	0.208	0.0102	4.9%
Bicubic	2048x1152	0.283	0.344	0.298	0.284	0.0180	6.0%
Bicubic	320x240	0.135	0.166	0.140	0.136	0.0071	5.1%
Bilinear	2048x1152	0.208	0.258	0.223	0.222	0.0119	5.3%
Bilinear	320x240	0.071	0.158	0.086	0.072	0.0273	31.7%

Another 2 core run

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.318	0.403	0.347	0.346	0.0200	5.8%
Antialias	320x240	0.186	0.301	0.220	0.215	0.0189	8.6%
Bicubic	2048x1152	0.286	0.397	0.318	0.311	0.0237	7.5%
Bicubic	320x240	0.137	0.168	0.151	0.150	0.0084	5.6%
Bilinear	2048x1152	0.230	0.304	0.251	0.253	0.0127	5.0%
Bilinear	320x240	0.076	0.087	0.082	0.083	0.0027	3.3%

Current test script:

from PIL import Image
import time
import math

def timeit(n, f, *args, **kwargs):
    def run():
        start = time.time()
        f(*args, **kwargs)
        return time.time() - start

    runs = [run() for _ in range(n)]
    mean = sum(runs)/float(n)
    stddev = math.sqrt(sum((r-mean)**2 for r in runs)/float(n))
    return {'mean':mean,
            'median': sorted(runs)[int(n/2)],
            'min': min(runs),
            'max': max(runs),
            'stddev':stddev,
            'dev_pct': stddev/mean*100.0
            }

    #return min(run() for _ in range(n))

n = 40
image = Image.open('5k_image.png').copy()
print 'warmup {mean:.4}'.format(**timeit(n // 4, image.im.stretch, (2048, 1152), Image.ANTIALIAS))
print "%s runs"%n
print "Interpolation | Size  |  min  |  max  |  mean | median| stddev | Dev %"
print "--------- | --------- | ----- | ----- | ----- | ----- | -----  | ----"
print 'Antialias | 2048x1152 | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (2048, 1152), Image.ANTIALIAS))
print 'Antialias | 320x240   | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (320, 240),   Image.ANTIALIAS))
print 'Bicubic   | 2048x1152 | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (2048, 1152), Image.BICUBIC))
print 'Bicubic   | 320x240   | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (320, 240),   Image.BICUBIC))
print 'Bilinear  | 2048x1152 | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (2048, 1152), Image.BILINEAR))
print 'Bilinear  | 320x240   | {min:5.3f} | {max:5.3f} | {mean:5.3f} | {median:5.3f} | {stddev:5.4f} | {dev_pct:4.1f}%'.format(**timeit(n, image.im.stretch, (320, 240),   Image.BILINEAR))
"""

wiredfool · 2015-06-17T19:18:32Z

I've just remerged this to master.

wiredfool · 2017-11-30T21:11:33Z

So, I've had a bit of a go with this on a monster machine. This is on a 2 ghz 96 core ARM machine with 128G memory. Individual cores are about half as fast as my (old) laptop cores on the test suite, but I can't say that we're particularly well optimized here. There's a sub-linear speedup on the cores, I'm seeing about 30-60% usage and speeds in the 20-30x range of a single core.

without openmp:

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.947	0.952	0.949	0.949	0.0011	0.1%
Antialias	320x240	0.565	0.571	0.566	0.565	0.0010	0.2%
Bicubic	2048x1152	0.637	0.649	0.638	0.637	0.0019	0.3%
Bicubic	320x240	0.391	0.391	0.391	0.391	0.0000	0.0%
Bilinear	2048x1152	0.416	0.416	0.416	0.416	0.0000	0.0%
Bilinear	320x240	0.227	0.227	0.227	0.227	0.0000	0.0%

With openmp, resize

Interpolation	Size	min	max	mean	median	stddev	Dev %
Antialias	2048x1152	0.036	0.043	0.037	0.037	0.0010	2.7%
Antialias	320x240	0.027	0.030	0.028	0.028	0.0007	2.5%
Bicubic	2048x1152	0.019	0.023	0.020	0.020	0.0006	3.0%
Bicubic	320x240	0.008	0.016	0.012	0.013	0.0023	18.9%
Bilinear	2048x1152	0.008	0.009	0.008	0.008	0.0001	1.1%
Bilinear	320x240	0.005	0.005	0.005	0.005	0.0001	1.8%

wiredfool added Enhancement Do Not Merge labels Nov 16, 2014

homm reviewed Nov 18, 2014
View reviewed changes

hugovk mentioned this pull request Dec 28, 2014

Release 2.7.0 on January 1, 2015 #949

Closed

homm mentioned this pull request Feb 11, 2015

Use vips to improve speed? #1109

Closed

aclark4life added the No Auto label Apr 1, 2015

aclark4life added this to the Future milestone Apr 1, 2015

wiredfool removed the No Auto label Jun 17, 2015

radarhere added the Needs Rebase label Jul 1, 2016

wiredfool added 4 commits November 30, 2017 19:52

Build support for openmp/gcc

e4903e2

OpenMP for Transpose. Questionable Benefit

0c9269d

updated openmp

c80d375

REBASE OUT benchmark script

209a775

wiredfool force-pushed the openmp branch from 4db0f43 to 209a775 Compare November 30, 2017 21:05

wiredfool removed the Needs Rebase label Nov 30, 2017

radarhere added the Needs Rebase label Oct 3, 2018

aclark4life added this to New Issues in Pillow Sep 12, 2019

hugovk force-pushed the master branch from 747e44b to 9e3ad5e Compare December 18, 2020 20:12

radarhere moved this from New Issues to Icebox in Pillow Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental: OpenMP #1013

Experimental: OpenMP #1013

wiredfool commented Nov 16, 2014

coveralls commented Nov 16, 2014

homm commented Nov 16, 2014

wiredfool commented Nov 18, 2014

wiredfool commented Nov 18, 2014

coveralls commented Nov 18, 2014

homm Nov 18, 2014

wiredfool Nov 18, 2014

homm commented Nov 19, 2014

wiredfool commented Nov 19, 2014

wiredfool commented Jun 17, 2015

wiredfool commented Nov 30, 2017 •

edited

Loading

Experimental: OpenMP #1013

Are you sure you want to change the base?

Experimental: OpenMP #1013

Conversation

wiredfool commented Nov 16, 2014

coveralls commented Nov 16, 2014

homm commented Nov 16, 2014

wiredfool commented Nov 18, 2014

wiredfool commented Nov 18, 2014

coveralls commented Nov 18, 2014

homm Nov 18, 2014

Choose a reason for hiding this comment

wiredfool Nov 18, 2014

Choose a reason for hiding this comment

homm commented Nov 19, 2014

wiredfool commented Nov 19, 2014

wiredfool commented Jun 17, 2015

wiredfool commented Nov 30, 2017 • edited Loading

wiredfool commented Nov 30, 2017 •

edited

Loading