New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

Merged
merged 6 commits into from Jun 26, 2017

Conversation

Projects
None yet
2 participants
@rouault
Collaborator

rouault commented Jun 21, 2017

Implements #953

With the new bench_dwt utility, on x86_64:

  • before changes: 3.356 s
  • with SSE2 optimization (default for a x86_64): 0.992 s
  • with AVX2 optimization (requested at compilation time): 0.744 s

SSE2/AVX2 is used in the vertical pass to handle several columns at the same time. This avoids a lot of CPU cache trashing.
Note: I tried a SSE2 optimized version of opj_idwt53_h_cas0() but the gain is almost unnoticeable, so not included in this PR.

rouault added some commits Jun 20, 2017

Improve performance of inverse DWT 5x3 (#953)
* Use single-pass lifting inverse wavelet transform.
* For vertical pass, use SSE2 when available so as to process 8 columns
  in parallel. This is the most beneficial improvement, since the
  vertical pass involves a lot of cache trashing.

With the bench_dwt utility with default arguments (16383x16383 image),
time goes from 4.064 s to 1.212 s.
IDWT 5x3: generalize SSE2 version for AVX2
Thanks to our macros that abstract SSE use, the functions can use
AVX2 when available (at compile time)

This brings an extra 23% speed improvement on bench_dwt in 64bit builds
with AVX2 compared to SSE2.
.travis.yml: add a configuration to test compilation of AVX2 (but dis…
…able tests since Travis doesn't have AVX2 compatible machines)
@rouault

This comment has been minimized.

Collaborator

rouault commented Jun 21, 2017

Note: the failure in AppVeyor is a network flake. Passes on the same commit pushed to my account: https://ci.appveyor.com/project/rouault/openjpeg/build/2.1.1.15

@rouault

This comment has been minimized.

Collaborator

rouault commented Jun 26, 2017

Results on opj_decompress time on 8c05f00a-ae05-4dd5-bdc7-a1b5eed4ebfb.jp2 from testovani : 15595 wide x 11128 tall x 3 components

idwt_53_improvements branch, SSE2 : 48.698s
idwt_53_improvements branch, AVX2 : 48.050s
master branch, SSE2: 55.759s
master branch, AVX2: 55.294s

So a global decrease of 12.6% (7.061 s) from master to idwt_53_improvements branch in SSE2, and an extra decrease 1.3% from SSE2 to AVX2 in idwt_53_improvements branch
Note: the SSE2->AVX2 improvement here is composed of a gain of recompiling the whole code base in AVX2 (55.759 - 55.294 = 465 ms) + a specific improvement due to the IDWT5x3 AVX2 optimization ( 48.698 - 48.050 - 0.465 = 183 ms)

@rouault rouault merged commit 533fa2f into uclouvain:master Jun 26, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

rouault added a commit that referenced this pull request Jun 29, 2017

@detonin detonin added the enhancement label Aug 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment