New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T1 & DWT multithreading decoding optimizations #786

Merged
merged 23 commits into from Sep 13, 2016

Conversation

Projects
None yet
5 participants
@rouault
Collaborator

rouault commented May 25, 2016

This PR builds upon #783 and adds multithreading decoding optimizations. (The multithreading decoding optimizations are independant from the improvements of #783 but as they touch both t1.c , it was more convenient to build upon)

The main components of this PR are :

  • Adds a threading API in thread.h/thread.c that adds an abstraction of POSIX mutex, conditions + joinable threads + high level thread pool API. Implementations for pthread on Unix systems, Win32 API or stub if no OS threading detected, or if thread use explicitly disabled with CMake -DUSE_THREAD=OFF
  • Add a opj_codec_set_threads() API & equivalent logic through the OPJ_NUM_THREADS=int/ALL_CPUS environment variable (so as to ease testing of existing code without modifying it to use opj_codec_set_threads())
  • Use thread pool for parallel decoding of code blocks in C1
  • Use thread pool for parallel processing of horizontal and vertical 1D DWT
  • Add a -threads int/ALL_CPUS parameter to opj_decode
  • Modify time measurement (on Linux) for opj_decode to avoid reporting the total CPU time
  • Add a new .travis.yml target that runs with OPJ_NUM_THREADS=2

Benchmarking:

This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)

Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "OPJ_NUM_THREADS=4 opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line

Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit

Before = state of PR #783
After = state of this PR

compiler C1 before (ms) C1 after (ms) delta % C2 before (ms) C2 after (ms) delta % C3 before (ms) C3 after (ms) delta % C4 before (ms) C4 after (ms) delta % C5 before (ms) C5 after (ms) delta % C6 before (ms) C6 after (ms) delta %
GCC 4.4 730 359 -50.8 1569 658 -58.1 3920 1310 -66.6 2060 819 -60.2 1569 606 -61.4 3770 1398 -62.9
GCC 4.6.2 710 343 -51.7 1490 657 -55.9 3860 1585 -58.9 2060 793 -61.5 1480 587 -60.3 3710 1351 -63.6
GCC 4.8.0 740 376 -49.2 1560 661 -57.6 4040 1390 -65.6 2130 763 -64.2 1540 623 -59.5 3910 1408 -64.0
GCC 5.2.0 730 379 -48.1 1660 712 -57.1 4080 1548 -62.1 2050 700 -65.9 1569 539 -65.6 3890 1502 -61.4
GCC 5.3.0 720 382 -46.9 1580 658 -58.4 4000 1407 -64.8 2040 814 -60.1 1560 598 -61.7 3830 1484 -61.3
CLang 3.7.0 740 430 -41.9 1560 702 -55.0 4080 1570 -61.5 2420 880 -63.6 1549 636 -58.9 3929 1413 -64.0

This work has been funded by Planet Labs

c0nk and others added some commits Dec 27, 2015

Move some MQC functions into a header for speed
Allow these hot functions to be inlined. This boosts decode performance by ~10%.
opj_t1_updateflags(): tiny optimization
We can avoid using a loop-up table with some shift arithmetics.
Improve code generation in opj_t1_dec_clnpass()
Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that
does the job of opj_t1_dec_clnpass_step_only() assuming the conditions
are met. And use it in opj_t1_dec_clnpass(). The compiler generates
more efficient code.
Reduce number of occurrences of orient function argument
This is essentially used to shift inside the lut_ctxno_zc, which we
can precompute at the beginning of opj_t1_decode_cblk() /
opj_t1_encode_cblk()
Tier 1 decoding: add a colflags array
Addition flag array such that colflags[1+0] is for state of col=0,row=0..3,
colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ...
This array avoids too much cache trashing when processing by 4 vertical samples
as done in the various decoding steps.
Add opj_codec_set_threads() in public API and propagate resulting thr…
…ead pool to tcd level

By default, only the main thread is used. If opj_codec_set_threads() is not used,
but the OPJ_NUM_THREADS environment variable is set, its value will be
used to initialize the number of threads. The value can be either an integer
number, or "ALL_CPUS". If OPJ_NUM_THREADS is set and this function is called,
this function will override the behaviour of the environment variable.
opj_decompress: use clock_gettime() instead of getrusage() so as to g…
…et the time spent, and not to the total CPU time

@rouault rouault changed the title from T1 & DWD multithreading decoding optimizations to T1 & DWT multithreading decoding optimizations May 25, 2016

{
opj_dwd_decode_h_job_t* job;
job = (opj_dwd_decode_h_job_t*) opj_malloc(sizeof(opj_dwd_decode_h_job_t));

This comment has been minimized.

@mayeut

mayeut May 26, 2016

Collaborator

allocation shall be checked for failure.

@mayeut

This comment has been minimized.

Collaborator

mayeut commented May 26, 2016

Lots of good things, unfortunately, I don't have much time to review this completely.
I think using native threads rather than OpenMP is a good thing as there will be more freedom for configuration in complex integrations.

opj_thread_pool_wait_completion(tp, 0);
opj_free(job);
opj_aligned_free(h.mem);
return OPJ_FALSE;

This comment has been minimized.

@mayeut

mayeut May 26, 2016

Collaborator

If the jobs are submitted outside of this loop (i.e. in another one following), then it would be possible to fallback to single-thread in case of allocation error (& changing single-thread condition below the loop). It's very likely that if hitting an out of memory condition here, one will be raised later so it's quite arguable wether to do this or not.

This comment has been minimized.

@rouault

rouault May 26, 2016

Collaborator

Ah ok, I now just got what you meant. Well, in modern OS, if malloc failures for such small structures happen you are in big trouble (swap trashing, etc...), so a clean error exit is probably good enough than a smarter fallback strategy

This comment has been minimized.

@mayeut

mayeut May 27, 2016

Collaborator

As I said, it was quite arguable. A clean error is probably good enough.

{
opj_dwd_decode_v_job_t* job;
job = (opj_dwd_decode_v_job_t*) opj_malloc(sizeof(opj_dwd_decode_v_job_t));

This comment has been minimized.

@mayeut

mayeut May 26, 2016

Collaborator

Missing allocation check.

opj_thread_pool_wait_completion(tp, 0);
opj_free(job);
opj_aligned_free(v.mem);
return OPJ_FALSE;

This comment has been minimized.

julienmalik added a commit to senbox-org/s2tbx that referenced this pull request May 27, 2016

@mayeut

This comment has been minimized.

Collaborator

mayeut commented Jul 14, 2016

@detonin, it would probably be a good time to merge this PR or #783
Any idea ?

@detonin

This comment has been minimized.

Contributor

detonin commented Jul 22, 2016

@rouault Thanks for this great work. 2 questions:

  • Could you add the allocation checks suggested by @mayeut ?
  • Does this multithreading implementation work on all platforms or only on Linux ? (didn't have the chance yet to test it on windows or macos and saw the conf you added in travis is linux)
thread->thread_fn = thread_fn;
thread->user_data = user_data;
thread->hThread = CreateThread( NULL, 0, opj_thread_callback_adapter, thread,

This comment has been minimized.

@mayeut

mayeut Jul 22, 2016

Collaborator

When building with MSVC, _beginthreadex shall be used instead of CreateThread. c.f. https://msdn.microsoft.com/en-us/library/windows/desktop/ms682453(v=vs.85).aspx remarks section.

This comment has been minimized.

@rouault

rouault Jul 26, 2016

Collaborator

thread.c is a C port of the cpl_multiproc.cpp code used in GDAL. We have used CreateThread() for years without problem. That said we could switch to _beginthreadex() if you think it is worth it.

This comment has been minimized.

@mayeut

mayeut Jul 26, 2016

Collaborator

It should be easy to check for _beginthreadex availability with cmake and given the remark from MSDN, I think it should be used so that the library returns properly rather than crashing the whole process. I don't know if the threads are calling any function from the CRT as of now (probably not) but future evolutions might and we'll have forgotten this.

@rouault

This comment has been minimized.

Collaborator

rouault commented Aug 11, 2016

I think all review comments have now been addressed

rouault added some commits Sep 8, 2016

Merge branch 'master' of https://github.com/uclouvain/openjpeg into t…
…ier1_optimizations_multithreading_2

Conflicts:
	src/lib/openjp2/t1.c
@rouault

This comment has been minimized.

Collaborator

rouault commented Sep 8, 2016

@mayeut @detonin Anything preventing this PR from being merged ? This branch has been sync'ed today with master to fix merge conflicts, all CI tests pass and all comments from review have been addressed AFAICS.

@detonin

This comment has been minimized.

Contributor

detonin commented Sep 8, 2016

@rouault Thanks for having kept the PR in sync and mergeable, and for having addressed the comments. From my point of view, it can now be merged indeed, especially as it does not break API nor ABI.
@mayeut ok for you ?

@mayeut

This comment has been minimized.

Collaborator

mayeut commented Sep 8, 2016

@detonin, is it possible to release v2.1.2 before merging this in something that would become v2.2.0 ?

@malaterre

This comment has been minimized.

Collaborator

malaterre commented Sep 9, 2016

@mayeut I was about to say something like this ! However the root issue is still to have some kind of release branch where all CVE are fixed, which would simplify my life as a Debian package maintainer. And at the same time have git/master be the -next branch...
I did not post originally this comment in the bug report, fearing this would delay @rouault PR even more... (sorry Even)

@mayeut

This comment has been minimized.

Collaborator

mayeut commented Sep 11, 2016

@malaterre, what could be done is to merge current master branch in openjpeg-2.1 branch (openjpeg-2.1...master) while waiting for a release then merge this PR in master.

@detonin, is that ok ?

@detonin

This comment has been minimized.

Contributor

detonin commented Sep 11, 2016

@mayeut Yes, this is the purpose of having created a specific branch for each minor version. Releases are actually never tagged in master but rather in their specific X.y branch. I was wondering if we should keep this way of doing given that in any case we do not have enough resources to maintain several versions of OpenJPEG and backport systematically bugfixes from Master to release branches. Unless @malaterre you could take care of this backport for potential security fixes after #786 merging ?

@malaterre

This comment has been minimized.

Collaborator

malaterre commented Sep 13, 2016

@mayeut great suggestion ! I'll trim the thirdparty changes out of the merge and update this bug once I am done. People will go nuts if I incorporate such large change in PNG/LCMS in a minor release.

@malaterre

This comment has been minimized.

Collaborator

malaterre commented Sep 13, 2016

@mayeut I am now happy with openjpeg-2.1 branch. Please go ahead and merge anything you want in git/master. Thanks everyone !

@detonin

This comment has been minimized.

Contributor

detonin commented Sep 13, 2016

@malaterre thanks Mathieu for the merge, we'll proceed with the PR then

@detonin detonin merged commit d6d0f07 into uclouvain:master Sep 13, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@detonin

This comment has been minimized.

Contributor

detonin commented Sep 13, 2016

@rouault Many thanks Even for these great optimizations that are now merged in master and will be released soon (as v2.2.0).

I guess some of the still open PR will need to be updated before being able to merge them.

@rouault

This comment has been minimized.

Collaborator

rouault commented Sep 13, 2016

Great ! Thanks.

mayeut added a commit to mayeut/openjpeg that referenced this pull request Sep 13, 2016

Fix some warnings
Fix warnings introduced by uclouvain#786

@mayeut mayeut referenced this pull request Sep 13, 2016

Merged

Fix some warnings #838

mayeut added a commit that referenced this pull request Sep 13, 2016

Fix some warnings (#838)
Fix warnings introduced by #786

@detonin detonin added the enhancement label Aug 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment