Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T1 & DWT multithreading decoding optimizations #786

Merged
merged 23 commits into from
Sep 13, 2016

Conversation

rouault
Copy link
Collaborator

@rouault rouault commented May 25, 2016

This PR builds upon #783 and adds multithreading decoding optimizations. (The multithreading decoding optimizations are independant from the improvements of #783 but as they touch both t1.c , it was more convenient to build upon)

The main components of this PR are :

  • Adds a threading API in thread.h/thread.c that adds an abstraction of POSIX mutex, conditions + joinable threads + high level thread pool API. Implementations for pthread on Unix systems, Win32 API or stub if no OS threading detected, or if thread use explicitly disabled with CMake -DUSE_THREAD=OFF
  • Add a opj_codec_set_threads() API & equivalent logic through the OPJ_NUM_THREADS=int/ALL_CPUS environment variable (so as to ease testing of existing code without modifying it to use opj_codec_set_threads())
  • Use thread pool for parallel decoding of code blocks in C1
  • Use thread pool for parallel processing of horizontal and vertical 1D DWT
  • Add a -threads int/ALL_CPUS parameter to opj_decode
  • Modify time measurement (on Linux) for opj_decode to avoid reporting the total CPU time
  • Add a new .travis.yml target that runs with OPJ_NUM_THREADS=2

Benchmarking:

This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)

Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "OPJ_NUM_THREADS=4 opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line

Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit

Before = state of PR #783
After = state of this PR

compiler C1 before (ms) C1 after (ms) delta % C2 before (ms) C2 after (ms) delta % C3 before (ms) C3 after (ms) delta % C4 before (ms) C4 after (ms) delta % C5 before (ms) C5 after (ms) delta % C6 before (ms) C6 after (ms) delta %
GCC 4.4 730 359 -50.8 1569 658 -58.1 3920 1310 -66.6 2060 819 -60.2 1569 606 -61.4 3770 1398 -62.9
GCC 4.6.2 710 343 -51.7 1490 657 -55.9 3860 1585 -58.9 2060 793 -61.5 1480 587 -60.3 3710 1351 -63.6
GCC 4.8.0 740 376 -49.2 1560 661 -57.6 4040 1390 -65.6 2130 763 -64.2 1540 623 -59.5 3910 1408 -64.0
GCC 5.2.0 730 379 -48.1 1660 712 -57.1 4080 1548 -62.1 2050 700 -65.9 1569 539 -65.6 3890 1502 -61.4
GCC 5.3.0 720 382 -46.9 1580 658 -58.4 4000 1407 -64.8 2040 814 -60.1 1560 598 -61.7 3830 1484 -61.3
CLang 3.7.0 740 430 -41.9 1560 702 -55.0 4080 1570 -61.5 2420 880 -63.6 1549 636 -58.9 3929 1413 -64.0

This work has been funded by Planet Labs

c0nk and others added 19 commits May 21, 2016 15:18
Allow these hot functions to be inlined. This boosts decode performance by ~10%.
We can avoid using a loop-up table with some shift arithmetics.
Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that
does the job of opj_t1_dec_clnpass_step_only() assuming the conditions
are met. And use it in opj_t1_dec_clnpass(). The compiler generates
more efficient code.
This is essentially used to shift inside the lut_ctxno_zc, which we
can precompute at the beginning of opj_t1_decode_cblk() /
opj_t1_encode_cblk()
Addition flag array such that colflags[1+0] is for state of col=0,row=0..3,
colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ...
This array avoids too much cache trashing when processing by 4 vertical samples
as done in the various decoding steps.
…ead pool to tcd level

By default, only the main thread is used. If opj_codec_set_threads() is not used,
but the OPJ_NUM_THREADS environment variable is set, its value will be
used to initialize the number of threads. The value can be either an integer
number, or "ALL_CPUS". If OPJ_NUM_THREADS is set and this function is called,
this function will override the behaviour of the environment variable.
…et the time spent, and not to the total CPU time
@rouault rouault changed the title T1 & DWD multithreading decoding optimizations T1 & DWT multithreading decoding optimizations May 25, 2016
{
opj_dwd_decode_h_job_t* job;

job = (opj_dwd_decode_h_job_t*) opj_malloc(sizeof(opj_dwd_decode_h_job_t));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocation shall be checked for failure.

@mayeut
Copy link
Collaborator

mayeut commented May 26, 2016

Lots of good things, unfortunately, I don't have much time to review this completely.
I think using native threads rather than OpenMP is a good thing as there will be more freedom for configuration in complex integrations.

opj_thread_pool_wait_completion(tp, 0);
opj_free(job);
opj_aligned_free(h.mem);
return OPJ_FALSE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the jobs are submitted outside of this loop (i.e. in another one following), then it would be possible to fallback to single-thread in case of allocation error (& changing single-thread condition below the loop). It's very likely that if hitting an out of memory condition here, one will be raised later so it's quite arguable wether to do this or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I now just got what you meant. Well, in modern OS, if malloc failures for such small structures happen you are in big trouble (swap trashing, etc...), so a clean error exit is probably good enough than a smarter fallback strategy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, it was quite arguable. A clean error is probably good enough.

julienmalik added a commit to senbox-org/s2tbx that referenced this pull request May 27, 2016
@mayeut
Copy link
Collaborator

mayeut commented Jul 14, 2016

@detonin, it would probably be a good time to merge this PR or #783
Any idea ?

@detonin
Copy link
Contributor

detonin commented Jul 22, 2016

@rouault Thanks for this great work. 2 questions:

  • Could you add the allocation checks suggested by @mayeut ?
  • Does this multithreading implementation work on all platforms or only on Linux ? (didn't have the chance yet to test it on windows or macos and saw the conf you added in travis is linux)

thread->thread_fn = thread_fn;
thread->user_data = user_data;

thread->hThread = CreateThread( NULL, 0, opj_thread_callback_adapter, thread,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building with MSVC, _beginthreadex shall be used instead of CreateThread. c.f. https://msdn.microsoft.com/en-us/library/windows/desktop/ms682453(v=vs.85).aspx remarks section.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread.c is a C port of the cpl_multiproc.cpp code used in GDAL. We have used CreateThread() for years without problem. That said we could switch to _beginthreadex() if you think it is worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be easy to check for _beginthreadex availability with cmake and given the remark from MSDN, I think it should be used so that the library returns properly rather than crashing the whole process. I don't know if the threads are calling any function from the CRT as of now (probably not) but future evolutions might and we'll have forgotten this.

@rouault
Copy link
Collaborator Author

rouault commented Aug 11, 2016

I think all review comments have now been addressed

@rouault
Copy link
Collaborator Author

rouault commented Sep 8, 2016

@mayeut @detonin Anything preventing this PR from being merged ? This branch has been sync'ed today with master to fix merge conflicts, all CI tests pass and all comments from review have been addressed AFAICS.

@detonin
Copy link
Contributor

detonin commented Sep 8, 2016

@rouault Thanks for having kept the PR in sync and mergeable, and for having addressed the comments. From my point of view, it can now be merged indeed, especially as it does not break API nor ABI.
@mayeut ok for you ?

@mayeut
Copy link
Collaborator

mayeut commented Sep 8, 2016

@detonin, is it possible to release v2.1.2 before merging this in something that would become v2.2.0 ?

@malaterre
Copy link
Collaborator

@mayeut I was about to say something like this ! However the root issue is still to have some kind of release branch where all CVE are fixed, which would simplify my life as a Debian package maintainer. And at the same time have git/master be the -next branch...
I did not post originally this comment in the bug report, fearing this would delay @rouault PR even more... (sorry Even)

@mayeut
Copy link
Collaborator

mayeut commented Sep 11, 2016

@malaterre, what could be done is to merge current master branch in openjpeg-2.1 branch (openjpeg-2.1...master) while waiting for a release then merge this PR in master.

@detonin, is that ok ?

@detonin
Copy link
Contributor

detonin commented Sep 11, 2016

@mayeut Yes, this is the purpose of having created a specific branch for each minor version. Releases are actually never tagged in master but rather in their specific X.y branch. I was wondering if we should keep this way of doing given that in any case we do not have enough resources to maintain several versions of OpenJPEG and backport systematically bugfixes from Master to release branches. Unless @malaterre you could take care of this backport for potential security fixes after #786 merging ?

@malaterre
Copy link
Collaborator

@mayeut great suggestion ! I'll trim the thirdparty changes out of the merge and update this bug once I am done. People will go nuts if I incorporate such large change in PNG/LCMS in a minor release.

@malaterre
Copy link
Collaborator

@mayeut I am now happy with openjpeg-2.1 branch. Please go ahead and merge anything you want in git/master. Thanks everyone !

@detonin
Copy link
Contributor

detonin commented Sep 13, 2016

@malaterre thanks Mathieu for the merge, we'll proceed with the PR then

@detonin detonin merged commit d6d0f07 into uclouvain:master Sep 13, 2016
@detonin
Copy link
Contributor

detonin commented Sep 13, 2016

@rouault Many thanks Even for these great optimizations that are now merged in master and will be released soon (as v2.2.0).

I guess some of the still open PR will need to be updated before being able to merge them.

@rouault
Copy link
Collaborator Author

rouault commented Sep 13, 2016

Great ! Thanks.

mayeut added a commit to mayeut/openjpeg that referenced this pull request Sep 13, 2016
Fix warnings introduced by uclouvain#786
@mayeut mayeut mentioned this pull request Sep 13, 2016
mayeut added a commit that referenced this pull request Sep 13, 2016
Fix warnings introduced by #786
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants