T1 & DWT multithreading decoding optimizations #786

rouault · 2016-05-25T20:16:50Z

This PR builds upon #783 and adds multithreading decoding optimizations. (The multithreading decoding optimizations are independant from the improvements of #783 but as they touch both t1.c , it was more convenient to build upon)

The main components of this PR are :

Adds a threading API in thread.h/thread.c that adds an abstraction of POSIX mutex, conditions + joinable threads + high level thread pool API. Implementations for pthread on Unix systems, Win32 API or stub if no OS threading detected, or if thread use explicitly disabled with CMake -DUSE_THREAD=OFF
Add a opj_codec_set_threads() API & equivalent logic through the OPJ_NUM_THREADS=int/ALL_CPUS environment variable (so as to ease testing of existing code without modifying it to use opj_codec_set_threads())
Use thread pool for parallel decoding of code blocks in C1
Use thread pool for parallel processing of horizontal and vertical 1D DWT
Add a -threads int/ALL_CPUS parameter to opj_decode
Modify time measurement (on Linux) for opj_decode to avoid reporting the total CPU time
Add a new .travis.yml target that runs with OPJ_NUM_THREADS=2

Benchmarking:

This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)

Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "OPJ_NUM_THREADS=4 opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line

Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit

Before = state of PR #783
After = state of this PR

compiler	C1 before (ms)	C1 after (ms)	delta %	C2 before (ms)	C2 after (ms)	delta %	C3 before (ms)	C3 after (ms)	delta %	C4 before (ms)	C4 after (ms)	delta %	C5 before (ms)	C5 after (ms)	delta %	C6 before (ms)	C6 after (ms)	delta %
GCC 4.4	730	359	-50.8	1569	658	-58.1	3920	1310	-66.6	2060	819	-60.2	1569	606	-61.4	3770	1398	-62.9
GCC 4.6.2	710	343	-51.7	1490	657	-55.9	3860	1585	-58.9	2060	793	-61.5	1480	587	-60.3	3710	1351	-63.6
GCC 4.8.0	740	376	-49.2	1560	661	-57.6	4040	1390	-65.6	2130	763	-64.2	1540	623	-59.5	3910	1408	-64.0
GCC 5.2.0	730	379	-48.1	1660	712	-57.1	4080	1548	-62.1	2050	700	-65.9	1569	539	-65.6	3890	1502	-61.4
GCC 5.3.0	720	382	-46.9	1580	658	-58.4	4000	1407	-64.8	2040	814	-60.1	1560	598	-61.7	3830	1484	-61.3
CLang 3.7.0	740	430	-41.9	1560	702	-55.0	4080	1570	-61.5	2420	880	-63.6	1549	636	-58.9	3929	1413	-64.0

This work has been funded by Planet Labs

Allow these hot functions to be inlined. This boosts decode performance by ~10%.

We can avoid using a loop-up table with some shift arithmetics.

Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that does the job of opj_t1_dec_clnpass_step_only() assuming the conditions are met. And use it in opj_t1_dec_clnpass(). The compiler generates more efficient code.

This is essentially used to shift inside the lut_ctxno_zc, which we can precompute at the beginning of opj_t1_decode_cblk() / opj_t1_encode_cblk()

Addition flag array such that colflags[1+0] is for state of col=0,row=0..3, colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ... This array avoids too much cache trashing when processing by 4 vertical samples as done in the various decoding steps.

… (of the non VSC case)

…qc_vsc() with loop unrolling

…oduced in ba1edf6

…ead pool to tcd level By default, only the main thread is used. If opj_codec_set_threads() is not used, but the OPJ_NUM_THREADS environment variable is set, its value will be used to initialize the number of threads. The value can be either an integer number, or "ALL_CPUS". If OPJ_NUM_THREADS is set and this function is called, this function will override the behaviour of the environment variable.

…et the time spent, and not to the total CPU time

mayeut · 2016-05-26T21:02:20Z

src/lib/openjp2/dwt.c

+            {
+                opj_dwd_decode_h_job_t* job;
+
+                job = (opj_dwd_decode_h_job_t*) opj_malloc(sizeof(opj_dwd_decode_h_job_t));


allocation shall be checked for failure.

mayeut · 2016-05-26T21:10:38Z

Lots of good things, unfortunately, I don't have much time to review this completely.
I think using native threads rather than OpenMP is a good thing as there will be more freedom for configuration in complex integrations.

mayeut · 2016-05-26T21:16:52Z

src/lib/openjp2/dwt.c

+                    opj_thread_pool_wait_completion(tp, 0);
+                    opj_free(job);
+                    opj_aligned_free(h.mem);
+                    return OPJ_FALSE;


If the jobs are submitted outside of this loop (i.e. in another one following), then it would be possible to fallback to single-thread in case of allocation error (& changing single-thread condition below the loop). It's very likely that if hitting an out of memory condition here, one will be raised later so it's quite arguable wether to do this or not.

Ah ok, I now just got what you meant. Well, in modern OS, if malloc failures for such small structures happen you are in big trouble (swap trashing, etc...), so a clean error exit is probably good enough than a smarter fallback strategy

As I said, it was quite arguable. A clean error is probably good enough.

mayeut · 2016-07-14T08:56:46Z

@detonin, it would probably be a good time to merge this PR or #783
Any idea ?

detonin · 2016-07-22T18:46:46Z

@rouault Thanks for this great work. 2 questions:

Could you add the allocation checks suggested by @mayeut ?
Does this multithreading implementation work on all platforms or only on Linux ? (didn't have the chance yet to test it on windows or macos and saw the conf you added in travis is linux)

mayeut · 2016-07-22T21:13:51Z

src/lib/openjp2/thread.c

+    thread->thread_fn = thread_fn;
+    thread->user_data = user_data;
+
+    thread->hThread = CreateThread( NULL, 0, opj_thread_callback_adapter, thread,


When building with MSVC, _beginthreadex shall be used instead of CreateThread. c.f. https://msdn.microsoft.com/en-us/library/windows/desktop/ms682453(v=vs.85).aspx remarks section.

thread.c is a C port of the cpl_multiproc.cpp code used in GDAL. We have used CreateThread() for years without problem. That said we could switch to _beginthreadex() if you think it is worth it.

It should be easy to check for _beginthreadex availability with cmake and given the remark from MSDN, I think it should be used so that the library returns properly rather than crashing the whole process. I don't know if the threads are calling any function from the CRT as of now (probably not) but future evolutions might and we'll have forgotten this.

rouault · 2016-08-11T20:19:44Z

I think all review comments have now been addressed

…ier1_optimizations_multithreading_2 Conflicts: src/lib/openjp2/t1.c

rouault · 2016-09-08T09:23:28Z

@mayeut @detonin Anything preventing this PR from being merged ? This branch has been sync'ed today with master to fix merge conflicts, all CI tests pass and all comments from review have been addressed AFAICS.

detonin · 2016-09-08T10:29:00Z

@rouault Thanks for having kept the PR in sync and mergeable, and for having addressed the comments. From my point of view, it can now be merged indeed, especially as it does not break API nor ABI.
@mayeut ok for you ?

mayeut · 2016-09-08T21:05:45Z

@detonin, is it possible to release v2.1.2 before merging this in something that would become v2.2.0 ?

malaterre · 2016-09-09T05:51:25Z

@mayeut I was about to say something like this ! However the root issue is still to have some kind of release branch where all CVE are fixed, which would simplify my life as a Debian package maintainer. And at the same time have git/master be the -next branch...
I did not post originally this comment in the bug report, fearing this would delay @rouault PR even more... (sorry Even)

mayeut · 2016-09-11T18:38:59Z

@malaterre, what could be done is to merge current master branch in openjpeg-2.1 branch (openjpeg-2.1...master) while waiting for a release then merge this PR in master.

@detonin, is that ok ?

detonin · 2016-09-11T18:53:43Z

@mayeut Yes, this is the purpose of having created a specific branch for each minor version. Releases are actually never tagged in master but rather in their specific X.y branch. I was wondering if we should keep this way of doing given that in any case we do not have enough resources to maintain several versions of OpenJPEG and backport systematically bugfixes from Master to release branches. Unless @malaterre you could take care of this backport for potential security fixes after #786 merging ?

malaterre · 2016-09-13T08:49:23Z

@mayeut great suggestion ! I'll trim the thirdparty changes out of the merge and update this bug once I am done. People will go nuts if I incorporate such large change in PNG/LCMS in a minor release.

malaterre · 2016-09-13T09:01:53Z

@mayeut I am now happy with openjpeg-2.1 branch. Please go ahead and merge anything you want in git/master. Thanks everyone !

detonin · 2016-09-13T09:29:45Z

@malaterre thanks Mathieu for the merge, we'll proceed with the PR then

detonin · 2016-09-13T14:41:54Z

@rouault Many thanks Even for these great optimizations that are now merged in master and will be released soon (as v2.2.0).

I guess some of the still open PR will need to be updated before being able to merge them.

rouault · 2016-09-13T14:49:59Z

Great ! Thanks.

Fix warnings introduced by uclouvain#786

Fix warnings introduced by #786

c0nk and others added 19 commits May 21, 2016 15:18

Move some MQC functions into a header for speed

426bf8d

Allow these hot functions to be inlined. This boosts decode performance by ~10%.

opj_t1_updateflags(): tiny optimization

c539808

We can avoid using a loop-up table with some shift arithmetics.

Improve code generation in opj_t1_dec_clnpass()

d8fef96

Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that does the job of opj_t1_dec_clnpass_step_only() assuming the conditions are met. And use it in opj_t1_dec_clnpass(). The compiler generates more efficient code.

Specialize decoding passes for 64x64 code blocks

23a01df

Reduce number of occurrences of orient function argument

ba1edf6

This is essentially used to shift inside the lut_ctxno_zc, which we can precompute at the beginning of opj_t1_decode_cblk() / opj_t1_encode_cblk()

Const'ify lut arrays so they are in the read-only data section

31882ad

opj_t1_decode_cblks(): tiny perf increase when loop unrolling

93f7f90

opj_t1_dec_clnpass(): remove useless test in the runlen decoding path…

956c31d

… (of the non VSC case)

Better inlining of opj_t1_updateflagscolflags() w.r.t. flags_stride

8371491

Improve perf of opj_t1_dec_sigpass_mqc_vsc() and opj_t1_dec_refpass_m…

107eb31

…qc_vsc() with loop unrolling

Fix MSVC210 build issue (use of C99 declaration after statement) intr…

7092f7e

…oduced in ba1edf6

Add threading and thread pool API

54179fe

Use thread-pool for T1 decoding

5fbb8b2

Use thread pool for DWT decoding

57b216b

.travis.yml: add a conf with OPJ_NUM_THREADS=2

e3eb0a2

opj_decompress: add a -threads <num_threads> option

d67cd22

opj_decompress: use clock_gettime() instead of getrusage() so as to g…

69497d3

…et the time spent, and not to the total CPU time

rouault mentioned this pull request May 25, 2016

Performance patch and a few enhancements #568

Closed

rouault changed the title ~~T1 & DWD multithreading decoding optimizations~~ T1 & DWT multithreading decoding optimizations May 25, 2016

mayeut reviewed May 26, 2016
View reviewed changes

Be robust to failed allocations of job structures

7d3c7a3

julienmalik added a commit to senbox-org/s2tbx that referenced this pull request May 27, 2016

Integrate uclouvain/openjpeg#786 into our openjpeg builds

e8573d4

rouault mentioned this pull request Jul 20, 2016

Implement predictive termination check #800

Merged

mayeut reviewed Jul 22, 2016
View reviewed changes

[Win32] Use _beginthreadex instead of CreateThread()

4f9abb9

rouault added 2 commits September 8, 2016 09:43

opj_thread_pool: fix potential deadlock at thread pool destruction

ab22c5b

Merge branch 'master' of https://github.com/uclouvain/openjpeg into t…

48c16b2

…ier1_optimizations_multithreading_2 Conflicts: src/lib/openjp2/t1.c

detonin merged commit d6d0f07 into uclouvain:master Sep 13, 2016

mayeut mentioned this pull request Sep 13, 2016

Add overflow checks for opj_aligned_malloc #832

Merged

mayeut added a commit to mayeut/openjpeg that referenced this pull request Sep 13, 2016

Fix some warnings

08b7c79

Fix warnings introduced by uclouvain#786

mayeut mentioned this pull request Sep 13, 2016

Fix some warnings #838

Merged

mayeut added a commit that referenced this pull request Sep 13, 2016

Fix some warnings (#838)

0954bc1

Fix warnings introduced by #786

detonin added the enhancement label Aug 3, 2017

tangxu00 mentioned this pull request Jan 17, 2019

multithreading encoding support? #1177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T1 & DWT multithreading decoding optimizations #786

T1 & DWT multithreading decoding optimizations #786

rouault commented May 25, 2016 •

edited

Loading

mayeut May 26, 2016

mayeut commented May 26, 2016 •

edited

Loading

mayeut May 26, 2016

rouault May 26, 2016

mayeut May 27, 2016

mayeut commented Jul 14, 2016

detonin commented Jul 22, 2016

mayeut Jul 22, 2016

rouault Jul 26, 2016

mayeut Jul 26, 2016

rouault commented Aug 11, 2016

rouault commented Sep 8, 2016

detonin commented Sep 8, 2016

mayeut commented Sep 8, 2016

malaterre commented Sep 9, 2016

mayeut commented Sep 11, 2016

detonin commented Sep 11, 2016

malaterre commented Sep 13, 2016

malaterre commented Sep 13, 2016

detonin commented Sep 13, 2016

detonin commented Sep 13, 2016

rouault commented Sep 13, 2016

T1 & DWT multithreading decoding optimizations #786

T1 & DWT multithreading decoding optimizations #786

Conversation

rouault commented May 25, 2016 • edited Loading

mayeut May 26, 2016

Choose a reason for hiding this comment

mayeut commented May 26, 2016 • edited Loading

mayeut May 26, 2016

Choose a reason for hiding this comment

rouault May 26, 2016

Choose a reason for hiding this comment

mayeut May 27, 2016

Choose a reason for hiding this comment

mayeut commented Jul 14, 2016

detonin commented Jul 22, 2016

mayeut Jul 22, 2016

Choose a reason for hiding this comment

rouault Jul 26, 2016

Choose a reason for hiding this comment

mayeut Jul 26, 2016

Choose a reason for hiding this comment

rouault commented Aug 11, 2016

rouault commented Sep 8, 2016

detonin commented Sep 8, 2016

mayeut commented Sep 8, 2016

malaterre commented Sep 9, 2016

mayeut commented Sep 11, 2016

detonin commented Sep 11, 2016

malaterre commented Sep 13, 2016

malaterre commented Sep 13, 2016

detonin commented Sep 13, 2016

detonin commented Sep 13, 2016

rouault commented Sep 13, 2016

rouault commented May 25, 2016 •

edited

Loading

mayeut commented May 26, 2016 •

edited

Loading