Tier1 decoder speed optimizations #783

rouault · 2016-05-23T12:57:05Z

This patch series improves T1 decoding speed, resulting in overall decompression time gains typically of 10-15% for operational products

Various tricks used :

more aggressive inlining (reusing Move some MQC functions into a header for speed #675)
specialization of decoding of 64x64 code blocks, which are common in a lot of products
addition of an auxiliary colflags array whose each 16-bits item stores the overall state of 4 values in a colum, and thus enables quick checks in a cache friendly way
loop unrolling for VSC steps (similarly to the non-VSC case)

Benchmarking:

This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)

Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line

Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit

compiler	C1 before (ms)	C1 after (ms)	delta %	C2 before (ms)	C2 after (ms)	delta %	C3 before (ms)	C3 after (ms)	delta %	C4 before (ms)	C4 after (ms)	delta %	C5 before (ms)	C5 after (ms)	delta %	C6 before (ms)	C6 after (ms)	delta %
GCC 4.4.3	799	730	-8.6	1710	1559	-8.8	4390	3960	-9.8	2670	2070	-22.5	1990	1589	-20.2	4250	3770	-11.3
GCC 4.6.2	830	720	-13.3	1700	1480	-12.9	4620	3860	-16.5	2469	2070	-16.2	1930	1490	-22.8	4400	3710	-15.7
GCC 4.8.0	880	740	-15.9	1790	1569	-12.3	4820	4040	-16.2	2630	2120	-19.4	2130	1530	-28.2	4700	3880	-17.4
GCC 5.2.0	860	740	-14.0	1720	1569	-8.8	4630	4160	-10.2	2480	2050	-17.3	1859	1569	-15.6	4520	3830	-15.3
GCC 5.3.0	850	730	-14.1	1730	1569	-9.3	4640	4010	-13.6	2480	2040	-17.7	1850	1569	-15.2	4530	3840	-15.2
CLang 3.7.0	809	730	-9.8	1670	1560	-6.6	4510	4090	-9.3	2770	2430	-12.3	1859	1549	-16.7	4390	3920	-10.7

Same with 32 bit build (-m32) :

compiler	C1 before (ms)	C1 after (ms)	delta %	C2 before (ms)	C2 after (ms)	delta %	C3 before (ms)	C3 after (ms)	delta %	C4 before (ms)	C4 after (ms)	delta %	C5 before (ms)	C5 after (ms)	delta %	C6 before (ms)	C6 after (ms)	delta %
GCC 4.4	1100	989	-10.1	2129	2069	-2.8	5570	4900	-12.0	3489	2719	-22.1	2479	2149	-13.3	5429	4740	-12.7
GCC 4.6.2	950	840	-11.6	1940	1800	-7.2	5210	4170	-20.0	2950	2420	-18.0	2280	1859	-18.5	5110	4019	-21.4
GCC 4.8.0	1000	809	-19.1	2060	1810	-12.1	5560	4570	-17.8	2960	2380	-19.6	2500	1950	-22.0	5450	4360	-20.0
GCC 5.2.0	909	779	-14.3	1839	1700	-7.6	4880	4230	-13.3	2680	2340	-12.7	2050	1770	-13.7	4810	4090	-15.0
GCC 5.3.0	909	789	-13.2	1830	1710	-6.6	4860	4240	-12.8	2690	2300	-14.5	2070	1760	-15.0	4800	4070	-15.2
CLang 3.7.0	980	850	-13.3	2009	1740	-13.4	5340	4490	-15.9	3050	2710	-11.1	2160	1799	-16.7	5200	4300	-17.3

This work has been funded by Planet Labs

Allow these hot functions to be inlined. This boosts decode performance by ~10%.

We can avoid using a loop-up table with some shift arithmetics.

Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that does the job of opj_t1_dec_clnpass_step_only() assuming the conditions are met. And use it in opj_t1_dec_clnpass(). The compiler generates more efficient code.

This is essentially used to shift inside the lut_ctxno_zc, which we can precompute at the beginning of opj_t1_decode_cblk() / opj_t1_encode_cblk()

Addition flag array such that colflags[1+0] is for state of col=0,row=0..3, colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ... This array avoids too much cache trashing when processing by 4 vertical samples as done in the various decoding steps.

… (of the non VSC case)

…qc_vsc() with loop unrolling

…oduced in ba1edf6

mayeut · 2016-07-12T20:24:08Z

src/lib/openjp2/t1.c

        }
 }                               /* VSC and  BYPASS by Antonin */

-static void opj_t1_dec_sigpass_mqc(
+#define opj_t1_dec_sigpass_mqc_internal(t1, bpno, w, h, flags_stride) \


Hi, is it possible to use an inline function here so that debugging is made easier ? like what was done in ae1da37

Timings will have to be checked after that of course.

c0nk and others added 12 commits May 21, 2016 15:18

Move some MQC functions into a header for speed

426bf8d

Allow these hot functions to be inlined. This boosts decode performance by ~10%.

opj_t1_updateflags(): tiny optimization

c539808

We can avoid using a loop-up table with some shift arithmetics.

Improve code generation in opj_t1_dec_clnpass()

d8fef96

Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that does the job of opj_t1_dec_clnpass_step_only() assuming the conditions are met. And use it in opj_t1_dec_clnpass(). The compiler generates more efficient code.

Specialize decoding passes for 64x64 code blocks

23a01df

Reduce number of occurrences of orient function argument

ba1edf6

This is essentially used to shift inside the lut_ctxno_zc, which we can precompute at the beginning of opj_t1_decode_cblk() / opj_t1_encode_cblk()

Const'ify lut arrays so they are in the read-only data section

31882ad

opj_t1_decode_cblks(): tiny perf increase when loop unrolling

93f7f90

opj_t1_dec_clnpass(): remove useless test in the runlen decoding path…

956c31d

… (of the non VSC case)

Better inlining of opj_t1_updateflagscolflags() w.r.t. flags_stride

8371491

Improve perf of opj_t1_dec_sigpass_mqc_vsc() and opj_t1_dec_refpass_m…

107eb31

…qc_vsc() with loop unrolling

Fix MSVC210 build issue (use of C99 declaration after statement) intr…

7092f7e

…oduced in ba1edf6

rouault mentioned this pull request May 25, 2016

T1 & DWT multithreading decoding optimizations #786

Merged

mayeut reviewed Jul 12, 2016
View reviewed changes

detonin merged commit 7092f7e into uclouvain:master Sep 13, 2016

detonin added the enhancement label Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tier1 decoder speed optimizations #783

Tier1 decoder speed optimizations #783

rouault commented May 23, 2016 •

edited

Loading

mayeut Jul 12, 2016

Tier1 decoder speed optimizations #783

Tier1 decoder speed optimizations #783

Conversation

rouault commented May 23, 2016 • edited Loading

mayeut Jul 12, 2016

Choose a reason for hiding this comment

rouault commented May 23, 2016 •

edited

Loading