New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tier1 decoder speed optimizations #783

Merged
merged 12 commits into from Sep 13, 2016

Conversation

Projects
None yet
5 participants
@rouault
Collaborator

rouault commented May 23, 2016

This patch series improves T1 decoding speed, resulting in overall decompression time gains typically of 10-15% for operational products

Various tricks used :

  • more aggressive inlining (reusing #675)
  • specialization of decoding of 64x64 code blocks, which are common in a lot of products
  • addition of an auxiliary colflags array whose each 16-bits item stores the overall state of 4 values in a colum, and thus enables quick checks in a cache friendly way
  • loop unrolling for VSC steps (similarly to the non-VSC case)

Benchmarking:

This has been tested with the following files :
C1: issue135.j2k (fom openjpeg-data, code blocks 32x32)
C2: Bretagne2.j2k (fom openjpeg-data, code blocks 32x32)
C3: 20160307_125117_0c74.jp2 (non public test file, 3 bands, 12 bits, 6600x2200 for band 1, 3300x2200 for bands 2 and 3, code blocks 64x64)
C4: issue135_vsc.jp2 ( issue135.j2k recoded by opj_compress -M 8, code blocks 64x64)
C5: issue135_raw.jp2 ( issue135.j2k recoded by opj_compress -M 1, code blocks 64x64)
C6: S2A_OPER_MSI_L1C_TL_MTI__20150819T171650_A000763_T30SWE_B05.jp2 (Sentinel 2 tile, 5490x5490, 1 band, 12 bits, code blocks 64x64)

Builds done with -DCMAKE_BUILD_TYPE=Release. Times measured are the smallest time of 2 consecutive runs reported by "opj_decompress -i $(INPUT_FILE) -o /tmp/out.ppm" in the "decode time: XXX ms" line

Machine & OS spec: Intel(R) Core(TM) i5 CPU 750 @ 2.67GHz, Linux 64 bit

compiler C1 before (ms) C1 after (ms) delta % C2 before (ms) C2 after (ms) delta % C3 before (ms) C3 after (ms) delta % C4 before (ms) C4 after (ms) delta % C5 before (ms) C5 after (ms) delta % C6 before (ms) C6 after (ms) delta %
GCC 4.4.3 799 730 -8.6 1710 1559 -8.8 4390 3960 -9.8 2670 2070 -22.5 1990 1589 -20.2 4250 3770 -11.3
GCC 4.6.2 830 720 -13.3 1700 1480 -12.9 4620 3860 -16.5 2469 2070 -16.2 1930 1490 -22.8 4400 3710 -15.7
GCC 4.8.0 880 740 -15.9 1790 1569 -12.3 4820 4040 -16.2 2630 2120 -19.4 2130 1530 -28.2 4700 3880 -17.4
GCC 5.2.0 860 740 -14.0 1720 1569 -8.8 4630 4160 -10.2 2480 2050 -17.3 1859 1569 -15.6 4520 3830 -15.3
GCC 5.3.0 850 730 -14.1 1730 1569 -9.3 4640 4010 -13.6 2480 2040 -17.7 1850 1569 -15.2 4530 3840 -15.2
CLang 3.7.0 809 730 -9.8 1670 1560 -6.6 4510 4090 -9.3 2770 2430 -12.3 1859 1549 -16.7 4390 3920 -10.7

Same with 32 bit build (-m32) :

compiler C1 before (ms) C1 after (ms) delta % C2 before (ms) C2 after (ms) delta % C3 before (ms) C3 after (ms) delta % C4 before (ms) C4 after (ms) delta % C5 before (ms) C5 after (ms) delta % C6 before (ms) C6 after (ms) delta %
GCC 4.4 1100 989 -10.1 2129 2069 -2.8 5570 4900 -12.0 3489 2719 -22.1 2479 2149 -13.3 5429 4740 -12.7
GCC 4.6.2 950 840 -11.6 1940 1800 -7.2 5210 4170 -20.0 2950 2420 -18.0 2280 1859 -18.5 5110 4019 -21.4
GCC 4.8.0 1000 809 -19.1 2060 1810 -12.1 5560 4570 -17.8 2960 2380 -19.6 2500 1950 -22.0 5450 4360 -20.0
GCC 5.2.0 909 779 -14.3 1839 1700 -7.6 4880 4230 -13.3 2680 2340 -12.7 2050 1770 -13.7 4810 4090 -15.0
GCC 5.3.0 909 789 -13.2 1830 1710 -6.6 4860 4240 -12.8 2690 2300 -14.5 2070 1760 -15.0 4800 4070 -15.2
CLang 3.7.0 980 850 -13.3 2009 1740 -13.4 5340 4490 -15.9 3050 2710 -11.1 2160 1799 -16.7 5200 4300 -17.3

This work has been funded by Planet Labs

c0nk and others added some commits Dec 27, 2015

Move some MQC functions into a header for speed
Allow these hot functions to be inlined. This boosts decode performance by ~10%.
opj_t1_updateflags(): tiny optimization
We can avoid using a loop-up table with some shift arithmetics.
Improve code generation in opj_t1_dec_clnpass()
Add a opj_t1_dec_clnpass_step_only_if_flag_not_sig_visit() method that
does the job of opj_t1_dec_clnpass_step_only() assuming the conditions
are met. And use it in opj_t1_dec_clnpass(). The compiler generates
more efficient code.
Reduce number of occurrences of orient function argument
This is essentially used to shift inside the lut_ctxno_zc, which we
can precompute at the beginning of opj_t1_decode_cblk() /
opj_t1_encode_cblk()
Tier 1 decoding: add a colflags array
Addition flag array such that colflags[1+0] is for state of col=0,row=0..3,
colflags[1+1] for col=1, row=0..3, colflags[1+flags_stride] for col=0,row=4..7, ...
This array avoids too much cache trashing when processing by 4 vertical samples
as done in the various decoding steps.
}
} /* VSC and BYPASS by Antonin */
static void opj_t1_dec_sigpass_mqc(
#define opj_t1_dec_sigpass_mqc_internal(t1, bpno, w, h, flags_stride) \

This comment has been minimized.

@mayeut

mayeut Jul 12, 2016

Collaborator

Hi, is it possible to use an inline function here so that debugging is made easier ? like what was done in ae1da37

Timings will have to be checked after that of course.

}
} /* VSC and BYPASS by Antonin */
static void opj_t1_dec_refpass_mqc(
#define opj_t1_dec_refpass_mqc_internal(t1, bpno, w, h, flags_stride) \

This comment has been minimized.

@mayeut

mayeut Jul 12, 2016

Collaborator

Hi, is it possible to use an inline function here so that debugging is made easier ? like what was done in ae1da37

Timings will have to be checked after that of course.

static void opj_t1_dec_clnpass(
#define MACRO_t1_flags_internal(x,y,flags_stride) t1->flags[((x)*(flags_stride))+(y)]
#define opj_t1_dec_clnpass_internal(consistency_check, t1, bpno, cblksty, w, h, flags_stride) \

This comment has been minimized.

@mayeut

mayeut Jul 12, 2016

Collaborator

Hi, is it possible to use an inline function here so that debugging is made easier ? like what was done in ae1da37

Timings will have to be checked after that of course.

This comment has been minimized.

@rouault

rouault Jul 12, 2016

Collaborator

Possibly, but the resulting assembly code should be checked to verify that the compiler really inlines when w, h are set to 64

@@ -1369,12 +1649,14 @@ static OPJ_BOOL opj_t1_decode_cblk(opj_t1_t *t1,
{
opj_raw_t *raw = t1->raw; /* RAW component */
opj_mqc_t *mqc = t1->mqc; /* MQC component */

This comment has been minimized.

@stweil

stweil Jul 12, 2016

Contributor

Adding such whitespace should be avoided.

@detonin detonin merged commit 7092f7e into uclouvain:master Sep 13, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@detonin detonin added the enhancement label Aug 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment