lmcs for 16bpc handling widths 8 through 128 #1

stone-d-chen · 2024-03-16T21:02:43Z

No description provided.

…mizations Changes VVCLMCSDSPContext to contain void (*filter[6]), one for each CTU width (4, 8, 16, 32, 64, 128) to handle potentially different code paths for each width. These are initialized with the original scalar function FUNC(lmcs_filter_luma). This also changes filter() calls to filter[0](). Since there will be a primary target of AVX2, for 16bpc data width 16 pixels to 128 pixels can be handled with a single code path but 4 & 8 will likely need separate code paths. Furthermore expansion to AVX512 could require separate code paths as well. This commit contains no change in functionality.

Creates two functions lmcs_16bpc (for width 16 to 128) & lmcs_16bpc_8pix (for width 8) in x86 asm to perform LMCS. The key instruction is vpgatherdd to perform a gather based on indices. Since vpgatherdd only operates on 32bit ints, the major steps are to unpack the 16bit data into 32bits (via punpcklwd with a vec of 0's), perform the gather, shuffle off garbage data, and repack to 16bit ints. Width 16 through 128 will use YMM register (16 * 16 = 256) and 8 pixels will use XMM registers (8 * 16 = 128). Currently the plan for width 4 is to just downshift to the scalar version. Rough draft is functional but will require cleanup and further optimizations. newline Fix loop counter for 8pixel 16bit lmcs The loop counter was the same as the 16pixel+ version, meaning it shifted right the width by mmsize/8 Instead we only need 1 loop. Technically we should eliminate the looping completely.

Prepares the appropriate files to create checkasm_check_vvc_lmcs. No functionality but successfully compiles.

…ideos; run function based on CTU width Modifies ff_vvc_dsp_init_x86 to load ff_lmcs_16bpc_avx2 and ff_lmcs_16bpc_8pix_avx2 for 10bit videos. Modifies call locations so that the appropriate function is selected based on the CTU width. For some reason I have a performance regression vs another local copy, Tango2_3840x2160_60_10_420_27_LD.266 3.0% vs 3.6%

Adds two checkasm tests for lmcs. check_vvc_lmcs_16bpc checks for widths 16, 32, 64, 128 and check_vvc_lmcs_16bpc_8pixels does 8 pixel width.

stone-d-chen added 5 commits March 16, 2024 10:48

prepare checkasm_check_vvc_lmcs, no functionality

2987b81

Prepares the appropriate files to create checkasm_check_vvc_lmcs. No functionality but successfully compiles.

Introduce check_vvc_lmcs_16bpc_8pixels & check_vvc_lmcs_16bpc

1df6946

Adds two checkasm tests for lmcs. check_vvc_lmcs_16bpc checks for widths 16, 32, 64, 128 and check_vvc_lmcs_16bpc_8pixels does 8 pixel width.

stone-d-chen mentioned this pull request Mar 16, 2024

Add AVX2 assembly code for LMCS filter ffvvc/FFmpeg#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lmcs for 16bpc handling widths 8 through 128 #1

lmcs for 16bpc handling widths 8 through 128 #1

stone-d-chen commented Mar 16, 2024

lmcs for 16bpc handling widths 8 through 128 #1

Are you sure you want to change the base?

lmcs for 16bpc handling widths 8 through 128 #1

Conversation

stone-d-chen commented Mar 16, 2024