Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lmcs for 16bpc handling widths 8 through 128 #1

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

stone-d-chen
Copy link
Owner

No description provided.

…mizations

Changes VVCLMCSDSPContext to contain void (*filter[6]), one for each CTU width (4, 8, 16, 32, 64, 128) to handle potentially different code paths for each width. These are initialized with the original scalar function FUNC(lmcs_filter_luma). This also changes filter() calls to filter[0]().

Since there will be a primary target of AVX2, for 16bpc data width 16 pixels to 128 pixels can be handled with a single code path but 4 & 8 will likely need separate code paths. Furthermore expansion to AVX512 could require separate code paths as well.

This commit contains no change in functionality.
Creates two functions lmcs_16bpc (for width 16 to 128) & lmcs_16bpc_8pix (for width 8) in x86 asm to perform LMCS.

The key instruction is vpgatherdd to perform a gather based on indices. Since vpgatherdd only operates on 32bit ints, the major steps are to unpack the 16bit data into 32bits (via punpcklwd with a vec of 0's), perform the gather, shuffle off garbage data, and repack to 16bit ints.

Width 16 through 128 will use YMM register (16 * 16 = 256) and 8 pixels will use XMM registers (8 * 16 = 128). Currently the plan for width 4 is to just downshift to the scalar version.

Rough draft is functional but will require cleanup and further optimizations.

newline

Fix loop counter for 8pixel 16bit lmcs

The loop counter was the same as the 16pixel+ version, meaning it shifted right the width by mmsize/8

Instead we only need 1 loop. Technically we should eliminate the looping completely.
Prepares the appropriate files to create checkasm_check_vvc_lmcs.

No functionality but successfully compiles.
…ideos; run function based on CTU width

Modifies ff_vvc_dsp_init_x86 to load ff_lmcs_16bpc_avx2 and ff_lmcs_16bpc_8pix_avx2 for 10bit videos.

Modifies call locations so that the appropriate function is selected based on the CTU width.

For some reason I have a performance regression vs another local copy, Tango2_3840x2160_60_10_420_27_LD.266 3.0% vs 3.6%
Adds two checkasm tests for lmcs.

check_vvc_lmcs_16bpc checks for widths 16, 32, 64, 128 and check_vvc_lmcs_16bpc_8pixels does 8 pixel width.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant