Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import x86 assembly for dav1d 0.9.1 #2769

Merged
merged 54 commits into from Aug 9, 2021
Merged

Conversation

shssoichiro
Copy link
Collaborator

@shssoichiro shssoichiro commented Aug 6, 2021

The majority of this assembly targets high bit-depth encoding for SSSE3/SSE4.1, so this will primarily benefit older machines that do not have AVX2.

There are minor improvements to AVX2 which result in about a 1% speedup at speed 6.

With only SSE4.1 available, this provides a 24% speedup on high bit-depth content at speed 6.

anotherwon and others added 30 commits August 6, 2021 09:26
Particularly in code that makes heavy use of macros it's possible
to end up with 3-operand instructions with a memory operand in src1.
In the case of SSE this works fine due to automatic move insertions,
but in AVX that fails since memory operands are only allowed in src2.

The main purpose of this feature is to minimize the amount of code
changes required to facilitate conversion of existing SSE code to AVX.
The stack size calculation ended up being incorrect when the stack
alignment was larger than 16 due to auto-generated alignment padding.
* Rename macro for consistency. WHT has exactly one line per register.
* Use REPX to make code more readable.
                                                 64-bit  32-bit
inv_txfm_add_4x4_adst_adst_0_10bpc_c:            257.0   346.3
inv_txfm_add_4x4_adst_adst_0_10bpc_sse4:          47.1    51.7
inv_txfm_add_4x4_adst_adst_0_10bpc_avx2:          57.4
inv_txfm_add_4x4_adst_adst_1_10bpc_c:            259.8   345.6
inv_txfm_add_4x4_adst_adst_1_10bpc_sse4:          47.1    52.0
inv_txfm_add_4x4_adst_adst_1_10bpc_avx2:          56.9
inv_txfm_add_4x4_adst_dct_0_10bpc_c:             284.6   369.9
inv_txfm_add_4x4_adst_dct_0_10bpc_sse4:           42.2    46.0
inv_txfm_add_4x4_adst_dct_0_10bpc_avx2:           51.9
inv_txfm_add_4x4_adst_dct_1_10bpc_c:             285.2   369.8
inv_txfm_add_4x4_adst_dct_1_10bpc_sse4:           42.4    45.9
inv_txfm_add_4x4_adst_dct_1_10bpc_avx2:           51.9
inv_txfm_add_4x4_adst_flipadst_0_10bpc_c:        262.9   345.0
inv_txfm_add_4x4_adst_flipadst_0_10bpc_sse4:      46.8    50.1
inv_txfm_add_4x4_adst_flipadst_0_10bpc_avx2:      57.0
inv_txfm_add_4x4_adst_flipadst_1_10bpc_c:        262.1   345.6
inv_txfm_add_4x4_adst_flipadst_1_10bpc_sse4:      46.8    50.3
inv_txfm_add_4x4_adst_flipadst_1_10bpc_avx2:      57.1
inv_txfm_add_4x4_adst_identity_0_10bpc_c:        225.6   302.9
inv_txfm_add_4x4_adst_identity_0_10bpc_sse4:      38.0    42.3
inv_txfm_add_4x4_adst_identity_0_10bpc_avx2:      41.4
inv_txfm_add_4x4_adst_identity_1_10bpc_c:        225.7   303.1
inv_txfm_add_4x4_adst_identity_1_10bpc_sse4:      37.8    42.3
inv_txfm_add_4x4_adst_identity_1_10bpc_avx2:      41.4
inv_txfm_add_4x4_dct_adst_0_10bpc_c:             274.6   378.0
inv_txfm_add_4x4_dct_adst_0_10bpc_sse4:           44.8    48.5
inv_txfm_add_4x4_dct_adst_0_10bpc_avx2:           50.7
inv_txfm_add_4x4_dct_adst_1_10bpc_c:             274.0   377.4
inv_txfm_add_4x4_dct_adst_1_10bpc_sse4:           44.6    48.6
inv_txfm_add_4x4_dct_adst_1_10bpc_avx2:           51.0
inv_txfm_add_4x4_dct_dct_0_10bpc_c:               39.2    50.6
inv_txfm_add_4x4_dct_dct_0_10bpc_sse4:            29.1    33.8
inv_txfm_add_4x4_dct_dct_0_10bpc_avx2:            29.3
inv_txfm_add_4x4_dct_dct_1_10bpc_c:              300.6   399.0
inv_txfm_add_4x4_dct_dct_1_10bpc_sse4:            39.7    44.3
inv_txfm_add_4x4_dct_dct_1_10bpc_avx2:            48.6
inv_txfm_add_4x4_dct_flipadst_0_10bpc_c:         278.6   377.8
inv_txfm_add_4x4_dct_flipadst_0_10bpc_sse4:       45.3    49.6
inv_txfm_add_4x4_dct_flipadst_0_10bpc_avx2:       50.2
inv_txfm_add_4x4_dct_flipadst_1_10bpc_c:         277.1   378.3
inv_txfm_add_4x4_dct_flipadst_1_10bpc_sse4:       45.0    49.7
inv_txfm_add_4x4_dct_flipadst_1_10bpc_avx2:       50.2
inv_txfm_add_4x4_dct_identity_0_10bpc_c:         246.9   335.8
inv_txfm_add_4x4_dct_identity_0_10bpc_sse4:       37.1    41.7
inv_txfm_add_4x4_dct_identity_0_10bpc_avx2:       37.4
inv_txfm_add_4x4_dct_identity_1_10bpc_c:         247.2   336.2
inv_txfm_add_4x4_dct_identity_1_10bpc_sse4:       37.1    41.6
inv_txfm_add_4x4_dct_identity_1_10bpc_avx2:       37.3
inv_txfm_add_4x4_flipadst_adst_0_10bpc_c:        259.4   351.7
inv_txfm_add_4x4_flipadst_adst_0_10bpc_sse4:      47.1    51.8
inv_txfm_add_4x4_flipadst_adst_0_10bpc_avx2:      57.9
inv_txfm_add_4x4_flipadst_adst_1_10bpc_c:        258.7   350.8
inv_txfm_add_4x4_flipadst_adst_1_10bpc_sse4:      47.1    51.8
inv_txfm_add_4x4_flipadst_adst_1_10bpc_avx2:      57.4
inv_txfm_add_4x4_flipadst_dct_0_10bpc_c:         282.3   375.4
inv_txfm_add_4x4_flipadst_dct_0_10bpc_sse4:       42.2    45.8
inv_txfm_add_4x4_flipadst_dct_0_10bpc_avx2:       52.5
inv_txfm_add_4x4_flipadst_dct_1_10bpc_c:         283.0   375.8
inv_txfm_add_4x4_flipadst_dct_1_10bpc_sse4:       42.5    45.9
inv_txfm_add_4x4_flipadst_dct_1_10bpc_avx2:       52.4
inv_txfm_add_4x4_flipadst_flipadst_0_10bpc_c:    258.8   356.1
inv_txfm_add_4x4_flipadst_flipadst_0_10bpc_sse4:  47.3    50.1
inv_txfm_add_4x4_flipadst_flipadst_0_10bpc_avx2:  57.4
inv_txfm_add_4x4_flipadst_flipadst_1_10bpc_c:    259.0   355.3
inv_txfm_add_4x4_flipadst_flipadst_1_10bpc_sse4:  47.8    50.2
inv_txfm_add_4x4_flipadst_flipadst_1_10bpc_avx2:  57.4
inv_txfm_add_4x4_flipadst_identity_0_10bpc_c:    228.6   309.4
inv_txfm_add_4x4_flipadst_identity_0_10bpc_sse4:  37.8    42.0
inv_txfm_add_4x4_flipadst_identity_0_10bpc_avx2:  41.4
inv_txfm_add_4x4_flipadst_identity_1_10bpc_c:    229.1   309.6
inv_txfm_add_4x4_flipadst_identity_1_10bpc_sse4:  37.9    42.2
inv_txfm_add_4x4_flipadst_identity_1_10bpc_avx2:  41.3
inv_txfm_add_4x4_identity_adst_0_10bpc_c:        200.8   275.8
inv_txfm_add_4x4_identity_adst_0_10bpc_sse4:      39.0    43.9
inv_txfm_add_4x4_identity_adst_0_10bpc_avx2:      47.4
inv_txfm_add_4x4_identity_adst_1_10bpc_c:        200.8   276.5
inv_txfm_add_4x4_identity_adst_1_10bpc_sse4:      39.0    44.0
inv_txfm_add_4x4_identity_adst_1_10bpc_avx2:      47.2
inv_txfm_add_4x4_identity_dct_0_10bpc_c:         226.4   300.3
inv_txfm_add_4x4_identity_dct_0_10bpc_sse4:       36.9    41.7
inv_txfm_add_4x4_identity_dct_0_10bpc_avx2:       42.8
inv_txfm_add_4x4_identity_dct_1_10bpc_c:         229.0   300.6
inv_txfm_add_4x4_identity_dct_1_10bpc_sse4:       36.8    41.6
inv_txfm_add_4x4_identity_dct_1_10bpc_avx2:       42.7
inv_txfm_add_4x4_identity_flipadst_0_10bpc_c:    202.6   278.9
inv_txfm_add_4x4_identity_flipadst_0_10bpc_sse4:  39.2    43.7
inv_txfm_add_4x4_identity_flipadst_0_10bpc_avx2:  47.1
inv_txfm_add_4x4_identity_flipadst_1_10bpc_c:    202.6   279.3
inv_txfm_add_4x4_identity_flipadst_1_10bpc_sse4:  39.2    43.8
inv_txfm_add_4x4_identity_flipadst_1_10bpc_avx2:  47.0
inv_txfm_add_4x4_identity_identity_0_10bpc_c:    168.7   235.9
inv_txfm_add_4x4_identity_identity_0_10bpc_sse4:  31.7    37.6
inv_txfm_add_4x4_identity_identity_0_10bpc_avx2:  33.9
inv_txfm_add_4x4_identity_identity_1_10bpc_c:    169.1   235.7
inv_txfm_add_4x4_identity_identity_1_10bpc_sse4:  31.7    37.4
inv_txfm_add_4x4_identity_identity_1_10bpc_avx2:  33.8
rbultje and others added 24 commits August 6, 2021 09:26
The wrapper function already backs up GPRs, and declaring 7 here means
we will backup/restore twice on x86-32.
@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.4%) to 83.533% when pulling 76d531e on shssoichiro:dav1d-091-x86-asm into 59ef884 on xiph:master.

@negge
Copy link
Collaborator

negge commented Aug 6, 2021

What about the arm assembly?

@shssoichiro
Copy link
Collaborator Author

shssoichiro commented Aug 6, 2021

I don't have an arm machine to test on. Usually mindfreeze handles those. From IRC:

10:04 AM <lu_zero> mindfreeze: willing to pick the aarch64 parts?
10:07 AM <mindfreeze> lu_zero: sure, I was checking the changes
10:07 AM <mindfreeze> https://code.videolan.org/videolan/dav1d/-/commits/0.9.1/src/arm
10:07 AM <mindfreeze> It looks like it is Filmgrain changes, we could import them, but it will be not useful
10:07 AM <mindfreeze> at this point 

Copy link
Collaborator

@lu-zero lu-zero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, let's land it :)

@lu-zero lu-zero merged commit 984515f into xiph:master Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants