-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge dav1d 0.3.0 AArch64 Assembly #1791
Conversation
Great work again. Is it worth pulling in Aarch32 NEON, considering the increase in binary size and compile times, the very low performance of ARMv7 cores and the declining adoption? I think Aarch64 should be the (only) focus for Arm. |
I'd postpone any 32bit work until the 64bit codepaths work flawlessly. |
Having the AArch32 sources available is inconsequential to AArch64 builds, so there's no harm in updating the ARM sources in the same order as upstream. Although, one difference with x86 is that the 32 and 64 bit sources are completely separated which makes it easier to skip the 32 bit sources. |
While armv8 cores are commonly adopted, they generally include support for both AArch32 and AArch64. The adoption of AArch64 software stacks is nowhere near as ubiquitous as armv8. This at least partly due to the large increase in code size from AArch32 to AArch64. So for the near future, there are likely applications for AArch32 on armv8. |
95fbd88
to
71c1df0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing purposes, we can match the constant stride in assembly to the buffer used by the encoder.
71c1df0
to
d85faa2
Compare
What's the status of this PR? Need any help testing? |
cfd3898
to
8f727d7
Compare
So,
|
8f727d7
to
fceb6ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, this matches the changes in dav1d from 0.1.0
to 0.3.0
.
Thank you for catching the warnings in src/arm/tables.S
.
On COFF, the default read only data section is `.rdata`, not `.rodata`.
On arm, there's no ubfm instruction, only ubfx.
Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w2_v_8bpc_neon: 155.1 131.8 mc_8tap_regular_w4_v_8bpc_neon: 199.6 148.1 mc_8tap_regular_w8_v_8bpc_neon: 286.2 225.5 After: mc_8tap_regular_w2_v_8bpc_neon: 134.1 129.5 mc_8tap_regular_w4_v_8bpc_neon: 157.6 146.5 mc_8tap_regular_w8_v_8bpc_neon: 208.0 225.0
Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w2_hv_8bpc_neon: 415.0 286.9 After: mc_8tap_regular_w2_hv_8bpc_neon: 399.1 269.9
Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w4_hv_8bpc_neon: 543.6 359.1 After: mc_8tap_regular_w4_hv_8bpc_neon: 466.7 355.5 The same kind of change doesn't seem to give any benefits on the 8 pixel wide hv filtering though, potentially related to the fact that it uses not only smull/smlal but also smull2/smlal2.
Relative speedups measured with checkasm: Cortex A7 A8 A9 A53 Snapdragon 835 mc_8tap_regular_w2_0_8bpc_neon: 9.63 4.05 3.82 5.41 5.68 mc_8tap_regular_w2_h_8bpc_neon: 3.30 5.44 3.38 3.88 5.12 mc_8tap_regular_w2_hv_8bpc_neon: 3.86 6.21 4.39 5.18 6.10 mc_8tap_regular_w2_v_8bpc_neon: 4.69 5.43 3.56 7.27 4.86 mc_8tap_regular_w4_0_8bpc_neon: 9.13 4.05 5.24 5.37 6.60 mc_8tap_regular_w4_h_8bpc_neon: 4.38 7.11 4.61 6.59 7.15 mc_8tap_regular_w4_hv_8bpc_neon: 5.11 9.77 7.37 9.21 10.29 mc_8tap_regular_w4_v_8bpc_neon: 6.24 7.88 4.96 11.16 7.89 mc_8tap_regular_w8_0_8bpc_neon: 9.12 4.20 5.59 5.59 9.25 mc_8tap_regular_w8_h_8bpc_neon: 5.91 8.42 4.84 8.46 7.08 mc_8tap_regular_w8_hv_8bpc_neon: 5.46 8.35 6.52 7.19 8.33 mc_8tap_regular_w8_v_8bpc_neon: 7.53 8.96 6.28 16.08 10.66 mc_8tap_regular_w16_0_8bpc_neon: 9.77 5.46 4.06 7.02 7.38 mc_8tap_regular_w16_h_8bpc_neon: 6.33 8.87 5.03 10.30 4.29 mc_8tap_regular_w16_hv_8bpc_neon: 5.00 7.84 6.15 6.83 7.44 mc_8tap_regular_w16_v_8bpc_neon: 7.74 8.81 6.23 19.24 11.16 mc_8tap_regular_w32_0_8bpc_neon: 6.11 4.63 2.44 5.92 4.70 mc_8tap_regular_w32_h_8bpc_neon: 6.60 9.02 5.20 11.08 3.50 mc_8tap_regular_w32_hv_8bpc_neon: 4.85 7.64 6.09 6.68 6.92 mc_8tap_regular_w32_v_8bpc_neon: 7.61 8.36 6.13 19.94 11.17 mc_8tap_regular_w64_0_8bpc_neon: 4.61 3.81 1.60 3.50 2.73 mc_8tap_regular_w64_h_8bpc_neon: 6.72 9.07 5.21 11.41 3.10 mc_8tap_regular_w64_hv_8bpc_neon: 4.67 7.43 5.92 6.43 6.59 mc_8tap_regular_w64_v_8bpc_neon: 7.64 8.28 6.07 20.48 11.41 mc_8tap_regular_w128_0_8bpc_neon: 2.41 3.13 1.11 2.31 1.73 mc_8tap_regular_w128_h_8bpc_neon: 6.68 9.03 5.09 11.41 2.90 mc_8tap_regular_w128_hv_8bpc_neon: 4.50 7.39 5.70 6.26 6.47 mc_8tap_regular_w128_v_8bpc_neon: 7.21 8.23 5.88 19.82 11.42 mc_bilinear_w2_0_8bpc_neon: 9.23 4.03 3.74 5.33 6.49 mc_bilinear_w2_h_8bpc_neon: 2.07 3.52 2.71 2.35 3.40 mc_bilinear_w2_hv_8bpc_neon: 2.60 5.24 2.73 2.74 3.89 mc_bilinear_w2_v_8bpc_neon: 2.57 4.39 3.14 3.04 4.05 mc_bilinear_w4_0_8bpc_neon: 8.74 4.03 5.38 5.28 6.53 mc_bilinear_w4_h_8bpc_neon: 3.41 6.22 4.28 3.86 7.56 mc_bilinear_w4_hv_8bpc_neon: 4.38 7.45 4.61 5.26 7.95 mc_bilinear_w4_v_8bpc_neon: 3.65 6.57 4.51 4.45 7.62 mc_bilinear_w8_0_8bpc_neon: 8.74 4.50 5.71 5.46 9.39 mc_bilinear_w8_h_8bpc_neon: 6.14 10.71 6.78 6.88 14.10 mc_bilinear_w8_hv_8bpc_neon: 7.11 12.80 8.24 11.08 7.83 mc_bilinear_w8_v_8bpc_neon: 7.24 11.69 7.57 8.04 15.46 mc_bilinear_w16_0_8bpc_neon: 10.01 5.47 4.07 6.97 7.64 mc_bilinear_w16_h_8bpc_neon: 8.36 17.00 8.34 11.61 7.64 mc_bilinear_w16_hv_8bpc_neon: 7.67 13.54 8.53 13.32 8.05 mc_bilinear_w16_v_8bpc_neon: 10.19 22.56 10.52 15.39 10.62 mc_bilinear_w32_0_8bpc_neon: 6.22 4.73 2.43 5.89 4.90 mc_bilinear_w32_h_8bpc_neon: 9.47 18.96 9.34 13.10 7.24 mc_bilinear_w32_hv_8bpc_neon: 7.95 13.15 9.49 13.78 8.71 mc_bilinear_w32_v_8bpc_neon: 11.10 23.53 11.34 16.74 8.78 mc_bilinear_w64_0_8bpc_neon: 4.58 3.82 1.59 3.46 2.71 mc_bilinear_w64_h_8bpc_neon: 10.07 19.77 9.60 13.99 6.88 mc_bilinear_w64_hv_8bpc_neon: 8.08 12.95 9.39 13.84 8.90 mc_bilinear_w64_v_8bpc_neon: 11.49 23.85 11.12 17.13 7.90 mc_bilinear_w128_0_8bpc_neon: 2.37 3.24 1.15 2.28 1.73 mc_bilinear_w128_h_8bpc_neon: 9.94 18.84 8.66 13.91 6.74 mc_bilinear_w128_hv_8bpc_neon: 7.26 12.82 8.97 12.43 8.88 mc_bilinear_w128_v_8bpc_neon: 9.89 23.88 8.93 14.73 7.33 mct_8tap_regular_w4_0_8bpc_neon: 2.82 4.46 2.72 3.50 5.41 mct_8tap_regular_w4_h_8bpc_neon: 4.16 6.88 4.64 6.51 6.60 mct_8tap_regular_w4_hv_8bpc_neon: 5.22 9.87 7.81 9.39 10.11 mct_8tap_regular_w4_v_8bpc_neon: 5.81 7.72 4.80 10.16 6.85 mct_8tap_regular_w8_0_8bpc_neon: 4.48 6.30 3.01 5.82 5.04 mct_8tap_regular_w8_h_8bpc_neon: 5.59 8.04 4.18 8.68 8.30 mct_8tap_regular_w8_hv_8bpc_neon: 5.34 8.32 6.42 7.04 7.99 mct_8tap_regular_w8_v_8bpc_neon: 7.32 8.71 5.75 17.07 9.73 mct_8tap_regular_w16_0_8bpc_neon: 5.05 9.60 3.64 10.06 4.29 mct_8tap_regular_w16_h_8bpc_neon: 5.53 8.20 4.54 9.98 7.33 mct_8tap_regular_w16_hv_8bpc_neon: 4.90 7.87 6.07 6.67 7.03 mct_8tap_regular_w16_v_8bpc_neon: 7.39 8.55 5.72 19.64 9.98 mct_8tap_regular_w32_0_8bpc_neon: 5.28 8.16 4.07 11.03 2.38 mct_8tap_regular_w32_h_8bpc_neon: 5.97 8.31 4.67 10.63 6.72 mct_8tap_regular_w32_hv_8bpc_neon: 4.73 7.65 5.98 6.51 6.31 mct_8tap_regular_w32_v_8bpc_neon: 7.33 8.18 5.72 20.50 10.03 mct_8tap_regular_w64_0_8bpc_neon: 5.11 9.19 4.01 10.61 1.92 mct_8tap_regular_w64_h_8bpc_neon: 6.05 8.33 4.53 10.84 6.38 mct_8tap_regular_w64_hv_8bpc_neon: 4.61 7.54 5.69 6.35 6.11 mct_8tap_regular_w64_v_8bpc_neon: 7.27 8.06 5.39 20.41 10.15 mct_8tap_regular_w128_0_8bpc_neon: 4.29 8.21 4.28 9.55 1.32 mct_8tap_regular_w128_h_8bpc_neon: 6.01 8.26 4.43 10.78 6.20 mct_8tap_regular_w128_hv_8bpc_neon: 4.49 7.49 5.46 6.11 5.96 mct_8tap_regular_w128_v_8bpc_neon: 6.90 8.00 5.19 18.47 10.13 mct_bilinear_w4_0_8bpc_neon: 2.70 4.53 2.67 3.32 5.11 mct_bilinear_w4_h_8bpc_neon: 3.02 5.06 3.13 3.28 5.38 mct_bilinear_w4_hv_8bpc_neon: 4.14 7.04 4.75 4.99 6.30 mct_bilinear_w4_v_8bpc_neon: 3.17 5.30 3.66 3.87 5.01 mct_bilinear_w8_0_8bpc_neon: 4.41 6.46 2.99 5.74 5.98 mct_bilinear_w8_h_8bpc_neon: 5.36 8.27 3.62 6.39 9.06 mct_bilinear_w8_hv_8bpc_neon: 6.65 11.82 6.79 11.47 7.07 mct_bilinear_w8_v_8bpc_neon: 6.26 9.62 4.05 7.75 16.81 mct_bilinear_w16_0_8bpc_neon: 4.86 9.85 3.61 10.03 4.19 mct_bilinear_w16_h_8bpc_neon: 5.26 12.91 4.76 9.56 9.68 mct_bilinear_w16_hv_8bpc_neon: 6.96 12.58 7.05 13.48 7.35 mct_bilinear_w16_v_8bpc_neon: 6.46 17.94 5.72 13.70 19.20 mct_bilinear_w32_0_8bpc_neon: 5.31 8.10 4.06 10.88 2.77 mct_bilinear_w32_h_8bpc_neon: 6.91 14.28 5.33 11.24 10.33 mct_bilinear_w32_hv_8bpc_neon: 7.13 12.21 7.57 13.91 7.19 mct_bilinear_w32_v_8bpc_neon: 8.06 18.48 5.88 14.74 15.47 mct_bilinear_w64_0_8bpc_neon: 5.08 7.29 3.83 10.44 1.71 mct_bilinear_w64_h_8bpc_neon: 7.24 14.59 5.40 11.70 11.03 mct_bilinear_w64_hv_8bpc_neon: 7.24 11.98 7.59 13.72 7.30 mct_bilinear_w64_v_8bpc_neon: 8.20 18.24 5.69 14.57 15.04 mct_bilinear_w128_0_8bpc_neon: 4.35 8.23 4.17 9.71 1.11 mct_bilinear_w128_h_8bpc_neon: 7.02 13.80 5.63 11.11 11.26 mct_bilinear_w128_hv_8bpc_neon: 6.31 11.89 6.75 12.12 7.24 mct_bilinear_w128_v_8bpc_neon: 6.95 18.26 5.84 11.31 14.78
These cases looped round too many.
The relative speedup compared to C code is around 4-8x: Cortex A7 A8 A9 A53 A72 A73 wiener_luma_8bpc_neon: 4.00 7.54 4.74 6.84 4.91 8.01
This uses the right registers, corresponding to the ones shifted in the arm64 version.
Speedup vs C code: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 4.62 4.48 4.76 cdef_filter_4x8_8bpc_neon: 4.82 4.80 5.08 cdef_filter_8x8_8bpc_neon: 5.29 5.33 5.79
This should fix compilation with compilers that default to armv6, such as on raspbian.
A symbol starting with two leading underscores is reserved for the compiler/standard library implementation. Also remove the trailing two double underscores for consistency and symmetry.
Speedup vs C code: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 4.43 3.51 4.39
Relative speedup vs C code: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 3.19 2.60 3.66 warp_8x8t_8bpc_neon: 3.09 2.50 3.58
Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 677.4 433.9 452.9 cdef_filter_4x8_8bpc_neon: 1255.0 815.2 841.8 cdef_filter_8x8_8bpc_neon: 2278.5 1440.0 1505.0 After: cdef_filter_4x4_8bpc_neon: 645.5 401.9 422.5 cdef_filter_4x8_8bpc_neon: 1193.7 756.6 782.4 cdef_filter_8x8_8bpc_neon: 2162.4 1361.9 1375.6
Pad with a value which works both as a large unsigned value and a negative signed value. This allows doing the max operation using signed max, avoiding the conditional altogether. Based on the same idea for x86 by Kyle Siefring. Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 645.5 401.9 422.5 cdef_filter_4x8_8bpc_neon: 1193.7 756.6 782.4 cdef_filter_8x8_8bpc_neon: 2162.4 1361.9 1375.6 After: cdef_filter_4x4_8bpc_neon: 596.3 377.8 384.8 cdef_filter_4x8_8bpc_neon: 1097.4 705.5 707.1 cdef_filter_8x8_8bpc_neon: 1967.4 1232.3 1239.9
This might have said pri_taps[k]/sec_taps[k] at some earlier time.
For cases with indented, nested .if/.macro in asm.S, ident those by 4 chars. Some initial assembly files were indented to 4/16 columns, while all the actual implementation files, starting with src/arm/64/mc.S, have used 8/24 for indentation.
The width register has been set to clz(w)-24, not the other way around. And the 32 bit prep function has got the h parameter in r4, not in r5.
This eases disambiguating these functions when looking at perf profiles.
Relative speedup vs (autovectorized) C code: Cortex A53 A72 A73 selfguided_3x3_8bpc_neon: 2.91 2.12 2.68 selfguided_5x5_8bpc_neon: 3.18 2.65 3.39 selfguided_mix_8bpc_neon: 3.04 2.29 2.98 The relative speedup vs non-vectorized C code is around 2.6-4.6x.
The exact relative speedup compared to C code is a bit vague and hard to measure, depending on eactly how many filtered blocks are skipped, as the NEON version always filters 16 pixels at a time, while the C code can skip processing individual 4 pixel blocks. Additionally, the checkasm benchmarking code runs the same function repeatedly on the same buffer, which can make the filter take different codepaths on each run, as the function updates the buffer which will be used as input for the next run. If tweaking the checkasm test data to try to avoid skipped blocks, the relative speedups compared to C is between 2x and 5x, while it is around 1x to 4x with the current checkasm test as such. Benchmark numbers from a tweaked checkasm that avoids skipped blocks: Cortex A53 A72 A73 lpf_h_sb_uv_w4_8bpc_c: 2954.7 1399.3 1655.3 lpf_h_sb_uv_w4_8bpc_neon: 895.5 650.8 692.0 lpf_h_sb_uv_w6_8bpc_c: 3879.2 1917.2 2257.7 lpf_h_sb_uv_w6_8bpc_neon: 1125.6 759.5 838.4 lpf_h_sb_y_w4_8bpc_c: 6711.0 3275.5 3913.7 lpf_h_sb_y_w4_8bpc_neon: 1744.0 1342.1 1351.5 lpf_h_sb_y_w8_8bpc_c: 10695.7 6155.8 6638.9 lpf_h_sb_y_w8_8bpc_neon: 2146.5 1560.4 1609.1 lpf_h_sb_y_w16_8bpc_c: 11355.8 6292.0 6995.9 lpf_h_sb_y_w16_8bpc_neon: 2475.4 1949.6 1968.4 lpf_v_sb_uv_w4_8bpc_c: 2639.7 1204.8 1425.9 lpf_v_sb_uv_w4_8bpc_neon: 510.7 351.4 334.7 lpf_v_sb_uv_w6_8bpc_c: 3468.3 1757.1 2021.5 lpf_v_sb_uv_w6_8bpc_neon: 625.0 415.0 397.8 lpf_v_sb_y_w4_8bpc_c: 5428.7 2731.7 3068.5 lpf_v_sb_y_w4_8bpc_neon: 1172.6 792.1 768.0 lpf_v_sb_y_w8_8bpc_c: 8946.1 4412.8 5121.0 lpf_v_sb_y_w8_8bpc_neon: 1565.5 1063.6 1062.7 lpf_v_sb_y_w16_8bpc_c: 8978.9 4411.7 5112.0 lpf_v_sb_y_w16_8bpc_neon: 1775.0 1288.1 1236.7
fceb6ea
to
944175b
Compare
Hi,
This new patchset advances addition of dav1d 0.2.0 and 0.2.1, 0.2.2, 0.3.0 to rav1e.
Overview
From 0.2.0 till 0.3.0 Release of dav1d, mainly we have is:
CDEF dir and filtersImprovements to Loop Restoration filtering.smart padding for CDEFSGR LooprestorationLoopfilteringWe are skipping the above for_now as accordance with x86.
We would be integrating Looprestoration, Loopfiltering, CDEF once we complete the initial phase of merging till 0.5.0 of dav1d arm.
There are around 25 new functions from dav1d 0.2.0..0.3.0, which are merged but not integerated for_now
rav1e_cdef_padding8_neonrav1e_cdef_padding4_neonrav1e_cdef_filter8_neonrav1e_cdef_filter4_neonrav1e_cdef_find_dir_neonrav1e_lpf_v_sb_y_neonrav1e_lpf_h_sb_y_neonrav1e_lpf_v_sb_uv_neonrav1e_lpf_h_sb_uv_neonrav1e_wiener_filter_h_neonrav1e_wiener_filter_v_neonrav1e_copy_narrow_neonrav1e_sgr_box3_h_neonrav1e_sgr_box5_h_neonrav1e_sgr_box3_v_neonrav1e_sgr_box5_v_neonrav1e_sgr_calc_ab1_neonrav1e_sgr_calc_ab2_neonrav1e_sgr_x_by_xrav1e_sgr_finish_filter1_neonrav1e_sgr_finish_filter2_neonrav1e_sgr_weighted1_neonrav1e_sgr_weighted2_neonThe loop restoration cannot be taken for now as it is using
wiener_filter
while we have an expensive RDO in rav1e, but we can use for the final calculation of RDO and also add upon, later in dav1d 3.0 there are self-guided filters which could be adapted if we do some hacking over it.On Initial assessment of better neon optimised for A53, we noticed an improvement of ~1_1.5 mins.
The next PR which has dav1d 0.4.0 merge would be really interesting to see as we have those in x86.
Reference
#1754