Merge dav1d 0.3.0 AArch64 Assembly #1791

vibhoothi · 2019-10-22T15:15:21Z

Hi,

This new patchset advances addition of dav1d 0.2.0 and 0.2.1, 0.2.2, 0.3.0 to rav1e.

Overview

From 0.2.0 till 0.3.0 Release of dav1d, mainly we have is:

mc optimisations for Cortex A53.
~~CDEF dir and filters~~
Implementations of warp8x8{,t} functions.
~~Improvements to Loop Restoration filtering.~~
~~smart padding for CDEF~~
~~SGR Looprestoration~~
~~Loopfiltering~~

We are skipping the above for_now as accordance with x86.
We would be integrating Looprestoration, Loopfiltering, CDEF once we complete the initial phase of merging till 0.5.0 of dav1d arm.

There are around 25 new functions from dav1d 0.2.0..0.3.0, which are merged but not integerated for_now

The loop restoration cannot be taken for now as it is using wiener_filter while we have an expensive RDO in rav1e, but we can use for the final calculation of RDO and also add upon, later in dav1d 3.0 there are self-guided filters which could be adapted if we do some hacking over it.

On Initial assessment of better neon optimised for A53, we noticed an improvement of ~1_1.5 mins.

The next PR which has dav1d 0.4.0 merge would be really interesting to see as we have those in x86.

Reference
#1754

coveralls · 2019-10-22T15:55:10Z

Coverage remained the same at 75.4% when pulling 944175b on vibhoothiiaanand:src-arm-0.2.0 into 4abaed9 on xiph:master.

EwoutH · 2019-10-22T21:13:58Z

Great work again. Is it worth pulling in Aarch32 NEON, considering the increase in binary size and compile times, the very low performance of ARMv7 cores and the declining adoption? I think Aarch64 should be the (only) focus for Arm.

lu-zero · 2019-10-22T22:56:31Z

I'd postpone any 32bit work until the 64bit codepaths work flawlessly.

barrbrain · 2019-10-23T03:24:48Z

Having the AArch32 sources available is inconsequential to AArch64 builds, so there's no harm in updating the ARM sources in the same order as upstream. Although, one difference with x86 is that the 32 and 64 bit sources are completely separated which makes it easier to skip the 32 bit sources.

barrbrain · 2019-10-23T03:30:06Z

very low performance of ARMv7 cores and the declining adoption

While armv8 cores are commonly adopted, they generally include support for both AArch32 and AArch64. The adoption of AArch64 software stacks is nowhere near as ubiquitous as armv8. This at least partly due to the large increase in code size from AArch32 to AArch64. So for the near future, there are likely applications for AArch32 on armv8.

barrbrain

For testing purposes, we can match the constant stride in assembly to the buffer used by the encoder.

src/arm/64/cdef.S

EwoutH · 2019-11-01T15:54:51Z

What's the status of this PR? Need any help testing?

vibhoothi · 2019-11-20T19:05:27Z

So,

Please ignore the Merge request branch name
This PR advances the merge process while we do not have any major gains but very minor gains,
on an average for 5 Frames encoding, an approx of 10 seconds and 0.01 FPS gain we got, encoded 5 frames(720p4:2:0), if we do detailed testing, we should get more
Helps the future expected gains.

src/arm/64/mc.S

barrbrain

Awesome, this matches the changes in dav1d from 0.1.0 to 0.3.0.
Thank you for catching the warnings in src/arm/tables.S.

On COFF, the default read only data section is `.rdata`, not `.rodata`.

On arm, there's no ubfm instruction, only ubfx.

Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w2_v_8bpc_neon: 155.1 131.8 mc_8tap_regular_w4_v_8bpc_neon: 199.6 148.1 mc_8tap_regular_w8_v_8bpc_neon: 286.2 225.5 After: mc_8tap_regular_w2_v_8bpc_neon: 134.1 129.5 mc_8tap_regular_w4_v_8bpc_neon: 157.6 146.5 mc_8tap_regular_w8_v_8bpc_neon: 208.0 225.0

Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w2_hv_8bpc_neon: 415.0 286.9 After: mc_8tap_regular_w2_hv_8bpc_neon: 399.1 269.9

Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w4_hv_8bpc_neon: 543.6 359.1 After: mc_8tap_regular_w4_hv_8bpc_neon: 466.7 355.5 The same kind of change doesn't seem to give any benefits on the 8 pixel wide hv filtering though, potentially related to the fact that it uses not only smull/smlal but also smull2/smlal2.

Relative speedups measured with checkasm: Cortex A7 A8 A9 A53 Snapdragon 835 mc_8tap_regular_w2_0_8bpc_neon: 9.63 4.05 3.82 5.41 5.68 mc_8tap_regular_w2_h_8bpc_neon: 3.30 5.44 3.38 3.88 5.12 mc_8tap_regular_w2_hv_8bpc_neon: 3.86 6.21 4.39 5.18 6.10 mc_8tap_regular_w2_v_8bpc_neon: 4.69 5.43 3.56 7.27 4.86 mc_8tap_regular_w4_0_8bpc_neon: 9.13 4.05 5.24 5.37 6.60 mc_8tap_regular_w4_h_8bpc_neon: 4.38 7.11 4.61 6.59 7.15 mc_8tap_regular_w4_hv_8bpc_neon: 5.11 9.77 7.37 9.21 10.29 mc_8tap_regular_w4_v_8bpc_neon: 6.24 7.88 4.96 11.16 7.89 mc_8tap_regular_w8_0_8bpc_neon: 9.12 4.20 5.59 5.59 9.25 mc_8tap_regular_w8_h_8bpc_neon: 5.91 8.42 4.84 8.46 7.08 mc_8tap_regular_w8_hv_8bpc_neon: 5.46 8.35 6.52 7.19 8.33 mc_8tap_regular_w8_v_8bpc_neon: 7.53 8.96 6.28 16.08 10.66 mc_8tap_regular_w16_0_8bpc_neon: 9.77 5.46 4.06 7.02 7.38 mc_8tap_regular_w16_h_8bpc_neon: 6.33 8.87 5.03 10.30 4.29 mc_8tap_regular_w16_hv_8bpc_neon: 5.00 7.84 6.15 6.83 7.44 mc_8tap_regular_w16_v_8bpc_neon: 7.74 8.81 6.23 19.24 11.16 mc_8tap_regular_w32_0_8bpc_neon: 6.11 4.63 2.44 5.92 4.70 mc_8tap_regular_w32_h_8bpc_neon: 6.60 9.02 5.20 11.08 3.50 mc_8tap_regular_w32_hv_8bpc_neon: 4.85 7.64 6.09 6.68 6.92 mc_8tap_regular_w32_v_8bpc_neon: 7.61 8.36 6.13 19.94 11.17 mc_8tap_regular_w64_0_8bpc_neon: 4.61 3.81 1.60 3.50 2.73 mc_8tap_regular_w64_h_8bpc_neon: 6.72 9.07 5.21 11.41 3.10 mc_8tap_regular_w64_hv_8bpc_neon: 4.67 7.43 5.92 6.43 6.59 mc_8tap_regular_w64_v_8bpc_neon: 7.64 8.28 6.07 20.48 11.41 mc_8tap_regular_w128_0_8bpc_neon: 2.41 3.13 1.11 2.31 1.73 mc_8tap_regular_w128_h_8bpc_neon: 6.68 9.03 5.09 11.41 2.90 mc_8tap_regular_w128_hv_8bpc_neon: 4.50 7.39 5.70 6.26 6.47 mc_8tap_regular_w128_v_8bpc_neon: 7.21 8.23 5.88 19.82 11.42 mc_bilinear_w2_0_8bpc_neon: 9.23 4.03 3.74 5.33 6.49 mc_bilinear_w2_h_8bpc_neon: 2.07 3.52 2.71 2.35 3.40 mc_bilinear_w2_hv_8bpc_neon: 2.60 5.24 2.73 2.74 3.89 mc_bilinear_w2_v_8bpc_neon: 2.57 4.39 3.14 3.04 4.05 mc_bilinear_w4_0_8bpc_neon: 8.74 4.03 5.38 5.28 6.53 mc_bilinear_w4_h_8bpc_neon: 3.41 6.22 4.28 3.86 7.56 mc_bilinear_w4_hv_8bpc_neon: 4.38 7.45 4.61 5.26 7.95 mc_bilinear_w4_v_8bpc_neon: 3.65 6.57 4.51 4.45 7.62 mc_bilinear_w8_0_8bpc_neon: 8.74 4.50 5.71 5.46 9.39 mc_bilinear_w8_h_8bpc_neon: 6.14 10.71 6.78 6.88 14.10 mc_bilinear_w8_hv_8bpc_neon: 7.11 12.80 8.24 11.08 7.83 mc_bilinear_w8_v_8bpc_neon: 7.24 11.69 7.57 8.04 15.46 mc_bilinear_w16_0_8bpc_neon: 10.01 5.47 4.07 6.97 7.64 mc_bilinear_w16_h_8bpc_neon: 8.36 17.00 8.34 11.61 7.64 mc_bilinear_w16_hv_8bpc_neon: 7.67 13.54 8.53 13.32 8.05 mc_bilinear_w16_v_8bpc_neon: 10.19 22.56 10.52 15.39 10.62 mc_bilinear_w32_0_8bpc_neon: 6.22 4.73 2.43 5.89 4.90 mc_bilinear_w32_h_8bpc_neon: 9.47 18.96 9.34 13.10 7.24 mc_bilinear_w32_hv_8bpc_neon: 7.95 13.15 9.49 13.78 8.71 mc_bilinear_w32_v_8bpc_neon: 11.10 23.53 11.34 16.74 8.78 mc_bilinear_w64_0_8bpc_neon: 4.58 3.82 1.59 3.46 2.71 mc_bilinear_w64_h_8bpc_neon: 10.07 19.77 9.60 13.99 6.88 mc_bilinear_w64_hv_8bpc_neon: 8.08 12.95 9.39 13.84 8.90 mc_bilinear_w64_v_8bpc_neon: 11.49 23.85 11.12 17.13 7.90 mc_bilinear_w128_0_8bpc_neon: 2.37 3.24 1.15 2.28 1.73 mc_bilinear_w128_h_8bpc_neon: 9.94 18.84 8.66 13.91 6.74 mc_bilinear_w128_hv_8bpc_neon: 7.26 12.82 8.97 12.43 8.88 mc_bilinear_w128_v_8bpc_neon: 9.89 23.88 8.93 14.73 7.33 mct_8tap_regular_w4_0_8bpc_neon: 2.82 4.46 2.72 3.50 5.41 mct_8tap_regular_w4_h_8bpc_neon: 4.16 6.88 4.64 6.51 6.60 mct_8tap_regular_w4_hv_8bpc_neon: 5.22 9.87 7.81 9.39 10.11 mct_8tap_regular_w4_v_8bpc_neon: 5.81 7.72 4.80 10.16 6.85 mct_8tap_regular_w8_0_8bpc_neon: 4.48 6.30 3.01 5.82 5.04 mct_8tap_regular_w8_h_8bpc_neon: 5.59 8.04 4.18 8.68 8.30 mct_8tap_regular_w8_hv_8bpc_neon: 5.34 8.32 6.42 7.04 7.99 mct_8tap_regular_w8_v_8bpc_neon: 7.32 8.71 5.75 17.07 9.73 mct_8tap_regular_w16_0_8bpc_neon: 5.05 9.60 3.64 10.06 4.29 mct_8tap_regular_w16_h_8bpc_neon: 5.53 8.20 4.54 9.98 7.33 mct_8tap_regular_w16_hv_8bpc_neon: 4.90 7.87 6.07 6.67 7.03 mct_8tap_regular_w16_v_8bpc_neon: 7.39 8.55 5.72 19.64 9.98 mct_8tap_regular_w32_0_8bpc_neon: 5.28 8.16 4.07 11.03 2.38 mct_8tap_regular_w32_h_8bpc_neon: 5.97 8.31 4.67 10.63 6.72 mct_8tap_regular_w32_hv_8bpc_neon: 4.73 7.65 5.98 6.51 6.31 mct_8tap_regular_w32_v_8bpc_neon: 7.33 8.18 5.72 20.50 10.03 mct_8tap_regular_w64_0_8bpc_neon: 5.11 9.19 4.01 10.61 1.92 mct_8tap_regular_w64_h_8bpc_neon: 6.05 8.33 4.53 10.84 6.38 mct_8tap_regular_w64_hv_8bpc_neon: 4.61 7.54 5.69 6.35 6.11 mct_8tap_regular_w64_v_8bpc_neon: 7.27 8.06 5.39 20.41 10.15 mct_8tap_regular_w128_0_8bpc_neon: 4.29 8.21 4.28 9.55 1.32 mct_8tap_regular_w128_h_8bpc_neon: 6.01 8.26 4.43 10.78 6.20 mct_8tap_regular_w128_hv_8bpc_neon: 4.49 7.49 5.46 6.11 5.96 mct_8tap_regular_w128_v_8bpc_neon: 6.90 8.00 5.19 18.47 10.13 mct_bilinear_w4_0_8bpc_neon: 2.70 4.53 2.67 3.32 5.11 mct_bilinear_w4_h_8bpc_neon: 3.02 5.06 3.13 3.28 5.38 mct_bilinear_w4_hv_8bpc_neon: 4.14 7.04 4.75 4.99 6.30 mct_bilinear_w4_v_8bpc_neon: 3.17 5.30 3.66 3.87 5.01 mct_bilinear_w8_0_8bpc_neon: 4.41 6.46 2.99 5.74 5.98 mct_bilinear_w8_h_8bpc_neon: 5.36 8.27 3.62 6.39 9.06 mct_bilinear_w8_hv_8bpc_neon: 6.65 11.82 6.79 11.47 7.07 mct_bilinear_w8_v_8bpc_neon: 6.26 9.62 4.05 7.75 16.81 mct_bilinear_w16_0_8bpc_neon: 4.86 9.85 3.61 10.03 4.19 mct_bilinear_w16_h_8bpc_neon: 5.26 12.91 4.76 9.56 9.68 mct_bilinear_w16_hv_8bpc_neon: 6.96 12.58 7.05 13.48 7.35 mct_bilinear_w16_v_8bpc_neon: 6.46 17.94 5.72 13.70 19.20 mct_bilinear_w32_0_8bpc_neon: 5.31 8.10 4.06 10.88 2.77 mct_bilinear_w32_h_8bpc_neon: 6.91 14.28 5.33 11.24 10.33 mct_bilinear_w32_hv_8bpc_neon: 7.13 12.21 7.57 13.91 7.19 mct_bilinear_w32_v_8bpc_neon: 8.06 18.48 5.88 14.74 15.47 mct_bilinear_w64_0_8bpc_neon: 5.08 7.29 3.83 10.44 1.71 mct_bilinear_w64_h_8bpc_neon: 7.24 14.59 5.40 11.70 11.03 mct_bilinear_w64_hv_8bpc_neon: 7.24 11.98 7.59 13.72 7.30 mct_bilinear_w64_v_8bpc_neon: 8.20 18.24 5.69 14.57 15.04 mct_bilinear_w128_0_8bpc_neon: 4.35 8.23 4.17 9.71 1.11 mct_bilinear_w128_h_8bpc_neon: 7.02 13.80 5.63 11.11 11.26 mct_bilinear_w128_hv_8bpc_neon: 6.31 11.89 6.75 12.12 7.24 mct_bilinear_w128_v_8bpc_neon: 6.95 18.26 5.84 11.31 14.78

These cases looped round too many.

…l at a time

The relative speedup compared to C code is around 4-8x: Cortex A7 A8 A9 A53 A72 A73 wiener_luma_8bpc_neon: 4.00 7.54 4.74 6.84 4.91 8.01

…ow_neon

This uses the right registers, corresponding to the ones shifted in the arm64 version.

Speedup vs C code: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 4.62 4.48 4.76 cdef_filter_4x8_8bpc_neon: 4.82 4.80 5.08 cdef_filter_8x8_8bpc_neon: 5.29 5.33 5.79

This should fix compilation with compilers that default to armv6, such as on raspbian.

A symbol starting with two leading underscores is reserved for the compiler/standard library implementation. Also remove the trailing two double underscores for consistency and symmetry.

Speedup vs C code: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 4.43 3.51 4.39

Relative speedup vs C code: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 3.19 2.60 3.66 warp_8x8t_8bpc_neon: 3.09 2.50 3.58

Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 677.4 433.9 452.9 cdef_filter_4x8_8bpc_neon: 1255.0 815.2 841.8 cdef_filter_8x8_8bpc_neon: 2278.5 1440.0 1505.0 After: cdef_filter_4x4_8bpc_neon: 645.5 401.9 422.5 cdef_filter_4x8_8bpc_neon: 1193.7 756.6 782.4 cdef_filter_8x8_8bpc_neon: 2162.4 1361.9 1375.6

Pad with a value which works both as a large unsigned value and a negative signed value. This allows doing the max operation using signed max, avoiding the conditional altogether. Based on the same idea for x86 by Kyle Siefring. Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 645.5 401.9 422.5 cdef_filter_4x8_8bpc_neon: 1193.7 756.6 782.4 cdef_filter_8x8_8bpc_neon: 2162.4 1361.9 1375.6 After: cdef_filter_4x4_8bpc_neon: 596.3 377.8 384.8 cdef_filter_4x8_8bpc_neon: 1097.4 705.5 707.1 cdef_filter_8x8_8bpc_neon: 1967.4 1232.3 1239.9

This might have said pri_taps[k]/sec_taps[k] at some earlier time.

…_neon

For cases with indented, nested .if/.macro in asm.S, ident those by 4 chars. Some initial assembly files were indented to 4/16 columns, while all the actual implementation files, starting with src/arm/64/mc.S, have used 8/24 for indentation.

The width register has been set to clz(w)-24, not the other way around. And the 32 bit prep function has got the h parameter in r4, not in r5.

This eases disambiguating these functions when looking at perf profiles.

Relative speedup vs (autovectorized) C code: Cortex A53 A72 A73 selfguided_3x3_8bpc_neon: 2.91 2.12 2.68 selfguided_5x5_8bpc_neon: 3.18 2.65 3.39 selfguided_mix_8bpc_neon: 3.04 2.29 2.98 The relative speedup vs non-vectorized C code is around 2.6-4.6x.

The exact relative speedup compared to C code is a bit vague and hard to measure, depending on eactly how many filtered blocks are skipped, as the NEON version always filters 16 pixels at a time, while the C code can skip processing individual 4 pixel blocks. Additionally, the checkasm benchmarking code runs the same function repeatedly on the same buffer, which can make the filter take different codepaths on each run, as the function updates the buffer which will be used as input for the next run. If tweaking the checkasm test data to try to avoid skipped blocks, the relative speedups compared to C is between 2x and 5x, while it is around 1x to 4x with the current checkasm test as such. Benchmark numbers from a tweaked checkasm that avoids skipped blocks: Cortex A53 A72 A73 lpf_h_sb_uv_w4_8bpc_c: 2954.7 1399.3 1655.3 lpf_h_sb_uv_w4_8bpc_neon: 895.5 650.8 692.0 lpf_h_sb_uv_w6_8bpc_c: 3879.2 1917.2 2257.7 lpf_h_sb_uv_w6_8bpc_neon: 1125.6 759.5 838.4 lpf_h_sb_y_w4_8bpc_c: 6711.0 3275.5 3913.7 lpf_h_sb_y_w4_8bpc_neon: 1744.0 1342.1 1351.5 lpf_h_sb_y_w8_8bpc_c: 10695.7 6155.8 6638.9 lpf_h_sb_y_w8_8bpc_neon: 2146.5 1560.4 1609.1 lpf_h_sb_y_w16_8bpc_c: 11355.8 6292.0 6995.9 lpf_h_sb_y_w16_8bpc_neon: 2475.4 1949.6 1968.4 lpf_v_sb_uv_w4_8bpc_c: 2639.7 1204.8 1425.9 lpf_v_sb_uv_w4_8bpc_neon: 510.7 351.4 334.7 lpf_v_sb_uv_w6_8bpc_c: 3468.3 1757.1 2021.5 lpf_v_sb_uv_w6_8bpc_neon: 625.0 415.0 397.8 lpf_v_sb_y_w4_8bpc_c: 5428.7 2731.7 3068.5 lpf_v_sb_y_w4_8bpc_neon: 1172.6 792.1 768.0 lpf_v_sb_y_w8_8bpc_c: 8946.1 4412.8 5121.0 lpf_v_sb_y_w8_8bpc_neon: 1565.5 1063.6 1062.7 lpf_v_sb_y_w16_8bpc_c: 8978.9 4411.7 5112.0 lpf_v_sb_y_w16_8bpc_neon: 1775.0 1288.1 1236.7

vibhoothi force-pushed the src-arm-0.2.0 branch 2 times, most recently from 95fbd88 to 71c1df0 Compare October 23, 2019 17:19

vibhoothi changed the title ~~[WIP] Merge dav1d 0.2.0 AArch64 Assembly~~ [WIP] Merge dav1d 0.2.1 AArch64 Assembly Oct 23, 2019

barrbrain reviewed Oct 24, 2019

View reviewed changes

src/arm/64/cdef.S Show resolved Hide resolved

src/arm/64/cdef.S Show resolved Hide resolved

vibhoothi force-pushed the src-arm-0.2.0 branch from 71c1df0 to d85faa2 Compare October 24, 2019 11:59

vibhoothi force-pushed the src-arm-0.2.0 branch 3 times, most recently from cfd3898 to 8f727d7 Compare November 20, 2019 18:05

vibhoothi changed the title ~~[WIP] Merge dav1d 0.2.1 AArch64 Assembly~~ [WIP] Merge dav1d 0.3.0 AArch64 Assembly Nov 20, 2019

vibhoothi changed the title ~~[WIP] Merge dav1d 0.3.0 AArch64 Assembly~~ Merge dav1d 0.3.0 AArch64 Assembly Nov 20, 2019

barrbrain reviewed Nov 21, 2019

View reviewed changes

src/arm/64/mc.S Outdated Show resolved Hide resolved

vibhoothi force-pushed the src-arm-0.2.0 branch from 8f727d7 to fceb6ea Compare November 21, 2019 07:56

vibhoothi requested a review from barrbrain November 21, 2019 08:07

barrbrain approved these changes Nov 21, 2019

View reviewed changes

vibhoothi and others added 9 commits November 21, 2019 08:14

arm: tables: Remove trailing commas

fa88585

arm: Create proper .rdata sections for COFF

681e0d4

On COFF, the default read only data section is `.rdata`, not `.rodata`.

arm64: mc: Use ubfx instead of ubfm, for consistency with arm

cf47e78

On arm, there's no ubfm instruction, only ubfx.

arm64: mc: Remove unused/unnecessary macro args

c83ecd7

arm64: mc: Improve a comment

1abd5f8

arm64: mc: Simplify the 8tap_2w_hv code slightly

fbb2e76

Before: Cortex A53 Snapdragon 835 mc_8tap_regular_w2_hv_8bpc_neon: 415.0 286.9 After: mc_8tap_regular_w2_hv_8bpc_neon: 399.1 269.9

arm: Fix the movrel macro for Apple with PIC

f34019f

mstorsjo and others added 26 commits November 21, 2019 08:14

arm64: looprestoration: Fix comment typos

5025fb9

arm64: looprestoration: Fix the loop condition in copy_narrow_neon

c35b7ef

These cases looped round too many.

arm64: looprestoration: Simplify the setup of wiener_filter_v_neon

b265283

arm64: looprestoration: Simplify the horizontal filtering of one pixe…

1f1413c

…l at a time

arm: fix movrel macro for thumb

4072c79

arm: looprestoration: NEON optimized wiener filter

356a72c

The relative speedup compared to C code is around 4-8x: Cortex A7 A8 A9 A53 A72 A73 wiener_luma_8bpc_neon: 4.00 7.54 4.74 6.84 4.91 8.01

arm64: looprestoration: Optimize loop termination checks in copy_narr…

963f952

…ow_neon

arm: looprestoration: Fix a comment to use the right instruction names

a3587be

arm: looprestoration: Fix register shifting at the end of a vertical run

9e00b69

This uses the right registers, corresponding to the ones shifted in the arm64 version.

arm64: cdef: NEON optimized cdef filter function

f85d346

Speedup vs C code: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 4.62 4.48 4.76 cdef_filter_4x8_8bpc_neon: 4.82 4.80 5.08 cdef_filter_8x8_8bpc_neon: 5.29 5.33 5.79

arm32: Set .arch armv7-a just like we already set .fpu neon

95f6209

This should fix compilation with compilers that default to armv6, such as on raspbian.

Remove leading double underscores from include guard defines

01fe987

A symbol starting with two leading underscores is reserved for the compiler/standard library implementation. Also remove the trailing two double underscores for consistency and symmetry.

arm64: cdef: NEON implementation of the dir function

3722827

Speedup vs C code: Cortex A53 A72 A73 cdef_dir_8bpc_neon: 4.43 3.51 4.39

arm64: mc: NEON implementation of warp8x8{,t}

b3515a1

Relative speedup vs C code: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 3.19 2.60 3.66 warp_8x8t_8bpc_neon: 3.09 2.50 3.58

arm64/ios: use prefixed dav1d_mc_warp_filter symbol

79108fe

fix dav1d spelling

b12bd3c

arm64: cdef: Clarify a slightly confusing comment

bbbb9dd

This might have said pri_taps[k]/sec_taps[k] at some earlier time.

arm: looprestoration: Simplify a few padding cases in wiener_filter_h…

62c0a0d

…_neon

arm: Fix typos in comments

925559c

The width register has been set to clz(w)-24, not the other way around. And the 32 bit prep function has got the h parameter in r4, not in r5.

arm: Add a _neon suffix to all internal functions

1b6ac24

This eases disambiguating these functions when looking at perf profiles.

vibhoothi force-pushed the src-arm-0.2.0 branch from fceb6ea to 944175b Compare November 21, 2019 08:23

barrbrain merged commit 944175b into xiph:master Nov 21, 2019

vibhoothi mentioned this pull request Nov 21, 2019

Merge dav1d assembly for AArch64, by release #1754

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dav1d 0.3.0 AArch64 Assembly #1791

Merge dav1d 0.3.0 AArch64 Assembly #1791

vibhoothi commented Oct 22, 2019 •

edited

Loading

coveralls commented Oct 22, 2019 •

edited

Loading

EwoutH commented Oct 22, 2019

lu-zero commented Oct 22, 2019

barrbrain commented Oct 23, 2019

barrbrain commented Oct 23, 2019

barrbrain left a comment

EwoutH commented Nov 1, 2019

vibhoothi commented Nov 20, 2019 •

edited

Loading

barrbrain left a comment

Merge dav1d 0.3.0 AArch64 Assembly #1791

Merge dav1d 0.3.0 AArch64 Assembly #1791

Conversation

vibhoothi commented Oct 22, 2019 • edited Loading

Overview

coveralls commented Oct 22, 2019 • edited Loading

EwoutH commented Oct 22, 2019

lu-zero commented Oct 22, 2019

barrbrain commented Oct 23, 2019

barrbrain commented Oct 23, 2019

barrbrain left a comment

Choose a reason for hiding this comment

EwoutH commented Nov 1, 2019

vibhoothi commented Nov 20, 2019 • edited Loading

barrbrain left a comment

Choose a reason for hiding this comment

vibhoothi commented Oct 22, 2019 •

edited

Loading

coveralls commented Oct 22, 2019 •

edited

Loading

vibhoothi commented Nov 20, 2019 •

edited

Loading