Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dav1d 0.3.0 AArch64 Assembly #1791

Merged
merged 35 commits into from
Nov 21, 2019
Merged

Conversation

vibhoothi
Copy link
Collaborator

@vibhoothi vibhoothi commented Oct 22, 2019

Hi,

This new patchset advances addition of dav1d 0.2.0 and 0.2.1, 0.2.2, 0.3.0 to rav1e.

Overview

From 0.2.0 till 0.3.0 Release of dav1d, mainly we have is:

  • mc optimisations for Cortex A53.
  • CDEF dir and filters
  • Implementations of warp8x8{,t} functions.
  • Improvements to Loop Restoration filtering.
  • smart padding for CDEF
  • SGR Looprestoration
  • Loopfiltering

We are skipping the above for_now as accordance with x86.
We would be integrating Looprestoration, Loopfiltering, CDEF once we complete the initial phase of merging till 0.5.0 of dav1d arm.

There are around 25 new functions from dav1d 0.2.0..0.3.0, which are merged but not integerated for_now

  • rav1e_cdef_padding8_neon
  • rav1e_cdef_padding4_neon
  • rav1e_cdef_filter8_neon
  • rav1e_cdef_filter4_neon
  • rav1e_cdef_find_dir_neon
  • rav1e_warp_affine_8x8_8bpc_neon
  • rav1e_warp_affine_8x8t_8bpc_neon
  • rav1e_lpf_v_sb_y_neon
  • rav1e_lpf_h_sb_y_neon
  • rav1e_lpf_v_sb_uv_neon
  • rav1e_lpf_h_sb_uv_neon
  • rav1e_wiener_filter_h_neon
  • rav1e_wiener_filter_v_neon
  • rav1e_copy_narrow_neon
  • rav1e_sgr_box3_h_neon
  • rav1e_sgr_box5_h_neon
  • rav1e_sgr_box3_v_neon
  • rav1e_sgr_box5_v_neon
  • rav1e_sgr_calc_ab1_neon
  • rav1e_sgr_calc_ab2_neon
  • rav1e_sgr_x_by_x
  • rav1e_sgr_finish_filter1_neon
  • rav1e_sgr_finish_filter2_neon
  • rav1e_sgr_weighted1_neon
  • rav1e_sgr_weighted2_neon

The loop restoration cannot be taken for now as it is using wiener_filter while we have an expensive RDO in rav1e, but we can use for the final calculation of RDO and also add upon, later in dav1d 3.0 there are self-guided filters which could be adapted if we do some hacking over it.

On Initial assessment of better neon optimised for A53, we noticed an improvement of ~1_1.5 mins.

The next PR which has dav1d 0.4.0 merge would be really interesting to see as we have those in x86.

Reference
#1754

@coveralls
Copy link
Collaborator

coveralls commented Oct 22, 2019

Coverage Status

Coverage remained the same at 75.4% when pulling 944175b on vibhoothiiaanand:src-arm-0.2.0 into 4abaed9 on xiph:master.

@EwoutH
Copy link
Contributor

EwoutH commented Oct 22, 2019

Great work again. Is it worth pulling in Aarch32 NEON, considering the increase in binary size and compile times, the very low performance of ARMv7 cores and the declining adoption? I think Aarch64 should be the (only) focus for Arm.

@lu-zero
Copy link
Collaborator

lu-zero commented Oct 22, 2019

I'd postpone any 32bit work until the 64bit codepaths work flawlessly.

@barrbrain
Copy link
Collaborator

Having the AArch32 sources available is inconsequential to AArch64 builds, so there's no harm in updating the ARM sources in the same order as upstream. Although, one difference with x86 is that the 32 and 64 bit sources are completely separated which makes it easier to skip the 32 bit sources.

@barrbrain
Copy link
Collaborator

very low performance of ARMv7 cores and the declining adoption

While armv8 cores are commonly adopted, they generally include support for both AArch32 and AArch64. The adoption of AArch64 software stacks is nowhere near as ubiquitous as armv8. This at least partly due to the large increase in code size from AArch32 to AArch64. So for the near future, there are likely applications for AArch32 on armv8.

@vibhoothi vibhoothi force-pushed the src-arm-0.2.0 branch 2 times, most recently from 95fbd88 to 71c1df0 Compare October 23, 2019 17:19
@vibhoothi vibhoothi changed the title [WIP] Merge dav1d 0.2.0 AArch64 Assembly [WIP] Merge dav1d 0.2.1 AArch64 Assembly Oct 23, 2019
Copy link
Collaborator

@barrbrain barrbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing purposes, we can match the constant stride in assembly to the buffer used by the encoder.

src/arm/64/cdef.S Show resolved Hide resolved
src/arm/64/cdef.S Show resolved Hide resolved
@EwoutH
Copy link
Contributor

EwoutH commented Nov 1, 2019

What's the status of this PR? Need any help testing?

@vibhoothi vibhoothi force-pushed the src-arm-0.2.0 branch 3 times, most recently from cfd3898 to 8f727d7 Compare November 20, 2019 18:05
@vibhoothi vibhoothi changed the title [WIP] Merge dav1d 0.2.1 AArch64 Assembly [WIP] Merge dav1d 0.3.0 AArch64 Assembly Nov 20, 2019
@vibhoothi vibhoothi changed the title [WIP] Merge dav1d 0.3.0 AArch64 Assembly Merge dav1d 0.3.0 AArch64 Assembly Nov 20, 2019
@vibhoothi
Copy link
Collaborator Author

vibhoothi commented Nov 20, 2019

So,

  • Please ignore the Merge request branch name
  • This PR advances the merge process while we do not have any major gains but very minor gains,
    on an average for 5 Frames encoding, an approx of 10 seconds and 0.01 FPS gain we got, encoded 5 frames(720p4:2:0), if we do detailed testing, we should get more
  • Helps the future expected gains.

src/arm/64/mc.S Outdated Show resolved Hide resolved
Copy link
Collaborator

@barrbrain barrbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this matches the changes in dav1d from 0.1.0 to 0.3.0.
Thank you for catching the warnings in src/arm/tables.S.

vibhoothi and others added 9 commits November 21, 2019 08:14
On COFF, the default read only data section is `.rdata`, not `.rodata`.
On arm, there's no ubfm instruction, only ubfx.
Before:                      Cortex A53   Snapdragon 835
mc_8tap_regular_w2_v_8bpc_neon:   155.1   131.8
mc_8tap_regular_w4_v_8bpc_neon:   199.6   148.1
mc_8tap_regular_w8_v_8bpc_neon:   286.2   225.5
After:
mc_8tap_regular_w2_v_8bpc_neon:   134.1   129.5
mc_8tap_regular_w4_v_8bpc_neon:   157.6   146.5
mc_8tap_regular_w8_v_8bpc_neon:   208.0   225.0
Before:                       Cortex A53   Snapdragon 835
mc_8tap_regular_w2_hv_8bpc_neon:   415.0   286.9
After:
mc_8tap_regular_w2_hv_8bpc_neon:   399.1   269.9
Before:                       Cortex A53   Snapdragon 835
mc_8tap_regular_w4_hv_8bpc_neon:   543.6   359.1
After:
mc_8tap_regular_w4_hv_8bpc_neon:   466.7   355.5

The same kind of change doesn't seem to give any benefits on the 8
pixel wide hv filtering though, potentially related to the fact that
it uses not only smull/smlal but also smull2/smlal2.
mstorsjo and others added 26 commits November 21, 2019 08:14
Relative speedups measured with checkasm:
                                 Cortex A7     A8     A9    A53   Snapdragon 835
mc_8tap_regular_w2_0_8bpc_neon:       9.63   4.05   3.82   5.41   5.68
mc_8tap_regular_w2_h_8bpc_neon:       3.30   5.44   3.38   3.88   5.12
mc_8tap_regular_w2_hv_8bpc_neon:      3.86   6.21   4.39   5.18   6.10
mc_8tap_regular_w2_v_8bpc_neon:       4.69   5.43   3.56   7.27   4.86
mc_8tap_regular_w4_0_8bpc_neon:       9.13   4.05   5.24   5.37   6.60
mc_8tap_regular_w4_h_8bpc_neon:       4.38   7.11   4.61   6.59   7.15
mc_8tap_regular_w4_hv_8bpc_neon:      5.11   9.77   7.37   9.21  10.29
mc_8tap_regular_w4_v_8bpc_neon:       6.24   7.88   4.96  11.16   7.89
mc_8tap_regular_w8_0_8bpc_neon:       9.12   4.20   5.59   5.59   9.25
mc_8tap_regular_w8_h_8bpc_neon:       5.91   8.42   4.84   8.46   7.08
mc_8tap_regular_w8_hv_8bpc_neon:      5.46   8.35   6.52   7.19   8.33
mc_8tap_regular_w8_v_8bpc_neon:       7.53   8.96   6.28  16.08  10.66
mc_8tap_regular_w16_0_8bpc_neon:      9.77   5.46   4.06   7.02   7.38
mc_8tap_regular_w16_h_8bpc_neon:      6.33   8.87   5.03  10.30   4.29
mc_8tap_regular_w16_hv_8bpc_neon:     5.00   7.84   6.15   6.83   7.44
mc_8tap_regular_w16_v_8bpc_neon:      7.74   8.81   6.23  19.24  11.16
mc_8tap_regular_w32_0_8bpc_neon:      6.11   4.63   2.44   5.92   4.70
mc_8tap_regular_w32_h_8bpc_neon:      6.60   9.02   5.20  11.08   3.50
mc_8tap_regular_w32_hv_8bpc_neon:     4.85   7.64   6.09   6.68   6.92
mc_8tap_regular_w32_v_8bpc_neon:      7.61   8.36   6.13  19.94  11.17
mc_8tap_regular_w64_0_8bpc_neon:      4.61   3.81   1.60   3.50   2.73
mc_8tap_regular_w64_h_8bpc_neon:      6.72   9.07   5.21  11.41   3.10
mc_8tap_regular_w64_hv_8bpc_neon:     4.67   7.43   5.92   6.43   6.59
mc_8tap_regular_w64_v_8bpc_neon:      7.64   8.28   6.07  20.48  11.41
mc_8tap_regular_w128_0_8bpc_neon:     2.41   3.13   1.11   2.31   1.73
mc_8tap_regular_w128_h_8bpc_neon:     6.68   9.03   5.09  11.41   2.90
mc_8tap_regular_w128_hv_8bpc_neon:    4.50   7.39   5.70   6.26   6.47
mc_8tap_regular_w128_v_8bpc_neon:     7.21   8.23   5.88  19.82  11.42
mc_bilinear_w2_0_8bpc_neon:           9.23   4.03   3.74   5.33   6.49
mc_bilinear_w2_h_8bpc_neon:           2.07   3.52   2.71   2.35   3.40
mc_bilinear_w2_hv_8bpc_neon:          2.60   5.24   2.73   2.74   3.89
mc_bilinear_w2_v_8bpc_neon:           2.57   4.39   3.14   3.04   4.05
mc_bilinear_w4_0_8bpc_neon:           8.74   4.03   5.38   5.28   6.53
mc_bilinear_w4_h_8bpc_neon:           3.41   6.22   4.28   3.86   7.56
mc_bilinear_w4_hv_8bpc_neon:          4.38   7.45   4.61   5.26   7.95
mc_bilinear_w4_v_8bpc_neon:           3.65   6.57   4.51   4.45   7.62
mc_bilinear_w8_0_8bpc_neon:           8.74   4.50   5.71   5.46   9.39
mc_bilinear_w8_h_8bpc_neon:           6.14  10.71   6.78   6.88  14.10
mc_bilinear_w8_hv_8bpc_neon:          7.11  12.80   8.24  11.08   7.83
mc_bilinear_w8_v_8bpc_neon:           7.24  11.69   7.57   8.04  15.46
mc_bilinear_w16_0_8bpc_neon:         10.01   5.47   4.07   6.97   7.64
mc_bilinear_w16_h_8bpc_neon:          8.36  17.00   8.34  11.61   7.64
mc_bilinear_w16_hv_8bpc_neon:         7.67  13.54   8.53  13.32   8.05
mc_bilinear_w16_v_8bpc_neon:         10.19  22.56  10.52  15.39  10.62
mc_bilinear_w32_0_8bpc_neon:          6.22   4.73   2.43   5.89   4.90
mc_bilinear_w32_h_8bpc_neon:          9.47  18.96   9.34  13.10   7.24
mc_bilinear_w32_hv_8bpc_neon:         7.95  13.15   9.49  13.78   8.71
mc_bilinear_w32_v_8bpc_neon:         11.10  23.53  11.34  16.74   8.78
mc_bilinear_w64_0_8bpc_neon:          4.58   3.82   1.59   3.46   2.71
mc_bilinear_w64_h_8bpc_neon:         10.07  19.77   9.60  13.99   6.88
mc_bilinear_w64_hv_8bpc_neon:         8.08  12.95   9.39  13.84   8.90
mc_bilinear_w64_v_8bpc_neon:         11.49  23.85  11.12  17.13   7.90
mc_bilinear_w128_0_8bpc_neon:         2.37   3.24   1.15   2.28   1.73
mc_bilinear_w128_h_8bpc_neon:         9.94  18.84   8.66  13.91   6.74
mc_bilinear_w128_hv_8bpc_neon:        7.26  12.82   8.97  12.43   8.88
mc_bilinear_w128_v_8bpc_neon:         9.89  23.88   8.93  14.73   7.33
mct_8tap_regular_w4_0_8bpc_neon:      2.82   4.46   2.72   3.50   5.41
mct_8tap_regular_w4_h_8bpc_neon:      4.16   6.88   4.64   6.51   6.60
mct_8tap_regular_w4_hv_8bpc_neon:     5.22   9.87   7.81   9.39  10.11
mct_8tap_regular_w4_v_8bpc_neon:      5.81   7.72   4.80  10.16   6.85
mct_8tap_regular_w8_0_8bpc_neon:      4.48   6.30   3.01   5.82   5.04
mct_8tap_regular_w8_h_8bpc_neon:      5.59   8.04   4.18   8.68   8.30
mct_8tap_regular_w8_hv_8bpc_neon:     5.34   8.32   6.42   7.04   7.99
mct_8tap_regular_w8_v_8bpc_neon:      7.32   8.71   5.75  17.07   9.73
mct_8tap_regular_w16_0_8bpc_neon:     5.05   9.60   3.64  10.06   4.29
mct_8tap_regular_w16_h_8bpc_neon:     5.53   8.20   4.54   9.98   7.33
mct_8tap_regular_w16_hv_8bpc_neon:    4.90   7.87   6.07   6.67   7.03
mct_8tap_regular_w16_v_8bpc_neon:     7.39   8.55   5.72  19.64   9.98
mct_8tap_regular_w32_0_8bpc_neon:     5.28   8.16   4.07  11.03   2.38
mct_8tap_regular_w32_h_8bpc_neon:     5.97   8.31   4.67  10.63   6.72
mct_8tap_regular_w32_hv_8bpc_neon:    4.73   7.65   5.98   6.51   6.31
mct_8tap_regular_w32_v_8bpc_neon:     7.33   8.18   5.72  20.50  10.03
mct_8tap_regular_w64_0_8bpc_neon:     5.11   9.19   4.01  10.61   1.92
mct_8tap_regular_w64_h_8bpc_neon:     6.05   8.33   4.53  10.84   6.38
mct_8tap_regular_w64_hv_8bpc_neon:    4.61   7.54   5.69   6.35   6.11
mct_8tap_regular_w64_v_8bpc_neon:     7.27   8.06   5.39  20.41  10.15
mct_8tap_regular_w128_0_8bpc_neon:    4.29   8.21   4.28   9.55   1.32
mct_8tap_regular_w128_h_8bpc_neon:    6.01   8.26   4.43  10.78   6.20
mct_8tap_regular_w128_hv_8bpc_neon:   4.49   7.49   5.46   6.11   5.96
mct_8tap_regular_w128_v_8bpc_neon:    6.90   8.00   5.19  18.47  10.13
mct_bilinear_w4_0_8bpc_neon:          2.70   4.53   2.67   3.32   5.11
mct_bilinear_w4_h_8bpc_neon:          3.02   5.06   3.13   3.28   5.38
mct_bilinear_w4_hv_8bpc_neon:         4.14   7.04   4.75   4.99   6.30
mct_bilinear_w4_v_8bpc_neon:          3.17   5.30   3.66   3.87   5.01
mct_bilinear_w8_0_8bpc_neon:          4.41   6.46   2.99   5.74   5.98
mct_bilinear_w8_h_8bpc_neon:          5.36   8.27   3.62   6.39   9.06
mct_bilinear_w8_hv_8bpc_neon:         6.65  11.82   6.79  11.47   7.07
mct_bilinear_w8_v_8bpc_neon:          6.26   9.62   4.05   7.75  16.81
mct_bilinear_w16_0_8bpc_neon:         4.86   9.85   3.61  10.03   4.19
mct_bilinear_w16_h_8bpc_neon:         5.26  12.91   4.76   9.56   9.68
mct_bilinear_w16_hv_8bpc_neon:        6.96  12.58   7.05  13.48   7.35
mct_bilinear_w16_v_8bpc_neon:         6.46  17.94   5.72  13.70  19.20
mct_bilinear_w32_0_8bpc_neon:         5.31   8.10   4.06  10.88   2.77
mct_bilinear_w32_h_8bpc_neon:         6.91  14.28   5.33  11.24  10.33
mct_bilinear_w32_hv_8bpc_neon:        7.13  12.21   7.57  13.91   7.19
mct_bilinear_w32_v_8bpc_neon:         8.06  18.48   5.88  14.74  15.47
mct_bilinear_w64_0_8bpc_neon:         5.08   7.29   3.83  10.44   1.71
mct_bilinear_w64_h_8bpc_neon:         7.24  14.59   5.40  11.70  11.03
mct_bilinear_w64_hv_8bpc_neon:        7.24  11.98   7.59  13.72   7.30
mct_bilinear_w64_v_8bpc_neon:         8.20  18.24   5.69  14.57  15.04
mct_bilinear_w128_0_8bpc_neon:        4.35   8.23   4.17   9.71   1.11
mct_bilinear_w128_h_8bpc_neon:        7.02  13.80   5.63  11.11  11.26
mct_bilinear_w128_hv_8bpc_neon:       6.31  11.89   6.75  12.12   7.24
mct_bilinear_w128_v_8bpc_neon:        6.95  18.26   5.84  11.31  14.78
The relative speedup compared to C code is around 4-8x:

                    Cortex A7     A8     A9    A53    A72    A73
wiener_luma_8bpc_neon:   4.00   7.54   4.74   6.84   4.91   8.01
This uses the right registers, corresponding to the ones shifted
in the arm64 version.
Speedup vs C code:     Cortex A53    A72    A73
cdef_filter_4x4_8bpc_neon:   4.62   4.48   4.76
cdef_filter_4x8_8bpc_neon:   4.82   4.80   5.08
cdef_filter_8x8_8bpc_neon:   5.29   5.33   5.79
This should fix compilation with compilers that default to armv6, such
as on raspbian.
A symbol starting with two leading underscores is reserved for
the compiler/standard library implementation.

Also remove the trailing two double underscores for consistency
and symmetry.
Speedup vs C code:
                Cortex A53    A72    A73
cdef_dir_8bpc_neon:   4.43   3.51   4.39
Relative speedup vs C code:
                 Cortex A53    A72    A73
warp_8x8_8bpc_neon:    3.19   2.60   3.66
warp_8x8t_8bpc_neon:   3.09   2.50   3.58
Before:                  Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:    677.4   433.9   452.9
cdef_filter_4x8_8bpc_neon:   1255.0   815.2   841.8
cdef_filter_8x8_8bpc_neon:   2278.5  1440.0  1505.0
After:
cdef_filter_4x4_8bpc_neon:    645.5   401.9   422.5
cdef_filter_4x8_8bpc_neon:   1193.7   756.6   782.4
cdef_filter_8x8_8bpc_neon:   2162.4  1361.9  1375.6
Pad with a value which works both as a large unsigned value and a
negative signed value. This allows doing the max operation using
signed max, avoiding the conditional altogether.

Based on the same idea for x86 by Kyle Siefring.

Before:                  Cortex A53     A72     A73
cdef_filter_4x4_8bpc_neon:    645.5   401.9   422.5
cdef_filter_4x8_8bpc_neon:   1193.7   756.6   782.4
cdef_filter_8x8_8bpc_neon:   2162.4  1361.9  1375.6
After:
cdef_filter_4x4_8bpc_neon:    596.3   377.8   384.8
cdef_filter_4x8_8bpc_neon:   1097.4   705.5   707.1
cdef_filter_8x8_8bpc_neon:   1967.4  1232.3  1239.9
This might have said pri_taps[k]/sec_taps[k] at some earlier time.
For cases with indented, nested .if/.macro in asm.S, ident those
by 4 chars.

Some initial assembly files were indented to 4/16 columns, while all
the actual implementation files, starting with src/arm/64/mc.S, have
used 8/24 for indentation.
The width register has been set to clz(w)-24, not the other way
around. And the 32 bit prep function has got the h parameter in
r4, not in r5.
This eases disambiguating these functions when looking at perf
profiles.
Relative speedup vs (autovectorized) C code:
                      Cortex A53    A72    A73
selfguided_3x3_8bpc_neon:   2.91   2.12   2.68
selfguided_5x5_8bpc_neon:   3.18   2.65   3.39
selfguided_mix_8bpc_neon:   3.04   2.29   2.98

The relative speedup vs non-vectorized C code is around 2.6-4.6x.
The exact relative speedup compared to C code is a bit vague and hard
to measure, depending on eactly how many filtered blocks are skipped,
as the NEON version always filters 16 pixels at a time, while the
C code can skip processing individual 4 pixel blocks.

Additionally, the checkasm benchmarking code runs the same function
repeatedly on the same buffer, which can make the filter take
different codepaths on each run, as the function updates the buffer
which will be used as input for the next run.

If tweaking the checkasm test data to try to avoid skipped blocks,
the relative speedups compared to C is between 2x and 5x, while
it is around 1x to 4x with the current checkasm test as such.

Benchmark numbers from a tweaked checkasm that avoids skipped
blocks:

                        Cortex A53     A72     A73
lpf_h_sb_uv_w4_8bpc_c:      2954.7  1399.3  1655.3
lpf_h_sb_uv_w4_8bpc_neon:    895.5   650.8   692.0
lpf_h_sb_uv_w6_8bpc_c:      3879.2  1917.2  2257.7
lpf_h_sb_uv_w6_8bpc_neon:   1125.6   759.5   838.4
lpf_h_sb_y_w4_8bpc_c:       6711.0  3275.5  3913.7
lpf_h_sb_y_w4_8bpc_neon:    1744.0  1342.1  1351.5
lpf_h_sb_y_w8_8bpc_c:      10695.7  6155.8  6638.9
lpf_h_sb_y_w8_8bpc_neon:    2146.5  1560.4  1609.1
lpf_h_sb_y_w16_8bpc_c:     11355.8  6292.0  6995.9
lpf_h_sb_y_w16_8bpc_neon:   2475.4  1949.6  1968.4
lpf_v_sb_uv_w4_8bpc_c:      2639.7  1204.8  1425.9
lpf_v_sb_uv_w4_8bpc_neon:    510.7   351.4   334.7
lpf_v_sb_uv_w6_8bpc_c:      3468.3  1757.1  2021.5
lpf_v_sb_uv_w6_8bpc_neon:    625.0   415.0   397.8
lpf_v_sb_y_w4_8bpc_c:       5428.7  2731.7  3068.5
lpf_v_sb_y_w4_8bpc_neon:    1172.6   792.1   768.0
lpf_v_sb_y_w8_8bpc_c:       8946.1  4412.8  5121.0
lpf_v_sb_y_w8_8bpc_neon:    1565.5  1063.6  1062.7
lpf_v_sb_y_w16_8bpc_c:      8978.9  4411.7  5112.0
lpf_v_sb_y_w16_8bpc_neon:   1775.0  1288.1  1236.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants