Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dav1d assembly for AArch64, by release #1754

Closed
9 tasks done
vibhoothi opened this issue Oct 11, 2019 · 4 comments · Fixed by #1868
Closed
9 tasks done

Merge dav1d assembly for AArch64, by release #1754

vibhoothi opened this issue Oct 11, 2019 · 4 comments · Fixed by #1868
Assignees
Labels
hacktoberfest SIMD Architecture-specific SIMD optimization speed performance

Comments

@vibhoothi
Copy link
Collaborator

vibhoothi commented Oct 11, 2019

Releases of dav1d

Extract dav1d subtree history

for TAG in `git -C dav1d tag -l`
do git -C dav1d subtree split -P src/arm/64 -b "src-arm-64-$TAG" "$TAG"
done

Apply to rav1e, restoring the prefix

git -C dav1d format-patch --no-stat --keep-subject --stdout --full-index -M100% \
--src-prefix=a/src/arm/64/ --dst-prefix=b/src/arm/64/ ..src-arm-64-0.1.0 -- '*.S' |
git -C rav1e am --keep

Reference #1750

@shssoichiro
Copy link
Collaborator

This process will be a little bit different since we don't have ARM assembly in rav1e yet. But adding dav1d's ARM assembly is a great plan!

@barrbrain barrbrain added this to To do in Hacktoberfest 2019 via automation Oct 11, 2019
@vibhoothi
Copy link
Collaborator Author

@shssoichiro Yeah,
We have been trying to make the build system better for AArch64
and with this we are slowy having the dav1d ARM Assembly.

@EwoutH
Copy link
Contributor

EwoutH commented Oct 14, 2019

@vibhoothiiaanand Great work, it would be amazing if rav1e works faster on ARMv8 CPUs, especially as rav1e's multithreading gets better.

@shssoichiro shssoichiro added SIMD Architecture-specific SIMD optimization speed performance labels Oct 14, 2019
@barrbrain barrbrain moved this from To do to In progress in Hacktoberfest 2019 Oct 23, 2019
@vibhoothi
Copy link
Collaborator Author

vibhoothi commented Nov 19, 2019

Ok,

Here are the new updates from this week

We have tried to incorporate CDEF changes but it is not so doable easily, we have to rework on memory buffers to incorporate changeable stride for input buffer which would take time and would be a blocker. So what we are going to do now is, have all x86 integrated changes for AArch64 also and have it pass all tests and do a benchmark.

Here is the summary of changes from upstream

  • 0.2.0 has CDEF, ARM Optimisations for MC, LR Improvements
  • 0.2.1 has smart padding for CDEF
  • 0.2.2 has SGR Looprestoration, Loopfiltering
  • 0.3.0 No changes for src/arm/
  • 0.3.1 has cold attribute changes+ msac_decode_symbol_adapt
  • 0.4.0 has msac_decode_bool, blend, w_mask, inv_txfm_add
  • 0.5.0 has msac optimisations, blend_h,blend_h,w_mask,intra_pred_(dc/h/v),paeth,smooth, palette,filter,cfl_pred,cfl_ac.
  • 0.5.1 has looprestoration improvements

So interesting parts comes from 0.4.0
List of commits including 0.2.0 till 0.5.0 which are relevant for integrating are in green and the commits which will be added but not going to be integrated is in red, this is made in accordance based on x86 integration from dav1d.

- arm64: looprestoration: Minimal scheduling improvements
- arm64: looprestoration: Fix a typo  …
- arm64: looprestoration: Fix register references in comments
- arm64: looprestoration: Use ld2r instead of ld1+dup+dup
+ arm64: ipred: Make sure all symbols are aligned 
+ arm: util: Split movrel into movrel and movrel_local
- arm64: ipred: NEON implementation of the cfl_ac functions  
+ arm64: ipred: NEON implementation of the cfl_pred functions  
- arm64: ipred: NEON implementation of the filter function  
- arm64: ipred: NEON implementation of palette prediction  
+ arm64: ipred: NEON implementation of smooth prediction  
+ arm64: ipred: NEON implementation of paeth prediction  
+ arm64: mc: Use addp instead of addv+trn1 in warp  
- arm64: cdef: Improve find_dir  
- arm64: cdef: Calculate two initial parameters in the same vector  
- arm64: cdef: Use loads with postincrement in more places in the padding function
- arm64: cdef: Rewrite an expression slightly  
+ arm64: mc: Schedule instructions better in the warp8x8 functions  
+ arm64: mc: Use sbfx instead of ubfx+sxth in the warp function
+ arm64: ipred: NEON implementation of dc/h/v prediction modes  
+ arm64: itx: Fix overflows in idct  
+ arm64: itx: Consistently use the factor 2896 in adst  
+ arm64: itx: Use smull+smlal instead of addl+mul  
+ arm64: itx: Do the final calculation of adst4/adst8/adst16 in 32 bit to avoid too narrow clipping  
- arm64: mc: NEON implementation of w_mask_444/422/420 function  
- arm64: mc: NEON implementation of blend, blend_h and blend_v function  
- Add msac optimizations  
+ arm64: itx: Add NEON optimized inverse transforms  
+ arm64: Consistently name macro arguments tX for temporaries in transposes
- arm64: msac: Add handwritten versions of msac_decode_bool functions  
- arm64: msac: Fix a typo in a comment
- Add __attribute__((cold)) to rarely used functions
- arm64: remove invalid macro argument delimiter
- arm64: msac: Implement NEON msac_decode_symbol_adapt  
- arm64: loopfilter: Implement NEON loop filters  
- arm64: looprestoration: Add a NEON implementation of SGR  
- arm64: cdef: Clarify a slightly confusing comment  
- arm64: cdef: Use a smarter padding constant  
- arm64: cdef: Do saturating subtractions to avoid max operations with 0  
+ fix dav1d spelling
- arm64/ios: use prefixed dav1d_mc_warp_filter symbol
- arm64: mc: NEON implementation of warp8x8{,t}  
- arm64: cdef: NEON implementation of the dir function  
- arm64: cdef: NEON optimized cdef filter function  
- arm64: looprestoration: Optimize loop termination checks in copy_narrow_neon
- arm64: looprestoration: Simplify the horizontal filtering of one pixel at a time
- arm64: looprestoration: Simplify the setup of wiener_filter_v_neon
- arm64: looprestoration: Fix the loop condition in copy_narrow_neon  
- arm64: looprestoration: Fix comment typos
- arm64: looprestoration: Avoid unnecessary alignment of the mid buffer  
+ arm64: mc: Optimize mc_8tap_regular_w4_hv_8bpc for A53  
+ arm64: mc: Simplify the 8tap_2w_hv code slightly  
+ arm64: mc: Optimize the mul_mla_8_* macros for Cortex A53  
+ arm64: mc: Improve a comment
+ arm64: mc: Remove unused/unnecessary macro args
- arm64: mc: Use ubfx instead of ubfm, for consistency with arm  
- arm64: looprestoration: NEON optimized wiener filter  
- arm64: mc: Implement 8tap and bilin functions  ```

Hacktoberfest 2019 automation moved this from In progress to Done Nov 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest SIMD Architecture-specific SIMD optimization speed performance
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants