Superbig boost in performance? #60

ghost · 2022-08-13T20:01:39Z

Huge set of performance improvements, users have reported absurd gains over master for some gains, in the range 30%-90%

Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting

…pecific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting

…oes not compile

Add branch of disruptorplus with working blocking_wait_stategy Switch back to blocking wait for timer queue

Blackbird88 · 2022-08-14T10:54:48Z

Great job!
Before

After (vsync_interval = 8)

t0mtee · 2022-08-15T21:23:48Z

this is insane, rdr is so good now, especially with 1080p through 540p patch and 2x res scale.

ajax-lives · 2022-08-21T02:56:47Z

Its people like you who inspire me to change my major back to software development. Fantastic work, absolutely incredible.

Tested with Red Dead Redemption on a midspec PC and am pushing past 60 FPS with relative ease.

Margen67 · 2022-08-21T15:41:41Z

@Epsilon93 Ask for tech support on the Xenia Discord server, not here: https://discord.gg/Q9mxZf9

Epsilon160 · 2022-08-21T15:42:19Z

Ok thank you for the way ;)

disjtqz added 4 commits August 13, 2022 12:59

revert to using old bad spinwait, disruptorplus' blocking_wait code d…

020d64a

…oes not compile

Add branch of ffmpeg with non-recursive split_radix_permutation

c9e4119

Add branch of disruptorplus with working blocking_wait_stategy Switch back to blocking wait for timer queue

once again return to spinloop

495b1f8

Gliniak merged commit 6bc3191 into xenia-canary:canary_experimental Aug 13, 2022

This comment was marked as off-topic.

Sign in to view

xenia-canary deleted a comment from ajax-lives Aug 21, 2022

xenia-canary locked as resolved and limited conversation to collaborators Aug 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Superbig boost in performance? #60

Superbig boost in performance? #60

ghost commented Aug 13, 2022 •

edited by ghost

Loading

Blackbird88 commented Aug 14, 2022 •

edited

Loading

t0mtee commented Aug 15, 2022

ajax-lives commented Aug 21, 2022 •

edited by Margen67

Loading

This comment was marked as off-topic.

Margen67 commented Aug 21, 2022

Epsilon160 commented Aug 21, 2022

Superbig boost in performance? #60

Superbig boost in performance? #60

Conversation

ghost commented Aug 13, 2022 • edited by ghost Loading

Blackbird88 commented Aug 14, 2022 • edited Loading

t0mtee commented Aug 15, 2022

ajax-lives commented Aug 21, 2022 • edited by Margen67 Loading

This comment was marked as off-topic.

Margen67 commented Aug 21, 2022

Epsilon160 commented Aug 21, 2022

ghost commented Aug 13, 2022 •

edited by ghost

Loading

Blackbird88 commented Aug 14, 2022 •

edited

Loading

ajax-lives commented Aug 21, 2022 •

edited by Margen67

Loading