-
-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Superbig boost in performance? #60
Conversation
…pecific build and clang-cl users have reported absurd gains over master for some gains, in the range 50%-90% But for normal msvc builds i would put it at around 30-50% Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger fixed a number of errors where temporaries were being passed by reference/pointer Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes. Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold. Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me. Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases ^this can be toggled off in the platform_win header handle indirect call true with constant function pointer, was occurring in h3 lookup host format swizzle in denser array by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar ^looking up whether its known or not took approx 0.3% cpu time Changed some things in /cpu to make the project UNITYBUILD friendly The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds) Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed added support for docdecaduple precision floating point so that we can represent our performance gains numerically tons of other stuff im probably forgetting
Add branch of disruptorplus with working blocking_wait_stategy Switch back to blocking wait for timer queue
this is insane, rdr is so good now, especially with 1080p through 540p patch and 2x res scale. |
Its people like you who inspire me to change my major back to software development. Fantastic work, absolutely incredible. Tested with Red Dead Redemption on a midspec PC and am pushing past 60 FPS with relative ease. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@Epsilon93 Ask for tech support on the Xenia Discord server, not here: https://discord.gg/Q9mxZf9 |
Ok thank you for the way ;) |
Huge set of performance improvements, users have reported absurd gains over master for some gains, in the range 30%-90%
Added per-xexmodule caching of information per instruction, can be used to remember what code needs compiling at start up
Record what guest addresses wrote mmio and backpropagate that to future runs, eliminating dependence on exception trapping. this makes many games like h3 actually tolerable to run under a debugger
fixed a number of errors where temporaries were being passed by reference/pointer
Can now be compiled with clang-cl 14.0.1, requires -Werror off though and some other solution/project changes.
Added macros wrapping compiler extensions like noinline, forceinline, __expect, and cold.
Removed the "global lock" in guest code completely. It does not properly emulate the behavior of mfmsrd/mtmsr and it seriously cripples amd cpus. Removing this yielded around a 3x speedup in Halo Reach for me.
Disabled the microprofiler for now. The microprofiler has a huge performance cost associated with it. Developers can re-enable it in the base/profiling header if they really need it
Disable the trace writer in release builds. despite just returning after checking if the file was open the trace functions were consuming about 0.60% cpu time total
Add IsValidReg, GetRegisterInfo is a huge (about 45k) branching function and using that to check if a register was valid consumed a significant chunk of time
Optimized RingBuffer::ReadAndSwap and RingBuffer::read_count. This gave us the largest overall boost in performance. The memcpies were unnecessary and one of them was always a no-op
Added simplification rules for multiplicative patterns like (x+x), (x<<1)+x
For the most frequently called win32 functions i added code to call their underlying NT implementations, which lets us skip a lot of MS code we don't care about/isnt relevant to our usecases
^this can be toggled off in the platform_win header
handle indirect call true with constant function pointer, was occurring in h3
lookup host format swizzle in denser array
by default, don't check if a gpu register is unknown, instead just check if its out of range. controlled by a cvar
^looking up whether its known or not took approx 0.3% cpu time
Changed some things in /cpu to make the project UNITYBUILD friendly
The timer thread was spinning way too much and consuming a ton of cpu, changed it to use a blocking wait instead
tagged some conditions as XE_UNLIKELY/LIKELY based on profiler feedback (will only affect clang builds)
Shifted around some code in CommandProcessor::WriteRegister based on how frequently it was executed
added support for docdecaduple precision floating point so that we can represent our performance gains numerically
tons of other stuff im probably forgetting