Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Thumbulator #945

Open
thrust26 opened this issue Nov 26, 2022 · 0 comments
Open

Optimize Thumbulator #945

thrust26 opened this issue Nov 26, 2022 · 0 comments
Assignees
Milestone

Comments

@thrust26
Copy link
Member

thrust26 commented Nov 26, 2022

In the StellaDS thread, AtariAge member llabnip made some suggestions how to speed up the Thumbulator class quite significantly (almost 40% on the Nintendo DS):

  • Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute(). (implemented with 025de6e)
  • Keeping a 16-bit pointer always pointing to the next instruction rather than re-index into the Thumb instruction ROM array. I don't even bother to update the PC register until it's needed (easy to back-calculate from the 16-bit Thumb instruction pointer from the start of ROM).
  • The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets. This way I only have to check those instructions in each bucket which really reduces the long search for the opcode. Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.) (Stella decodes opcodes only once, which is even better)
  • The conditional branch is heavily used in most programs - Galagon calls it about 200k per second. Since each entry in the 8-bit decoded table (256 possibilities) only has 72 (rough count) opcodes... some of the most heavily used opcodes could be further split during decoding. The conditional branch, for example, could be split into the 13 different types (branch if zero, branch if not zero, etc). This would just add to the op-code count but would save the shift, AND and switch for that instruction. (implemented with 96d5a3f)

This might ease the CPU stress for other platforms too.

BTW: It is quite impressive to get ARM games running at (mostly) full speed on a platform (Nintendo DSi) where the main CPU clocks at just 133 MHz.

@thrust26 thrust26 added this to the Prio 2 milestone Nov 26, 2022
@thrust26 thrust26 self-assigned this Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant