Optimize Thumbulator #945

thrust26 · 2022-11-26T08:20:53Z

In the StellaDS thread, AtariAge member llabnip made some suggestions how to speed up the Thumbulator class quite significantly (almost 40% on the Nintendo DS):

Not calling into the execute() for each Thumb instruction - the overhead of the call was not optimized away with GCC at the max settings so I moved the handling of the Thumb loop to inside the execute(). (implemented with 025de6e)

Keeping a 16-bit pointer always pointing to the next instruction rather than re-index into the Thumb instruction ROM array. I don't even bother to update the PC register until it's needed (easy to back-calculate from the 16-bit Thumb instruction pointer from the start of ROM).

The biggest improvement in speed came from simply using the top 2 bits of the Thumb instruction to binary parse the instruction. So I check if the high bit is set - that puts the instruction into one of two buckets. Then check bit 15 to parse those two buckets into two further buckets. This way I only have to check those instructions in each bucket which really reduces the long search for the opcode. Then I did some profiling and found the popular instructions which were often several orders of magnitude more likely to be called - and check them first in each bucket (i.e. ADD big immediate one register, CMP immediate, conditional branch, etc.) (Stella decodes opcodes only once, which is even better)

The conditional branch is heavily used in most programs - Galagon calls it about 200k per second. Since each entry in the 8-bit decoded table (256 possibilities) only has 72 (rough count) opcodes... some of the most heavily used opcodes could be further split during decoding. The conditional branch, for example, could be split into the 13 different types (branch if zero, branch if not zero, etc). This would just add to the op-code count but would save the shift, AND and switch for that instruction. (implemented with 96d5a3f)

This might ease the CPU stress for other platforms too.

BTW: It is quite impressive to get ARM games running at (mostly) full speed on a platform (Nintendo DSi) where the main CPU clocks at just 133 MHz.

The text was updated successfully, but these errors were encountered:

thrust26 added the enhancement label Nov 26, 2022

thrust26 added this to the Prio 2 milestone Nov 26, 2022

thrust26 self-assigned this Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Thumbulator #945

Optimize Thumbulator #945

thrust26 commented Nov 26, 2022 •

edited

Optimize Thumbulator #945

Optimize Thumbulator #945

Comments

thrust26 commented Nov 26, 2022 • edited

thrust26 commented Nov 26, 2022 •

edited