Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing local register allocation for the tier-1 JIT compiler #341

Merged
merged 1 commit into from
Feb 13, 2024

Conversation

qwe661234
Copy link
Collaborator

@qwe661234 qwe661234 commented Feb 4, 2024

Local register allocation effectively reuses the host register value with in a basic block scope, thereby reducing the number of load and store instructions.

Take continuous addi instructions as an example:

addi t0, t0, 1
addi t0, t0, 1
addi t0, t0, 1
  • The generated machine code without register allocation
load t0, t0_addr
add t0, 1
sw t0, t0_addr
load t0, t0_addr
add t0, 1
sw t0, t0_addr
load t0, t0_addr
add t0, 1
sw t0, t0_addr
  • The generated machine code with register allocation
load t0, t0_addr
add t0, 1
add t0, 1
add t0, 1
sw t0, t0_addr

As shown in the above example, register allocation reuses the host register and reduces the number of load and store instructions.

  • x86-64(i7-11700)
Metric w/o RA w/ RA SpeedUp
dhrystone 0.342 s 0.328 s +4.27%
miniz 1.243 s 1.185 s +4.89%
primes 1.716 s 1.689 s +1.60%
sha512 2.063 s 1.880 s +9.73%
stream 11.619 s 11.419 s +1.75%
  • Aarch64 (eMag)
Metric w/o RA w/ RA SpeedUp
dhrystone 1.935 s 1.301 s +48.73%
miniz 7.706 s 4.362 s +76.66%
primes 10.513 s 9.633 s +9.14%
sha512 6.508 s 6.119 s +6.36%
stream 45.174 s 38.037 s +18.76%

As demonstrated in the performance analysis, the register allocation improves the overall performance for the T1C generated machine code. Without RA, the generated machine need to store back the register value in the end of intruction. With RA, we only need to store back theregister value in the end of basic block or when host registers are fully occupied. The performance enhancement is particularly pronounced on Aarch64 due to its increased availability of registers, providing a more extensive mapping capability for VM registers.

@jserv
Copy link
Contributor

jserv commented Feb 4, 2024

Prior to our review of this pull request, please provide a detailed explanation of the register allocation (RA) algorithm you've implemented, including the rationale behind your choice. It would also be beneficial to discuss how this approach compares with traditional algorithms like graph coloring and linear scan, highlighting the advantages and considerations of each.

Referece:

@jserv jserv requested a review from vacantron February 4, 2024 09:56
tools/gen-jit-template.py Outdated Show resolved Hide resolved
@qwe661234
Copy link
Collaborator Author

I implemented register allocation using an available host register table, a counter, and a VM register table.

  • Available host register table: Records the available host registers for register allocation.
  • Counter: Points to the available host register.
  • VM register table: Records the mapping between host registers and VM registers.

Take the instruction sequence below as an example:

addi t0, a0, 1
addi t0, t0, 1

counter = 0
avalible_host_register = [RAX, RBX, RCX]

In the first addi instruction, we check the VM register table and find that reg_a0 has not been mapped to a host register. Therefore, we map reg_a0 to RAX and load the value from the RV data into it. Next, we need to map the destination VM register t0. We check the VM register table and find that reg_t0 has not been mapped to a host register, so we map reg_t0 to RBX, then move the value in RAX (reg_a0) to RBX (reg_t0) and add RBX (reg_t0) with 1.

In the second addi instruction, we check the VM register table and find that reg_t0 has been mapped to the host register RBX, so there is no need to load reg_t0. Next, we need to map the destination VM register t0, and reg_t0 has been mapped to the host register RBX. Therefore, we directly add RBX (reg_t0) with 1.

tools/gen-jit-template.py Outdated Show resolved Hide resolved
tools/gen-jit-template.py Outdated Show resolved Hide resolved
src/rv32_template.c Outdated Show resolved Hide resolved
src/rv32_template.c Outdated Show resolved Hide resolved
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest master branch for API refinement.

src/jit.c Outdated Show resolved Hide resolved
src/jit.c Outdated Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Feb 5, 2024

As shown in the above example, register allocation reuses the host register and reduces the number of load and store instructions.
Metric w/o RA w/ RA Speedup

Provide benchmarks for both x86-64 and Aarch64.

if (reg_table[i] == target_reg) {
reg_table[i] = -1;
emit_store(state, S32, target_reg, parameter_reg[0],
offsetof(riscv_t, X) + 4 * i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do an early return here? The target should be unique in the reg_table.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do early return here.

@qwe661234
Copy link
Collaborator Author

qwe661234 commented Feb 6, 2024

As shown in the above example, register allocation reuses the host register and reduces the number of load and store instructions.
Metric w/o RA w/ RA Speedup

Provide benchmarks for both x86-64 and Aarch64.

Added

@jserv
Copy link
Contributor

jserv commented Feb 6, 2024

Provide benchmarks for both x86-64 and Aarch64.

Added

Refine git commit message with correct column description -- w/o vs. w/.
Can you explain the measurements of RA for X64 and A64 code generation?

src/jit.c Outdated Show resolved Hide resolved
@qwe661234
Copy link
Collaborator Author

Provide benchmarks for both x86-64 and Aarch64.

Added

Refine git commit message with correct column description -- w/o vs. w/. Can you explain the measurements of RA for X64 and A64 code generation?

Sure, I explained in the git message.

src/jit.c Outdated Show resolved Hide resolved
src/jit.c Outdated Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Feb 12, 2024

In the realm of compiler optimizations, register allocation strategies are broadly categorized into Local, Global, and Interprocedural register allocation. The proposed change specifically pertains to local register allocation. This approach focuses on tracking register contents within a basic block -- a linear sequence of instructions -- allowing for the efficient reuse of variables and constants directly from registers. This detail should be clearly communicated in the context of both pull request and git commit messages to ensure clarity regarding the scope and nature of the register allocation.

I defer to @vacantron for confirmation.

Local register allocation effectively reuses the host register value
within a basic block scope, thereby reducing the number of load and
store instructions.

Take continuous addi instructions as an example:

addi t0, t0, 1
addi t0, t0, 1
addi t0, t0, 1

* The generated machine code without register allocation

load t0, t0_addr
add t0, 1
sw t0, t0_addr
load t0, t0_addr
add t0, 1
sw t0, t0_addr
load t0, t0_addr
add t0, 1
sw t0, t0_addr

* The generated machine code without register allocation

load t0, t0_addr
add t0, 1
add t0, 1
add t0, 1
sw t0, t0_addr

As shown in the above example, register allocation reuses the host
register and reduces the number of load and store instructions.

* x86-64(i7-11700)

| Metric   |  W/O RA  |  W/ RA   | SpeedUp |
|----------+----------+----------+---------|
| dhrystone| 0.342 s  | 0.328 s  |  +4.27% |
| miniz    | 1.243 s  | 1.185 s  |  +4.89% |
| primes   | 1.716 s  | 1.689 s  |  +1.60% |
| sha512   | 2.063 s  | 1.880 s  |  +9.73% |
| stream   |11.619 s  |11.419 s  |  +1.75% |

* Aarch64 (eMag)

| Metric   |  W/O RA  |  W/ RA   | SpeedUp |
|----------+----------+----------+---------|
| dhrystone| 1.935 s  | 1.301 s  | +48.73% |
| miniz    | 7.706 s  | 4.362 s  | +76.66% |
| primes   |10.513 s  | 9.633 s  |  +9.14% |
| sha512   | 6.508 s  | 6.119 s  |  +6.36% |
| stream   |45.174 s  |38.037 s  | +18.76% |

As demonstrated in the performance analysis, the register allocation
improves the overall performance for the T1C generated machine code.
Without RA, the generated machine need to store back the register
value in the end of intruction. With RA, we only need to store back the
register value in the end of basic block or when host registers are
fully occupied. The performance enhancement is particularly pronounced
on Aarch64 due to its increased availability of registers, providing a
more extensive mapping capability for VM registers.
@qwe661234 qwe661234 changed the title Introducing register allocation for the tier-1 JIT compiler Introducing lcoal register allocation for the tier-1 JIT compiler Feb 13, 2024
@qwe661234 qwe661234 changed the title Introducing lcoal register allocation for the tier-1 JIT compiler Introducing local register allocation for the tier-1 JIT compiler Feb 13, 2024
@jserv jserv merged commit 3c3d440 into sysprog21:master Feb 13, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants