Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce preliminary macro operation fusion #132

Merged
merged 1 commit into from
May 29, 2023

Conversation

qwe661234
Copy link
Collaborator

@qwe661234 qwe661234 commented May 22, 2023

Through our observations, we have identified certain patterns in instruction sequences. By converting these specific RISC-V instruction patterns into faster and equivalent code, we can significantly improve execution efficiency.

In our current analysis, we focus on a commonly used benchmark and have found the following frequently occurring instruction patterns: auipc + addi, auipc + add, multiple sw, and multiple lw.

Metric commit fba5802 macro fuse operation Speedup
CoreMark 1351.065 (Iterations/Sec) 1352.843 (Iterations/Sec) +0.13%
dhrystone 1073 DMIPS 1146 DMIPS +6.8%
nqueens 8295 msec 7824 msec +6.0%

src/decode.h Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented May 22, 2023

To enhance execution efficiency, we employ instruction fusion by combining sequences that adhere to specific patterns into fused instructions. Currently, we have incorporated four fused instructions: auipc + addi, auipc + add, multiple sw, and multiple lw.

You shall show some numbers to illustrate how we can benefit from macro operation fusion.
In addition, why were 4 patterns were picked? Denote them with existing benchmark programs.

@jserv jserv changed the title Add fuse instruction Introduce macro operation fusion May 22, 2023
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/riscv.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
src/decode.h Outdated Show resolved Hide resolved
src/emulate.c Outdated Show resolved Hide resolved
@@ -1219,6 +1220,60 @@ RVOP(cswsp, {
})
#endif

/* auipc + addi */
Copy link
Contributor

@jserv jserv May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to manipulate the sequence lui + addi?
See #81 (comment)

Disassembly of CoreMark:

   10324:       000087b7                lui     a5,0x8
   10328:       b0578793                addi    a5,a5,-1275 # 0x7b05

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible. however, there are some problems when running qrcode.elf if we import this pattern, so I skip it in this pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible. however, there are some problems when running qrcode.elf if we import this pattern, so I skip it in this pull request.

Add a comment starting with "FIXME: lui + addi"

rv->PC += ir->insn_len * (ir->imm2 - 1);
})

/* multiple lw */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lw is the most frequent instruction (see #34), and we might dive into its use case more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you handle the following case? (disassembly from CoreMark

   10248:       03012603                lw      a2,48(sp)
   1024c:       01c11583                lh      a1,28(sp)
   10250:       03412503                lw      a0,52(sp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, consider the following scenario:

   10a84:       01c12083                lw      ra,28(sp)
   10a88:       07f47513                andi    a0,s0,127
   10a8c:       01812403                lw      s0,24(sp)
   10a90:       01412483                lw      s1,20(sp)
   10a94:       01012903                lw      s2,16(sp)
   10a98:       00c12983                lw      s3,12(sp)

It can be regarded as 5 lw. Roughly speaking, if peephole optimization can be applied, we shall benefit from further optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case: (disassembly from CoreMark)

   10c08:       01162023                sw      a7,0(a2)
   10c0c:       00052783                lw      a5,0(a0)
   10c10:       00059883                lh      a7,0(a1)
   10c14:       00259603                lh      a2,2(a1)
   10c18:       00f82023                sw      a5,0(a6)
   10c1c:       01052023                sw      a6,0(a0)
   10c20:       00e82223                sw      a4,4(a6)
   10c24:       0006a783                lw      a5,0(a3)

Mixture of sw and lw.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case: (disassembly from CoreMark)

   10c08:       01162023                sw      a7,0(a2)
   10c0c:       00052783                lw      a5,0(a0)
   10c10:       00059883                lh      a7,0(a1)
   10c14:       00259603                lh      a2,2(a1)
   10c18:       00f82023                sw      a5,0(a6)
   10c1c:       01052023                sw      a6,0(a0)
   10c20:       00e82223                sw      a4,4(a6)
   10c24:       0006a783                lw      a5,0(a3)

Mixture of sw and lw.

In this case, the memory address is not contiguous, what we can do just pack these instructions, but we cannot save any operation, such as checking misaligned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, consider the following scenario:

   10a84:       01c12083                lw      ra,28(sp)
   10a88:       07f47513                andi    a0,s0,127
   10a8c:       01812403                lw      s0,24(sp)
   10a90:       01412483                lw      s1,20(sp)
   10a94:       01012903                lw      s2,16(sp)
   10a98:       00c12983                lw      s3,12(sp)

It can be regarded as 5 lw. Roughly speaking, if peephole optimization can be applied, we shall benefit from further optimizations.

In this case, we can pack the last four instruction lw. if we want to handle this case by packing 5 lw, we need to reorder the instruction. For example, swap the first and the second instruction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you handle the following case? (disassembly from CoreMark

   10248:       03012603                lw      a2,48(sp)
   1024c:       01c11583                lh      a1,28(sp)
   10250:       03412503                lw      a0,52(sp)

Ditto, if we want to handle this case, we need some strategies to reorder the instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this pull request, let's concentrate on preliminary support of macro operation fusion. You shall add some comments for further efforts such as instruction reordering.

@jserv jserv changed the title Introduce macro operation fusion Introduce preliminary macro operation fusion May 28, 2023
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Fixed Show fixed Hide fixed
@qwe661234 qwe661234 requested a review from jserv May 29, 2023 09:01
src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Fixed Show fixed Hide fixed
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some FIXME/TODO comments which address more macro operation fusion we can pay attention to.

src/emulate.c Outdated Show resolved Hide resolved
src/emulate.c Show resolved Hide resolved
case rv_insn_lw:
COMBINE_MEM_OPS(1);
break;
/* FIXME: lui + addi */

Check notice

Code scanning / CodeQL

FIXME comment

FIXME comment: lui + addi
@qwe661234 qwe661234 requested a review from jserv May 29, 2023 09:58
jserv

This comment was marked as outdated.

jserv

This comment was marked as duplicate.

Through our observations, we have identified certain patterns in instruction
sequences. By converting these specific RISC-V instruction patterns into
faster and equivalent code, we can significantly improve execution efficiency.

In our current analysis, we focus on a commonly used benchmark and have
found the following frequently occurring instruction patterns: auipc + addi,
auipc + add, multiple sw, and multiple lw.

|  Metric  |     commit fba5802       |    macro fuse operation   |Speedup|
|----------+--------------------------+---------------------------+-------|
| CoreMark | 1351.065 (Iterations/Sec)|  1352.843 (Iterations/Sec)|+0.13% |
| dhrystone|       1073 DMIPS         |        1146 DMIPS         | +6.8% |
| nqueens  |       8295 msec          |        7824 msec          | +6.0% |
@qwe661234
Copy link
Collaborator Author

Check CI failure.

In debug mode, the rv_step only emulates one instruction per step, specifically, it executes only the first instruction in a basic block then translate next basic block in PC + 4. If we apply macro fusion operations in debug mode, errors can occur. For instance, fusing auipc and addi and executing them together. However, the subsequent instruction is not a nop because the emulator only emulates the first instruction in a basic block. Consequently, the following instruction remains addi, resulting in an error because the result become auipc + addi + addi.

Therefore, we cannot do fuse operation in debug mode.

@qwe661234 qwe661234 requested a review from jserv May 29, 2023 15:45
@jserv jserv merged commit 5fb9d8b into sysprog21:master May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants