Introduce mtval2 and stval2 CSRs #394

avpatel · 2019-06-06T15:22:35Z

Currently, we only have one trap value register for M-mode and S-mode
which is mtval and stval respectively.

The contents of mtval and stval will be fault address for following traps:
Load address misaligned
Load access fault
Store/AMO address misaligned
Store/AMO access fault
Load page fault
Store/AMO page fault

Unfortunately, knowing just fault address for above mentioned traps is not
sufficent because software needs to know exact width (1, 2, 4, or 8 bytes)
of load/store operation and target registers for handling the trap. This
means software (OpenSBI/Hypervisor) will have to do unprivledge load for
determining and decoding the faulting instruction.
(Refer, get_insn() and get_insn_virt() of OpenSBI at
https://github.com/riscv/opensbi/blob/hyp_ext_changes_v1/lib/riscv_unpriv.c)
(Refer, get_insn() of Xvisor at
https://github.com/avpatel/xvisor-next/blob/master/arch/riscv/cpu/generic/include/riscv_unpriv.h)

Use-Case1: Misaligned load/store emulation in M-mode
The M-mode RUNTIME firmware (OpenSBI) will have to do misaligned load/store
emulation for both HS-mode (Hypervisor) and VS-mode (Guest OS) whenever
underlying HW does not implement misaligned load/store. This means M-mode
RUNTIME firmware will do unprivledge load for getting faulting instruction.
This is expensive for traps coming from both HS-mode and VS-mode. Particularly,
it's very expensive for traps coming from VS-mode because unprivledge load
can use nested page table walk to read instruction from memory.

Use-Case2: Software device emulation in HS-mode
Real world HS-mode hypervisors will be emulating lot of devices in software.
For such devices, hypervisor will not create any mapping in Stage2 page table
so whenever Guest OS access these devices it will generate load page fault
or store page fault. To handle the load/store page fault, hypervisor will
do unprivledge load for getting faulting instruction. This is very expensive
for hypervisor because unprivledge load can use nested page table walk to
read instruction from memory.

To make RISC-V competitive with other architectures, I propose to introduce
mtval2 and stval2 CSRs. These CSRs will hold faulting instruction for above
mentioned traps. If CPU designers don't want to implement these CSRs then
they can simply tie these CSRs to zero in which case software will fallback
to manually reading instruction using unprivledge load. A similar thing
already exist for illegal instruction trap where mtval/stval has faulting
instruction or zero.

DavidYu360 · 2019-06-07T03:04:40Z

Maybe too late to change, I think for the consistency, there should be one fixed CSR that always contains the synchronous fault instruction.

When I writing a very simple random instruction generating script, I found it is hard to tell an EBREAK or a C.EBREAK without loading instruction from data side in the exception handler.

avpatel · 2019-06-07T03:57:55Z

Thanks for details related to EBREAK and C.EBREAK. This means we should have faulting instruction in a separate CSR (probably mtval2/stval2) for "Breakpoint" trap as well.

I used mtval2 and stval2 CSR names because for "Instruction access fault" and "Instruction page fault" traps we cannot know the faulting instruction. In-future, we might have more traps where its not possible to know the faulting instruction.

aswaterman · 2019-06-07T05:32:01Z

We need to evaluate these suggestions quantitatively. In both use cases, the code paths are already pretty long, and so the opportunity for speedup is limited.

The EBREAK thing is more a matter of convenience than performance, so while it's an understandable desire, I don't think it factors into this decision.

fintelia · 2019-07-15T02:57:39Z

Could you explain a bit more why doing an unprivileged load via MPRV would be so expensive? The processor just loaded the instruction only a handful of cycles beforehand so presumably the two-level translation will still be in the TLB and the data in the L2 cache (and L1 instruction cache, but that doesn't help). Am I missing something?

avpatel · 2019-07-15T03:46:19Z

First of all, if you look existing uses of MPRV (OpenSBI) and SPRV (Xvisor, KVM, etc) then you will realize that it's quite a few instructions.

Further, when we do unprivledge load/store (from M-mode or HS-mode) it can fault because S-mode Linux has swapper (running in background) which can change page table mappings. We have seen unprivledge load/store faulting on SiFive unleashed board but the rate will depend on rate of swapping which in-turn depends on system load and pressure for more memory.

The rationale here is that HW already has the instruction when the fault is generated so it can easily save faulting instruction in some trap CSR (probably mtval2/stval2) at time of taking trap.

aswaterman · 2019-07-15T03:48:05Z

If it faults because S-mode has swapped the page out, the instruction page fault can be redirected to S-mode so that the OS page-fault handler will bring the instruction page back into memory. Then the OS will retry the misaligned instruction, and it will once again trap into M-mode for emulation. It should be a perf issue, not a functional issue.

avpatel · 2019-07-15T04:24:20Z

In case of virtualization, lot of Guest/VM devices are software emulated (UART, ethernet, block, etc). This means whenever Guest Linux access these SW emulated devices on Xvisor/KVM it will generate STORE/LOAD trap. In fact, entire PLIC is SW emulated for Guest Linux.

Now imagine a loaded system where Guest Linux is doing lot of network/storage IO resulting in lots of PLIC, ethernet, block, etc MMIO accesses. For every MMIO access from Guest Linux is generating STORE/LOAD trap and we are doing get_insn() (roughly 50-100 instructions) just to get the faulting instruction.

I mean not knowing the faulting instruction is making RISC-V hypervisor extension less competitive in-terms of performance.

IMHO, having dedicated HS-mode CSR for faulting instruction makes perfect sense for RISC-V hypervisor extension. Providing same CSR for M-mode as well is for completeness.

aswaterman · 2019-07-15T04:35:37Z

One thing we are confused about is, why does it take so many instructions in those codes? In bbl, it takes 10 - 14 instructions, depending on size and alignment. That code is believed to handle all corner cases correctly. (I'm referring to this code when it is passed through GCC 8.2: https://github.com/riscv/riscv-pk/blob/d5909ee6f6c41ac0c539b687fb6e849401b0aecc/machine/unprivileged_memory.h#L69)

avpatel · 2019-07-15T04:48:14Z

Yes, we have extended same code from BBL/OpenSBI to Xvisor/KVM but for HS-mode.

For KVM, the host Linux runs most of the KVM code in process context hence it can be preempted even when we are in Linux KVM code (except the KVM world-switch).

The get_insn() in KVM RISC-V has to be:

Done atomically using local_irq_save() and local_irq_restore()
HSTATUS, BSSTATUS and SSTATUS have to be saved-updated-restored

Above adds more instructions to get_insn().
(Refer, https://github.com/avpatel/linux/blob/riscv_kvm_v1/arch/riscv/kvm/vcpu_exit.c)

aswaterman · 2019-07-15T04:56:25Z

Thanks for the link.

To be clear, I'm not opposed to adding this feature, I just want to see a quantitative case for its inclusion. When emulating device I/O, what fraction of the runtime is in get_insn, in the case that the guest code is already hot?

avpatel · 2019-07-15T05:38:29Z

We don't have quantitative data as of now about amount of runtime spend in get_insn(). The impact of get_insn() will only show up in Guest Linux workloads resulting in high MMIO access.

Currently, irrespective to the hypervisor type we will have get_insn() called once for every MMIO trap.

Here's some details about MMIO trap rate for 1Gbps network traffic from Guest Linux....

Network throughput of 1Gbps from Guest Linux implies 83K packets-per-second of 1500 bytes (max MTU size). Now for every virtual ethernet interrupt, we will have two PLIC MMIO traps (read CLAIM register and write CLAIM register). Now, the number of MMIO access for virtual ethernet controller will vary based on type of ethernet controller we are emulating in hypervisor. If we assume 2 MMIO traps to virtual ethernet controller per-packet (optimistic) and one virtual interrupt per-packet then this will be roughly 333K MMIO traps per second for 1Gbps traffic. Now if get_insn() is 20 instructions then we have 6.6 million instructions per-second just to read the faulting instructions.

jscheid-ventana · 2019-07-15T15:16:33Z

Unfortunately, knowing just fault address for above mentioned traps is not
sufficent because software needs to know exact width (1, 2, 4, or 8 bytes)
of load/store operation and target registers for handling the trap.

As an alternative to providing the full instruction memory encoding, consider providing applicable exception syndrome information directly. Such has the possibility to be much less constraining on implementations, and potentially more efficient for SW. So in this case, pieces of information like the access width and target register.

fintelia · 2019-07-15T18:46:53Z

As an alternative to providing the full instruction memory encoding, consider providing applicable exception syndrome information directly. Such has the possibility to be much less constraining on implementations, and potentially more efficient for SW. So in this case, pieces of information like the access width and target register.

For uncompressed load/store instructions that information is stored directly in bits [12,13] and [7,11] respectively (and whether to sign extend in bit 14). It is rather more annoying to extract that information from the compressed versions but perhaps mtval2/stval2 could always store the uncompressed version of instructions? Bits [0, 1] would then be free to indicate the real instruction size for advancing the PC. This would also give a reason to make mtval2/stval2 consistent across all traps including illegal instruction traps.

avpatel · 2019-07-16T05:22:23Z

Unfortunately, knowing just fault address for above mentioned traps is not
sufficent because software needs to know exact width (1, 2, 4, or 8 bytes)
of load/store operation and target registers for handling the trap.

As an alternative to providing the full instruction memory encoding, consider providing applicable exception syndrome information directly. Such has the possibility to be much less constraining on implementations, and potentially more efficient for SW. So in this case, pieces of information like the access width and target register.

Yes, having decoded information (such as access width and target registers number) in mtval2/stval2 CSR will make SW more efficient. In fact, this is what is done in ARM64 architecture. Although, issue with ARM64 way of representing decoded information is that it does not cover all ARM64 load/store instructions particularly it does not cover ARM64 LDP/STP instructions where we have two target/source registers.

The rationale for full instruction encoding in mtval2/stval2 was:

It is inline with what's already done for illegal instruction trap
Keeps things simpler for CPU designers

I guess we can also go with decoded information in mtval2/stval2 because we don't have too many variants of load/store instructions in RISC-V.

Regards,
Anup

aswaterman · 2019-07-16T08:08:59Z

Providing exactly the bits that were fetched from memory is the more future-proof approach, and it also matches the existing mtval illegal-instruction trap behavior.

jscheid-ventana · 2019-07-18T17:43:14Z

Providing exactly the bits that were fetched from memory is the more future-proof approach, and it also matches the existing mtval illegal-instruction trap behavior.

In general, providing more exception syndrome information, and especially more initial or intermediate state information, can impose higher costs for various implementation choices. The intent of providing abstract syndrome information is to focus on the intersection of what is useful for SW and what is likely to be available to a broad set of implementations at the point a particular exception type is identified.

Example implementations where providing the full instruction value has additional cost:

Hardware implementations where instructions are decomposed into operations that do not preserve the original instruction encodings past the point of decode
Partial decode may occur prior to caching structures.
Dynamic code optimization implementations such as Transmeta where instructions are similarly decomposed but in a software layer. etc.

All of these implementations would likely know aspects of the instruction, such as the access type, size, destination register, etc. at the time of detecting a data access fault, but not the specific instruction encoding.

The mtval values are optional. But the risk is that the architecture will eventually need multiple mechanisms to present syndrome information to SW. So to address this it’s worth contemplating a mechanism that is more broadly feasible, presenting a single architecture to SW.

avpatel · 2019-11-07T07:16:15Z

Closing this issue since it's already covered in v0.5 draft.

This was referenced Jul 31, 2019

Add CSRs to the hypervisor extension for more trap values #424

Closed

Add CSRs to the hypervisor extension for more trap values, version 2 #431

Closed

avpatel closed this as completed Nov 7, 2019

ChengJin01 mentioned this issue Jan 5, 2021

Porting OpenJ9 to RISC-V eclipse-openj9/openj9#5058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce mtval2 and stval2 CSRs #394

Introduce mtval2 and stval2 CSRs #394

avpatel commented Jun 6, 2019

DavidYu360 commented Jun 7, 2019

avpatel commented Jun 7, 2019

aswaterman commented Jun 7, 2019

fintelia commented Jul 15, 2019

avpatel commented Jul 15, 2019

aswaterman commented Jul 15, 2019 •

edited

avpatel commented Jul 15, 2019

aswaterman commented Jul 15, 2019 •

edited

avpatel commented Jul 15, 2019

aswaterman commented Jul 15, 2019

avpatel commented Jul 15, 2019 •

edited

jscheid-ventana commented Jul 15, 2019

fintelia commented Jul 15, 2019

avpatel commented Jul 16, 2019

aswaterman commented Jul 16, 2019

jscheid-ventana commented Jul 18, 2019

avpatel commented Nov 7, 2019

Introduce mtval2 and stval2 CSRs #394

Introduce mtval2 and stval2 CSRs #394

Comments

avpatel commented Jun 6, 2019

DavidYu360 commented Jun 7, 2019

avpatel commented Jun 7, 2019

aswaterman commented Jun 7, 2019

fintelia commented Jul 15, 2019

avpatel commented Jul 15, 2019

aswaterman commented Jul 15, 2019 • edited

avpatel commented Jul 15, 2019

aswaterman commented Jul 15, 2019 • edited

avpatel commented Jul 15, 2019

aswaterman commented Jul 15, 2019

avpatel commented Jul 15, 2019 • edited

jscheid-ventana commented Jul 15, 2019

fintelia commented Jul 15, 2019

avpatel commented Jul 16, 2019

aswaterman commented Jul 16, 2019

jscheid-ventana commented Jul 18, 2019

avpatel commented Nov 7, 2019

aswaterman commented Jul 15, 2019 •

edited

aswaterman commented Jul 15, 2019 •

edited

avpatel commented Jul 15, 2019 •

edited