Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce mtval2 and stval2 CSRs #394

Closed
avpatel opened this issue Jun 6, 2019 · 17 comments
Closed

Introduce mtval2 and stval2 CSRs #394

avpatel opened this issue Jun 6, 2019 · 17 comments

Comments

@avpatel
Copy link

avpatel commented Jun 6, 2019

Currently, we only have one trap value register for M-mode and S-mode
which is mtval and stval respectively.

The contents of mtval and stval will be fault address for following traps:
Load address misaligned
Load access fault
Store/AMO address misaligned
Store/AMO access fault
Load page fault
Store/AMO page fault

Unfortunately, knowing just fault address for above mentioned traps is not
sufficent because software needs to know exact width (1, 2, 4, or 8 bytes)
of load/store operation and target registers for handling the trap. This
means software (OpenSBI/Hypervisor) will have to do unprivledge load for
determining and decoding the faulting instruction.
(Refer, get_insn() and get_insn_virt() of OpenSBI at
https://github.com/riscv/opensbi/blob/hyp_ext_changes_v1/lib/riscv_unpriv.c)
(Refer, get_insn() of Xvisor at
https://github.com/avpatel/xvisor-next/blob/master/arch/riscv/cpu/generic/include/riscv_unpriv.h)

Use-Case1: Misaligned load/store emulation in M-mode
The M-mode RUNTIME firmware (OpenSBI) will have to do misaligned load/store
emulation for both HS-mode (Hypervisor) and VS-mode (Guest OS) whenever
underlying HW does not implement misaligned load/store. This means M-mode
RUNTIME firmware will do unprivledge load for getting faulting instruction.
This is expensive for traps coming from both HS-mode and VS-mode. Particularly,
it's very expensive for traps coming from VS-mode because unprivledge load
can use nested page table walk to read instruction from memory.

Use-Case2: Software device emulation in HS-mode
Real world HS-mode hypervisors will be emulating lot of devices in software.
For such devices, hypervisor will not create any mapping in Stage2 page table
so whenever Guest OS access these devices it will generate load page fault
or store page fault. To handle the load/store page fault, hypervisor will
do unprivledge load for getting faulting instruction. This is very expensive
for hypervisor because unprivledge load can use nested page table walk to
read instruction from memory.

To make RISC-V competitive with other architectures, I propose to introduce
mtval2 and stval2 CSRs. These CSRs will hold faulting instruction for above
mentioned traps. If CPU designers don't want to implement these CSRs then
they can simply tie these CSRs to zero in which case software will fallback
to manually reading instruction using unprivledge load. A similar thing
already exist for illegal instruction trap where mtval/stval has faulting
instruction or zero.

@DavidYu360
Copy link

Maybe too late to change, I think for the consistency, there should be one fixed CSR that always contains the synchronous fault instruction.

When I writing a very simple random instruction generating script, I found it is hard to tell an EBREAK or a C.EBREAK without loading instruction from data side in the exception handler.

@avpatel
Copy link
Author

avpatel commented Jun 7, 2019

Thanks for details related to EBREAK and C.EBREAK. This means we should have faulting instruction in a separate CSR (probably mtval2/stval2) for "Breakpoint" trap as well.

I used mtval2 and stval2 CSR names because for "Instruction access fault" and "Instruction page fault" traps we cannot know the faulting instruction. In-future, we might have more traps where its not possible to know the faulting instruction.

@aswaterman
Copy link
Member

We need to evaluate these suggestions quantitatively. In both use cases, the code paths are already pretty long, and so the opportunity for speedup is limited.

The EBREAK thing is more a matter of convenience than performance, so while it's an understandable desire, I don't think it factors into this decision.

@fintelia
Copy link
Contributor

Could you explain a bit more why doing an unprivileged load via MPRV would be so expensive? The processor just loaded the instruction only a handful of cycles beforehand so presumably the two-level translation will still be in the TLB and the data in the L2 cache (and L1 instruction cache, but that doesn't help). Am I missing something?

@avpatel
Copy link
Author

avpatel commented Jul 15, 2019

First of all, if you look existing uses of MPRV (OpenSBI) and SPRV (Xvisor, KVM, etc) then you will realize that it's quite a few instructions.

Further, when we do unprivledge load/store (from M-mode or HS-mode) it can fault because S-mode Linux has swapper (running in background) which can change page table mappings. We have seen unprivledge load/store faulting on SiFive unleashed board but the rate will depend on rate of swapping which in-turn depends on system load and pressure for more memory.

The rationale here is that HW already has the instruction when the fault is generated so it can easily save faulting instruction in some trap CSR (probably mtval2/stval2) at time of taking trap.

@aswaterman
Copy link
Member

aswaterman commented Jul 15, 2019

If it faults because S-mode has swapped the page out, the instruction page fault can be redirected to S-mode so that the OS page-fault handler will bring the instruction page back into memory. Then the OS will retry the misaligned instruction, and it will once again trap into M-mode for emulation. It should be a perf issue, not a functional issue.

@avpatel
Copy link
Author

avpatel commented Jul 15, 2019

In case of virtualization, lot of Guest/VM devices are software emulated (UART, ethernet, block, etc). This means whenever Guest Linux access these SW emulated devices on Xvisor/KVM it will generate STORE/LOAD trap. In fact, entire PLIC is SW emulated for Guest Linux.

Now imagine a loaded system where Guest Linux is doing lot of network/storage IO resulting in lots of PLIC, ethernet, block, etc MMIO accesses. For every MMIO access from Guest Linux is generating STORE/LOAD trap and we are doing get_insn() (roughly 50-100 instructions) just to get the faulting instruction.

I mean not knowing the faulting instruction is making RISC-V hypervisor extension less competitive in-terms of performance.

IMHO, having dedicated HS-mode CSR for faulting instruction makes perfect sense for RISC-V hypervisor extension. Providing same CSR for M-mode as well is for completeness.

@aswaterman
Copy link
Member

aswaterman commented Jul 15, 2019

One thing we are confused about is, why does it take so many instructions in those codes? In bbl, it takes 10 - 14 instructions, depending on size and alignment. That code is believed to handle all corner cases correctly. (I'm referring to this code when it is passed through GCC 8.2: https://github.com/riscv/riscv-pk/blob/d5909ee6f6c41ac0c539b687fb6e849401b0aecc/machine/unprivileged_memory.h#L69)

@avpatel
Copy link
Author

avpatel commented Jul 15, 2019

Yes, we have extended same code from BBL/OpenSBI to Xvisor/KVM but for HS-mode.

For KVM, the host Linux runs most of the KVM code in process context hence it can be preempted even when we are in Linux KVM code (except the KVM world-switch).

The get_insn() in KVM RISC-V has to be:

  1. Done atomically using local_irq_save() and local_irq_restore()
  2. HSTATUS, BSSTATUS and SSTATUS have to be saved-updated-restored

Above adds more instructions to get_insn().
(Refer, https://github.com/avpatel/linux/blob/riscv_kvm_v1/arch/riscv/kvm/vcpu_exit.c)

@aswaterman
Copy link
Member

Thanks for the link.

To be clear, I'm not opposed to adding this feature, I just want to see a quantitative case for its inclusion. When emulating device I/O, what fraction of the runtime is in get_insn, in the case that the guest code is already hot?

@avpatel
Copy link
Author

avpatel commented Jul 15, 2019

We don't have quantitative data as of now about amount of runtime spend in get_insn(). The impact of get_insn() will only show up in Guest Linux workloads resulting in high MMIO access.

Currently, irrespective to the hypervisor type we will have get_insn() called once for every MMIO trap.

Here's some details about MMIO trap rate for 1Gbps network traffic from Guest Linux....

Network throughput of 1Gbps from Guest Linux implies 83K packets-per-second of 1500 bytes (max MTU size). Now for every virtual ethernet interrupt, we will have two PLIC MMIO traps (read CLAIM register and write CLAIM register). Now, the number of MMIO access for virtual ethernet controller will vary based on type of ethernet controller we are emulating in hypervisor. If we assume 2 MMIO traps to virtual ethernet controller per-packet (optimistic) and one virtual interrupt per-packet then this will be roughly 333K MMIO traps per second for 1Gbps traffic. Now if get_insn() is 20 instructions then we have 6.6 million instructions per-second just to read the faulting instructions.

@jscheid-ventana
Copy link
Contributor

Unfortunately, knowing just fault address for above mentioned traps is not
sufficent because software needs to know exact width (1, 2, 4, or 8 bytes)
of load/store operation and target registers for handling the trap.

As an alternative to providing the full instruction memory encoding, consider providing applicable exception syndrome information directly. Such has the possibility to be much less constraining on implementations, and potentially more efficient for SW. So in this case, pieces of information like the access width and target register.

@fintelia
Copy link
Contributor

As an alternative to providing the full instruction memory encoding, consider providing applicable exception syndrome information directly. Such has the possibility to be much less constraining on implementations, and potentially more efficient for SW. So in this case, pieces of information like the access width and target register.

For uncompressed load/store instructions that information is stored directly in bits [12,13] and [7,11] respectively (and whether to sign extend in bit 14). It is rather more annoying to extract that information from the compressed versions but perhaps mtval2/stval2 could always store the uncompressed version of instructions? Bits [0, 1] would then be free to indicate the real instruction size for advancing the PC. This would also give a reason to make mtval2/stval2 consistent across all traps including illegal instruction traps.

@avpatel
Copy link
Author

avpatel commented Jul 16, 2019

Unfortunately, knowing just fault address for above mentioned traps is not
sufficent because software needs to know exact width (1, 2, 4, or 8 bytes)
of load/store operation and target registers for handling the trap.

As an alternative to providing the full instruction memory encoding, consider providing applicable exception syndrome information directly. Such has the possibility to be much less constraining on implementations, and potentially more efficient for SW. So in this case, pieces of information like the access width and target register.

Yes, having decoded information (such as access width and target registers number) in mtval2/stval2 CSR will make SW more efficient. In fact, this is what is done in ARM64 architecture. Although, issue with ARM64 way of representing decoded information is that it does not cover all ARM64 load/store instructions particularly it does not cover ARM64 LDP/STP instructions where we have two target/source registers.

The rationale for full instruction encoding in mtval2/stval2 was:

  1. It is inline with what's already done for illegal instruction trap
  2. Keeps things simpler for CPU designers

I guess we can also go with decoded information in mtval2/stval2 because we don't have too many variants of load/store instructions in RISC-V.

Regards,
Anup

@aswaterman
Copy link
Member

Providing exactly the bits that were fetched from memory is the more future-proof approach, and it also matches the existing mtval illegal-instruction trap behavior.

@jscheid-ventana
Copy link
Contributor

Providing exactly the bits that were fetched from memory is the more future-proof approach, and it also matches the existing mtval illegal-instruction trap behavior.

In general, providing more exception syndrome information, and especially more initial or intermediate state information, can impose higher costs for various implementation choices. The intent of providing abstract syndrome information is to focus on the intersection of what is useful for SW and what is likely to be available to a broad set of implementations at the point a particular exception type is identified.

Example implementations where providing the full instruction value has additional cost:

  • Hardware implementations where instructions are decomposed into operations that do not preserve the original instruction encodings past the point of decode
  • Partial decode may occur prior to caching structures.
  • Dynamic code optimization implementations such as Transmeta where instructions are similarly decomposed but in a software layer. etc.

All of these implementations would likely know aspects of the instruction, such as the access type, size, destination register, etc. at the time of detecting a data access fault, but not the specific instruction encoding.

The mtval values are optional. But the risk is that the architecture will eventually need multiple mechanisms to present syndrome information to SW. So to address this it’s worth contemplating a mechanism that is more broadly feasible, presenting a single architecture to SW.

@avpatel
Copy link
Author

avpatel commented Nov 7, 2019

Closing this issue since it's already covered in v0.5 draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants