Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBI Remote Fence API: why doesn't this require the page table? #42

Closed
konrad-schwarz opened this issue Apr 8, 2020 · 32 comments
Closed

Comments

@konrad-schwarz
Copy link
Contributor

The remote fence API allows freeing of memory ranges, possibly belonging to a given ASID or VMID.

The privileged architecture specification makes it clear that a Xfence.Yvma instruction needs to be issued per page table entry that is to be invalidated; so one invalidation instruction per 4KB page in the memory range. However, with superpages, this would be reduced by a factor of 2^10 for 32-bit paging systems and a factor of 2^9 for 64 page systems per level of paging that is removed.

For an arbitrary (process) address space, there is no way for the (machine-level) SBI to discover the corresponding page table (so as to map the address range to the page table entry structure); it seems the only viable strategy for SBI is to assume that no superpages were used and to march through the memory range with a 4 KB stride.

Could you please comment if this is true? If so, I'd have to say that this part of the specification is not well thought out: convenience functions that makes things dramatically worse in certain cases. I'd much prefer the SBI to limit itself to essential functionality.

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

Remote sfence only deals with remote TLB shoot down. It is not meant for updating remote page table. A SMP OS kernel (Linux) will usually do remote sfence after updating page table.

Regards,
Anup

@konrad-schwarz
Copy link
Contributor Author

Where in my post do you see anything about updating the remote page table?

My point is that the SBI is in no position to do remote TLB shoot down efficiently.

Since remote TLB shootdown does not require machine mode privilege (except for the inter-processor interrupt) and since the SBI isn't especially good at doing it, there is no reason for including this functionality.

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

RISC-V spec does not enforse various page sizes in TLB so SBI implementation has to be conservative and invalidate based on 4kb granularity.

The S-mode software has to be intelligent to decide range of remote TLB flush when doing SBI call.

For example, if S-mode software wants to invalidate 2MB page entry and underlying HW TLB implement 2MB page size then S-mode software should only invalidate any 4kb within the 2Mb and not entire 2Mb.

Regards,
Anup

@konrad-schwarz
Copy link
Contributor Author

konrad-schwarz commented Apr 8, 2020

RISC-V spec does not enforse various page sizes in TLB so SBI implementation has to be conservative and invalidate based on 4kb granularity.

It is this inefficiency is why the SBI should not be attempting to do this at all.

Furthermore, contrary to your assertion, the RISC-V Spec does require the TLB to support all page sizes of the paging models it supports, see 4.2:

For the common case that the translation data structures have only been modified for a single
address mapping (i.e., one page or superpage), rs1 can specify a virtual address within that mapping to effect a translation fence for that mapping only.

I.e., if a superpage has been used, a single invalidation is all that is required.

The rest of your response is predicated on this erroneous assumption.

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

A RISCV implementation can always have fewer page sizes in TLB to simplify HW by breaking bigger page table entry into smaller mapping in TLB.

The statement you are referring does not mandate all page sizes to be supported in TLB. Look at the phrase "single address mapping (i.e., one page or superpage)".

You can raise a issue in RISCV isa manual if you like.

Regards,
Anup

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

We had tried doing remote TLB flushes in Linux with/without SBI based remote TLB flushes. It turns out SBI based remote TLB flushes are much faster. Please refer to our Linux kernel mailing list archives.

Further, SBI based remote TLB flushes also helps remote TLB shoot down for Guest/VM in Hypervisors.

Like I mentioned in previous reply, the S-mode operating system has to do efficient use of SBI call by passing appropriate parameter rather than relying on SBI implementation to peek at page table.

Regards,
Anup

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

The below email thread will provide you summary of past discussions regarding SBI RFENCE extension.

https://patchwork.kernel.org/cover/11302029/

Regards,
Anup

@konrad-schwarz
Copy link
Contributor Author

A RISCV implementation can always have fewer page sizes in TLB to simplify HW by breaking bigger page table entry into smaller mapping in TLB.

No it cannot, the passage I quoted makes this clear.

The statement you are referring does not mandate all page sizes to be supported in TLB. Look at the phrase "single address mapping (i.e., one page or superpage)".

If not all page sizes need to be supported, then a PTE entry would either be supported by a larger page size or a smaller page size.

  • If larger, then more address space than requested would be mapped by this PTE. Obviously broken.
  • If smaller, then a fence.vma instruction to a given address would invalidate just that smaller area, not the surrounding area, directly contradicting the quoted passage.

Only possible conclusion: the Spec requires the TLB to support all possible page sizes of the paging modes it supports. Note that this is not particularly onerous: each TLB entry just needs a mask to indicate to which address bits it applies to.

And honestly: the primary motivation for having superpages is to reduce the number of TLB misses. An implementation that implemented superpages with smaller entries would not be well received.

1 similar comment
@konrad-schwarz
Copy link
Contributor Author

A RISCV implementation can always have fewer page sizes in TLB to simplify HW by breaking bigger page table entry into smaller mapping in TLB.

No it cannot, the passage I quoted makes this clear.

The statement you are referring does not mandate all page sizes to be supported in TLB. Look at the phrase "single address mapping (i.e., one page or superpage)".

If not all page sizes need to be supported, then a PTE entry would either be supported by a larger page size or a smaller page size.

  • If larger, then more address space than requested would be mapped by this PTE. Obviously broken.
  • If smaller, then a fence.vma instruction to a given address would invalidate just that smaller area, not the surrounding area, directly contradicting the quoted passage.

Only possible conclusion: the Spec requires the TLB to support all possible page sizes of the paging modes it supports. Note that this is not particularly onerous: each TLB entry just needs a mask to indicate to which address bits it applies to.

And honestly: the primary motivation for having superpages is to reduce the number of TLB misses. An implementation that implemented superpages with smaller entries would not be well received.

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

I think you should raise an issue in RISCV isa manual to get it clarified. I am quite confident that all page sizes are not mandatory to be supported in TLB. To support more page sizes in TLB the HW TLB lookup state machine will also become complicated along with more area taken by TLB table. The complexity of TLB state machine should be a choice of RISC-V implementation and RISC-V spec does not mandate it.

Let me give you a brief explanation why SBI based IPI and RFENCE calls are used in RISCV world.

All RISCV systems available till now, don't have dedicated HW for S-mode to inject IPIs. Most systems currently have M-mode CLINT using which only M-mode IPIs can be injected. This means S-mode software need assistance from M-mode to inject IPIs. Due to this reason, we have SBI based IPI injection. Now S-mode can certainly do remote TLB flush on it's own using SBI based IPIs but this will be slower compared (we have performance numbers as well) to SBI based remote TLB flush hence we have SBI based remote TLB flushes as well. Eventually, it turns out that SBI based remote TLB flushes are very useful in virtualization as well because it is faster compared to CLINT trap-n-emulate and Hypervisor can further optimise the effective set of Host CPUs in which flushes are required.

With SBI v0.2, all SBI calls are now optional and discoverable at boot time by S-mode software. In future, when a RISCV extension is defined for remote TLB flushes then S-mode software can prefer that over SBI extension.

Regards,
Anup

@avpatel
Copy link
Contributor

avpatel commented Apr 8, 2020

BTW, RISC-V implementation that choose to implement fewer page sizes will have to select smaller page sizes to preserve functional correctness. This pretty obvious and such RISCV implementation usually target low performance, low power and smaller SOC size.

Regards,
Anup

@fintelia
Copy link
Contributor

fintelia commented Apr 8, 2020

The RISC-V spec does not specify anything about which page sizes need to be stored in the TLB, because it doesn't even require that an implementation has a TLB.

@aswaterman
Copy link

aswaterman commented Apr 9, 2020

There's nothing in the spec that supports the idea that it's optional to support superpages. Obviously, how they're implemented in the microarchitecture is up to the microarchitects (breaking them into 4K pages internally is a valid strategy, but it definitely complicates the implementation of SFENCE.VMA, because SFENCE.VMA is allowed to use any address within the superpage and it must flush all the related 4K entries out of such a TLB. So it probably won't be popular in practice.)

@avpatel
Copy link
Contributor

avpatel commented Apr 9, 2020

Thanks Andrew for the clarification.

Coming back to the original question of SBI RFENCE on super-pages, If we assume that TLB does support super pages then still it is responsibility of S-mode software to pass right parameters.

For example, If S-mode software updates 2MB mapping at 0x10000000 virtual address in a page table shared across CPUs then it will need to shoot down TLB on all CPUs. Now S-mode software can do SBI RFENCE call with start=0x10000000 size=2MB or 4KB. In this case, S-mode software should pass size=4KB because it knows that it has updated 2MB mapping and one sfence.vma is sufficient. This rationale also applies to using sfence.vma instruction on single CPU.

The above example clearly shows that efficient use of SBI RFENCE calls is responsibility of S-mode software and S-mode software cannot expect SBI implementation to peek into page tables.

Regards,
Anup

@konrad-schwarz
Copy link
Contributor Author

konrad-schwarz commented Apr 9, 2020

Now S-mode can certainly do remote TLB flush on it's own using SBI based IPIs but this will be slower compared (we have performance numbers as well) to SBI based remote TLB flush hence we have SBI based remote TLB flushes as well.

Well, you should rethink your API to include a pointer to the relevant page table, so that the SBI can efficiently clear superpage entries.

Coming back to the original question of SBI RFENCE on super-pages, If we assume that TLB does support super pages then still it is responsibility of S-mode software to pass right parameters. [...]

Please demonstrate how S-Mode software is supposed to call the SBI RFENCE API to delete an aligned 2 MiB region a) mapped with 512 4 KiB pages and b) mapped using a single 2 MiB superpage on RV64? If the answer is "512 calls to a RFENCE function" in the first case, then your API is mis-documented and could easily be improved -- see above paragraph.

@avpatel
Copy link
Contributor

avpatel commented Apr 9, 2020

Use SBI RFENCE call for multiple CPUs just like you would use sfence.vma instruction on single CPU.

The SBI RFENCE call are basically sfence.vma instruction on multiple CPUs.

I don't think there is any inefficiency in SBI RFENCE calls. The S-mode software has to use SBI RFENCE calls in efficient manner.

Regards,
Anup

@fintelia
Copy link
Contributor

fintelia commented Apr 9, 2020

If I need to invalidate say 5 adjacent 2MB huge pages (a 10MB range of addresses) then it sounds like you are suggesting that I issue 5 separate SBI RFENCE calls. But isn't it rather wasteful to do 5 IPIs when only one is really required?

@avpatel
Copy link
Contributor

avpatel commented Apr 9, 2020

Yes, 5 SBI RFENCE calls are efficient for 5 consecutive 2MB pages as compared to 1 SBI RFENCE call for 10MB because SBI implementation will have to do multiple sfence.vma 4kb at a time to flush 10MB range.

For the same case on single CPU, we will have to do sfence.vma instruction 5 times.

Basically, the SBI RFENCE calls are modelled around definition of sfence.vma.

Regards,
Anup

@fintelia
Copy link
Contributor

fintelia commented Apr 9, 2020

But the current design is the wrong baseline to use. Yes 5 SBI RFENCE calls might be the fastest approach given the current options, but I'd estimate it is still nearly 5x slower than doing 1 SBI RFENCEv2 call that took a "stride" as an additional argument.

This isn't necessarily to say that RFENCEv2 should be added; the workloads that this matters in might be sufficiently rare, or the speedup so minimal that this doesn't matter. I'm just trying to talk through the trade-offs in the current design.

@avpatel
Copy link
Contributor

avpatel commented Apr 9, 2020

Yes, a stride parameter will help reduce multiple SBI RFENCE calls to just one SBI RFENCE call but only for the case of multiple contiguous super pages.

Like you mentioned, there are not much benefits of having additional calls in SBI RFENCE extension with stride parameter since it only helps a very particular case which will be rare in real world usage.

Regards,
Anup

@avpatel
Copy link
Contributor

avpatel commented Apr 11, 2020

I further investigated potential use of stride parameter in Linux kernel.

It turns out that currently it is not possible for Linux kernel to use stride parameter because none of the flush_tlb_xyz() arch functions have stride/mapping_size parameter passed by generic Linux memory management code.

The generic Linux memory management code will need to be extended first so that it can pass stride/mapping_size parameter to flush_tlb_xyz() functions of arch.

(Refer, <linux_source>/arch/riscv/include/asm/tlbflush.h)

Regards,
Anup

@konrad-schwarz
Copy link
Contributor Author

Please see #42 (comment) on how to solve your conundrum.

@avpatel
Copy link
Contributor

avpatel commented Apr 14, 2020

Passing page table as parameter to SBI RFENCE calls is the last thing we will want because on SMP systems the Linux kernel can update page table on one CPU while other CPU does SBI call with same page table as parameter. We have seen this in past when debugging OpenSBI.

Regards,
Anup

@konrad-schwarz
Copy link
Contributor Author

This is an interesting point.

An OS needs to invalidate after modifying its page table, because
a RISC-V core is allowed to prefetch TLB entries -- and because the Spec
says so

To ensure the implicit reads observe writes to the same memory locations, an SFENCE.VMA
instruction must be executed after the writes to flush the relevant cached translations.

So we can no longer assume that the original page table is available to guide invalidations (unless the OS was using some sort of copying/universal non-blocking
construction to modify the page table).

Event though Linux in its current form cannot easily use the stride parameter, other operating systems or hypervisors may be able to do so. The SBI should not be catering solely to Linux. So I think the addition of a stride parameter would be worthwhile for the current API.

Just as a Gedankenexperiment: the only interesting thing about the page table from the RFENCE perspective is the superpage structure, i.e., if branches in the paging table tree are less than maximal length. The class of page table modifications that does not modify the page table structure includes swapping/paging operations. These might be frequent enough to warrant RFENCE variants that reference the page table (in its updated form) -- although paging operations will typically reference only individual entries.

As an aside: why does RISC-V allow the caching of invalid entries? An access violation is going to lead some sort of error handling (slow path) anyway, what is the use of occupying TLB entries with translations that (are supposed to) happen infrequently and are not expected to be handled quickly (error handling path)?

@aswaterman
Copy link

As an aside: why does RISC-V allow the caching of invalid entries? An access violation is going to lead some sort of error handling (slow path) anyway, what is the use of occupying TLB entries with translations that (are supposed to) happen infrequently and are not expected to be handled quickly (error handling path)?

It simplifies implementations. It's a useful simplifying assumption that instructions can only complete if they get a TLB hit.

@konrad-schwarz
Copy link
Contributor Author

Another question: any chances of introducing remote TLB fences at the hardware level? E.g., PowerPC has tlbie (invalidate entry), tlbia (invalidate all) and tlbsync (ensure previous TLB invalidate instructions have completed) instructions for this.

Regarding tlbsync: my impression is that RISC-V avoids global fences; but one could make the global invalidate instructions synchronous (i.e. blocking), just as memory accesses block the CPU.

How complicated can it be to extend the existing cache coherency protocol to operate on the TLB as well? And add S-level IPIs at the same time :-) And cache management instructions for non-coherent external agents?

Thanks!

@aswaterman
Copy link

Another question: any chances of introducing remote TLB fences at the hardware level? E.g., PowerPC has tlbie (invalidate entry), tlbia (invalidate all) and tlbsync (ensure previous TLB invalidate instructions have completed) instructions for this.

These instructions add massive hardware complexity, and their benefits are a mixed bag. (The old x86-style scheme where ASIDs are private has its own set of advantages and is much easier to implement.) In the long run, if people make a good quantitative case for these instructions' ROI, they might eventually make it in.

I can understand your intuition for why cache coherence might help with TLB coherence, and indeed some IBM mainframes have implemented TLB coherence in this manner. Unfortunately, it requires tagging TLB entries with the addresses of all levels of the page table that might ultimately affect the translation. Obviously, it's possible to compress this information, but it's still costly and complex.

@avpatel
Copy link
Contributor

avpatel commented May 7, 2020

I think it is possible to add stride/page_size parameter to SBI RFENCE calls in way that it will be backward compatible to existing SBI v0.2 RFENCE calls.

We can name the additional parameter as "stride_order_hint" or "page_order_hint". As the name suggest, the additional parameter is a hint and S-mode software can always pass zero in this parameter.

If S-mode software passes invalid value of "page_order_hint" (i.e. value < 12 or __riscv_xlen <= value) then SBI implementation should fallback to page_order_hint = 12.

If S-mode software passes valid value X as "page_order_hint" but "start" address is not aligned to
2^X then SBI implementation should fallback to page_order_hint = 12.

The S-mode software can choose to totally ignore "page_order_hint" parameter and always assume it to be 12 (i.e. 4KB). This will be equivalent to the way SBI v0.2 RFENCE calls are already defined hence the additional "page_order_hint" parameter is backward compatible to SBI v0.2 RFENCE calls.

@fintelia , since this was your suggestion. I request you to send PATCH or Github PR for this.

Thanks,
Anup

@fintelia
Copy link
Contributor

fintelia commented May 7, 2020

I'm not sure that adding a stride_order_hint parameter would be backwards compatible. My understanding is that the ABI allows any argument registers beyond the number needed for the called function to hold garbage values

@avpatel
Copy link
Contributor

avpatel commented May 7, 2020

Yes, ABI does not specify state of unused argument register but luckily most S-mode software (Linux, etc) are setting these unused registers to zero.

Regards,
Anup

@avpatel
Copy link
Contributor

avpatel commented May 7, 2020

To tackle the backward compatibility issue, we can use higher bits of SBI RFENCE function number.

For example, if bit[31] of function number is set then stride_order_hint argument is valid

Alternatly, We can also treat some of the higher bits as function version number.

Regards,
Anup

@atishp04
Copy link
Collaborator

atishp04 commented Nov 1, 2022

Closing this issue. Please reopen if it is not resolved.

@atishp04 atishp04 closed this as completed Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants