Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement vectored mtvec #691

Merged
merged 4 commits into from Sep 26, 2023
Merged

Conversation

NikLeberg
Copy link
Collaborator

@NikLeberg NikLeberg commented Sep 17, 2023

Well hello there!

I stumbled upon this:

An implementation may have different alignment constraints for different modes. In particular, MODE=Vectored may have stricter alignment constraints than MODE=Direct.
[...]
Allowing coarser alignments in Vectored mode enables vectoring to be implemented without a hardware adder circuit.

Which I found very interesting and thought we could implement that! It could shorten the latency from IRQ to calling the right ISR as no intermediate generic handler is required. A simple jal zero, <function> as mtvec jump table could directly jump to the handler.

Example jump table:

.balign 128 // 128 byte alignment i.e. lowest 7 bits all zero
mtvec_jump_table:
  jal zero, trap_handler          //  0 - exception (all)
  nop                             //  1 - n/a
  nop                             //  2 - n/a
  jal zero, msi_handler           //  3 - machine software interrupt
  nop                             //  4 - n/a
  nop                             //  5 - n/a
  nop                             //  6 - n/a
  jal zero, mti_handler           //  7 - machine timer interrupt
  nop                             //  8 - n/a
  nop                             //  9 - n/a
  nop                             // 10 - n/a
  jal zero, mei_handler           // 11 - machine external interrupt
  nop                             // 12 - n/a
  nop                             // 13 - n/a
  nop                             // 14 - n/a
  nop                             // 15 - n/a
  jal zero, firq0_handler         // 16 - fast interrupt  0
  jal zero, firq1_handler         // 17 - fast interrupt  1
  [...]

This can then be installed with:

neorv32_cpu_csr_write(CSR_MTVEC, mtvec_jump_table | 0x1);

SiFive actually implemented this in plain C using some nice GCC compiler tricks.

As the direct mode is still available, no immediate change to the current sw framework is required. But those that wish to utilize the lower latency may enable vectored mtvec mode.

I'm not sure about the implications about area and speed. Quartus showed non-deterministic values for LUT usage and adding the vectored mode actually decreased LUT usage (with pr: 1474; without pr: 1494).

WIP to get your feedback if this is generally wished for or not. If ok then I can come up with a test to see how much reduction in latency there actually is and also add some documentation.

Cheers,
Nik

@NikLeberg NikLeberg marked this pull request as draft September 17, 2023 15:24
@stnolting
Copy link
Owner

stnolting commented Sep 19, 2023

I'm not sure about the implications about area and speed. Quartus showed non-deterministic values for LUT usage and adding the vectored mode actually decreased LUT usage (with pr: 1474; without pr: 1494).

I have synthesized your modifications and compared them to the current version. Interestingly, I can confirm your findings - the hardware is smaller with vectored mode implemented. To this seems like a good deal: more functionality with less hardware 😎

So, I really like this! But we should think about the alignment of the table entries. I am not sure about the actual use cases, but the current alignment only allows a single jump instruction per table entry reaching a ~1MB range. What if we want to place the table in RAM, say at address 0x8000_0000, and have all the interrupt handler functions in ROM, say at address 0x0000_0000? Then a single jump instruction is not able to reach the according handler functions. Is this something we should take care of?! 🤔 Can we use several instructions (in the table) to construct larger branches without polluting registers of t he interrupted context?

WIP to get your feedback if this is generally wished for or not.

So yes! Let's add this! 👍

@stnolting stnolting self-assigned this Sep 19, 2023
@stnolting stnolting added enhancement New feature or request risc-v compliance Modification to comply with official RISC-V specs. HW hardware-related labels Sep 19, 2023
@NikLeberg
Copy link
Collaborator Author

To this seems like a good deal: more functionality with less hardware 😎

Perfect, thanks for checking. Let's do this. :)

Then a single jump instruction is not able to reach the according handler functions. Is this something we should take care of?! 🤔 Can we use several instructions (in the table) to construct larger branches without polluting registers of t he interrupted context?

This is actually how it's designed in the ISA... 🤷‍♂️, pc is simply to be put at mtvec + 4 * mcause which gives every interrupt cause only one single 32-bit instruction:

When MODE=Vectored, all synchronous exceptions into machine mode cause the pc to be set to the address in the BASE field, whereas interrupts cause the pc to be set to the address in the BASE field plus four times the interrupt cause number. [source]

But this makes the control hardware actually simpler. No additional fetching of the address from memory needs to be done.

We could solve this in software by jumping to an intermediate larger table that holds more entries per interrupt source and is located close (+/- 1 MB) to the mtvec table. I.e. the trap handler jumps to mtvec + 4 * mcause, where a larger table is located from where we can jump to the whole 4 GB address space. The intermediate trampoline table must save or utilize a scratch register and modify mpc though... I fear that with such a setup the latency is not reduced much.

Alternatively we could support the CLINT ISA Draft in the future. It adds with xtvt a new CSR that is an actual real vector table. But this is more complex: On an interrupt the table entry first needs to be fetched from memory before it can be jumped to it.

So yes! Let's add this! 👍

Great, I'll work out a latency test then. :)

@stnolting
Copy link
Owner

stnolting commented Sep 21, 2023

This is actually how it's designed in the ISA... 🤷‍♂️, pc is simply to be put at mtvec + 4 * mcause which gives every interrupt cause only one single 32-bit instruction:

I was thinking about a more relaxed alignment of the individual table entries, maybe something like mtvec + 16 * mcause providing space for up to 4 32-bit instructions per entry... 🤔 But you are right - let's stick to the RISC-V specs.

We could solve this in software by jumping to an intermediate larger table that holds more entries per interrupt source and is located close (+/- 1 MB) to the mtvec table

That would be a nice workaround - if you are not too concerned about interrupt latency, of course.

Alternatively we could support the CLINT ISA Draft in the future. It adds with xtvt a new CSR that is an actual real vector table. But this is more complex: On an interrupt the table entry first needs to be fetched from memory before it can be jumped to it.

Yeah I saw that spec... But let's keep things simple for now 😉

Great, I'll work out a latency test then. :)

Awesome! We should also think about a simple test case for the processor_check to verify this new hardware mechanism feature. 😉

@NikLeberg
Copy link
Collaborator Author

I have tested the latency of the direct and vectored mtvec mode. It is highly dependant on how the software is set up though. The basic test setup was: CPU is in wfi sleep with mei interrupt enabled. Testbench triggers interrupt and the installed ISR writes GPIO pin 0 to high. The latency is then assumed to be the time between mei assertion and gpio value change.

I got some preliminary latency measurements while simulating various different SW configurations:

  • 26 cycles: Vectored mode + not calling functions in ISR
  • 121 cycles: Vectored mode + calling functions in ISR
  • 73 cycles: Direct mode + simple switch-case handler + not calling functions in ISR
  • 142 cycles: Direct mode + simple switch-case handler + calling functions in ISR
  • 265 cycles: Direct mode + default neorv32 RTE + not calling functions in ISR
  • 285 cycles: Direct mode + default neorv32 RTE + calling functions in ISR

By not calling any functions, gcc is able to optimize away the saving and restoring of the caller-saved cpu registers in the ISR.

The vectored mode seems to reduce the interrupt latency up to 10 times! 🚀

We should also think about a simple test case for the processor_check to verify this new hardware mechanism.

Sure! I'll probably manually install a table, use the sim_irq_trigger() to test mei and msi interrupts, and then re-install the rte with the global_trap_handler() as handler for everything. Do you see any problems with that approach?

@NikLeberg
Copy link
Collaborator Author

What do I do wrong...? I can compile and run the new test just fine...

root@neorv32_soc:/workspaces/neorv32_soc/lib/neorv32/sw/example/processor_check# make all
Memory utilization:
   text    data     bss     dec     hex filename
  23636       0     296   23932    5d7c main.elf

root@neorv32_soc:/workspaces/neorv32_soc/lib/neorv32/sw/example/processor_check# make sim
Installing application image to ../../../rtl/core/neorv32_application_image.vhd
Simulating neorv32_application_image.vhd...
[...]
[27] Vectored IRQ (sim) [ok]
[...]

I have GCC 12.2.0.

@stnolting
Copy link
Owner

stnolting commented Sep 23, 2023

Hey @NikLeberg!

Looks great so far! I'm in the train right now having bad no internet connection... 🙈
I am no C/software/GCC expert at all, but as far as I can see this might be the problem:

void vectored_irq_table(void) {
  asm volatile(
    ".org  vectored_irq_table + 0*4     \n"
    "jal   zero,vectored_global_handler \n" // 0

The entire code is compiled first and is linked afterwards. So at compile-time the label vectored_global_handler is not known (= the address is not fixed) until linking.

You need to keep an "alias" here until linking. Maybe something like this could work (?):

  asm volatile(
    ".org  vectored_irq_table + 0*4     \n"
    "jal zero, %[dst]                   \n" // 0
  : : [dst] "i" ((uint32_t)&vectored_global_handler)
  );

At least this is what I see from the GitHub actions log. Maybe it is just a version issue - the actions use GCC 12.1.

@stnolting
Copy link
Owner

By the way, thanks for the detailed latency evaluation!

Vectored interrupts might be very handy for small-scale real-time systems running at low operating frequencies. I think FreeRTOS (now?) also provides an option to use vectored interrupts... ?! 🤔

@stnolting stnolting self-requested a review September 23, 2023 16:40
@stnolting
Copy link
Owner

Maybe we could put the table into a unique compilation unit - so the labels might remain aliases until linking 🤔

section(".text.vector_table")

@NikLeberg
Copy link
Collaborator Author

That was worth a try, thanks for the idea. Sadly it was not it..
It's the -flto link time optimization flag! The run_check.sh script specifically sets that. Seems as if something gets optimized too early or the inline assembly simply can't be written like that when lto is enabled...

@stnolting
Copy link
Owner

In general, link-time optimization seems like a good thing to do. I am not sure if it would be some (dirty) work-around if we disable that...

What do you think about giving an explicit clobber list? This should keep an alias for the addresses which the linker should be able to resolve:

  asm volatile(
    ".org  vectored_irq_table + 0*4     \n"
    "jal zero, %[dst]                   \n" // 0
  : : [dst] "i" ((uint32_t)&vectored_global_handler)
  );

@NikLeberg
Copy link
Collaborator Author

NikLeberg commented Sep 24, 2023

Spoke too soon, it was the inline asm alias thing you suggested earlier. Sorry for not trying that out sooner.

Edit: Yes, LTO is always a good thing! I remember stm32 crt0 not allowing to be compiled with lto what really annoyed me.

@NikLeberg
Copy link
Collaborator Author

All passing now. Thanks for the help! ❤
I rebased on latest main. Now the last step would be to add documentation.
Can you share some insights about your editor setup? I'm not too skilled with AsciiDoc to get it visually right on the first time. And waiting for CI to scream at me is a slow workflow. 😄
Do you edit the .adoc files by hand or do you have a specific VsCode extension to recommend? Especially support for creating / editing tables would be nice to have.

@stnolting
Copy link
Owner

Awesome! Thanks for all your work Nik! I this is is very cool new feature 👍

Do you edit the .adoc files by hand or do you have a specific VsCode extension to recommend? Especially support for creating / editing tables would be nice to have

Actually, I use Notepad++ for everything. I guess I am quite a minimalist 😅 Anyway, asciidoc is quite simple (a bit like Markdown) and I think it is sufficient if we update the mtvec CSR description.

Beyond the documentation, we still need to update the version number and make an entry for the change log.

@NikLeberg NikLeberg marked this pull request as ready for review September 25, 2023 16:48
@stnolting
Copy link
Owner

Looks good to me! 👍

Thanks for all your work Nik!

@stnolting stnolting merged commit 562d838 into stnolting:main Sep 26, 2023
5 checks passed
@NikLeberg NikLeberg deleted the vectored_mtvec branch September 27, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HW hardware-related risc-v compliance Modification to comply with official RISC-V specs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants