Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behavior on RISC-V code (compiled with the toolchain) #1434

Closed
alex-paschoaletto opened this issue Mar 12, 2024 · 12 comments
Closed

Weird behavior on RISC-V code (compiled with the toolchain) #1434

alex-paschoaletto opened this issue Mar 12, 2024 · 12 comments

Comments

@alex-paschoaletto
Copy link

alex-paschoaletto commented Mar 12, 2024

Hello everyone! hope you're all doing well.

I don't know if this is the proper place to ask for help or report a bug, so please feel free to warn me in case I should post this elsewhere. I've been trying to develop real-time software on RISC-V (taking the official FreeRTOS QEMU demo ports as a starting point), but some very weird issues got in the way and made me decide to ask for help. I have already placed this issue on the FreeRTOS forums, but as one might see throughout this post, it might have something to do with the compiler and not the RTOS library.

TL:DR

This repository contains a sample project that illustrates well one of the problems I've been getting lately. It basically goes like this: the program executes fine when a certain code snippet is encapsulated within a function, but "crashes" (i.e. hangs) when the same snippet is placed directly in the main code:

for(int i=0; i < NUMBER_OF_ITEMS; i++){
    createAndPushItem(i);

    // the function above does the exact same thing as the commented code below
    // yet, the commented code does not work and will crash the program. but why??
        
    // int index = priorities[i];                                               
    // void *value = (void *) getValue(i + 1);
    // LinkedListItem_t *item = createItem(index, value);                          
    // if(item){
    //     push(item, &list);
    // }
}

The scope shouldn't matter at all here, since there is no local variable being used or anything like that. Also, in favor of simplicity the sample code doesn't even uses FreeRTOS' tasks or scheduler, just the pvPortMalloc/vPortFree functions.

Context

It all started with me deciding to use the ESFree library to append the EDF scheduler to the project, but the code didn't work out-of-the-box on neither of the 'virt' ports. It ran OK on the ARM ports for QEMU, though, so I figured it should be some incompatibility issue on the RISC-V side. While investigating, I found the problem seemed to be related to the linked lists API, as the List_t uxNumberOfItems variable just seemed to go crazy on a certain point of the code that wasn't even supposed to interact with it.

I then decided to create my own library for managing linked lists as a workaround and created a whole new sample project to test it, but my attempt of making it happen also went sideways with a new bug: the code would execute just fine when a certain snippet were placed inside a dedicated function, but not so much when placed directly on the main code. That is what I describe on the TL:DR section above.

This doesn't really seem to me like a problem within FreeRTOS, as like I said there are no schedulers or tasks being used. The only resources provided by FreeRTOS which I actually use on the sample project are pvPortMalloc and vPortFree, but the problem happens anyway with regular malloc (as shown below). I can't really cross FreeRTOS out of the equation, though, since the whole port setup (e.g. libraries, assembly files and makefile) has been provided by it.

Workarounds

These were my attempts of solving the problem so far:

  1. Tried using the other FreeRTOS RISC-V port meant for QEMU (RISC-V-Qemu-virt-GCC), but the problem persists.
  2. Tried building the project without compiler optimization (using -O0 instead of -Os in the makefile). The output is even worse, with the execution hanging even before printing the list items for the first time.
  3. Tested with other heap files provided by FreeRTOS (heap_1, heap_3 and heap_4). No success whatsoever.
  4. Tested with regular malloc() and free() from stdlib.h. Output shows same behavior as compiling without optimization (i.e. prints nothing on the terminal).

I am familiar with many programming languages, but FreeRTOS/QEMU/compiler developing/debugging are definitely out of my league. Hence why I'm here asking for help from the clever ones. I'll be posting this issue on the QEMU forums aswell, if any. Hopefully someone might help me out on this quest.
Thanks in advance.

@alex-paschoaletto alex-paschoaletto changed the title Weird behavior on RISC-V code Weird behavior on RISC-V code (compiled with the toolchain) Mar 12, 2024
@TommyMurphyTM1234
Copy link
Collaborator

Did you debug your code to find out why it crashes/hangs in the failure case? In particular what CSRs like mcause, mepc, mtval are set to at the point of failure? If interactive debug isn't possible for some reason then at least have something like an exception/trap handler that captures/dumps these in some way in order to better understand the nature of the failure. Systematic debug is a better approach than randomly changing other things.

@patrick-rivos
Copy link
Collaborator

Thanks for spending the time to try reducing the problem/make such a clear writeup about this hang.
Can you share the hash/version of the riscv compiler you're using (riscv32-unknown-elf-gcc -v)?

@alex-paschoaletto
Copy link
Author

Did you debug your code to find out why it crashes/hangs in the failure case? In particular what CSRs like mcause, mepc, mtval are set to at the point of failure? If interactive debug isn't possible for some reason then at least have something like an exception/trap handler that captures/dumps these in some way in order to better understand the nature of the failure. Systematic debug is a better approach than randomly changing other things.

Hello there! I'm sorry if I didn't provide enough information. I am quite new to this world of low-level debugging and don't know exactly how to do what you say. Could you please explain how would I check the register values throughout execution?

Thanks for spending the time to try reducing the problem/make such a clear writeup about this hang. Can you share the hash/version of the riscv compiler you're using (riscv32-unknown-elf-gcc -v)?

Yeah, sure! Here's the terminal output to riscv32-unknown-elf-gcc -v:

Using built-in specs.
COLLECT_GCC=riscv32-unknown-elf-gcc
COLLECT_LTO_WRAPPER=/opt/riscv/libexec/gcc/riscv32-unknown-elf/13.2.0/lto-wrapper
Target: riscv32-unknown-elf
Configured with: /home/alex/riscv/riscv-gnu-toolchain/gcc/configure --target=riscv32-unknown-elf --prefix=/opt/riscv --disable-shared --disable-threads --enable-languages=c,c++ --with-pkgversion= --with-system-zlib --enable-tls --with-newlib --with-sysroot=/opt/riscv/riscv32-unknown-elf --with-native-system-header-dir=/include --disable-libmudflap --disable-libssp --disable-libquadmath --disable-libgomp --disable-nls --disable-tm-clone-registry --src=.././gcc --enable-multilib --with-abi=ilp32d --with-arch=rv32gc --with-tune=rocket --with-isa-spec=20191213 'CFLAGS_FOR_TARGET=-Os    -mcmodel=medlow' 'CXXFLAGS_FOR_TARGET=-Os    -mcmodel=medlow'
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 13.2.0 () 

@TommyMurphyTM1234
Copy link
Collaborator

TommyMurphyTM1234 commented Mar 12, 2024

Did you debug your code to find out why it crashes/hangs in the failure case? In particular what CSRs like mcause, mepc, mtval are set to at the point of failure? If interactive debug isn't possible for some reason then at least have something like an exception/trap handler that captures/dumps these in some way in order to better understand the nature of the failure. Systematic debug is a better approach than randomly changing other things.

Hello there! I'm sorry if I didn't provide enough information. I am quite new to this world of low-level debugging and don't know exactly how to do what you say. Could you please explain how would I check the register values throughout execution?

The mcause, mepcand mtval CSRs provide information about a RISC-V trap/exception. If your "hang/crash" is actually a trap/exception then the values of these registers will be very useful in investigating the problem. The CSRs are described in the RISC-V Privileged spec:

The values of these registers should be available by dumping them in some way from, say, a catch-all trap/exception handler or via interactive debugging.

If the problem doesn't actually result in a trap/exception then interactive debugging should still help to identify where/when/why the code doesn't behave as you expect and pinpoint the root cause of the problem.

There is plenty of info about debugging RISC-V programs running on QEMU available elsewhere. E.g.:

@patrick-rivos
Copy link
Collaborator

@TommyMurphyTM1234 provided some great links for debugging with QEMU - the second link is helpful for debugging when using qemu-system (which your program does).

If you're able to extract this problem into a program that runs in qemu-user mode (make build-qemu) and produces a different result from x86/arm/whatever your host machine uses then I have a flow that will let me reduce it down to a small program that will show the bug (or the unintended undefined behavior).

AKA if you can reproduce the problem with a flow like this:

<riscv32-unknown-elf-gcc compile commands>
<qemu-user output.out>
{result}

<gcc compile commands for the same source files>
<./output.out>
{result that doesn't match riscv}

I can reduce it down to a preprocessed C file that compiler people can work with :)

@TommyMurphyTM1234
Copy link
Collaborator

If you're able to extract this problem into a program that runs in qemu-user mode (make build-qemu) and produces a different result from x86/arm/whatever your host machine uses then I have a flow that will let me reduce it down to a small program that will show the bug (or the unintended undefined behavior).

That's a very good point/suggestion about compiling the code for some non RISC-V ISA (e.g. x86_64 or Arm) and comparing the results to the RISC-V case. If the program still crashes in the non-RISC-V case then it strengthens the hypothesis that it's a problem with the actual code rather than the toolchain.

@alex-paschoaletto
Copy link
Author

I'm sorry for the absence, have been a bit busy lately.

The mcause, mepcand mtval CSRs provide information about a RISC-V trap/exception. If your "hang/crash" is actually a trap/exception then the values of these registers will be very useful in investigating the problem.

Thank you very much for the enlightenment! I'm currently learning how to use gdb, but a colleague of mine with some knowledge on that has done me a favor of running this project with gdb. When the application hangs, these are the mcause, mepcand mtval register values:

registers

If you're able to extract this problem into a program that runs in qemu-user mode (make build-qemu) and produces a different result from x86/arm/whatever your host machine uses then I have a flow that will let me reduce it down to a small program that will show the bug (or the unintended undefined behavior).

Ok! I will give a try on that and report results here as soon as I have them.

That's a very good point/suggestion about compiling the code for some non RISC-V ISA (e.g. x86_64 or Arm) and comparing the results to the RISC-V case. If the program still crashes in the non-RISC-V case then it strengthens the hypothesis that it's a problem with the actual code rather than the toolchain.

I understand and agree, but like I said running on QEMU emulating ARM (MPS2 M3) performed just fine. Running on the Raspberry Pi Pico (ARM Cortex M0+) also shown no issues whatsoever.

I also have a new finding: on the past thursday I ran this program on a Seeed XIAO ESP32-C3 (RISC-V based) and the program behaved well on all cases. Unfortunately, however, the test conditions were not the same, since Espressif has its own compiler (riscv32-esp-elf-gcc) and their custom flavor of FreeRTOS. Hopefully this might be an useful piece of information, though.

@TommyMurphyTM1234
Copy link
Collaborator

mcause 2 is illegal instruction.
What's the disassembly of your program at/around the mepc of 0x80001180?

@alex-paschoaletto
Copy link
Author

mcause 2 is illegal instruction. What's the disassembly of your program at/around the mepc of 0x80001180?

The assembly code around this mepc value seems to be the exception handler itself, if my understanding is right from the little time I had using gdb so far:

assembly

The code itself is this one, located in FreeRTOS´ portASM file:

freertos_risc_v_exception_handler:
    portcontextSAVE_EXCEPTION_CONTEXT
    /* a0 now contains mcause. */
    li t0, 11                           /* 11 == environment call. */
    bne a0, t0, other_exception         /* Not an M environment call, so some other exception. */
    call vTaskSwitchContext
    portcontextRESTORE_CONTEXT

With that in mind, 0x80001180 seems to be the line where the code flows after the error has occurred (therefore a consequence, not a cause). I am still investigating to understand where exactly the code jumps to this trap.

@alex-paschoaletto
Copy link
Author

Ok, so I have a new finding.

After a deeper analysis of the code line-by-line at an assembly level and consistently checking the linked list state, I've found out the inlined code will hang right after creating and printing the items for the first time because printi - a soubroutine of both printf and sprintf - corrupts the pointers to the linked lists' head and tail after executing. More precisely, the tail gets corrupted first, at Assembly instruction [printi+98]: sb t3, -1(a5), whereas the head comes next a few loop cycles laters in the same exact line.

So it looks like the first forEach executes entirely because it only reads the head pointer once, at the start. But when the sort function follows, it tries to access the head pointer again - and with a corrupted value, a segmentation fault happens and the handler raises the register values as an illegal instruction. Or at least that's my understanding.

This is what the linked list with 5 items inside looks like normally:
normal linked list

And this is what it looks like after corruption:
corrupted linked list

Note that all elements within the linked list are stored on the heap (0x8000...). The data within 0x3030... or 0x3031... is just some code I don't really understand the purpose of but assume it has something to do with the illegal instruction mask:
random code

With that said, I suppose the problem is related to the standard C printi function. If I run the code printing only strings - i.e.

printf("item: whatever");

instead of

printf("item: %d", itemValue);

The linked list pointers are never corrupted. It still may not explain the first problem I reported of encapsulated vs inline code, but it is a problem that deserves some attention anyway.

@alex-paschoaletto
Copy link
Author

After all the problem seems to have been that the RISC-V port had a stack_size statically defined to 350 bytes only, which was not enough for this application:

LDFLAGS += -nostartfiles -Xlinker --gc-sections -Wl,-Map,$(OUTPUT_DIR)/RTOSDemo.map \
           -T./fake_rom.ld -march=rv32imac -mabi=ilp32 -mcmodel=medlow -Xlinker \
           --defsym=__stack_size=350 -Wl,--start-group -Wl,--end-group

Increasing to a greater value, such as 700 bytes, got everything working alright. Thanks for the help anyway!

@TommyMurphyTM1234
Copy link
Collaborator

Thanks a lot for following up with an explanation of the root cause @alex-paschoaletto! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants