Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird behavior on RISC-V code (compiled with the toolchain) #1434

Open
alex-paschoaletto opened this issue Mar 12, 2024 · 6 comments
Open

Weird behavior on RISC-V code (compiled with the toolchain) #1434

alex-paschoaletto opened this issue Mar 12, 2024 · 6 comments

Comments

@alex-paschoaletto
Copy link

alex-paschoaletto commented Mar 12, 2024

Hello everyone! hope you're all doing well.

I don't know if this is the proper place to ask for help or report a bug, so please feel free to warn me in case I should post this elsewhere. I've been trying to develop real-time software on RISC-V (taking the official FreeRTOS QEMU demo ports as a starting point), but some very weird issues got in the way and made me decide to ask for help. I have already placed this issue on the FreeRTOS forums, but as one might see throughout this post, it might have something to do with the compiler and not the RTOS library.

TL:DR

This repository contains a sample project that illustrates well one of the problems I've been getting lately. It basically goes like this: the program executes fine when a certain code snippet is encapsulated within a function, but "crashes" (i.e. hangs) when the same snippet is placed directly in the main code:

for(int i=0; i < NUMBER_OF_ITEMS; i++){
    createAndPushItem(i);

    // the function above does the exact same thing as the commented code below
    // yet, the commented code does not work and will crash the program. but why??
        
    // int index = priorities[i];                                               
    // void *value = (void *) getValue(i + 1);
    // LinkedListItem_t *item = createItem(index, value);                          
    // if(item){
    //     push(item, &list);
    // }
}

The scope shouldn't matter at all here, since there is no local variable being used or anything like that. Also, in favor of simplicity the sample code doesn't even uses FreeRTOS' tasks or scheduler, just the pvPortMalloc/vPortFree functions.

Context

It all started with me deciding to use the ESFree library to append the EDF scheduler to the project, but the code didn't work out-of-the-box on neither of the 'virt' ports. It ran OK on the ARM ports for QEMU, though, so I figured it should be some incompatibility issue on the RISC-V side. While investigating, I found the problem seemed to be related to the linked lists API, as the List_t uxNumberOfItems variable just seemed to go crazy on a certain point of the code that wasn't even supposed to interact with it.

I then decided to create my own library for managing linked lists as a workaround and created a whole new sample project to test it, but my attempt of making it happen also went sideways with a new bug: the code would execute just fine when a certain snippet were placed inside a dedicated function, but not so much when placed directly on the main code. That is what I describe on the TL:DR section above.

This doesn't really seem to me like a problem within FreeRTOS, as like I said there are no schedulers or tasks being used. The only resources provided by FreeRTOS which I actually use on the sample project are pvPortMalloc and vPortFree, but the problem happens anyway with regular malloc (as shown below). I can't really cross FreeRTOS out of the equation, though, since the whole port setup (e.g. libraries, assembly files and makefile) has been provided by it.

Workarounds

These were my attempts of solving the problem so far:

  1. Tried using the other FreeRTOS RISC-V port meant for QEMU (RISC-V-Qemu-virt-GCC), but the problem persists.
  2. Tried building the project without compiler optimization (using -O0 instead of -Os in the makefile). The output is even worse, with the execution hanging even before printing the list items for the first time.
  3. Tested with other heap files provided by FreeRTOS (heap_1, heap_3 and heap_4). No success whatsoever.
  4. Tested with regular malloc() and free() from stdlib.h. Output shows same behavior as compiling without optimization (i.e. prints nothing on the terminal).

I am familiar with many programming languages, but FreeRTOS/QEMU/compiler developing/debugging are definitely out of my league. Hence why I'm here asking for help from the clever ones. I'll be posting this issue on the QEMU forums aswell, if any. Hopefully someone might help me out on this quest.
Thanks in advance.

@alex-paschoaletto alex-paschoaletto changed the title Weird behavior on RISC-V code Weird behavior on RISC-V code (compiled with the toolchain) Mar 12, 2024
@TommyMurphyTM1234
Copy link
Collaborator

Did you debug your code to find out why it crashes/hangs in the failure case? In particular what CSRs like mcause, mepc, mtval are set to at the point of failure? If interactive debug isn't possible for some reason then at least have something like an exception/trap handler that captures/dumps these in some way in order to better understand the nature of the failure. Systematic debug is a better approach than randomly changing other things.

@patrick-rivos
Copy link
Collaborator

Thanks for spending the time to try reducing the problem/make such a clear writeup about this hang.
Can you share the hash/version of the riscv compiler you're using (riscv32-unknown-elf-gcc -v)?

@alex-paschoaletto
Copy link
Author

Did you debug your code to find out why it crashes/hangs in the failure case? In particular what CSRs like mcause, mepc, mtval are set to at the point of failure? If interactive debug isn't possible for some reason then at least have something like an exception/trap handler that captures/dumps these in some way in order to better understand the nature of the failure. Systematic debug is a better approach than randomly changing other things.

Hello there! I'm sorry if I didn't provide enough information. I am quite new to this world of low-level debugging and don't know exactly how to do what you say. Could you please explain how would I check the register values throughout execution?

Thanks for spending the time to try reducing the problem/make such a clear writeup about this hang. Can you share the hash/version of the riscv compiler you're using (riscv32-unknown-elf-gcc -v)?

Yeah, sure! Here's the terminal output to riscv32-unknown-elf-gcc -v:

Using built-in specs.
COLLECT_GCC=riscv32-unknown-elf-gcc
COLLECT_LTO_WRAPPER=/opt/riscv/libexec/gcc/riscv32-unknown-elf/13.2.0/lto-wrapper
Target: riscv32-unknown-elf
Configured with: /home/alex/riscv/riscv-gnu-toolchain/gcc/configure --target=riscv32-unknown-elf --prefix=/opt/riscv --disable-shared --disable-threads --enable-languages=c,c++ --with-pkgversion= --with-system-zlib --enable-tls --with-newlib --with-sysroot=/opt/riscv/riscv32-unknown-elf --with-native-system-header-dir=/include --disable-libmudflap --disable-libssp --disable-libquadmath --disable-libgomp --disable-nls --disable-tm-clone-registry --src=.././gcc --enable-multilib --with-abi=ilp32d --with-arch=rv32gc --with-tune=rocket --with-isa-spec=20191213 'CFLAGS_FOR_TARGET=-Os    -mcmodel=medlow' 'CXXFLAGS_FOR_TARGET=-Os    -mcmodel=medlow'
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 13.2.0 () 

@TommyMurphyTM1234
Copy link
Collaborator

TommyMurphyTM1234 commented Mar 12, 2024

Did you debug your code to find out why it crashes/hangs in the failure case? In particular what CSRs like mcause, mepc, mtval are set to at the point of failure? If interactive debug isn't possible for some reason then at least have something like an exception/trap handler that captures/dumps these in some way in order to better understand the nature of the failure. Systematic debug is a better approach than randomly changing other things.

Hello there! I'm sorry if I didn't provide enough information. I am quite new to this world of low-level debugging and don't know exactly how to do what you say. Could you please explain how would I check the register values throughout execution?

The mcause, mepcand mtval CSRs provide information about a RISC-V trap/exception. If your "hang/crash" is actually a trap/exception then the values of these registers will be very useful in investigating the problem. The CSRs are described in the RISC-V Privileged spec:

The values of these registers should be available by dumping them in some way from, say, a catch-all trap/exception handler or via interactive debugging.

If the problem doesn't actually result in a trap/exception then interactive debugging should still help to identify where/when/why the code doesn't behave as you expect and pinpoint the root cause of the problem.

There is plenty of info about debugging RISC-V programs running on QEMU available elsewhere. E.g.:

@patrick-rivos
Copy link
Collaborator

@TommyMurphyTM1234 provided some great links for debugging with QEMU - the second link is helpful for debugging when using qemu-system (which your program does).

If you're able to extract this problem into a program that runs in qemu-user mode (make build-qemu) and produces a different result from x86/arm/whatever your host machine uses then I have a flow that will let me reduce it down to a small program that will show the bug (or the unintended undefined behavior).

AKA if you can reproduce the problem with a flow like this:

<riscv32-unknown-elf-gcc compile commands>
<qemu-user output.out>
{result}

<gcc compile commands for the same source files>
<./output.out>
{result that doesn't match riscv}

I can reduce it down to a preprocessed C file that compiler people can work with :)

@TommyMurphyTM1234
Copy link
Collaborator

If you're able to extract this problem into a program that runs in qemu-user mode (make build-qemu) and produces a different result from x86/arm/whatever your host machine uses then I have a flow that will let me reduce it down to a small program that will show the bug (or the unintended undefined behavior).

That's a very good point/suggestion about compiling the code for some non RISC-V ISA (e.g. x86_64 or Arm) and comparing the results to the RISC-V case. If the program still crashes in the non-RISC-V case then it strengthens the hypothesis that it's a problem with the actual code rather than the toolchain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants