Skip to content
This repository has been archived by the owner on May 7, 2024. It is now read-only.

Compiler inefficiency - Automatic arrays #193

Open
mp-17 opened this issue Aug 6, 2020 · 4 comments
Open

Compiler inefficiency - Automatic arrays #193

mp-17 opened this issue Aug 6, 2020 · 4 comments

Comments

@mp-17
Copy link

mp-17 commented Aug 6, 2020

Dear All,

I hope this is the right place for this issue, otherwise I will be happy to submit it in the right place.

I am working on reducing the code size of the RISC-V programs compiled with GCC. To do so, I compile the same program with GCC targeting both rv32imc and armv7-m ISAs, and analyze the disassembly.

While I was compiling opus_demo (example of OPUS audio codec https://github.com/xiph/opus) I found that some particular functions were very inflated w.r.t. the ARM counterparts: for example, celt_decode_lost (https://github.com/xiph/opus/blob/484af2580b82dffd695fcd0841fc34a9dc5f7293/celt/celt_decoder.c#L502).

The function makes use of automatic arrays and automatic variable-length arrays (VLA). I analyzed the disassembly, and I detected strange choices made by RISC-V GCC, which impact both performance and code density. I have tried to manually optimize the disassembly and eliminated a lot of (apparently) unnecessary operations.

I tried to reproduce the issue with two smaller toy-example programs. Even if the problem is not so extended as is celt_decode_lost, some apparently strange choices are made by GCC also in the following:

#include <stdio.h>

#define N 1000
#define M 2000

extern void foo(int *arr);

int main(int argc, char **argv) {

int arr_0[N];
int arr_1[M];

foo(arr_0);
foo(arr_1);

return 0;

}

It seems that after a certain amount of Bytes pushed to the stack, GCC performs redundant operations to reference the variables that are far from the actual stack pointer. Actually, in compiling celt_decode_lost the redundancies are more extended.

More detailed analyses of the problem can be found in this auxiliary presentation:
https://github.com/pulp-platform/fpnew/blob/feature/tiny-fpu/report/riscv_possible_inefficiencies.pdf
The first part is about the original celt_decode_lost, the last part is about the toy examples.

I hope I did not misunderstand anything. Feel free to ask if the presentation is not clear enough.

I have used GCC 7.3.0, 9.2.0, and 10.2.0, but the problem seems still there.

@jim-wilson
Copy link
Collaborator

This is a known problem. GCC first emits code using a frame pointer, and then tries to eliminate the frame pointer during register allocation. When the frame size is larger than the immediate field of an add, we need temporaries to compute frame offsets. Compiler optimizations like common subexpression elimination (cse) try to optimize the calculation of the temporaries. Then when we try to eliminate the frame pointer, we run into trouble because the cse opt changes interfere with the frame pointer elimination, and we end up with an ugly inefficient mess.

This problem isn't RISC-V specific, but since RISC-V has 12-bit immediates and most everyone else has 16-bit immediates, we hit the problem sooner, and thus makes it much more visible for RISC-V. The example above is putting 12KB on the stack, which is larger than 12-bits of range (4KB), but well within 16-bits of range (64KB).

I have a prototype of a patch to fix it by allowing >12-bit offsets for frame pointer references, and then fixing this when eliminating the frame pointer, but this can cause other problems, and needs more work to be usable. I have no idea when someone might try to finish the patch.

@mp-17
Copy link
Author

mp-17 commented Aug 6, 2020

Thank you for the fast reply,

Very clear explanation. I only did not catch well how GCC is supposed to act in these situations; does it try to do at least one of these two incompatible optimizations?

  • If I understood well, in my toy example the frame pointer was originally simplified but then it was calculated again at every address formation (at every stack reference, the stack pointer is incremented to form the frame pointer, and then the reference is calculated starting from it). At the same time, the temporaries were not optimized.

  • In the celt_decode_lost function the frame pointer is still present AND the temporaries are not optimized.

Thank you again,
Matteo

@vineetgarc
Copy link
Contributor

Hi @jim-wilson - its been a while, but would you happen to have the patch somewhere for us to use as a starting point ?
FWIW this issue is formally tracked in bugzilla 105733 - I also observed something similar in 106439

@vineetgarc
Copy link
Contributor

Never mind, it is there in your optimizations page too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants