One of the more tricky problems when coding for an embedded system is ensuring that each task has enough stack space for its needs.
Most microprocessor control units (MCUs) have limited amounts of RAM; on a general-purpose machine, for example a Linux desktop, it would in theory be possible for a potential stack overflow to be recognised and for additional stack space to be allocated, and even if that's not possible, a large stack can be pre-allocated using virtual memory without any immediate cost.
This can't be done if the MCU only has real memory. Stack overflow may be caught or not. If you're lucky there'll be enough information left behind for you to use the debugger to find the reason for the resulting crash straight away, if not it could take a while. Under FreeRTOS you're very likely to get a hard fault to add to the joy (stacks and task control blocks live close together, so running off the end of your stack is likely to trample on the stored state of another task, resulting in it trying to access invalid memory).
stack_usage has been developed on macOS Mojave using Python 2.7 and 3.7 and PLY (Python Lex and Yacc). To install PLY,
pip install --user ply
It relies on the information generated by the GCC compiler (FSF GCC 10 or later, GNAT GPL 2015 or later) using the switch
usage: stack_usage.py [flags] <input .ci files> flags: -h, --help: output this message -s, --save=file: save data in file -l, --load=file: restore previously saved data -o, --output=file: file for CSV output (D=stack_usage.csv) -d, --diagnostics: output diagnostic info on parsing
- When data is saved, it's in Python's pickle form.
- The diagnostics output is only useful while developing the parsing information.
As noted above, the initial motivation for this work was a hard fault encountered while writing a test program to check that Ada timing events work properly (well, usably) with the FreeRTOS-based Cortex GNAT RTS.
Timing events should, per ARM D.15(25), "be executed directly by the real-time clock interrupt mechanism". This wasn't done in Cortex GNAT RTS, because of ensuring proper handling of inter-task and inter-interrupt protection (especially tricky with the micro:bit, because it uses a cortex-M0 part); instead, a highest-priority task checks timings and runs handlers when required.
Handlers are protected procedures, and code for them is generated as for any protected procedure, with the proper locking mechanisms.
Timer task was specified as
task Timer is pragma Priority (System.Priority'Last); end Timer;
implemented, comments elided, as
task body Timer is Next : Time := Time_First; Period : constant Time_Span := Milliseconds (5); begin loop Process_Queued_Events (Next_Event_Time => Next); delay until (if Next = Time_First then Clock + Period else Next); end loop; end Timer;
In Cortex GNAT RTS, an Ada task is created as a FreeRTOS task whose initial procedure is a wrapper which allocates any secondary stack required out of the task's stack (in this case, 10% of the task's default 768 bytes) and invokes a compiler-generated wrapper round the user-written task body.
The event handlers for this software are specified as
protected LED_Event_Handling is pragma Interrupt_Priority; procedure Handle (Event : in out Ada.Real_Time.Timing_Events.Timing_Event); private ... end LED_Event_Handling;
and are implemented by the compiler as a wrapper, which performs any locking and calls the user-written handler body. This wrapper is called directly by the timer task, so any stack used by the handler body comes out of the timer task's stack.
It turned out that a lot of the problem was that the standard compilation options used included
-O0 (no optimisation), and this led to the wrapper procedure (see above) using 300 bytes more stack than the 40 bytes it used with
-Og (which "offer[s] a reasonable level of optimization while maintaining fast compilation and a good debugging experience").
A further improvement was to eliminate the
Timer task's secondary stack and bump its main stack:
task Timer with Priority => System.Priority'Last, Storage_Size => 1024, Secondary_Stack_Size => 0 is pragma Task_Name ("events_timer"); end Timer;
At this point, it seemed like a good idea to implement
stack_usage.py. I also added callgraph code generation as an option to the Cortex GNAT RTS build process.
To generate raw callgraph info for the micro:bit RTS,
cd ~/cortex-gnat-rts/microbit make CALLGRAPH=yes clean all install
To generate raw callgraph info for the
events test program,
cd ~/cortex-gnat-rts/test_microbit make clean gprbuild -p -P testbed events -cargs -Og -fcallgraph-info=su,da
To process and save the RTS's information,
cd ~/stack_usage ./stack_usage.py \ --save=microbit.p \ ~/cortex-gnat-rts/microbit/.build/*.ci
To combine this saved RTS information with the
events program's data,
./stack_usage.py \ --load=microbit.p \ --output=events.csv \ ~/cortex-gnat-rts/test-microbit/.build/*.ci
events.csv, we find that
ada.real_time.timing_events.timerTKVIP (called during elaboration to create the
Timer task) uses 264 bytes (from the environment task's stack). The
system.tasking.restricted.stages.wrapper procedure uses 40 bytes of the
Timer task's stack, and calls (via a pointer, so invisibly to
ada.real_time.timing_events.timerTKB, the task's body, which uses 296 bytes (including 64 bytes used by
stack_usage.py can't trace into the handler procedure, because it's called via a pointer:
if Handler /= null then Handler.all (Timing_Event (Next_Event.all)); end if;
so we have to use our knowledge of the actual handlers to find that it's
event_support.led_event_handling.handleN (312 bytes), wrapped by
event_support.led_event_handling.handleP (which organises locking) for a total of 328 bytes.
The total stack usage for the timer task is then predicted to be 40 + 296 + 328 = 664 bytes.
This is a worst-case value: on measuring the actual free space, the actual peak usage was 424 bytes.
The worst-case usage for a particular subprogram is the amount it uses for its own purposes (register storage, local variables) plus the maximum used by any of the subprograms it calls. If a particular execution pattern doesn't call that maximal subprogram, then the actual usage will be lower. In this case, one of the most expensive called subprograms was 64-bit scaled arithmetic division, at 192 bytes, called by
Ada.Real_Time.Time_Of at 248 bytes. I'm assuming that this was never actually invoked.
The tool currently ignores
- calls to subprograms not compiled with the required options (e.g.
- dynamic objects (for example,
declareblocks with variables sized at runtime)
- dispatching calls
At the very least it should mark the subprograms where there's a potential problem.
It might be helpful to see details of which subprograms are called by a given subprogram.