Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zephyr instrumentation subsystem #57373

Open
gromero opened this issue Apr 28, 2023 · 6 comments
Open

Zephyr instrumentation subsystem #57373

gromero opened this issue Apr 28, 2023 · 6 comments
Assignees
Labels
area: Profiling RFC Request For Comments: want input from the community

Comments

@gromero
Copy link
Collaborator

gromero commented Apr 28, 2023

TL:DR;

New Zephyr subsystem for profiling and tracing, proof of concept: #57371

CLI tool demo:
https://asciinema.org/a/581092

Perfetto output example:

image

Introduction

This RFC proposes a new Zephyr subsystem named instrumentation meant for
profiling and tracing. The new 'instrumentation' subsys leverages the compiler
code instrumentation feature available on compilers like GCC and LLVM, enabled
via the flag -finstrument-functions, to collect profile data, function
entry/exit traces, and thread scheduling info in a non-intrusive way.

Problem description

Most profiling and tracing features in Zephyr today either depend on proprietary
and expensive tooling, or have significant platform limitations. For instance,
Segger's SystemView, which requires J-Link, or gprof, which only supports a few
platforms, often emulated, like native_posix, and native_posix_64 platforms,
with no timestamp data associated with events. Also, although tracing subsys
has numerous hooks available in its APIs, these hooks are static and so can't be
easily set to trace arbitrary functions in kernel or application code.

By leveraging the compiler code instrumentation feature it is possible to have a
platform, tooling, and arch-neutral tracing and profiling subsystem which allows
collecting traces at arbitrary code locations, both at compile time and runtime,
and at the same time, using the very same mechanism, collect profiling data for
the traced code.

Once the traces and profiling data are collected, various transport layers can
be used to transfer data from the target to the host, similary to the transport
backends available in the tracing subsys (UART, USB, etc).

Data can then the viewed in the host using a CLI tool, which can display traces
and profing statistics in the text terminal or export it to various formats so
data can be visualized on GUI tools, like Perfetto TraceViewer. The same CLI
tool allows for setting the functions to be traced at runtime (it can also be
set via Kconfig, at compile time).

Proposed change

Implement a new Zephyr subsystem named instrumentation and an associated CLI
tool that allows setting, collecting, displaying, and exporting tracing and
profiling data captured by this new subsystem.

Detailed RFC

Compilers like GCC and LLVM can generate instrumentation calls for entry and
exit to functions if flag -finstrument-functions is passed to the compiler.
When this feature is enabled, just after function entry and just before function
exit, the following functions, respectively, are called with the address of the
current function and its call site:

void __cyg_profile_func_enter(void *callee, void *call_site);
void __cyg_profile_func_exit(void *callee, void *call_site);

The instrumentation is also done for functions expanded inline in other
functions. A function may be given the attribute no_instrument_function, in
which case this instrumentation is not done for the function. It's also possible
to exclude from instrumentation all functions in a compilation unit (source
file) by setting -finstrument-functions-exclude-file-list=file,file,..., or
also exclude functions by their name setting
-finstrument-functions-exclude-function-list=sym,sym,....

All these options take effect at compile time and so the instrumentation
subsys has yet another way to disable the instrumentation at runtime: calling
instr_disable(); and instr_enable(), to enable it back. This is an
interesting mechanism because it allows using any function inside the
instrumentation code and avoid having to exclude that function from
instrumentation at compile time, which would prevent tracing and profiling that
function outside the instrumentation code, i.e. elsewhere in kernel or
in the application code.

Because the compiler instruments the entry and exit to functions, it's possible
to have a single handler called by __cyg_profile_func_enter() and
__cyg_profile_func_exit() that collects profiling and tracing data for the
whole system. The instr_event_handler() is this handler.

Conceptually, the only events that can trigger the handler are function entries
and exits, which provide good function trace information. However once the
handler inspects the context and the name of the function (callee) the handler
can promote these events to others. For instance, if it's possible to get the
context when a function is called, a function entry event with additional
context (CPU ID, priority, thread name, etc) is generated; another example is if
the callee is part of the thread switching code, the event can be promoted to a
thread switched in/out scheduler event (e.g. z_thread_mark_switched_in/out()).
Other events can be easily added the same way.

tracing subsys needs to enabled (CONFIG_TRACING=y) to give access to hooks
like z_thread_mark_switched_in/out().

The instrumentation subsys relies upon a monotonic timer which is initialized to
zero at subsys initialization. Different platforms can set the timer backend to
the most suitable timer source to be used. For example, on some Arm Cortex-M the
DWT (Data Watchpoint and Trace) unit is available and DWT_CYCCNT can be used
as a monotonic clock source. This timer is used by the handler to acquire a
timestamp for each event processed.

The handler collects profile data for each function in an aggregated way,
where the total amount of time spent in the function (delta) is calculated per
entry/exit and accumulated. The accumulators are kept in an array. Similarly to
the way a function stack is created and destroyed in the function's prologue and
epilogue, a timestamp (S0) is pushed to an alternative stack when a function is
called and another timestamp is obtained just before the function returns (S1).
The delta is calculated as delta = S1 - S0. Delta is then accumulated per
function. The depth of the timestamp stack can be configured (e.g to 16) and the
number of functions for which delta can be accumulated can also be configured
(e.g. 256 as the size of accumulator array).

The handler also collects traces, notabily, function entry/exit events but
also scheduling events. The events are recorded into a configurable size buffer,
which can be configured to behave as a ring buffer or as a fixed buffer,
depending on the use case.

The ring buffer API used by the instrumentation subsys to keep the traces is
similar to the Zephyr Ring Buffers API but allows putting variable size elements
(event records) and also can work in overwrite mode. In the overwrite mode when
the buffer fills up newer events will overwrite the older ones. The buffer can
also be configured to be in fixed mode, in this mode once the buffer fills up no
more traces are recorded. The ring buffer API consists in:

void init_instr_ringbuffer(void);

struct instr_record* put_item_claim(enum instr_event_types type);
void put_item_finish(enum instr_event_types type);
int put_item_wrapping(struct instr_record* record);

struct instr_record* get_item_claim(void);
void get_item_finish(struct instr_record* record);
int get_item_wrapping(struct instr_record* record);
struct instr_record* get_item(struct instr_record* record);

When the instrumentation handler needs to write a record it calls
put_item_claim and informs the type of the record it wants to write. If there
is enough contiguous space (buffer doesn't wrap) put_item_claim returns a
pointer that will be used to write the record directly to the buffer. If the
buffer wraps, the instrumentation handler will need to fill up first a record
structure and pass it to put_item_wrapping, which will take care of copying
the record to the buffer, wrapping data accordingly to the start of the buffer.

When the instrumentation handler needs to read a record, usually for dumping it
to the host, the ring buffer get_* functions are used instead and they work
similarly to put_* ones. One additional function is available for convenience
tho: get_item. get_item will returned the next available item at the tail
of the ring buffer, until there is no more items to get from the ring buffer.

Event records are defined as follows, where the header is essentially used to
specify the event type:

struct instr_record {
        struct instr_header header;
        void* callee;
        void* caller;
        uint64_t timestamp;
        union {
                struct instr_payload0 payload0;     // Metadata.
                struct instr_payload1 payload1;     // Entry/Exit, no context.
                struct instr_event_context context; // Entry/Exit, with context.
                // ...
                /* Add more payloads here */
        };
} __packed;

The following basic event types are proposed:

enum __attribute__ ((__packed__)) instr_event_types {
        INSTR_EVENT_META = 0,	/* Metadata record, defined in TLV format. */
        INSTR_EVENT_ENTRY,	/* Callee entry event record */
        INSTR_EVENT_EXIT,	/* Callee exit event record */
        INSTR_EVENT_ENTRY_CTXT,	/* Entry event record, but with context */
        INSTR_EVENT_EXIT_CTXT,	/* Exit event record, but with context */
        INSTR_EVENT_PROFILE,	/* Profile events */
        INSTR_EVENT_SCHED_IN,	/* Thread switched in scheduler event */
        INSTR_EVENT_SCHED_OUT,	/* Thread switched out scheduler event */
        INSTR_EVENT_NUM
};

Subsystem initialization and activation

Because the compiler can instrument any function, the instrumentation handler
can be called quite early in the boot process, even as soon as the first kernel
function is called. This can be problematic because RAM might be not properly
initialized yet.

One way to guarantee the instrumentation subsystem has the proper context to be
initialized is to check if an initialized C variable in RAM is already properly
initialized with a magic number; in other words, check an in-RAM value against a
magic value in the flash memory. This check is done by
instr_fundamentals_initialized() and guarantees essential Zephyr RAM
initialization functions were executed (like z_early_memset & friends).

In the instrumentation initialization the tracing buffer will be initialized.
Also, the trigger and stopper function will be set accordingly. The trigger
and stopper function addresses control when, after the subsys is enabled, the
tracing and profiling will start and stop, respectively. The trigger and
stopper addresses are set to a default value at compile time based on Kconfig,
but since they can be set at runtime and need to survive between reboots, they
can be set also from the RAM, via the retained_mem driver. Hence, if values
for trigger and stopper are available from retained memory, they will be set
accordingly, overwritting the defaults from compile time.

After proper initialization the instrumentation subsystem will be enabled but
turned off.

In the instrumentation subsys terminology enabled means that tracing and
profiling can be performed at any time but will start only when instrumentation
is turned on, and will stop when it is turned off. Thus there are two stages to
effetively activate tracing and profiling. This allows disabling the
instrumentation in the instrumentation code itself at runtime, so in critical
sections, avoiding infinite recursions (instrumentation subsys functions, like
the handler, are excluded at compile time from instrumentation but can call
other functions, like strncpy, etc, which are instrumented, leading to
infinite recursion).

Tracing and profiling are turned on when the trigger function is called and
are turned off when the stopper function returns. This lets a fine-grained
control over which code location tracing and profiling will be active.

Transport

Much like in the tracing subsys, the instrumentation subsys might support
various backends for transporting the tracing and profiling data collected in
the buffers. The focus must be on popular, low-cost, and widely available
interfaces, like UART and USB.

The transport controller supports the following commands:

status: Get status on tracing and profiling (enabled / disabled)
reboot: Reboot target
ping: Ping transport controller
dump_trace: Dump traces from buffer
dump_profiler: Dump profile deltas from accumulator array
trigger: Set trigger function
stopper: Set stopper function

The dump commands dump the event records as a binary stream via the transport
channel. The binary stream is decoded using the CTF (Common Trace Format) v1.8
and libbabeltrace2.

Currently only the asynchronous transfer mode is considered.

CLI tool

To interface with the transport controller an easy to use CLI tool is proposed,
which provides the subcommands and options below:

~/zephyrproject/zephyr$ ./scripts/zaru --help
usage: zaru [-h] [--serial SERIAL] [--elf ELF] {reboot,status,trace,profile} ...

Zephyr compiler-managed system profiling and tracing tool.

positional arguments:
  {reboot,status,trace,profile}
    reboot              reboot target.
    status              get instrumentation status from target.
    trace               get traces from target.
    profile             get profile info from target.

options:
  -h, --help            show this help message and exit
  --serial SERIAL       default to '/dev/ttyACM0'.
  --elf ELF             default to 'build/zephyr/zephyr.elf'.

~/zephyrproject/zephyr$ ./scripts/zaru trace --help
usage: zaru trace [-h] [--verbose] [--reboot] [--list] [--trigger FUNC_NAME] [--stopper FUNC_NAME] [--couple FUNC_NAME] [--perfetto] [--output OUTPUT] [--annotation]

options:
  -h, --help            show this help message and exit
  --verbose, -v         verbose mode.
  --reboot, -r          reboot target before getting traces.
  --list, -l            list trigger and stopper functions set in the target.
  --trigger FUNC_NAME, -t FUNC_NAME
                        set FUNC_NAME as the trigger function. When called it turns on instrumentation.
  --stopper FUNC_NAME, -s FUNC_NAME
                        set FUNC_NAME as the stopper function. When returned it turns off instrumentation.
  --couple FUNC_NAME, -c FUNC_NAME
                        set both trigger and stopper to FUNC_NAME.
  --perfetto, -p        export traces to Perfetto TraceViewer UI (Event Trace Format).
  --output OUTPUT, -o OUTPUT
                        output filename. Default to './perfetto.json'.
  --annotation, -a      disable function name annotation in function returns.

~/zephyrproject/zephyr$ ./scripts/zaru profile --help
usage: zaru profile [-h] [--verbose] [--reboot] [-n [N]]

options:
  -h, --help     show this help message and exit
  --verbose, -v  verbose mode.
  --reboot, -r   reboot target before profiling.
  -n [N]         show first N most expensive functions.

Changes in Zephyr core

No changes are necessary in the existing Zephyr core code or in the tracing
hooks since the instrumentation subsys is called from instrumented code
generated by the compiler.

Use cases

The instrumentation subsys and its CLI tool, zaru, can be used to trace and
profile specific code location.

The user will set trigger and stopper function names, usually set to main.
If they are changed at runtime, then --reboot must be passed to reboot the
target before data is collected at new location specified.

Profiling data is collected always when tracing is active.

The collected data can be viewed using zaru trace or zaru profile
subcommands. The size of the ring buffer can be configured accordingly to the
memory constraints of the target. The more buffer size, the more traces it's
possible to collect.

If the ring buffer is in write mode, the most recent events can be observed.
Otherwise, once the buffer gets full data collection stops and so only the first
(and older) events are kept recorded.

The tracing data can be exported to various formats, like the Event Trace Format,
that can be imported by Perfetto Trace Viewer.

Dependencies

The instrumentation subsys depends on the tracing subsys, i.e.
CONFIG_TRACING=y, for the thread scheduling hooks. It also depends on the
retained_mem driver for retaining the trigger and stopper addresses
between reboots, i.e. CONFIG_RETAINED_MEM=y and CONFIG_RETAINED_MEM_MUTEXES=n.
And it also depends on the reboot capability (CONFIG_REBOOT=y).

The CLI tool, zaru, depends on Python 3 Babeltrace2 module (bt2) for decoding
the binary stream (event records) via the Common Trace Format's Trace Stream
Description Language (TSDL).

Concerns and Unresolved Questions

  • Calculating delta in profiling for recursive calls can easily lead to wrong
    delta values because the number of recursive calls exceeds the number of slots
    in the timestamp stack (usually 16 slots). Maybe some heuristics can be used
    to avoid pushing further timestamps in recursion.
  • If the number of functions profiled exceeds the max number of functions for
    which delta can be accumulated, some kind of heuristic can be used to discard
    the least relevant ones, for instance, function with low delta values.
  • About 50% (20 bytes) of the event record size is spend in the thread name. It
    can be compacted further by using a lookup table and just recording the lookup
    index.
  • Due to limitations in the CTF 1.8 a variant id can't be used to select a field
    type, so it's difficult to represent 32-bit vs 64-bit binary formats in a
    single metadata file. The same happens for expressing big-endian vs
    little-endian formats. Hence multiple metadata files would exist in CTF 1.8.
    This problem is solved in CTF 2.0, but currently there isn't a library
    implementation for it (it's a WIP).
  • trigger and stopper mechanism can be enhanced, for instance, be activated
    only a certain amount of times and when callee is called from specific
    callers.
  • It's possible to enhance the tracing by providing additional statistics. For
    example: if the trace buffer isn't enough to capture all the events for the
    location selected, print how much memory it would be necessary to capture all
    the events ideally, so the user can adjust the buffer size based on this
    estimation.
@gromero gromero added the RFC Request For Comments: want input from the community label Apr 28, 2023
@github-actions
Copy link

Hi @gromero! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

@gromero gromero self-assigned this May 1, 2023
@microbuilder microbuilder added dev-review To be discussed in dev-review meeting and removed dev-review To be discussed in dev-review meeting labels May 2, 2023
@cfriedt
Copy link
Member

cfriedt commented May 2, 2023

Also, although tracing subsys has numerous hooks available in its APIs, these hooks are static and so can't be
easily set to trace arbitrary functions in kernel or application code.

💯
Very much support adding this upstream. I'll see if I can get some interested parties to comment here.

Once the traces and profiling data are collected, various transport layers can be used to transfer data from the target to the host, similary to the transport backends available in the tracing subsys (UART, USB, etc).

Perfect - this is exactly what I was looking for!

Thanks for putting this together @gromero and @microbuilder

@vlotech
Copy link
Contributor

vlotech commented May 15, 2023

This is looking great!
We used to debug and trace the same with eCos and it's internal tools which solved issues in hours instead of weeks.

@microbuilder
Copy link
Member

@gromero Any update on getting this over the finish line?

@TaiJuWu
Copy link
Collaborator

TaiJuWu commented Feb 6, 2024

Is it ready to merge?

@gromero
Copy link
Collaborator Author

gromero commented Mar 6, 2024

Is it ready to merge?

@TaiJuWu @microbuilder Unfortunately, not yet. I need to remove the variable length support in the ring buffer (use Zephyr API instead) and test it on a VM as requested by @carlocaione in Prague. I didn't give up. It's been just hard to find the cycles I need to address it. Otherwise, I deem it done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Profiling RFC Request For Comments: want input from the community
Projects
Status: No status
Development

No branches or pull requests

6 participants