Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process trace events into intermediate storage format #35

Open
mjcarroll opened this issue Feb 6, 2023 · 4 comments
Open

Process trace events into intermediate storage format #35

mjcarroll opened this issue Feb 6, 2023 · 4 comments

Comments

@mjcarroll
Copy link
Member

As first discussed in safe-ros/ros2_profiling#1

The idea would be to read raw CTF traces into some intermediate time-series data that is well suited for analysis tasks. Further high-level APIs could be built to ingest the intermediate data.

General requirements:

  • Disk and memory-efficient
  • Readable/writable across multiple languages
  • Convenient API for interaction with time series as well as basic tasks like JOIN

tracetools_read is currently using Pandas Dataframes

Proposed alternatives:

CC: @iluetkeb

@christophebedard
Copy link
Member

christophebedard commented Feb 7, 2023

@mjcarroll I would just like to clarify a few things:

  1. tracetools_read (in this repository)
    1. Currently uses the babeltrace Python bindings to read a CTF trace from disk and return a list of events as Python dictionaries; it doesn't do anything else.
  2. tracetools_analysis (in ros-tracing/tracetools_analysis)
    1. Reads events from a CTF trace using tracetools_read and writes the dictionaries to a file (pickle). This is because it is quicker to read from a pickle file than to read the CTF trace using babeltrace, and this allows us to only read the actual CTF trace once and then just read the pickle file. See tracetools_analysis/process.py's process() function or the load_file() function, which is usually what's used in Jupyter notebooks, as you probably know.
    2. Processes events one by one and writes some data to pandas DataFrames. See tracetools_analysis/processor/ros2.py and tracetools_analysis/data_model/ros2.py, respectively. A single row in a DataFame roughly corresponds to a single trace event, but at this point the trace events are abstracted away.
      1. To improve performance, it actually first writes data to normal Python lists, and then converts these lists to DataFrames once all trace events have been processed. Appending to a Python list is much faster than appending to a DataFrame.
    3. Then some functions are written to compare/merge/etc. DataFrames to extract high-level information. See files under tracetools_analysis/utils/.

The idea would be to read raw CTF traces into some intermediate time-series data that is well suited for analysis tasks. Further high-level APIs could be built to ingest the intermediate data.

So I'm guessing you're talking about an alternative to step 2.ii, and not talking about storing the events themselves?

Then in parallel we can change steps 1.i/2.i, which is kind of more related to #22.

@mjcarroll
Copy link
Member Author

So I'm guessing you're talking about an alternative to step 2.ii, and not talking about storing the events themselves?

Yes, mostly talking about an alternative to 2.ii in this outline.

Depending on the output of #22, there may be a potential of collapsing 2.i and 2.ii into a single step. For example, if ctf -> <intermediate format> is wildly more efficient than ctf-> pickled dict -> intermediate format while retaining all the same information. I don't see this as a high priority, though.

@iluetkeb
Copy link
Contributor

iluetkeb commented Feb 10, 2023

Coming at this from the usage end, we have two different kinds of information in the CTF

  1. meta-data, essentially mapping names of functions and endpoints to thread-ids+memory-addresses
  2. activity data, that is, callbacks being called, messages being sent/received, etc.

Meta-data is emitted first, but due to things like the life-cycle, system modes and more complex launch scenarios, the entire tracefile has to be scanned to be sure to get everything. We usually need all meta-data for later association. For reasons of efficiency and storage size, I am assuming that we want to store meta-data separately also during later stages, but note that we never measured the advantage of this, and due to things like category tables etc., merged storage might actually be comparable.

In contrast, for activity data, it is often sufficient and quite often very useful to process just parts of it, usually temporal chunks For example, for analysis of performance, we usually need to differentiate at least where the system is starting up, idle, active, or shutting down. Many systems also frequently switch between active and idle.

Last, but not least, memory-wise it can be necessary to load data partially.

I think it doesn't matter very much in practice whether we store data after it has been converted into a pandas dataframe or before, assuming that we're using one of several data storage formats which can be easily written from and loaded into pandas dataframes (like those from Apache Arrow).

@mjcarroll
Copy link
Member Author

Meta-data is emitted first, but due to things like the life-cycle, system modes and more complex launch scenarios, the entire tracefile has to be scanned to be sure to get everything.

I was hoping, but could not find evidence, that babeltrace2 would let us filter on event type/name, such that you could iterate for all metadata before doing filtered views of the longer running event data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants