Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strace feature to trace syscalls in the kernel. #1443

Merged
merged 13 commits into from
Jan 22, 2020

Conversation

gendx
Copy link
Contributor

@gendx gendx commented Nov 1, 2019

Pull Request Overview

This pull request adds an "strace" configuration to the kernel, which causes a trace of all syscalls and callbacks to be displayed on the debug port.

This also adds a new way of statically configure the kernel, via a kernel/src/config.rs file containing a static const configuration object. This allows to quickly turn on/off configuration values during development, while making sure that un-configured code is still checked for syntax and types, and allowing the compiler to optimize away dead code (due to the configuration being const).

Testing Strategy

This pull request was tested by:

  • Running make ci-travis.
  • Flashing Tock and an app on an nRF52840dk board and checking the debug output.

TODO or Help Wanted

N/A

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make formatall.

@alevy
Copy link
Member

alevy commented Nov 1, 2019

Implementation-wise, this seems good to me. And this also seems like a useful feature to have for debugging or auditing.

I think the biggest concern is around using features, as is the current persistent point of conflict.

Perhaps we should discuss whether there other alternatives (e.g. moving the decision as to whether to include strace to the board crate rather than as a feature in the kernel) might make sense? Or if, rather, this is really a case where features are the best option.

@alevy
Copy link
Member

alevy commented Nov 1, 2019

For some context: a number of folks have been burned by #ifdef hell in other OSs, where it becomes very very difficult for a non-expert to reason about which code is included, what version of things people are running, etc.

In some ways, Rust features are better than C-style #ifdefs, but in other ways not really... A lot of the pushback comes from a reluctance to see features creep their way into ubiquity, which is why there is such pushback for relatively innocuous individual PRs that happen to use features.

(I'm summarizing my sense of other people's thoughts. Those people should chime in themselves and express their opinions more accurately. Reminder to please address this particular PR in the comments---in other words, please address how we can best include this proposed feature).

@ppannuto
Copy link
Member

ppannuto commented Nov 1, 2019

For me, I think a service like strace falls on the line of either something that Tock supports or Tock does not support, in contrast to something that this particular compilation of Tock might support. One of the advantages afforded by an operating system is a somewhat standard suite of services afforded to applications; i.e. we have (in principle at least) a remote application update service.[1]

While features are fine for on-the-desk, quick in-loop debugging, they make life an absolute misery once things are deployed. If I have a misbehaving node in the wild and I want to inspect application behavior, I want to be able to turn on strace -- it can't require re-flashing the kernel to turn on debugging options for deployed nodes. Especially because that necessarily restarts an application, which may be in a rarely seen state that I want to introspect. We had a nice demo put together last year around how to remotely perform compute performance tracking. I think I'd like to see strace looking more like that, with a design that's looking towards all debugging scenarios.

[1] In practice, one major concern with this is that remote update is quite heavyweight in size / burden, and could be quite wasteful for nodes that don't use it. The solution was to have a privileged userspace application that does the majority of the work (download new binary, validate it, write it to flash, etc) that requires a minimal set of hooks in the kernel (networking, flash access, ability to stop an app and start an app). These hooks are always present in the kernel. <-- n.b. a decent chunk of this code is at a mostly working / proof of concept off in Signpost stuff

Here, I think strace is light enough that it can simply be directly integrated in the short term, probably as a bit in the process struct and the cfgs becoming ifs. The bit probably set by the application header in the short term with a longer-term vision of syscall that twiddles it. Further down the longer-term vision might have the remote access stuff and maybe a supervisory application that gets called on each syscall (as is done with strace/ptrace in practice today).

I'm definitely very strongly opposed to something like this being a feature that only some versions of compiled Tock kernels support. Keeping track of that in practice is impossible.


Written quickly on a plane before landing, sorry if things are unclear but wanted to get some stuff written / out there before call today. I'll try to join later if I can

@rajivr
Copy link
Contributor

rajivr commented Nov 2, 2019

For some context: a number of folks have been burned by #ifdef hell in other OSs, where it becomes very very difficult for a non-expert to reason about which code is included, what version of things people are running, etc.

Personally, I really like the fact that Tock doesn't have #[cfg(...)] in its code base. I agree with Amit, lack of cfg (and Rust Type system) helps beginners come up to speed with Tock, much more quickly than other OSes.

I've had few patches rejected in the past that we really needed and which I think would have been useful, but core team felt otherwise (PR 1264 as an example).

Something that we could potentially consider doing is to have something like tock-patches repository where we collect useful, but rejected patches, which some of us can keep current with release (we'll do that anyway).

@gendx
Copy link
Contributor Author

gendx commented Nov 4, 2019

Thanks for your feedback!

A separate tock-patches repository would definitely be the right place for a patch like this, given that it can be useful for debugging but not much worth the overhead in production.

I was mostly seeing this strace/ptrace as a "print to debug" feature (as it is implemented right now), not so much as an "hypervisor" available to other apps (given the elevated privileges that it gives).

I definitely agree with the ifdefs/features adding complexity & maintenance burden. On the other hand, the major advantage of conditional compilation is that it avoids the overhead of debug code, especially in terms of flash (due to extra code) and RAM (due to memory used by this extra code) usage, both of which are limited resources on embedded platforms. There can also be CPU overhead to formatting debug strings that are then thrown away when no debug output (UART, RTT) is connected. Removing any debugging code also reduces attack surface for a "production-grade" firmware.

Instead of defining one feature per debugging part, I see another possible trade-off to configure conditional code:

  • A single compile-time feature to distinguish between release/production builds and dev/debug builds (something like [RFC] Providing a "no_debug" building mode #1372). This would allow to entirely remove any debugging code when compiling for a production firmware. It could be a feature as we know them, or based on Cargo's --release flag, or any other suitable way to configure it.
  • A runtime DebugConfig type, which would contain a bunch of flags to control enabling of each debug part (e.g. strace could be such a flag). The DebugConfig would be initialized in the board setup (as a const debug_config = DebugConfig { ... }), so that one can easily toggle debug flags while developing. If we're lucky, the Rust compiler might even realize that flags are constant and optimize away code generated for unset flags. The flags could also be more flexible than Cargo features (that are limited to booleans) and instead contain integer (or richer) values.

@phil-levis
Copy link
Contributor

@gendx can you report on how much this increases code size?

I agree that the code size overhead seems small. Having a mechanism that will allow us to cut out debug output (for production/deployment) is increasingly important, but that's shouldn't block this.

That being said, I agree with Pat that allowing this to be a runtime configuration - i.e., turn it on and off dynamically -- is important. What would code with runtime switching look like?

@gendx
Copy link
Contributor Author

gendx commented Nov 23, 2019

As an experiment, I replaced the feature by a const configuration object. The good news is that the compiler optimizes away dead code as expected :)

Here are the numbers with strace = false and strace = true. The impact on code size is roughly:

  • 1KB in text for ARM.
  • 1.5KB in text + 350B in data + some changes in bss for RISC-V.

strace = false (I got the same numbers on before my commits, so the const config is really optimized away)

   text	   data	    bss	    dec	    hex	filename
 146432	   5832	  51512	 203776	  31c00	target/thumbv7em-none-eabi/release/imix
  89600	   4388	  61144	 155132	  25dfc	target/thumbv7em-none-eabi/release/hail
  56321	   2180	  71548	 130049	  1fc01	target/thumbv7em-none-eabi/release/nucleo_f429zi
  43520	   5452	  14004	  62976	   f600	target/riscv32imac-unknown-none-elf/release/hifive1
  59392	   1212	  72512	 133116	  207fc	target/thumbv7em-none-eabi/release/launchxl
  56833	   2180	  71548	 130561	  1fe01	target/thumbv7em-none-eabi/release/nucleo_f446re
  46592	   5028	  19544	  71164	  115fc	target/riscv32imac-unknown-none-elf/release/arty-e21
  58368	   1828	  47324	 107520	  1a400	target/thumbv7em-none-eabi/release/acd52832
  95744	   2464	 259680	 357888	  57600	target/thumbv7em-none-eabi/release/nrf52840dk
  77312	   1660	  39300	 118272	  1ce00	target/thumbv7em-none-eabi/release/nrf52dk

strace = true

   text	   data	    bss	    dec	    hex	filename
 147968	   5832	  51512	 205312	  32200	target/thumbv7em-none-eabi/release/imix
  90624	   4388	  61144	 156156	  261fc	target/thumbv7em-none-eabi/release/hail
  57857	   2180	  71548	 131585	  20201	target/thumbv7em-none-eabi/release/nucleo_f429zi
  45056	   5812	  14668	  65536	  10000	target/riscv32imac-unknown-none-elf/release/hifive1
  62464	   1212	  72512	 136188	  213fc	target/thumbv7em-none-eabi/release/launchxl
  57857	   2180	  71548	 131585	  20201	target/thumbv7em-none-eabi/release/nucleo_f446re
  48128	   5388	  19184	  72700	  11bfc	target/riscv32imac-unknown-none-elf/release/arty-e21
  59904	   1828	  47324	 109056	  1aa00	target/thumbv7em-none-eabi/release/acd52832
  97280	   2464	 259680	 359424	  57c00	target/thumbv7em-none-eabi/release/nrf52840dk
  78336	   1660	  39300	 119296	  1d200	target/thumbv7em-none-eabi/release/nrf52dk

@gendx
Copy link
Contributor Author

gendx commented Nov 23, 2019

So overall, I don't think features are needed for this kind of configuration. A const config seems to work, and will be cleaner as the code is always checked by the compiler and doesn't require to mess with Cargo features.

Regarding having this as a dynamic configuration, I'm not sure how you would want that to work, i.e. what should toggle on/off the stracing? If it's an application via some syscall, as you mentioned above, this requires defining a suitable syscall API, and it will be important to define a threat model. There could be something heavier that sends back stracing events to an application (although can an app strace itself??), but again that's much more work.

For now, I mostly see this stracing as a manual debugging tool from the kernel, e.g. to help implementing things like a USB stack, and a const configuration is sufficient for this purpose.

If you want to go ahead with the const configuration, I'll be happy to document this process further. Such configuration could be helpful for other things like selecting a scheduling algorithm for example (once someone implements some kind of "fair" scheduler), which I don't think should change at runtime. Another interesting improvement could be to define such const configuration via the boards, so that ultimately all configuration happens there (provided the compiler still optimizes away dead code configured from the boards).

@ppannuto
Copy link
Member

Quick reaction: I think this looks fantastic.

I agree that there's a bunch of other pieces that need to come together before dynamic configuration would make sense, but I like that this design is still quite amenable to eventually moving that direction if/when we were ready (at the obvious cost of permanent code-size growth).

@phil-levis
Copy link
Contributor

phil-levis commented Nov 23, 2019

I don't think we need dynamic tracing right now. My thought is that this could be a significant overhead, so something you don't want always on. But as Pat pointed out, you might want to be able to turn it on in the field in order to see what's happening. We don't have such as use case right now, so we don't need to be able to do it. But it's clear architecturally how we would if such a need comes up.

I like the const config approach very much.

Is the PR comment/text still correct?

phil-levis
phil-levis previously approved these changes Nov 23, 2019
@gendx
Copy link
Contributor Author

gendx commented Nov 27, 2019

Is the PR comment/text still correct?

I updated the PR text and added some documentation.

@gendx
Copy link
Contributor Author

gendx commented Nov 27, 2019

Hum, weird that Travis complains about macro_use on the website. I can't reproduce locally with make ci-travis, despite having the same version of Rust:

$ rustc --version
rustc 1.40.0-nightly (0e8a4b441 2019-10-16)
$ rustup --version
rustup 1.20.2 (13979c968 2019-10-16)
$ cargo --version
cargo 1.40.0-nightly (3a9abe3f0 2019-10-15)

Is there anything broken with the pinning of a specific toolchain?

@gendx
Copy link
Contributor Author

gendx commented Nov 27, 2019

Hum, weird that Travis complains about macro_use on the website. I can't reproduce locally with make ci-travis, despite having the same version of Rust

Ok, looks like Travis runs the tests on top of some automatic but hidden merge commit. I could reproduce and fix the bug with a manual merge commit.

@gendx
Copy link
Contributor Author

gendx commented Dec 2, 2019

@alevy Does that look good to you?

@gendx gendx requested a review from phil-levis December 4, 2019 14:53
@bradjc
Copy link
Contributor

bradjc commented Dec 9, 2019

While I'm intrigued by the idea of a config module, this has the same net effect as a feature, doesn't it?

@gendx
Copy link
Contributor Author

gendx commented Dec 10, 2019

While I'm intrigued by the idea of a config module, this has the same net effect as a feature, doesn't it?

Features can be more powerful in that they allow to conditionally compile arbitrary syntax. For example, one could define some type T to be completely different with or without some feature. Modifying types according to constants is currently not possible (although the upcoming const generics will provide more flexibility in that space as well).

But I think that whenever possible, a const config will be more understandable, more testable and less error-prone than features, as we discussed in many other pull requests.

I'm not sure whether strace or ptrace would be the most appropriate, so
a more explicit name may be clearer for the reader.
phil-levis
phil-levis previously approved these changes Jan 18, 2020
@phil-levis
Copy link
Contributor

bors r+

@gendx
Copy link
Contributor Author

gendx commented Jan 21, 2020

I don't think this is quite mergeable yet. It adds a new file to the kernel crate with no in-file documentation, not even the header line for rustdoc. Some of the doc/ description should be included there, but in particular how to use it should be documented in the rust source, as is done in other kernel crate files.

Done.

@gendx gendx requested review from bradjc and phil-levis January 21, 2020 15:21
Consistent with rest of kernel crate.
Avoid using the internal `idx()` function. This may not be the correct
way to reference a process in the future. Plus we have formatting
defined for AppId already.
bradjc
bradjc previously approved these changes Jan 21, 2020
Copy link
Contributor

@bradjc bradjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than leave comments I just added commits. See what you think.

  • Added link in doc readme to the new file.
  • Print {:?}, appid rather than use the .idx() function. This is looking ahead as it will be better to think of AppId as an opaque type rather than just a holder of a number. The display trait is already implemented for it.
  • Change pub(crate) to crate to match the rest of the kernel.
  • And change the print format a bit. I think it is important to have :#x as otherwise it is a pain to know what value something like "56" is. But I also changed some numbers to decimal as that is how they are specified in the Tock source (in particular driver numbers and subnumbers) and I think it is easier to not make readers have to do the conversion.

phil-levis
phil-levis previously approved these changes Jan 22, 2020
alistair23
alistair23 previously approved these changes Jan 22, 2020
@gendx gendx dismissed stale reviews from alistair23, phil-levis, and bradjc via 46d3977 January 22, 2020 10:26
@gendx
Copy link
Contributor Author

gendx commented Jan 22, 2020

  • Print {:?}, appid rather than use the .idx() function. This is looking ahead as it will be better to think of AppId as an opaque type rather than just a holder of a number. The display trait is already implemented for it.

You mention the Display trait but used the Debug formatting ;)

  • Change pub(crate) to crate to match the rest of the kernel.

Today I learned. Did this syntax appear in a recent Rust?

  • And change the print format a bit. I think it is important to have :#x as otherwise it is a pain to know what value something like "56" is.

+1

But I also changed some numbers to decimal as that is how they are specified in the Tock source (in particular driver numbers and subnumbers) and I think it is easier to not make readers have to do the conversion.

  • That's incorrect. Driver numbers are defined in hexadecimal: https://github.com/tock/tock/blob/master/capsules/src/driver.rs
  • I don't think we currently use more than 10 subdrivers/driver, so keeping those in decimal sounds good.
  • You forgot to update the formatting in schedule.
  • I don't know what you did for memop. You're mixing input (arg0) and result within the brackets.
  • The buffer length in allow would make more sense in hexadecimal, as buffer lengths are usually powers of two - and one can then more easily see the end address of the slice w.r.t. the start of the slice.
  • Let's keep the formatting simple: function(a, b, c) = result, there's no need to add too fancy brackets or res:, the = should be clear enough (same spirit as strace) and prints less characters to the console (therefore reducing the latency of debugging).

Updated the code again on my side.

@gendx gendx requested review from phil-levis and bradjc January 22, 2020 10:27
@gendx
Copy link
Contributor Author

gendx commented Jan 22, 2020

  • Print {:?}, appid rather than use the .idx() function. This is looking ahead as it will be better to think of AppId as an opaque type rather than just a holder of a number. The display trait is already implemented for it.

Also, note that in Process::remove_pending_callbacks there is no such thing as the AppId, only the index. Leaving it like this for now, I guess one can fix that later when the forward looking vision of an opaque AppId type is implemented in Process.

@bradjc
Copy link
Contributor

bradjc commented Jan 22, 2020

bors r+

@gendx
Copy link
Contributor Author

gendx commented Jan 22, 2020

Hum, don't know why bors canceled this merge.

@bradjc bradjc merged commit a8f30ae into tock:master Jan 22, 2020
@gendx gendx deleted the add-strace-feature branch January 22, 2020 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants