-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Incorrect VirtualMuxAlarm implementation #1651
Comments
@gendx this is a great summary, thanks!
I, personally, need to think about this more carefully. Which leads me to...
Yes, this is 100% my experience. I believe I've fixed timer bugs like these over several iterations in the past few years, and they are incredibly difficult to debug, especially since they often tend to appear at the edge (near wrapping cases), which occur very infrequently. |
We can be in a situation where an alarm 1 is already set, and by the time we set another alarm 2, alarm 1 is already in the past (but it hasn't fired yet). But this can be a normal occurrence with any finite type. I think your notion of "past" is confused, because a smaller number actually means in the future, due to wraparound. You can go down the RISC-V route and make all counter values 64 bits and hope things never wraparound. So I would rephrase this as "we can't tell if alarm 1 is in the recent past or far future." But the thing is, this problem isn't new. It's been solved before. Put another way, there are lots of embedded systems out there that don't miss alarms and don't use 64-bit types. So how do they do it? The root problem is the decoupling of when an alarm is request and then when it's processed. I think we discussed this in depth in tock-dev. There are good reasons to have a 64-bit counter in the kernel, but I don't think this one of them. The 64 bit counter will wraparound eventually too -- you may have avoided the problem in practice, but you haven't actually solved it. |
To come back to the bug that I observed, there has been no wraparound (the bug occurred within a second of booting the board, all alarms are set to durations much smaller than the wraparound), so there's no confusion about what is past and future (referring to the notions of "past" and "future" in the real world). Before going back to lengthy discussions about the time HIL, can we agree that Otherwise, Tock doesn't uphold the guarantee that setting an alarm in the future will fire. If there's no such guarantee, all capsules that rely on an alarm are essentially broken. Indeed, capsules do rely on the assumption that it's safe to wait on The "setting an alarm in the future" part is also a reason why I suggested to have a
That's a good question, but if it was solved in Tock, there wouldn't be this bug.
I don't think this argument is relevant. Unless there is an embedded system where the wraparound happens within the lifetime of the device (please let me know if you have a system in mind, but the consensus in the last discussion we had was that 64 bits is always enough), the counter will not eventually wraparound. If the problem is avoided in practice, I don't see any reason to solve it (at least not in mainline Tock). Of course, it's also unclear whether 64-bit timestamps will be simpler to handle (due to the extra cost of handling overflow interrupts). But solving for solving's sake shouldn't be a reason to pursue complex solutions. I think we should be clear about the expected properties of an alarm, and make sure that these properties are implemented accordingly. Here are some further thoughts.
|
Note: this bug is a "re-discovery" of what I observed a few months ago already in #1513. |
@gendx I agree completely with your observation on the bug. I was disagreeing with your generalization and conclusions from it. In particular, If the problem is avoided in practice, I don't see any reason to solve it (at least not in mainline Tock). Of course, it's also unclear whether 64-bit timestamps will be simpler to handle (due to the extra cost of handling overflow interrupts). But solving for solving's sake shouldn't be a reason to pursue complex solutions. They're not complex. They're what everyone else does. From a systems standpoint, I'd argue it's always better to solve the problem in essence -- you solve it in practice when doing so in essence is too expensive/complex. |
Dropping in to report that I have run into this bug in the real-world while bringing up Tock on a new nRF52-based platform. This userland app prints (via RTT) #include <timer.h>
int
main() {
int i = 0;
while (1) {
printf("Loop %d\n", ++i);
delay_ms(1000);
}
return 0;
} |
2089: Time redesign v3 r=bradjc a=phil-levis ### Pull Request Overview This pull request updates the time HIL to address a series of bug reports with the previous API (#1651, #1691, #1513). It also incorporates proposed changes by @gendx to generalize the width of counters/alarms/timers with an associated type rather than assume 32 bits (#1521). This has been implemented on all of the chips. It has been tested for the 24-bit nRF52 series, the 32-bit SAM4L, and 64-bit OpenTitan. The overall design and summary of the traits is described in https://github.com/tock/tock/blob/time-redesign-v3/doc/reference/trd-time.md We will update this document and give it a TRD number when ready to merge. There is also an update to the system call API: a new command for Alarm passes both a reference time and a dt. This new API can be used by using the `timer_v3_updates` branch of libtock-c. ### Testing Strategy This pull request was tested by compiling and testing on nrf52, SAM4L (imix), and OpenTitan (FPGA) boards. For imix and OT, it was tested using the multi_alarm_test and multi_timer_test tests in the kernel. On imix, it was tested in userspace by running a pair of multi_alarm_test processes. I was not able to test the userspace alarm driver on OpenTitan -- after struggling to get libtock-rs applications to run and librtock-c ones to compile I gave up. This is an important test because the capsule is 32 bits, and tries to automatically handle an underlying 64-bit Alarm. ### TODO or Help Wanted This pull request needs userspace testing on OT (to test that 64-to-32 conversion works correctly for the userspace API). This PR updates the mtimer implementation to seed it with a value close to a 32-bit overflow. So you do not have to run the test very long. Any userspace application that uses an alarm should be a good test. This pull request needs kernel testing on - [x] arty_e1 (@bradjc ) - [x] hifive (@alevy ) - [x] msp (@hudson-ayers ) - [x] nano33ble (OK) - [x] nucleo () - [x] redboard (@alistair23 ) - [x] stm32f () To test, you need run a `multi_alarm_test`. I've added a `multi_alarm_test` for each board and modified each board's `main.rs` to invoke it. Double-check you see a call to `multi_alarm_test::run_multi_alarm(mux_alarm)`. This test starts 3 alarms (A, B, C). The `dt` of these alarms is random, with one in 11 alarms (randomly) having a `dt` of 0. A typical output of the test looks something like this (this is from OpenTitan): ```TestA: Alarm fired. TestA@Ticks64(17033607736): Expected at Ticks64(17033607729) (diff = Ticks64(7)), setting alarm to Ticks64(17033616266) (delay = Ticks64(8537)) TestB: Alarm fired. TestB@Ticks64(17033614398): Expected at Ticks64(17033614391) (diff = Ticks64(7)), setting alarm to Ticks64(17033626851) (delay = Ticks64(12462)) TestC: Alarm fired. TestC@Ticks64(17033614581): Expected at Ticks64(17033614576) (diff = Ticks64(5)), setting alarm to Ticks64(17033618165) (delay = Ticks64(3592)) TestA: Alarm fired. TestA@Ticks64(17033616273): Expected at Ticks64(17033616266) (diff = Ticks64(7)), setting alarm to Ticks64(17033629481) (delay = Ticks64(13214)) TestC: Alarm fired. TestC@Ticks64(17033618172): Expected at Ticks64(17033618165) (diff = Ticks64(7)), setting alarm to Ticks64(17033626435) (delay = Ticks64(8270)) ``` The `delay` value is the `dt` set for the next invocation of this Alarm. The `diff` value is the number of ticks between the desired firing time and a call to `now` in the firing. Note that this value is large (e.g., 7 ticks above!) mostly because of these print statements: formatting the numbers takes significant cycles at these timescales. The three things to look for to make sure the test is running properly are: - All 3 Alarms are firing (one has not been lost or dropped or otherwise miscalculated) - The diff values are always positive - The diff of alarm firings after a delay of 0 are not excessively high (they will be higher than non-zero delays) ### Documentation Updated - [x] Updated the relevant files in `/docs`, or no updates are required. ### Formatting - [X] Ran `make prepush`. Co-authored-by: Philip Levis <pal@cs.stanford.edu> Co-authored-by: Guillaume Endignoux <guillaumee@google.com>
If we set alarms for very short duration, then it is possible for these alarms to get set in the past (which mean they have to wait until a roll over timer event). Ensure we are always comparing against the same common time in the past (prev) as our reference point. This independently fixes the same issue that is reported here: #1651 BUG=b:183711396 TEST=no more hangs on fw_updater during AP boot TEST=no more "missed" alarms that must wait until after roll over event. Change-Id: I7f1bfe703db4698bfd382d825d67c463c7cec6dd
Context
When setting an alarm with
VirtualMuxAlarm
, a reference point (self.mux.prev
) is set along with the alarm.tock/capsules/src/virtual_alarm.rs
Lines 84 to 95 in f30961f
This
prev
reference point is then used to decide whether alarms in the mux have expired or not.tock/capsules/src/virtual_alarm.rs
Line 149 in f30961f
Problem
The problem is that inside
VirtualMuxAlarm::set_alarm
there is no guarantee thatcur_alarm >= now
. We can be in a situation where an alarm 1 is already set, and by the time we set another alarm 2, alarm 1 is already in the past (but it hasn't fired yet).In that case, because the new
prev
is just after alarm 1, the alarm 1 won't be considered as expired next timeMuxAlarm::fired
is called - it would instead take time for the ticks to wrap-around for the alarm to be considered expired.This is in particular observable with the Segger RTT debugging on Nordic, which sets a timer very close in the future (100us). By the time other code has run, this can already be in the past.
tock/capsules/src/segger_rtt.rs
Line 243 in f30961f
But all capsules using a virtual alarm are affected.
The observable result is that the alarm client (in this case Segger RTT) waits forever (for Segger RTT, kernel debugging & console seem to freeze).
Reproducing the bug
It's easy to reproduce by adding the following check to
VirtualAlarmMux::set_alarm
on an nRF52840-DK board withUSB_DEBUGGING
andtrace_syscalls
enabled. Note that I've already applied the fix from #1636.I'm using the OpenSK app with the following parameters (see deploy.py).
And an example of panic is the following.
Without the manual
panic
, the debug output also freezes quite quickly in this setup.Solutions?
prev
forward when setting a new alarm. It should only be updated in theMuxAlarm::fired
function, where expired alarms are actually all fired.The text was updated successfully, but these errors were encountered: