[Bug] Incorrect VirtualMuxAlarm implementation #1651

gendx · 2020-03-03T17:52:40Z

Context

When setting an alarm with VirtualMuxAlarm, a reference point (self.mux.prev) is set along with the alarm.

Lines 84 to 95 in f30961f

    
           if enabled > 0 { 
        
               let cur_alarm = self.mux.alarm.get_alarm(); 
        
               let now = self.now(); 
        
               if cur_alarm.wrapping_sub(now) > when.wrapping_sub(now) { 
        
                   self.mux.prev.set(self.mux.alarm.now()); 
        
                   self.mux.alarm.set_alarm(when); 
        
               } 
        
           } else { 
        
               self.mux.prev.set(self.mux.alarm.now()); 
        
               self.mux.alarm.set_alarm(when); 
        
           }

This prev reference point is then used to decide whether alarms in the mux have expired or not.

tock/capsules/src/virtual_alarm.rs

Line 149 in f30961f

.filter(|cur| cur.armed.get() && has_expired(cur.when.get(), now, prev))

Problem

The problem is that inside VirtualMuxAlarm::set_alarm there is no guarantee that cur_alarm >= now. We can be in a situation where an alarm 1 is already set, and by the time we set another alarm 2, alarm 1 is already in the past (but it hasn't fired yet).

> in VirtualMuxAlarm::set_alarm
alarms:       | alarm 1   |      | alarm 2
before:  prev | cur_alarm | now  | when
after:        |           | prev | cur_alarm

In that case, because the new prev is just after alarm 1, the alarm 1 won't be considered as expired next time MuxAlarm::fired is called - it would instead take time for the ticks to wrap-around for the alarm to be considered expired.

This is in particular observable with the Segger RTT debugging on Nordic, which sets a timer very close in the future (100us). By the time other code has run, this can already be in the past.

tock/capsules/src/segger_rtt.rs

Line 243 in f30961f

let interval = (100 as u32) * <A::Frequency>::frequency() / 1000000;

But all capsules using a virtual alarm are affected.

The observable result is that the alarm client (in this case Segger RTT) waits forever (for Segger RTT, kernel debugging & console seem to freeze).

Reproducing the bug

It's easy to reproduce by adding the following check to VirtualAlarmMux::set_alarm on an nRF52840-DK board with USB_DEBUGGING and trace_syscalls enabled. Note that I've already applied the fix from #1636.

if enabled > 0 {
    let cur_alarm = self.mux.alarm.get_alarm();
    let now = self.now();
    let prev = self.mux.prev.get();

    if now.wrapping_sub(prev) > cur_alarm.wrapping_sub(prev) {
        panic!("prev = {}, cur_alarm = {}, now = {}");
    }
    ...
}

I'm using the OpenSK app with the following parameters (see deploy.py).

$ ./deploy.py os --board nrf52840_dk
$ ./deploy.py app --debug --panic-console --opensk

And an example of panic is the following.

Kernel panic at <::core::macros::panic macros>:5:
        "prev = 9025, cur_alarm = 9028, now = 9031"

Without the manual panic, the debug output also freezes quite quickly in this setup.

Solutions?

I think it's incorrect to move prev forward when setting a new alarm. It should only be updated in the MuxAlarm::fired function, where expired alarms are actually all fired.
This again shows that the "3-way comparison" of 32-bit timestamps is error-prone, as it's complex to implement, hard to understand, hard to know when it's safe to update the reference point, etc. I think that this is another argument towards having truely monotonic (i.e. 64-bit) timestamps in time-related capsules & HILs (as previously discussed in https://groups.google.com/d/msg/tock-dev/CWecIZq7jqg/RrWySQLbBgAJ).

The text was updated successfully, but these errors were encountered:

alevy · 2020-03-03T21:03:37Z

@gendx this is a great summary, thanks!

I think it's incorrect to move prev forward when setting a new alarm. It should only be updated in the MuxAlarm::fired function, where expired alarms are actually all fired.

I, personally, need to think about this more carefully. Which leads me to...

This again shows that the "3-way comparison" of 32-bit timestamps is error-prone, as it's complex to implement, hard to understand, hard to know when it's safe to update the reference point, etc.

Yes, this is 100% my experience. I believe I've fixed timer bugs like these over several iterations in the past few years, and they are incredibly difficult to debug, especially since they often tend to appear at the edge (near wrapping cases), which occur very infrequently.

phil-levis · 2020-03-04T17:00:31Z

We can be in a situation where an alarm 1 is already set, and by the time we set another alarm 2, alarm 1 is already in the past (but it hasn't fired yet).

But this can be a normal occurrence with any finite type. I think your notion of "past" is confused, because a smaller number actually means in the future, due to wraparound. You can go down the RISC-V route and make all counter values 64 bits and hope things never wraparound. So I would rephrase this as "we can't tell if alarm 1 is in the recent past or far future."

But the thing is, this problem isn't new. It's been solved before. Put another way, there are lots of embedded systems out there that don't miss alarms and don't use 64-bit types. So how do they do it?

The root problem is the decoupling of when an alarm is request and then when it's processed. I think we discussed this in depth in tock-dev. There are good reasons to have a 64-bit counter in the kernel, but I don't think this one of them. The 64 bit counter will wraparound eventually too -- you may have avoided the problem in practice, but you haven't actually solved it.

gendx · 2020-03-04T18:22:49Z

I think your notion of "past" is confused, because a smaller number actually means in the future, due to wraparound.

To come back to the bug that I observed, there has been no wraparound (the bug occurred within a second of booting the board, all alarms are set to durations much smaller than the wraparound), so there's no confusion about what is past and future (referring to the notions of "past" and "future" in the real world).

Before going back to lengthy discussions about the time HIL, can we agree that prev should never move forward of any alarm still waiting to fire, to hopefully fix this bug?

Otherwise, Tock doesn't uphold the guarantee that setting an alarm in the future will fire. If there's no such guarantee, all capsules that rely on an alarm are essentially broken. Indeed, capsules do rely on the assumption that it's safe to wait on time::AlarmClient::fire() before resuming execution, and that this will indeed fire (more or less at the right time, but definitely not an wraparound later). This is what happens with the Segger RTT capsule.

The "setting an alarm in the future" part is also a reason why I suggested to have a set_alarm_from_now function in #1521. This could even go down to the chip level where "now" would be checked just before setting up the alarm - to make sure that the timestamp is not already elapsed by the time the hardware timer is set. Indeed, this has the shortcoming that the "now" we're talking about may be a bit later in the future, so the alarm may fire later than expected. But regardless of that, even if the interrupt triggers at the right time, it could also already be "too much in the future" by the time the Tock scheduler handles it.

But the thing is, this problem isn't new. It's been solved before. Put another way, there are lots of embedded systems out there that don't miss alarms and don't use 64-bit types. So how do they do it?

That's a good question, but if it was solved in Tock, there wouldn't be this bug.

The 64 bit counter will wraparound eventually too -- you may have avoided the problem in practice, but you haven't actually solved it.

I don't think this argument is relevant. Unless there is an embedded system where the wraparound happens within the lifetime of the device (please let me know if you have a system in mind, but the consensus in the last discussion we had was that 64 bits is always enough), the counter will not eventually wraparound.

If the problem is avoided in practice, I don't see any reason to solve it (at least not in mainline Tock). Of course, it's also unclear whether 64-bit timestamps will be simpler to handle (due to the extra cost of handling overflow interrupts). But solving for solving's sake shouldn't be a reason to pursue complex solutions.

I think we should be clear about the expected properties of an alarm, and make sure that these properties are implemented accordingly. Here are some further thoughts.

There should be some set_alarm_from(now, duration) method, which given a now timestamp (as recorded by the capsule) and a duration to wait for, (1) will fire in a reasonable time frame and (2) not before now + duration. Alternatively, an additional set_alarm_from_now(duration) function will compute now itself for capsules that don't care. But I assume making the former available would be more in line with @alevy's concerns in [RFC] Generalize ticks in the time HIL, and add support for 24- and 32-bit ticks #1521 (comment).
A side note is that passing the pair (now, duration) has the same cost as passing a 64-bit value. I'm not generally against it if it works (after all, simulating 64-bit timestamps with overflow interrupts isn't trivial either), but memory usage shouldn't be the main argument.
If the now + duration is already elapsed, the implementation may call fire() right away without setting any hardware timer.
The "reasonable" time frame within which the alarm will fire is TBD, but I don't think this could be a hard number like "before now + duration + N", because that depends on many factors.
- Maybe the real now when calling it is already past now + duration + N.
- Other capsules that are also requesting alarms at the same time. Each capsule has its own processing time, which we don't control. But there could be rules about the order in which alarms are triggered, so that a slow capsule that triggers high-frequency alarms doesn't starve other capsules.
- Down to the interrupts in the chip, we may want to delay the alarm a little bit, e.g. when a counter is at N and the alarm is to be set at N+1. This could lead to no interrupt being delivered (see the rules for nRF52840's RTC), violating the guarantee that the alarm will fire. Instead, the chip could set the interrupt at N+2 for example. The client's fire() will then trigger one tick too late, but anyway other operations in Tock between the interrupt and the fire() call may already take multiple ticks. I think it would be wrong though to immediately trigger fire() one tick too early, because that's against the "not before now + duration" rule.
As a corollary, it should be clear that capsules that expect real-time properties (e.g. an alarm must fire within N ticks of its timestamp) are not currently supported by Tock - in the sense that the capsule might work if it's alone on the system, but there are no guarantees if it competes with other "slow" capsules.

gendx · 2020-03-05T14:26:49Z

Note: this bug is a "re-discovery" of what I observed a few months ago already in #1513.

phil-levis · 2020-03-05T17:15:43Z

@gendx I agree completely with your observation on the bug. I was disagreeing with your generalization and conclusions from it. In particular,

If the problem is avoided in practice, I don't see any reason to solve it (at least not in mainline Tock). Of course, it's also unclear whether 64-bit timestamps will be simpler to handle (due to the extra cost of handling overflow interrupts). But solving for solving's sake shouldn't be a reason to pursue complex solutions.

They're not complex. They're what everyone else does. From a systems standpoint, I'd argue it's always better to solve the problem in essence -- you solve it in practice when doing so in essence is too expensive/complex.

JayKickliter · 2020-04-05T22:21:40Z

Dropping in to report that I have run into this bug in the real-world while bringing up Tock on a new nRF52-based platform.

This userland app prints (via RTT) Loop 1 once and delay_ms() never returns.

#include <timer.h>

int
main() {
    int i = 0;
    while (1) {
        printf("Loop %d\n", ++i);
        delay_ms(1000);
    }
    return 0;
}

@gendx

2089: Time redesign v3 r=bradjc a=phil-levis ### Pull Request Overview This pull request updates the time HIL to address a series of bug reports with the previous API (#1651, #1691, #1513). It also incorporates proposed changes by @gendx to generalize the width of counters/alarms/timers with an associated type rather than assume 32 bits (#1521). This has been implemented on all of the chips. It has been tested for the 24-bit nRF52 series, the 32-bit SAM4L, and 64-bit OpenTitan. The overall design and summary of the traits is described in https://github.com/tock/tock/blob/time-redesign-v3/doc/reference/trd-time.md We will update this document and give it a TRD number when ready to merge. There is also an update to the system call API: a new command for Alarm passes both a reference time and a dt. This new API can be used by using the `timer_v3_updates` branch of libtock-c. ### Testing Strategy This pull request was tested by compiling and testing on nrf52, SAM4L (imix), and OpenTitan (FPGA) boards. For imix and OT, it was tested using the multi_alarm_test and multi_timer_test tests in the kernel. On imix, it was tested in userspace by running a pair of multi_alarm_test processes. I was not able to test the userspace alarm driver on OpenTitan -- after struggling to get libtock-rs applications to run and librtock-c ones to compile I gave up. This is an important test because the capsule is 32 bits, and tries to automatically handle an underlying 64-bit Alarm. ### TODO or Help Wanted This pull request needs userspace testing on OT (to test that 64-to-32 conversion works correctly for the userspace API). This PR updates the mtimer implementation to seed it with a value close to a 32-bit overflow. So you do not have to run the test very long. Any userspace application that uses an alarm should be a good test. This pull request needs kernel testing on - [x] arty_e1 (@bradjc ) - [x] hifive (@alevy ) - [x] msp (@hudson-ayers ) - [x] nano33ble (OK) - [x] nucleo () - [x] redboard (@alistair23 ) - [x] stm32f () To test, you need run a `multi_alarm_test`. I've added a `multi_alarm_test` for each board and modified each board's `main.rs` to invoke it. Double-check you see a call to `multi_alarm_test::run_multi_alarm(mux_alarm)`. This test starts 3 alarms (A, B, C). The `dt` of these alarms is random, with one in 11 alarms (randomly) having a `dt` of 0. A typical output of the test looks something like this (this is from OpenTitan): ```TestA: Alarm fired. TestA@Ticks64(17033607736): Expected at Ticks64(17033607729) (diff = Ticks64(7)), setting alarm to Ticks64(17033616266) (delay = Ticks64(8537)) TestB: Alarm fired. TestB@Ticks64(17033614398): Expected at Ticks64(17033614391) (diff = Ticks64(7)), setting alarm to Ticks64(17033626851) (delay = Ticks64(12462)) TestC: Alarm fired. TestC@Ticks64(17033614581): Expected at Ticks64(17033614576) (diff = Ticks64(5)), setting alarm to Ticks64(17033618165) (delay = Ticks64(3592)) TestA: Alarm fired. TestA@Ticks64(17033616273): Expected at Ticks64(17033616266) (diff = Ticks64(7)), setting alarm to Ticks64(17033629481) (delay = Ticks64(13214)) TestC: Alarm fired. TestC@Ticks64(17033618172): Expected at Ticks64(17033618165) (diff = Ticks64(7)), setting alarm to Ticks64(17033626435) (delay = Ticks64(8270)) ``` The `delay` value is the `dt` set for the next invocation of this Alarm. The `diff` value is the number of ticks between the desired firing time and a call to `now` in the firing. Note that this value is large (e.g., 7 ticks above!) mostly because of these print statements: formatting the numbers takes significant cycles at these timescales. The three things to look for to make sure the test is running properly are: - All 3 Alarms are firing (one has not been lost or dropped or otherwise miscalculated) - The diff values are always positive - The diff of alarm firings after a delay of 0 are not excessively high (they will be higher than non-zero delays) ### Documentation Updated - [x] Updated the relevant files in `/docs`, or no updates are required. ### Formatting - [X] Ran `make prepush`. Co-authored-by: Philip Levis <pal@cs.stanford.edu> Co-authored-by: Guillaume Endignoux <guillaumee@google.com>

If we set alarms for very short duration, then it is possible for these alarms to get set in the past (which mean they have to wait until a roll over timer event). Ensure we are always comparing against the same common time in the past (prev) as our reference point. This independently fixes the same issue that is reported here: #1651 BUG=b:183711396 TEST=no more hangs on fw_updater during AP boot TEST=no more "missed" alarms that must wait until after roll over event. Change-Id: I7f1bfe703db4698bfd382d825d67c463c7cec6dd

This was referenced Mar 3, 2020

[RFC] Expand Process::last_syscall to a list or ring buffer #1641

Closed

[RFC] Panic buffer #1652

Closed

gendx changed the title ~~Bug in VirtualMuxAlarm implementation~~ [Bug] Incorrect VirtualMuxAlarm implementation Mar 5, 2020

gendx mentioned this issue Mar 5, 2020

Correctness of has_elapsed formula in virtual_alarm capsule? #1513

Closed

bradjc added the release-blocker Issue or PR that must be resolved before the next release label Mar 18, 2020

alexandruradovici mentioned this issue Mar 22, 2020

[draft] proposal for new alarm hil #1712

Closed

2 tasks

gendx mentioned this issue Apr 6, 2020

single-byte console writes never complete #1736

Closed

gendx mentioned this issue Apr 24, 2020

Upgrade to Tock-1.5-rc2 google/OpenSK#97

Merged

2 tasks

bradjc removed the release-blocker Issue or PR that must be resolved before the next release label Apr 24, 2020

cisen mentioned this issue Jun 27, 2020

tock 源码相关 cisen/blog#840

Open

phil-levis mentioned this issue Sep 1, 2020

Time redesign v3 #2089

Merged

9 tasks

bradjc closed this as completed Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Incorrect VirtualMuxAlarm implementation #1651

[Bug] Incorrect VirtualMuxAlarm implementation #1651

gendx commented Mar 3, 2020

alevy commented Mar 3, 2020

phil-levis commented Mar 4, 2020

gendx commented Mar 4, 2020

gendx commented Mar 5, 2020

phil-levis commented Mar 5, 2020

JayKickliter commented Apr 5, 2020 •

edited

[Bug] Incorrect VirtualMuxAlarm implementation #1651

[Bug] Incorrect VirtualMuxAlarm implementation #1651

Comments

gendx commented Mar 3, 2020

Context

Problem

Reproducing the bug

Solutions?

alevy commented Mar 3, 2020

phil-levis commented Mar 4, 2020

gendx commented Mar 4, 2020

gendx commented Mar 5, 2020

phil-levis commented Mar 5, 2020

JayKickliter commented Apr 5, 2020 • edited

JayKickliter commented Apr 5, 2020 •

edited