Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: controller: Follow up on ticker conflict resolution #16830

Closed
cvinayak opened this issue Jun 14, 2019 · 11 comments

Comments

@cvinayak
Copy link
Contributor

commented Jun 14, 2019

Is your enhancement proposal related to a problem? Please describe.
A clear and concise description of what the problem is.

With the changes in the PR #15787, I notice 2 main issues:

  1. The scan_adv sample is missing required advertising events due to incorrect conflict resolution. As the conflict resolution is done late into an occupied region, the overlapping event is skipped and its lazy_current is incremented.
    a. Instead if the occupied region has not been lazy (skipped) before, then the new overlapping event has equal or higher priority. And should its LLL be given a chance to contest by using the is_abort procedure. This applies to nRF52 with higher CPU power.
    b. For nRF51, similar solution using the lazy, but has to be done advance in time during the grant of the region to the currently occupying event, because ticker_worker cannot be running while a Radio ISR is expected. Of course, ticker worker could be configured at lower IRQ priority which means we exhaust a minimum of 2 ISR priority levels of the total 4 available on nRF51 (CM0).
    c. Old behavior, logic capture attached
    scan_adv_ticker_old_new.zip

scan_adv_before

d. New behavior (missing adv events), , logic capture attached
scan_adv_after

  1. In nRF51, due to slow CPU, any execution of ticker_worker inside an occupied region will introduce high Radio ISR latency causing fatal failure in controller.
    a. The conflict resolution should have to be done immediately after the Prepare of the current winning event and before the start of Radio of the winning event. i.e. resolution of any future event that will overlap this occupied region.

I will assume the upstream/master would be broken for nRF51 and for nRF52 will be degraded in terms of expected functionality till the above issues are addressed.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Some in the above description (which was an email initially)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

We will discuss in the comments below.

Additional context
Add any other context or graphics (drag-and-drop an image) about the feature request here.

@cvinayak

This comment has been minimized.

Copy link
Contributor Author

commented Jun 14, 2019

From: Chettimada, Vinayak Kariappa [mailto:vinayak.kariappa.chettimada@nordicsemi.no]
Sent: 13. juni 2019 17:31
To: Morten Priess (MTPR) mtpr@demant.com; Peter Søren Kirk Hansen (PSHA) psha@demant.com; Mark Ruvald Pedersen (MPED) mped@demant.com
Cc: Alberto Escolar Piedras (ALPI) ALPI@demant.com; Cufi, Carles Carles.Cufi@nordicsemi.no
Subject: Re: Difference in scan_adv sample after #15787

Hi Morten,

We hold back on roll-back. I feel, we can iterate on the existing changes and come up on a decision or list of requirements to be covered. A skype call makes sense, unfortunately I will not be available until next Friday (I need to be full time at my desk to make any sense to have fruitful discussion, I travel back to Bangalore next Tuesday).

Thinking aloud:
A little something in me says, we are improving things, whats in is better and we will get it right with an iteration. Just need to take the right ranked decision and/or compromise in requirement/limitation.
Limitations/requirements I place myself are:

  1. The design shall be able to sustain with only one IRQ level used, and should sustain with use of multiple IRQ levels to reduce the maximum ISR latencies in a SoC when the controller uses lower IRQ priority levels.
  2. Controller’s own implementation shall avoid introducing Radio ISR latencies (nRF platform limitation), ULL executions have acceptable CPU time to yield to LLL context when in same IRQ priority level.
  3. Predictive use of future CPU cycles, use less CPU now than more later in timeline under most likely scenario.

Regards,
Vinayak

@cvinayak

This comment has been minimized.

Copy link
Contributor Author

commented Jun 14, 2019

From: Morten Priess (MTPR) mtpr@demant.com
Sent: 13 June 2019 23:31
To: Chettimada, Vinayak Kariappa vinayak.kariappa.chettimada@nordicsemi.no
Cc: Peter Søren Kirk Hansen (PSHA) psha@demant.com; Mark Ruvald Pedersen (MPED) mped@demant.com; Alberto Escolar Piedras (ALPI) ALPI@demant.com; Cufi, Carles Carles.Cufi@nordicsemi.no
Subject: Regarding the missing required advertising events with new ticker

Hi All,

It turns out that it is a matter of the “old” ticker favoring long intervals.
It’s like if you want to go to the movies, and book your seat in advance – the longer time in advance you book, the more certain you are to get the best seat ;-)

The “old” ticker reserves the ADV slot in good time, as it is inserted on the ticker “bottom half” (ticker_job), after timeout.
When some other ticker node comes along and wants to occupy that slot, it gets re-scheduled to a later slot with no conflict. So older reservations prevail.
This is true if there is no latency and no force-flag involved, which is how it looks for the adv_scan sample.

The “new” ticker evaluates the nodes at the timeout, just prior to the ADV ticker node timeout and says “next node (ADV) does not have higher priority, no latency” and goes ahead and schedules the node.
When ADV ticker node expires, the previous node is still in the air, and ADV must be skipped.

So basically one could argue that the “new” ticker does it right… Why should old reservations have higher priority? If this is a feature of the ticker that actually makes things run better, obviously that’s how it should be.
Changing the priority of the ADV ticker node from 0 to -1, changes the picture. With ADV on -1 priority, the behavior is identical to the “old” ticker, down to the microsecond – in simulation, that is.

Could this be the solution for the collision resolving issue? To set the default priority of ADV ticker node to -1?

Regarding the CPU usage, I will look into that tomorrow.

B.R.
Morten

@cvinayak

This comment has been minimized.

Copy link
Contributor Author

commented Jun 14, 2019

Fra: Chettimada, Vinayak Kariappa vinayak.kariappa.chettimada@nordicsemi.no
Sendt: 14. juni 2019 14:36
Til: Morten Priess (MTPR)
Cc: Peter Søren Kirk Hansen (PSHA); Mark Ruvald Pedersen (MPED); Alberto Escolar Piedras (ALPI); Cufi, Carles
Emne: RE: Regarding the missing required advertising events with new ticker

Hi,

In the old ticker, conflicts are resolved on the following basis:

  1. New Start
    a. A new ticker_start instance has 1 ‘force’ value higher but not high enough to cancel existing tickers that need forced expire at their timeout (slave/master under supervision timeout)
  2. Advance booking (as you put it)
    a. Roles with higher periodic interval have higher chance compared to lower intervals as lower intervals get more chances with time space anyway. An advance booked role under supervision timeout (force = 1) would have higher chance to retain its timeout over a recently added ‘New start’ (force = 1).
  3. Round robin
    a. Conflicting with same timeout tick get placed after the existing ticker in the pending ticker list
  4. Fairness (not perfectly though)
    a. Conflicting with same timeout tick and that has higher lazy_current and lower force value than the occupied ticker get higher chances to be scheduled in.
  5. Absolute reservation (no ticker CPU use inside reserved region)
    a. In order to avoid latencies inside the reserved region, ticker code is restricted from executing inside a reserved time space.
    b. Ticker_worker executes at the boundary of the reserved regions and not inside.
    c. This is partially implemented in co-operation with the user of ticker API, in this case, the controller disables the mayfly executing the ticker_job functions.

BLE controller and Flash driver are users of ticker, wherein a reserved time space by flash driver halts CPU and conflict resolution could be delay as much as the reserved duration requested by the flash driver.

Regarding the definition of priority, for me it meant a fixed precedence set by ‘an’ authority (like a master or a single entity controller in the system). In the old controller there is no single master/entity module that maintains the states/scheduling of the roles (adv, scan, slave, master or flash driver or any new proprietary role that can get added at runtime). The conflict is resolved by contest, hence my difficult decision to use the term ‘force’ that’s applied just-in-time by the contesting roles (or ticker instances).

With the introduction of ‘priority’ (set by an authority), I will prefer we continue to use ‘force’ being now derived from ‘priority’ but try to have conflict resolution by contest.

Could this be the solution for the collision resolving issue? To set the default priority of ADV ticker node to -1?

No. We don’t have an authority entity in the controller yet, we will need one that will need to decide on the priority based on all active ticker_instances. Do we want to go this way?
Example, scanner with higher interval and advertiser with lower interval value, scanner instance will need priority -1 but then conflict resolution inside the advertisers event space causes Radio ISR latencies when ULL and LLL are at same IRQ priority.

Regards,
Vinayak

@mtpr-ot

This comment has been minimized.

Copy link
Collaborator

commented Jun 15, 2019

Regarding 1 "The scan_adv sample is missing required advertising events due to incorrect conflict resolution":
I agree that it would be difficult to handle use of the priority feature to have ADV in the adv_scan sample run as scheduled by the "legacy" ticker.
Instead I propose implementing the rule described above in 2.a "Roles with higher periodic interval have higher chance compared to lower intervals" by adding the rule to the conflict resolution. This would mean that conflict between nodes with equal priority, force and latency would be resolved using the interval (when periodic).

A quick test with this added shows identical behavior in adv_scan sample simulation for legacy- and new ticker.

Question: Should this feature be under some CONFIG_? That would allow users to chose legacy- or new behavior?

@mtpr-ot

This comment has been minimized.

Copy link
Collaborator

commented Jun 15, 2019

Regarding 2 "...any execution of ticker_worker inside an occupied region will introduce high Radio ISR latency":
I think this is quite managable to fix in ticker bottom half (ticker_job). The idea is that ticker nodes are checked for future expiry within active region. If such nodes are found we should:

  1. Remove the ticker node
  2. Add a periodic interval to the ticks_to_expire
  3. Re-insert ticker node
  4. If node is a "must expire", invoke callback with "would-have-been" ticks as ticks_at_expire

This means that jitter will be seen on the prepare callback due to some early "must_expire", however, this is not thought to be a problem.

@cvinayak

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2019

Regarding 1...

This would mean that conflict between nodes with equal priority, force and latency would be resolved using the interval (when periodic).

Question: Should this feature be under some CONFIG_? That would allow users to chose legacy- or new behavior?

CONFIG_ not required, we use the legacy behavior for legacy BLE states/roles (adv, scan, slave, master & flash driver) and priority be used for 'future' feature requirement

Regarding 2...

I think this is quite managable to fix in ticker bottom half (ticker_job). The idea is that ticker nodes are checked for future expiry within active region.

This appears to be like revert back of #15787 conflict resolution from ticker_worker to ticker_job, if that is what you mean, I am ok. Reason, having the old implementation and adding a simple "must expire"check in ticker_job and a "would-have-been" callback seems reasonable. I would like the ticker_worker to be very lightweight to ensure soft realtime event anchors have least latency due to ticker's own code.

@mtpr-ot

This comment has been minimized.

Copy link
Collaborator

commented Jun 15, 2019

This appears to be like revert back of #15787 conflict resolution from ticker_worker to ticker_job, if that is what you mean, I am ok. Reason, having the old implementation and adding a simple "must expire"check in ticker_job and a "would-have-been" callback seems reasonable. I would like the ticker_worker to be very lightweight to ensure soft realtime event anchors have least latency due to ticker's own code.

This would not be exactly like a revert of #15787. One reason for this is that the old implementation would potentially skip several intervals (in theory) to find a free region. I would prefer to only skip one interval per ticker_job (for each conflicting node). This way, you will only get one early "must expire" callback, and it will be easier to undo, if a ticker update opens a free slot (see next comment about dynamic audio slave timing).

@mtpr-ot

This comment has been minimized.

Copy link
Collaborator

commented Jun 15, 2019

Regarding audio timing requirements

In our audio implementation, we use a modified LLL BLE slave implementation, where after each successful RX of the primary packet, we add a 1 cycle latency to the ticker node (lazy_periodic = 1) to skip the retransmission, thereby freeing the next interval for other events. At the next prepare, we remove the latency (lazy_periodic = 0). By doing this, we can effectively implement optimal audio in LLL, without need for modifying ULL.
As soon as we start deferring events in the ticker node queue, this method stops working, unless we make it possible to "undo" a deferred event by pulling it back into the original spot. This is possible by looking at ticks_to_expire and lazy_current, however, it really complicates things.

So maybe we're at a cross road where we need to decide if we have to split the ticker functionality in two: A legacy config to make sure ticker_worker is not run during radio activity, and an "advanced" (or whatever) version favoring real time and optimal conflict resolution?

@cvinayak

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2019

I would prefer to only skip one interval per ticker_job (for each conflicting node).

Lets not skip one full interval, but :

Thinking aloud:

  1. Revert implementation to old behavior to skip ahead as before for non-must-expire ticker nodes
  2. For must-expire ticker nodes, conflicts be resolved late,
    a. ticker_job (ULL context) generates optional (ULL context, can be delegated to LLL context using mayfly as required) callback informing the resolution will be done after the winning ticker_node's preparation completes. The ticker node is inserted after the winning ticker_node (overlapping the reserved region).
    b. afterwards at the ticks_to_expire of the winning ticker's prepare callback is executed in ticker_worker (LLL), check for any conflicting must-expire ticker node that also expires in the same region. Let the two use the pre-empt concept using the is_abort_cb to resolve conflict.

This way if the winning ticker gets removed, then the must-expire ticker node gets the reserved region.

we add a 1 cycle latency to the ticker node (lazy_periodic = 1) to skip the retransmission, thereby freeing the next interval for other events. At the next prepare, we remove the latency (lazy_periodic = 0). By doing this, we can effectively implement optimal Audio in LLL, without need for modifying ULL.

Case 1: Periodic must-expire ticker node and normal ticker node of same interval overlap with must-expire expiring 'before' normal ticker node:

  • must-expire ticker node's prepare could decide to yield its region early (skip retransmission). i.e. update the timeout callback signature to return a reduced ticks_slot (can be a value of 0). Remember, this has to be in ticker_worker context, as timeout callbacks are inline function pointer call in ticker_worker.
  • ticker_job will then be able to dequeue the must-expire and set the next counter expiry to the now-not-overlapping normal ticker node.

Audio saves from using more CPU, i.e. calling ticker_update to ping-pong the lazy_periodic value.

Case 2: Periodic must-expire ticker node and normal ticker node of same interval overlap with must-expire expiring 'after' normal ticker node:

  • In this case the conflict resolution happens by the use of pre-empt concept, as explained in 2a above.
@pskhansen

This comment has been minimized.

Copy link
Collaborator

commented Jun 17, 2019

I do not think it is a good idea to involve LLL mechanisms in the basic scheduling rules for several reasons:

  1. The LLL is vendor specific.
  2. The LLL execution assumes the reservation slot includes LLL start+done margins, i.e., events are always placed having a minimum distance in time.

As I see it, we agree that the scheduler should have an option to support radios with critical IRQ latency, i.e.,

  1. This means that scheduler execution should be avoided during a reservation slot.
  2. This means that the scheduler should process all entries in time intervals that is known not to change scheduling decisions, i.e., when the head entry of the scheduler queue is forwarded to LLL for event execution. In this case, the reservation slot of the forwarded event CANNOT be compromised.

Thus, the loosing entries are moved out of the just forwarded reservation slot, however, the scheduling resolution is NOT done until the scheduler time/head reach them in a future time.

One remaining question is how the loosing entries are moved out of the just forwarded reservation slot, when this requires several connection intervals and the entries are of the type 'must-expire'. An easy solution would be to have counter to say how many times the 'must-expire' was skipped such that the 'must-expire' execution can take that into account when updating the event state etc.

aescolar added a commit that referenced this issue Jun 18, 2019

bluetooth: controller: Reverted revised ticker pending redesign
Reversed revised ticker implementation pending new design which resolves
the issues described in GH issues #16830.

Signed-off-by: Morten Priess <mtpr@oticon.com>
@aescolar

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2019

As this is not an issue in master anymore, I will close it so it does not count for the release. Let's have the discussion in another place.

@aescolar aescolar closed this Jun 18, 2019

alexanderwachter added a commit to alexanderwachter/zephyr that referenced this issue Jul 2, 2019

bluetooth: controller: Reverted revised ticker pending redesign
Reversed revised ticker implementation pending new design which resolves
the issues described in GH issues zephyrproject-rtos#16830.

Signed-off-by: Morten Priess <mtpr@oticon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.