New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556
Comments
Some tidbits so far:
|
@kv2019i this must also include DMIC capture no? not just SSP capture? Interestingly, the chrome team brought an issue on ADL with DMIC capture (with noise cancellation on core 1) and HDMI playback with similar logs. Of course it doesnt use zephyr. |
@kv2019i can we try the following patch to show the IRQs enabled and the status. We may see the culprit.... diff --git a/src/lib/agent.c b/src/lib/agent.c
index 1aac47796..44753bb54 100644
--- a/src/lib/agent.c
+++ b/src/lib/agent.c
@@ -81,6 +81,9 @@ static enum task_state validate(void *data)
else
tr_warn(&sa_tr, "validate(), ll drift detected, delta = %u",
(unsigned int)delta);
+ tr_warn(&sa_tr, "irq enabled %x status %x",
+ arch_interrupt_get_enabled(),
+ arch_interrupt_get_status());
}
/* update last_check to current */ |
@kv2019i I think we need to disable agent (so we only have 1 IRQ user for the timer - no context sharing) and add some checks like (untested). diff --git a/src/schedule/zephyr_domain.c b/src/schedule/zephyr_domain.c
index 6b9960268..1c38db6d7 100644
--- a/src/schedule/zephyr_domain.c
+++ b/src/schedule/zephyr_domain.c
@@ -101,6 +101,7 @@ static void zephyr_domain_timer_fn(struct k_timer *timer)
struct zephyr_domain *zephyr_domain = k_timer_user_data_get(timer);
uint64_t now = k_uptime_ticks();
int core;
+ int missed_irq_warn = 0;
if (!zephyr_domain)
return;
@@ -112,8 +113,13 @@ static void zephyr_domain_timer_fn(struct k_timer *timer)
* struct task::start for a strictly periodic Zephyr-based LL scheduler
* implementation, they will be removed after a short grace period.
*/
- while (zephyr_domain->ll_domain->next_tick <= now)
+ while (zephyr_domain->ll_domain->next_tick <= now) {
zephyr_domain->ll_domain->next_tick += LL_TIMER_PERIOD_TICKS;
+ if (!missed_irq_warn++)
+ tr_err(&ll_tr, "missed IRQ now %ld next %ld delta %ld", now,
+ zephyr_domain->ll_domain->next_tick,
+ now - zephyr_domain->ll_domain->next_tick);
+ }
for (core = 0; core < CONFIG_CORE_COUNT; core++) {
struct zephyr_domain_thread *dt = zephyr_domain->domain_thread + core;
diff --git a/src/schedule/zephyr_ll.c b/src/schedule/zephyr_ll.c
index 5af538767..1c19330da 100644
--- a/src/schedule/zephyr_ll.c
+++ b/src/schedule/zephyr_ll.c
@@ -21,12 +21,18 @@ DECLARE_SOF_UUID("zll-schedule", zll_sched_uuid, 0x1547fe68, 0xde0c, 0x11eb,
DECLARE_TR_CTX(ll_tr, SOF_UUID(zll_sched_uuid), LOG_LEVEL_INFO);
+/* headroom for scheduling, warning if 10uS late. */
+/* TODO: use Zephyr timing macros instead of hard codec clocks */
+#define HEADROOM_CYCLES (38400 / 100)
+
/* per-scheduler data */
struct zephyr_ll {
struct list_item tasks; /* list of ll tasks */
unsigned int n_tasks; /* task counter */
struct ll_schedule_domain *ll_domain; /* scheduling domain */
unsigned int core; /* core ID of this instance */
+ unsigned int period_cycles; /* period of LL tick in cycles */
+ unsigned int last_tick_cycles; /* last tick cycles */
};
/* per-task scheduler data */
@@ -174,6 +180,18 @@ static void zephyr_ll_run(void *data)
struct task *task;
struct list_item *list;
uint32_t flags;
+ unsigned int cycles_now = k_cycle_get_32();
+ unsigned int cycles_target = sch->last_tick_cycles + sch->period_cycles
+ + HEADROOM_CYCLES;
+
+ /* are we scheduled on time ? if not lets report it as we could be
+ * blocked by other higher priority threads or IRQs.
+ */
+ if (cycles_now > cycles_target) {
+ tr_err(&ll_tr, "LL scheduling %d cycles late",
+ cycles_now - cycles_target);
+ }
+ sch->last_tick_cycles = cycles_now;
zephyr_ll_lock(sch, &flags);
@@ -343,6 +361,8 @@ static int zephyr_ll_task_schedule_common(struct zephyr_ll *sch, struct task *ta
zephyr_ll_unlock(sch, &flags);
+ //TODO: is this the best place to initial set last tick cycles.
+ sch->last_tick_cycles = k_cycle_get_32();
ret = domain_register(sch->ll_domain, task, &zephyr_ll_run, sch);
if (ret < 0)
tr_err(&ll_tr, "zephyr_ll_task_schedule: cannot register domain %d",
@@ -534,6 +554,9 @@ int zephyr_ll_scheduler_init(struct ll_schedule_domain *domain)
sch->ll_domain = domain;
sch->core = cpu_get_id();
+ // TODO: get this from topology and use Zephyr API to calc cycles.
+ sch->period_cycles = 38400;
+
scheduler_init(domain->type, &z |
The timer tick does not seem to be systematically late at all. There is definitely some minor jitter in waking up the per-core threads (and this has increased lately, is less severe before the spinlock change), but when I've traced what happens on the n+1 cycle after agent notices the jitter, we always catch up. So it could be the actual failure is caused by too much "agent traffic" and then some DMA runs out of data. Like here:
So in summary, we probably need to try removing the agent. It's adding to the problem here and if we have load issues, the DSP load traces will catch that. |
The system agent is not required with SOF Zephyr builds as the Zephyr ll scheduler implementation can track DSP load and print periodic status of the load and any observed overruns in scheduling. Disabling agent is beneficial as agent can create a lot of DMA traffic when DMA trace is enabled. This happens if there is jitter in agent execution, with delta slightly over the warning threshold. This can itself worsen the scheduling variation and lead to actual problems. BugLink: #5556 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
#5557 did not help. Logs are less cluttered, but there's a still failure in configuring SSP1 DAI (on core3). |
@lgirdwood Some log data with your patches. First the IRQs. The status doesn't change during any tests when measured at agent. Probably if some irq is blocked, that would have to be observed outside the agent: [ 25633407.734375] ( 9.739583) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222 |
The missed IRQ trace didn't provide useful output. The next_tick is updated always once, so this floods the terminal. Instead I modified the "LL scheduler" trace a bit. With 10us headroam, this fires too often and affects test execution itself, but with 10% headroom and slight modification to print "x cycles late" without the headroom included, results are as follows:
All cores are loaded, but the delay is not reported on all. Still, the latency spike happens at a very slow pace, here hundreds of milliseconds a part. Here's another log excepr show warnings from all cores and includes avg/max data from the audio tasks (on different cores). So we get occasional variation in scheduling invocation, but audio processing load is not high and no peaks are seen:
I modified the trace a bit to print out the LL-to-LL delay on the N+1 iteration after a delayed invocation. As expected (a fixed timer irq running the system), we systematically catch up on the next iteration:
And second piece:
There is definitely some unexpected variation, but we stay within 15% budget all the time and no peaks in audio processing are reported that would exceed and cause problem even for a single scheduler run. We can probably add similar lower cost tracking of the IRQ wakeups to get more data on this, without spamming the DMA-trace. E.g. track avg+max of IRQ wakeup delay. |
Managed to reproduce on one of the CI machines now. There's a Zephyr OS panic. Debugging it now. |
Status update:
Bisect shows the spinlock #5286 makes this problem easier to reproduce. |
Another update:
|
There's a draft workaround for the Zephyr timer reliability, but unfortunately the error is still happening with that workaround in place. There's still unexplained spikes in timer invocation, so there can still be a connection, but at least the error in timer average time does not explain the multiple-pipeline-capture fail. |
@kv2019i Could be IRQs OFF in locks ?
|
I'll resume the IRQ timing tomorrow. I spent today with DW-DMA. The variety of symptoms still leaves me puzzling. Sometime the error happens at state transitions (either a Zephyr OS panic via invalid load or a timed out IPC), but often time the main failure is an xrun in stable state. There are four streams running on four cores, all using DW-DMA. Traces like these are very common: [ 374512296.472385] ( 18.750000) c2 ssp-dai 1.0 /drivers/intel/ssp/ssp.c:1001 INFO ssp_start() OUT Note: sof-logger run with 19.2 clock reference, so timestamp are off by 2X in the output. The last trace is a local modification that prints the "avail bytes" value in three last invocations to get_data_size(). So previous iterations have been normal, but suddenly in middle of stable streaming state, the DW-DMA goes crazy (avail |
Status update:
|
Status update:
|
Allocate "struct zephyr_ll_pdata" in shared/coherent memory as it embeds a "struct k_sem" object. Zephyr kernel code assumes the object to be in cache coherent memory, so incorrect operation may result if condition is not met. Long test runs of all-core capture stress test on Intel cAVS2.5 platform show failures that are fixed with this change. Discovered via runtime assert in zephyr/kernel/sched.c:pend() that is hit without this patch. BugLink: thesofproject#5556 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Performed a bisect of sort on all current fixes and it turns out the k_sem coherency fix is the key change. Others are probably valid fixed, but not required for this bug. Test details at #5588 |
Allocate "struct zephyr_ll_pdata" in shared/coherent memory as it embeds a "struct k_sem" object. Zephyr kernel code assumes the object to be in cache coherent memory, so incorrect operation may result if condition is not met. Long test runs of all-core capture stress test on Intel cAVS2.5 platform show failures that are fixed with this change. Discovered via runtime assert in zephyr/kernel/sched.c:pend() that is hit without this patch. BugLink: #5556 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
No errors in Intel daily plan 11433, so potentially fixed by the fix to #5556. |
Not seen in Intel daily testing for multiple days, closing. |
Describe the bug
To Reproduce
TPLG=/lib/firmware/intel/sof-tplg/sof-adl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_ZEPHYR ~/sof-test/test-case/multiple-pipeline.sh -f c -l 200
Reproduction Rate
100% (with 200 iterations)
Expected behavior
Test passes
Impact
Recording fails.
Environment
Screenshots or console output
The text was updated successfully, but these errors were encountered: