[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

kv2019i · 2022-03-16T15:41:59Z

Describe the bug

To Reproduce
TPLG=/lib/firmware/intel/sof-tplg/sof-adl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_ZEPHYR ~/sof-test/test-case/multiple-pipeline.sh -f c -l 200

Reproduction Rate
100% (with 200 iterations)

Expected behavior
Test passes

Impact
Recording fails.

Environment

sof-2.1 stable branch SOF+Zephyr
Name of the topology file
- Topology: sof-cavs-nocodec
Name of the platform(s) on which the bug is observed.
- Platform: ADLP, TGLP

Screenshots or console output

[       52109.633346] (          22.187500) c2 ssp-dai      1.0   /drivers/intel/ssp/ssp.c:962  INFO ssp_early_start(): SSE set for SSP0
[       52253.227090] (         143.593750) c2 ssp-dai      1.0   /drivers/intel/ssp/ssp.c:1079 INFO ssp_trigger() cmd 1
[       52273.331256] (          20.104166) c2 ssp-dai      1.0   /drivers/intel/ssp/ssp.c:977  INFO ssp_start()
[       80064.788485] (       27791.457031) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40517
[       86218.694491] (        6153.905762) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40763
[      136613.588321] (       50394.894531) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40503
[      148851.712835] (       12238.125000) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 41096
[      195152.388079] (       46300.675781) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40399
[      197192.440081] (        2040.052002) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40367
[      213275.356109] (       16082.916016) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40700
[      221436.084951] (        8160.729004) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 41717
[      229411.605467] (        7975.520508) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40619

The text was updated successfully, but these errors were encountered:

kv2019i · 2022-03-16T15:53:19Z

Some tidbits so far:

not same as [BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556
DSP load is not high
there is a some connection with capture, SSP, dmic and multicore (only happens with a combination with all of these elements)
disabling dmic irq does not help
disabling dmic volume/ramp worker does not help

ranj063 · 2022-03-16T16:00:51Z

@kv2019i this must also include DMIC capture no? not just SSP capture? Interestingly, the chrome team brought an issue on ADL with DMIC capture (with noise cancellation on core 1) and HDMI playback with similar logs. Of course it doesnt use zephyr.

lgirdwood · 2022-03-16T16:32:55Z

@kv2019i can we try the following patch to show the IRQs enabled and the status. We may see the culprit....

diff --git a/src/lib/agent.c b/src/lib/agent.c
index 1aac47796..44753bb54 100644
--- a/src/lib/agent.c
+++ b/src/lib/agent.c
@@ -81,6 +81,9 @@ static enum task_state validate(void *data)
                else
                        tr_warn(&sa_tr, "validate(), ll drift detected, delta = %u",
                                (unsigned int)delta);
+               tr_warn(&sa_tr, "irq enabled %x status %x",
+                       arch_interrupt_get_enabled(),
+                       arch_interrupt_get_status());
        }
 
        /* update last_check to current */

lgirdwood · 2022-03-16T16:48:18Z

@kv2019i I think we need to disable agent (so we only have 1 IRQ user for the timer - no context sharing) and add some checks like (untested).

diff --git a/src/schedule/zephyr_domain.c b/src/schedule/zephyr_domain.c
index 6b9960268..1c38db6d7 100644
--- a/src/schedule/zephyr_domain.c
+++ b/src/schedule/zephyr_domain.c
@@ -101,6 +101,7 @@ static void zephyr_domain_timer_fn(struct k_timer *timer)
 	struct zephyr_domain *zephyr_domain = k_timer_user_data_get(timer);
 	uint64_t now = k_uptime_ticks();
 	int core;
+	int missed_irq_warn = 0;
 
 	if (!zephyr_domain)
 		return;
@@ -112,8 +113,13 @@ static void zephyr_domain_timer_fn(struct k_timer *timer)
 	 * struct task::start for a strictly periodic Zephyr-based LL scheduler
 	 * implementation, they will be removed after a short grace period.
 	 */
-	while (zephyr_domain->ll_domain->next_tick <= now)
+	while (zephyr_domain->ll_domain->next_tick <= now) {
 		zephyr_domain->ll_domain->next_tick += LL_TIMER_PERIOD_TICKS;
+		if (!missed_irq_warn++)
+			tr_err(&ll_tr, "missed IRQ now %ld next %ld delta %ld", now,
+				zephyr_domain->ll_domain->next_tick,
+				now - zephyr_domain->ll_domain->next_tick);
+	}
 
 	for (core = 0; core < CONFIG_CORE_COUNT; core++) {
 		struct zephyr_domain_thread *dt = zephyr_domain->domain_thread + core;
diff --git a/src/schedule/zephyr_ll.c b/src/schedule/zephyr_ll.c
index 5af538767..1c19330da 100644
--- a/src/schedule/zephyr_ll.c
+++ b/src/schedule/zephyr_ll.c
@@ -21,12 +21,18 @@ DECLARE_SOF_UUID("zll-schedule", zll_sched_uuid, 0x1547fe68, 0xde0c, 0x11eb,
 
 DECLARE_TR_CTX(ll_tr, SOF_UUID(zll_sched_uuid), LOG_LEVEL_INFO);
 
+/* headroom for scheduling, warning if 10uS late. */
+/* TODO: use Zephyr timing macros instead of hard codec clocks */
+#define HEADROOM_CYCLES		(38400 / 100)
+
 /* per-scheduler data */
 struct zephyr_ll {
 	struct list_item tasks;			/* list of ll tasks */
 	unsigned int n_tasks;			/* task counter */
 	struct ll_schedule_domain *ll_domain;	/* scheduling domain */
 	unsigned int core;			/* core ID of this instance */
+	unsigned int period_cycles;		/* period of LL tick in cycles */
+	unsigned int last_tick_cycles;		/* last tick cycles */
 };
 
 /* per-task scheduler data */
@@ -174,6 +180,18 @@ static void zephyr_ll_run(void *data)
 	struct task *task;
 	struct list_item *list;
 	uint32_t flags;
+	unsigned int cycles_now = k_cycle_get_32();
+	unsigned int cycles_target = sch->last_tick_cycles + sch->period_cycles
+				   + HEADROOM_CYCLES;
+
+	/* are we scheduled on time ? if not lets report it as we could be
+	 * blocked by other higher priority threads or IRQs.
+	 */
+	if (cycles_now > cycles_target) {
+		tr_err(&ll_tr, "LL scheduling %d cycles late",
+			cycles_now - cycles_target);
+	}
+	sch->last_tick_cycles = cycles_now;
 
 	zephyr_ll_lock(sch, &flags);
 
@@ -343,6 +361,8 @@ static int zephyr_ll_task_schedule_common(struct zephyr_ll *sch, struct task *ta
 
 	zephyr_ll_unlock(sch, &flags);
 
+	//TODO: is this the best place to initial set last tick cycles.
+	sch->last_tick_cycles = k_cycle_get_32();
 	ret = domain_register(sch->ll_domain, task, &zephyr_ll_run, sch);
 	if (ret < 0)
 		tr_err(&ll_tr, "zephyr_ll_task_schedule: cannot register domain %d",
@@ -534,6 +554,9 @@ int zephyr_ll_scheduler_init(struct ll_schedule_domain *domain)
 	sch->ll_domain = domain;
 	sch->core = cpu_get_id();
 
+	// TODO: get this from topology and use Zephyr API to calc cycles.
+	sch->period_cycles = 38400;
+
 	scheduler_init(domain->type, &z

kv2019i · 2022-03-16T17:17:01Z

The timer tick does not seem to be systematically late at all. There is definitely some minor jitter in waking up the per-core threads (and this has increased lately, is less severe before the spinlock change), but when I've traced what happens on the n+1 cycle after agent notices the jitter, we always catch up.

So it could be the actual failure is caused by too much "agent traffic" and then some DMA runs out of data. Like here:

[   619607122.097775] (        6040.729004) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40708
[   619613215.587116] (        6093.489258) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40687
[   619617305.222370] (        4089.635254) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 41312
[   619627351.888638] (       10046.666016) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40494
[   619627546.888630] (         194.999985) c3 host         4.18          src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 384 free in DMA
[   619629445.846888] (        1898.958252) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40603

[   619635556.054978] (          40.677082) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40688
[   619636555.065356] (         213.541656) c2 pipe         2.11  ....../pipeline-stream.c:261  INFO pipe trigger cmd 6
[   619636745.638265] (         102.760414) c0 pipe         21.71 ....../pipeline-stream.c:261  INFO pipe trigger cmd 6
[   619637490.481985] (         265.833313) c2 zll-schedule       src/schedule/zephyr_ll.c:67   INFO task complete 0xbe08c000 pipe-task
[   619635515.377897] (        6069.531250) c2 dw-dma                 src/drivers/dw/dma.c:1128 ERROR dw_dma_get_data_size(): xrun detected
[   619635590.794560] (          34.739582) c0 dw-dma                 src/drivers/dw/dma.c:1128 ERROR dw_dma_get_data_size(): xrun detected
[   619635710.065389] (         119.270828) c2 dai          2.10           src/audio/dai.c:882  ERROR dai_report_xrun(): overrun due to no space available
[   619635898.867465] (         188.802078) c0 dai          21.70          src/audio/dai.c:882  ERROR dai_report_xrun(): overrun due to no space available
[   619636122.200789] (         223.333328) c2 dai          2.10           src/audio/dai.c:766  ERROR comp_overrun(): sink->free = 756, copy_bytes = 0
[   619636341.523697] (         219.322906) c0 dai          21.70          src/audio/dai.c:766  ERROR comp_overrun(): sink->free = 512, copy_bytes = 0
[   619636642.877852] (          87.812500) c2 pipe         2.11  ....../pipeline-stream.c:134  ERROR pipeline_copy(): ret = -61, start->comp.id = 10, dir = 0
[   619636781.263263] (          35.625000) c0 pipe         21.71 ....../pipeline-stream.c:134  ERROR pipeline_copy(): ret = -61, start->comp.id = 70, dir = 0
[   619636978.971589] (         197.708328) c2 pipe         2.11  ..../pipeline-schedule.c:183  ERROR pipeline_task(): xrun recovery failed! pipeline is stopped.
[   619637224.648662] (         245.677078) c0 pipe         21.71 ..../pipeline-schedule.c:183  ERROR pipeline_task(): xrun recovery failed! pipeline is stopped.
[   619637510.169484] (          19.687500) c2 zll-schedule       src/schedule/zephyr_ll.c:69   INFO num_tasks 1 total_num_tasks 6
[   619637532.461150] (          22.291666) c2 zll-schedule       ......../zephyr_domain.c:209  INFO zephyr_domain_unregister domain->type 1 domain->clk 4
[   619637691.888227] (         159.427078) c0 zll-schedule       src/schedule/zephyr_ll.c:67   INFO task complete 0xbe087b80 pipe-task 
[   619637711.159060] (          19.270832) c0 zll-schedule       src/schedule/zephyr_ll.c:69   INFO num_tasks 3 total_num_tasks 5
[   619637775.742390] (          64.583328) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 42640
[   619638673.398605] (         897.656189) c0 dmic-dai     2.1   ....../intel/dmic/dmic.c:508  INFO dmic_irq_handler(), OUTSTAT0 = 0x3000000c, OUTSTAT1 = 0x28000000
[   619939207.240829] (      300515.000000) c1 zll-schedule       src/schedule/zephyr_ll.c:142  INFO ll task 0xbe084f80 pipe-task [   619638692.252771] (          18.854166) c0 dmic-dai     2.1   ....../intel/dmic/dmic.c:517  ERROR dmic_irq_handler(): full fifo B or PDM overrun avg 10722, max 12786
[   619939231.407495] (          24.166666) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 11030, max 12994, overruns 0
[   619969135.208390] (       29903.800781) c3 zll-schedule       src/schedule/zephyr_ll.c:142  INFO ll task 0xbe089e00 pipe-task  avg 6765, max 7902
[   619969160.052139] (          24.843748) c3 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 6968, max 8115, overruns 0

So in summary, we probably need to try removing the agent. It's adding to the problem here and if we have load issues, the DSP load traces will catch that.

The system agent is not required with SOF Zephyr builds as the Zephyr ll scheduler implementation can track DSP load and print periodic status of the load and any observed overruns in scheduling. Disabling agent is beneficial as agent can create a lot of DMA traffic when DMA trace is enabled. This happens if there is jitter in agent execution, with delta slightly over the warning threshold. This can itself worsen the scheduling variation and lead to actual problems. BugLink: #5556 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

kv2019i · 2022-03-16T19:21:11Z

#5557 did not help. Logs are less cluttered, but there's a still failure in configuring SSP1 DAI (on core3).

kv2019i · 2022-03-17T07:20:36Z

@lgirdwood Some log data with your patches. First the IRQs. The status doesn't change during any tests when measured at agent. Probably if some irq is blocked, that would have to be observed outside the agent:

[ 25633407.734375] ( 9.739583) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222
[ 25641487.421875] ( 10.520833) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222
[ 25645529.895833] ( 11.119792) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222

kv2019i · 2022-03-17T07:37:53Z

The missed IRQ trace didn't provide useful output. The next_tick is updated always once, so this floods the terminal.

Instead I modified the "LL scheduler" trace a bit. With 10us headroam, this fires too often and affects test execution itself, but with 10% headroom and slight modification to print "x cycles late" without the headroom included, results are as follows:

[    26978454.166667] (       50684.347656) c1 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe083e80 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 10534, max 12566
[    26978466.510417] (          12.343750) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 10938, max 12770, overruns 0
[    27019625.000000] (       41158.488281) c0 host         21.64         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 256 free in DMA
[    27064062.968750] (       44437.968750) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4038 cycles late
[    27166364.739583] (      102301.773438) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[    27291377.708333] (      125012.968750) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3947 cycles late
[    27383355.208333] (       91977.500000) c0 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe086700 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 3179, max 4953
[    27398462.942708] (       15107.734375) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3845 cycles late
[    27473343.125000] (       74880.179688) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[    27493476.015625] (       20132.890625) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4437 cycles late
[    27604917.500000] (      111441.484375) c0 host         21.64         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 256 free in DMA
[    27749323.098958] (      144405.593750) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3950 cycles late
[    27795998.255208] (       46675.156250) c2 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe088d00 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 6923, max 8233

All cores are loaded, but the delay is not reported on all. Still, the latency spike happens at a very slow pace, here hundreds of milliseconds a part.

Here's another log excepr show warnings from all cores and includes avg/max data from the audio tasks (on different cores). So we get occasional variation in scheduling invocation, but audio processing load is not high and no peaks are seen:

[   407846942.916667] (       56481.980469) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3980 cycles late
[   407850015.338542] (        3072.421875) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3937 cycles late
[   407886468.411458] (       36453.074219) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4267 cycles late
[   407930947.161458] (       44478.750000) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4187 cycles late
[   407950400.104167] (       19452.943359) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[   407969469.062500] (       19068.958984) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3914 cycles late
[   408192947.447917] (      223478.390625) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4205 cycles late
[   408202141.015625] (        9193.567383) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c080 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 4214, max 5047
[   408202341.119792] (         200.104172) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c1c0 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7103, max 7653
[   408202353.593750] (          12.473958) c3 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 11600, max 12722, overruns 0
[   408203210.494792] (         856.901062) c2 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe088d00 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7196, max 8373
[   408203222.942708] (          12.447917) c2 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 7561, max 8564, overruns 0
[   408227392.213542] (       24169.271484) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[   408352454.843750] (      125062.632812) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4501 cycles late
[   408401926.901042] (       49472.058594) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4211 cycles late
[   408418354.244792] (       16427.343750) c1 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe083e80 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 10521, max 12456
[   408418366.328125] (          12.083333) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 10791, max 12658, overruns 0
[   408560402.005208] (      142035.671875) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[   408639940.000000] (       79537.992188) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3893 cycles late
[   408788566.901042] (      148626.906250) c0 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 4089, max 5826, overruns

I modified the trace a bit to print out the LL-to-LL delay on the N+1 iteration after a delayed invocation. As expected (a fixed timer irq running the system), we systematically catch up on the next iteration:

[     6592101.666667] (       66791.851562) c0 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 4006 cycles late
[     6593062.968750] (         961.302063) c0 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -2449 cycles
[     6597119.921875] (        4056.953125) c0 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3872 cycles late
[     6598102.473958] (         982.552063) c0 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -1990 cycles

And second piece:

[    40701401.223958] (       38392.812500) c1 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3987 cycles late
[    40702358.072917] (         956.848938) c1 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -2406 cycles
[    40812577.916667] (      110219.843750) c2 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe088d00 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7006, max 8295
[    40812590.885417] (          12.968750) c2 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 7435, max 8567, overruns 0
[    40813526.171875] (         935.286438) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c080 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 4152, max 4867
[    40813721.614583] (         195.442703) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c1c0 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7177, max 7717
[    40813733.177083] (          11.562500) c3 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 11615, max 12740, overruns 0
[    40814440.312500] (         707.135437) c0 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3909 cycles late
[    40815368.619792] (         928.307312) c0 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -3108 cycles
[    40860152.239583] (       44783.621094) c3 host         7.36          src/audio/host.c:475  INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[    40887525.052083] (       27372.812500) c2 host         2.6           src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 384 free in DMA
[    40916856.796875] (       29331.744141) c1 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3881 cycles late
[    40917896.666667] (        1039.869751) c1 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 1496 cycles
[    40923021.796875] (        5125.130371) c0 host         21.64         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 256 free in DMA
[    41029471.718750] (      106449.921875) c1 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe083e80 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 10481, max 12010
[    41029484.114583] (          12.395833) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 10705, max 12314, overruns 0

There is definitely some unexpected variation, but we stay within 15% budget all the time and no peaks in audio processing are reported that would exceed and cause problem even for a single scheduler run.

We can probably add similar lower cost tracking of the IRQ wakeups to get more data on this, without spamming the DMA-trace. E.g. track avg+max of IRQ wakeup delay.

kv2019i · 2022-03-17T10:01:37Z

Managed to reproduce on one of the CI machines now. There's a Zephyr OS panic. Debugging it now.

kv2019i · 2022-03-17T14:16:51Z

Status update:

the Zephyr panic dumps are scrambled but @lyakh managed to descramble one and it points to dai_report_xrun()
but in today's testing, failure has happened in multiple ways, there is not always a Zephyr panic
common thing is an xrun on SSP0 on core0 when three other streams are already active
other observed failure patterns
- running out of memory allocating a new pipe
- no panic, no out-of-mem, just xrun in the FW log

Bisect shows the spinlock #5286 makes this problem easier to reproduce.

kv2019i · 2022-03-17T16:33:21Z

Another update:

issue hit without the new spinlock implementation (Convert spinlock to Zephyr API #5286) -- takes much longer, but can still be hit
issue hit on TGL -- harder to hit as well, but similar error footprint

kv2019i · 2022-03-21T10:17:18Z

There's a draft workaround for the Zephyr timer reliability, but unfortunately the error is still happening with that workaround in place. There's still unexplained spikes in timer invocation, so there can still be a connection, but at least the error in timer average time does not explain the multiple-pipeline-capture fail.

lgirdwood · 2022-03-21T16:05:59Z

@kv2019i Could be IRQs OFF in locks ?
could we have a hack in k_spin_unlock() that does

Check int timer status IRQ bit
if 1 is pending then log delta between time now and timeout. Also log caller.

kv2019i · 2022-03-21T18:50:49Z

I'll resume the IRQ timing tomorrow. I spent today with DW-DMA. The variety of symptoms still leaves me puzzling. Sometime the error happens at state transitions (either a Zephyr OS panic via invalid load or a timed out IPC), but often time the main failure is an xrun in stable state. There are four streams running on four cores, all using DW-DMA. Traces like these are very common:

[ 374512296.472385] ( 18.750000) c2 ssp-dai 1.0 /drivers/intel/ssp/ssp.c:1001 INFO ssp_start() OUT
[ 374849807.761057] ( 1738.541626) c2 zll-schedule ......../zephyr_domain.c:125 INFO domain tick avg 38400, max 41351
[ 376558874.411895] ( 11971.875000) c2 zll-schedule src/schedule/zephyr_ll.c:142 INFO ll task 0xbe081f80 pipe-task avg 4581, max 8032
[ 376558910.088977] ( 35.677082) c2 zll-schedule ......../zephyr_domain.c:89 INFO ll timer avg 4845, max 8375, overruns 0
[ 376560620.453492] ( 1710.364502) c2 dw-dma src/drivers/dw/dma.c:1146 INFO dw_dma_get_data_size(): avail 3188078672 free 3188078676
[ 376560639.411825] ( 18.958332) c2 dw-dma src/drivers/dw/dma.c:1148 INFO dw_dma_get_data_size(): av n1 496 n2 520 n3 520

Note: sof-logger run with 19.2 clock reference, so timestamp are off by 2X in the output.

The last trace is a local modification that prints the "avail bytes" value in three last invocations to get_data_size(). So previous iterations have been normal, but suddenly in middle of stable streaming state, the DW-DMA goes crazy (avail
3188078672 ). One soon after this, the streams on other cores start returning errors.

kv2019i · 2022-03-23T09:42:17Z

Status update:

one issue found in IRQ driver on Zephyr side -> drivers: intc: intc_cavs: use correct per-core register set for all ops zephyrproject-rtos/zephyr#44104
one issue mutlicore coherency issue found with SOF scheduler code -> zephyr: ll-scheduler: k_sem handle must be in coherent memory #5588
discovery that on some systems the bug is almost impossible to hit -- no hypothesis what can explain this

kv2019i · 2022-03-23T17:40:57Z

Status update:

with the patches of previous comments and a workaround to k_timer callback timing gets unreliable with more cores active zephyrproject-rtos/zephyr#43964, I've today completed 400+ test rounds on three distinct systems without issues
on one system, the error never appeared, which casts some doubt to the results

Allocate "struct zephyr_ll_pdata" in shared/coherent memory as it embeds a "struct k_sem" object. Zephyr kernel code assumes the object to be in cache coherent memory, so incorrect operation may result if condition is not met. Long test runs of all-core capture stress test on Intel cAVS2.5 platform show failures that are fixed with this change. Discovered via runtime assert in zephyr/kernel/sched.c:pend() that is hit without this patch. BugLink: thesofproject#5556 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

kv2019i · 2022-03-25T08:16:33Z

Performed a bisect of sort on all current fixes and it turns out the k_sem coherency fix is the key change. Others are probably valid fixed, but not required for this bug. Test details at #5588

Allocate "struct zephyr_ll_pdata" in shared/coherent memory as it embeds a "struct k_sem" object. Zephyr kernel code assumes the object to be in cache coherent memory, so incorrect operation may result if condition is not met. Long test runs of all-core capture stress test on Intel cAVS2.5 platform show failures that are fixed with this change. Discovered via runtime assert in zephyr/kernel/sched.c:pend() that is hit without this patch. BugLink: #5556 Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

kv2019i · 2022-03-29T10:09:08Z

No errors in Intel daily plan 11433, so potentially fixed by the fix to #5556.

kv2019i · 2022-04-01T12:07:36Z

Not seen in Intel daily testing for multiple days, closing.

kv2019i added bug Something isn't working as expected Zephyr Issues only observed with Zephyr integrated TGL Applies to Tiger Lake ADL Applies to Alder Lake platform labels Mar 16, 2022

kv2019i self-assigned this Mar 16, 2022

lgirdwood added this to the v2.1 milestone Mar 16, 2022

kv2019i mentioned this issue Mar 16, 2022

platform: disable system agent on Zephyr builds #5557

Closed

kv2019i added the P1 Blocker bugs or important features label Mar 17, 2022

kv2019i mentioned this issue Mar 18, 2022

k_timer callback timing gets unreliable with more cores active zephyrproject-rtos/zephyr#43964

Closed

kv2019i mentioned this issue Mar 21, 2022

[BUG][Zephyr] IPC tx timed out(COMP_CONNECT/COMP_NEW/PIPE_FREE) on ADLP Zephyr platform #5573

Closed

kv2019i mentioned this issue Mar 23, 2022

Zephyr Panic dump garbled on Intel cAVS platforms zephyrproject-rtos/zephyr#44145

Closed

kv2019i mentioned this issue Mar 25, 2022

zephyr: ll-scheduler: k_sem handle must be in coherent memory #5588

Merged

kv2019i mentioned this issue Mar 25, 2022

[BUG][Zephyr] ipc tx error for 0x30100000(PIPE_NEW) on ADLP Zephyr platform #5554

Closed

kv2019i mentioned this issue Mar 29, 2022

[BUG] Xrun on ADP_RVP_NOCODEC #5469

Closed

kv2019i closed this as completed Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

kv2019i commented Mar 16, 2022

kv2019i commented Mar 16, 2022

ranj063 commented Mar 16, 2022

lgirdwood commented Mar 16, 2022

lgirdwood commented Mar 16, 2022

kv2019i commented Mar 16, 2022

kv2019i commented Mar 16, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 21, 2022

lgirdwood commented Mar 21, 2022

kv2019i commented Mar 21, 2022

kv2019i commented Mar 23, 2022

kv2019i commented Mar 23, 2022

kv2019i commented Mar 25, 2022

kv2019i commented Mar 29, 2022

kv2019i commented Apr 1, 2022

[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

Comments

kv2019i commented Mar 16, 2022

kv2019i commented Mar 16, 2022

ranj063 commented Mar 16, 2022

lgirdwood commented Mar 16, 2022

lgirdwood commented Mar 16, 2022

kv2019i commented Mar 16, 2022

kv2019i commented Mar 16, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 17, 2022

kv2019i commented Mar 21, 2022

lgirdwood commented Mar 21, 2022

kv2019i commented Mar 21, 2022

kv2019i commented Mar 23, 2022

kv2019i commented Mar 23, 2022

kv2019i commented Mar 25, 2022

kv2019i commented Mar 29, 2022

kv2019i commented Apr 1, 2022