Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

Closed
kv2019i opened this issue Mar 16, 2022 · 19 comments
Closed

[BUG][Zephyr] multiple-pipeline-capture-200.sh fails on ADLP-NOCODEC #5556

kv2019i opened this issue Mar 16, 2022 · 19 comments
Assignees
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected P1 Blocker bugs or important features TGL Applies to Tiger Lake Zephyr Issues only observed with Zephyr integrated
Milestone

Comments

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 16, 2022

Describe the bug

To Reproduce
TPLG=/lib/firmware/intel/sof-tplg/sof-adl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_ZEPHYR ~/sof-test/test-case/multiple-pipeline.sh -f c -l 200

Reproduction Rate
100% (with 200 iterations)

Expected behavior
Test passes

Impact
Recording fails.

Environment

  1. sof-2.1 stable branch SOF+Zephyr
  2. Name of the topology file
    • Topology: sof-cavs-nocodec
  3. Name of the platform(s) on which the bug is observed.
    • Platform: ADLP, TGLP

Screenshots or console output

[       52109.633346] (          22.187500) c2 ssp-dai      1.0   /drivers/intel/ssp/ssp.c:962  INFO ssp_early_start(): SSE set for SSP0
[       52253.227090] (         143.593750) c2 ssp-dai      1.0   /drivers/intel/ssp/ssp.c:1079 INFO ssp_trigger() cmd 1
[       52273.331256] (          20.104166) c2 ssp-dai      1.0   /drivers/intel/ssp/ssp.c:977  INFO ssp_start()
[       80064.788485] (       27791.457031) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40517
[       86218.694491] (        6153.905762) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40763
[      136613.588321] (       50394.894531) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40503
[      148851.712835] (       12238.125000) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 41096
[      195152.388079] (       46300.675781) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40399
[      197192.440081] (        2040.052002) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40367
[      213275.356109] (       16082.916016) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40700
[      221436.084951] (        8160.729004) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 41717
[      229411.605467] (        7975.520508) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40619
@kv2019i kv2019i added bug Something isn't working as expected Zephyr Issues only observed with Zephyr integrated TGL Applies to Tiger Lake ADL Applies to Alder Lake platform labels Mar 16, 2022
@kv2019i kv2019i self-assigned this Mar 16, 2022
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 16, 2022

Some tidbits so far:

@ranj063
Copy link
Collaborator

ranj063 commented Mar 16, 2022

@kv2019i this must also include DMIC capture no? not just SSP capture? Interestingly, the chrome team brought an issue on ADL with DMIC capture (with noise cancellation on core 1) and HDMI playback with similar logs. Of course it doesnt use zephyr.

@lgirdwood lgirdwood added this to the v2.1 milestone Mar 16, 2022
@lgirdwood
Copy link
Member

@kv2019i can we try the following patch to show the IRQs enabled and the status. We may see the culprit....

diff --git a/src/lib/agent.c b/src/lib/agent.c
index 1aac47796..44753bb54 100644
--- a/src/lib/agent.c
+++ b/src/lib/agent.c
@@ -81,6 +81,9 @@ static enum task_state validate(void *data)
                else
                        tr_warn(&sa_tr, "validate(), ll drift detected, delta = %u",
                                (unsigned int)delta);
+               tr_warn(&sa_tr, "irq enabled %x status %x",
+                       arch_interrupt_get_enabled(),
+                       arch_interrupt_get_status());
        }
 
        /* update last_check to current */

@lgirdwood
Copy link
Member

@kv2019i I think we need to disable agent (so we only have 1 IRQ user for the timer - no context sharing) and add some checks like (untested).

diff --git a/src/schedule/zephyr_domain.c b/src/schedule/zephyr_domain.c
index 6b9960268..1c38db6d7 100644
--- a/src/schedule/zephyr_domain.c
+++ b/src/schedule/zephyr_domain.c
@@ -101,6 +101,7 @@ static void zephyr_domain_timer_fn(struct k_timer *timer)
 	struct zephyr_domain *zephyr_domain = k_timer_user_data_get(timer);
 	uint64_t now = k_uptime_ticks();
 	int core;
+	int missed_irq_warn = 0;
 
 	if (!zephyr_domain)
 		return;
@@ -112,8 +113,13 @@ static void zephyr_domain_timer_fn(struct k_timer *timer)
 	 * struct task::start for a strictly periodic Zephyr-based LL scheduler
 	 * implementation, they will be removed after a short grace period.
 	 */
-	while (zephyr_domain->ll_domain->next_tick <= now)
+	while (zephyr_domain->ll_domain->next_tick <= now) {
 		zephyr_domain->ll_domain->next_tick += LL_TIMER_PERIOD_TICKS;
+		if (!missed_irq_warn++)
+			tr_err(&ll_tr, "missed IRQ now %ld next %ld delta %ld", now,
+				zephyr_domain->ll_domain->next_tick,
+				now - zephyr_domain->ll_domain->next_tick);
+	}
 
 	for (core = 0; core < CONFIG_CORE_COUNT; core++) {
 		struct zephyr_domain_thread *dt = zephyr_domain->domain_thread + core;
diff --git a/src/schedule/zephyr_ll.c b/src/schedule/zephyr_ll.c
index 5af538767..1c19330da 100644
--- a/src/schedule/zephyr_ll.c
+++ b/src/schedule/zephyr_ll.c
@@ -21,12 +21,18 @@ DECLARE_SOF_UUID("zll-schedule", zll_sched_uuid, 0x1547fe68, 0xde0c, 0x11eb,
 
 DECLARE_TR_CTX(ll_tr, SOF_UUID(zll_sched_uuid), LOG_LEVEL_INFO);
 
+/* headroom for scheduling, warning if 10uS late. */
+/* TODO: use Zephyr timing macros instead of hard codec clocks */
+#define HEADROOM_CYCLES		(38400 / 100)
+
 /* per-scheduler data */
 struct zephyr_ll {
 	struct list_item tasks;			/* list of ll tasks */
 	unsigned int n_tasks;			/* task counter */
 	struct ll_schedule_domain *ll_domain;	/* scheduling domain */
 	unsigned int core;			/* core ID of this instance */
+	unsigned int period_cycles;		/* period of LL tick in cycles */
+	unsigned int last_tick_cycles;		/* last tick cycles */
 };
 
 /* per-task scheduler data */
@@ -174,6 +180,18 @@ static void zephyr_ll_run(void *data)
 	struct task *task;
 	struct list_item *list;
 	uint32_t flags;
+	unsigned int cycles_now = k_cycle_get_32();
+	unsigned int cycles_target = sch->last_tick_cycles + sch->period_cycles
+				   + HEADROOM_CYCLES;
+
+	/* are we scheduled on time ? if not lets report it as we could be
+	 * blocked by other higher priority threads or IRQs.
+	 */
+	if (cycles_now > cycles_target) {
+		tr_err(&ll_tr, "LL scheduling %d cycles late",
+			cycles_now - cycles_target);
+	}
+	sch->last_tick_cycles = cycles_now;
 
 	zephyr_ll_lock(sch, &flags);
 
@@ -343,6 +361,8 @@ static int zephyr_ll_task_schedule_common(struct zephyr_ll *sch, struct task *ta
 
 	zephyr_ll_unlock(sch, &flags);
 
+	//TODO: is this the best place to initial set last tick cycles.
+	sch->last_tick_cycles = k_cycle_get_32();
 	ret = domain_register(sch->ll_domain, task, &zephyr_ll_run, sch);
 	if (ret < 0)
 		tr_err(&ll_tr, "zephyr_ll_task_schedule: cannot register domain %d",
@@ -534,6 +554,9 @@ int zephyr_ll_scheduler_init(struct ll_schedule_domain *domain)
 	sch->ll_domain = domain;
 	sch->core = cpu_get_id();
 
+	// TODO: get this from topology and use Zephyr API to calc cycles.
+	sch->period_cycles = 38400;
+
 	scheduler_init(domain->type, &z

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 16, 2022

The timer tick does not seem to be systematically late at all. There is definitely some minor jitter in waking up the per-core threads (and this has increased lately, is less severe before the spinlock change), but when I've traced what happens on the n+1 cycle after agent notices the jitter, we always catch up.

So it could be the actual failure is caused by too much "agent traffic" and then some DMA runs out of data. Like here:

[   619607122.097775] (        6040.729004) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40708
[   619613215.587116] (        6093.489258) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40687
[   619617305.222370] (        4089.635254) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 41312
[   619627351.888638] (       10046.666016) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40494
[   619627546.888630] (         194.999985) c3 host         4.18          src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 384 free in DMA
[   619629445.846888] (        1898.958252) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40603

[   619635556.054978] (          40.677082) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 40688
[   619636555.065356] (         213.541656) c2 pipe         2.11  ....../pipeline-stream.c:261  INFO pipe trigger cmd 6
[   619636745.638265] (         102.760414) c0 pipe         21.71 ....../pipeline-stream.c:261  INFO pipe trigger cmd 6
[   619637490.481985] (         265.833313) c2 zll-schedule       src/schedule/zephyr_ll.c:67   INFO task complete 0xbe08c000 pipe-task
[   619635515.377897] (        6069.531250) c2 dw-dma                 src/drivers/dw/dma.c:1128 ERROR dw_dma_get_data_size(): xrun detected
[   619635590.794560] (          34.739582) c0 dw-dma                 src/drivers/dw/dma.c:1128 ERROR dw_dma_get_data_size(): xrun detected
[   619635710.065389] (         119.270828) c2 dai          2.10           src/audio/dai.c:882  ERROR dai_report_xrun(): overrun due to no space available
[   619635898.867465] (         188.802078) c0 dai          21.70          src/audio/dai.c:882  ERROR dai_report_xrun(): overrun due to no space available
[   619636122.200789] (         223.333328) c2 dai          2.10           src/audio/dai.c:766  ERROR comp_overrun(): sink->free = 756, copy_bytes = 0
[   619636341.523697] (         219.322906) c0 dai          21.70          src/audio/dai.c:766  ERROR comp_overrun(): sink->free = 512, copy_bytes = 0
[   619636642.877852] (          87.812500) c2 pipe         2.11  ....../pipeline-stream.c:134  ERROR pipeline_copy(): ret = -61, start->comp.id = 10, dir = 0
[   619636781.263263] (          35.625000) c0 pipe         21.71 ....../pipeline-stream.c:134  ERROR pipeline_copy(): ret = -61, start->comp.id = 70, dir = 0
[   619636978.971589] (         197.708328) c2 pipe         2.11  ..../pipeline-schedule.c:183  ERROR pipeline_task(): xrun recovery failed! pipeline is stopped.
[   619637224.648662] (         245.677078) c0 pipe         21.71 ..../pipeline-schedule.c:183  ERROR pipeline_task(): xrun recovery failed! pipeline is stopped.
[   619637510.169484] (          19.687500) c2 zll-schedule       src/schedule/zephyr_ll.c:69   INFO num_tasks 1 total_num_tasks 6
[   619637532.461150] (          22.291666) c2 zll-schedule       ......../zephyr_domain.c:209  INFO zephyr_domain_unregister domain->type 1 domain->clk 4
[   619637691.888227] (         159.427078) c0 zll-schedule       src/schedule/zephyr_ll.c:67   INFO task complete 0xbe087b80 pipe-task 
[   619637711.159060] (          19.270832) c0 zll-schedule       src/schedule/zephyr_ll.c:69   INFO num_tasks 3 total_num_tasks 5
[   619637775.742390] (          64.583328) c0 sa                          src/lib/agent.c:83   WARN validate(), ll drift detected, delta = 42640
[   619638673.398605] (         897.656189) c0 dmic-dai     2.1   ....../intel/dmic/dmic.c:508  INFO dmic_irq_handler(), OUTSTAT0 = 0x3000000c, OUTSTAT1 = 0x28000000
[   619939207.240829] (      300515.000000) c1 zll-schedule       src/schedule/zephyr_ll.c:142  INFO ll task 0xbe084f80 pipe-task [   619638692.252771] (          18.854166) c0 dmic-dai     2.1   ....../intel/dmic/dmic.c:517  ERROR dmic_irq_handler(): full fifo B or PDM overrun avg 10722, max 12786
[   619939231.407495] (          24.166666) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 11030, max 12994, overruns 0
[   619969135.208390] (       29903.800781) c3 zll-schedule       src/schedule/zephyr_ll.c:142  INFO ll task 0xbe089e00 pipe-task  avg 6765, max 7902
[   619969160.052139] (          24.843748) c3 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 6968, max 8115, overruns 0

So in summary, we probably need to try removing the agent. It's adding to the problem here and if we have load issues, the DSP load traces will catch that.

kv2019i added a commit that referenced this issue Mar 16, 2022
The system agent is not required with SOF Zephyr builds
as the Zephyr ll scheduler implementation can track DSP
load and print periodic status of the load and any observed
overruns in scheduling.

Disabling agent is beneficial as agent can create a lot of DMA traffic
when DMA trace is enabled. This happens if there is jitter in agent
execution, with delta slightly over the warning threshold. This can
itself worsen the scheduling variation and lead to actual problems.

BugLink: #5556
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 16, 2022

#5557 did not help. Logs are less cluttered, but there's a still failure in configuring SSP1 DAI (on core3).

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 17, 2022

@lgirdwood Some log data with your patches. First the IRQs. The status doesn't change during any tests when measured at agent. Probably if some irq is blocked, that would have to be observed outside the agent:

[ 25633407.734375] ( 9.739583) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222
[ 25641487.421875] ( 10.520833) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222
[ 25645529.895833] ( 11.119792) c0 sa src/lib/agent.c:87 WARN irq enabled 10040 status 222

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 17, 2022

The missed IRQ trace didn't provide useful output. The next_tick is updated always once, so this floods the terminal.

Instead I modified the "LL scheduler" trace a bit. With 10us headroam, this fires too often and affects test execution itself, but with 10% headroom and slight modification to print "x cycles late" without the headroom included, results are as follows:

[    26978454.166667] (       50684.347656) c1 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe083e80 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 10534, max 12566
[    26978466.510417] (          12.343750) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 10938, max 12770, overruns 0
[    27019625.000000] (       41158.488281) c0 host         21.64         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 256 free in DMA
[    27064062.968750] (       44437.968750) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4038 cycles late
[    27166364.739583] (      102301.773438) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[    27291377.708333] (      125012.968750) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3947 cycles late
[    27383355.208333] (       91977.500000) c0 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe086700 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 3179, max 4953
[    27398462.942708] (       15107.734375) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3845 cycles late
[    27473343.125000] (       74880.179688) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[    27493476.015625] (       20132.890625) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4437 cycles late
[    27604917.500000] (      111441.484375) c0 host         21.64         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 256 free in DMA
[    27749323.098958] (      144405.593750) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3950 cycles late
[    27795998.255208] (       46675.156250) c2 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe088d00 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 6923, max 8233

All cores are loaded, but the delay is not reported on all. Still, the latency spike happens at a very slow pace, here hundreds of milliseconds a part.

Here's another log excepr show warnings from all cores and includes avg/max data from the audio tasks (on different cores). So we get occasional variation in scheduling invocation, but audio processing load is not high and no peaks are seen:

[   407846942.916667] (       56481.980469) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3980 cycles late
[   407850015.338542] (        3072.421875) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3937 cycles late
[   407886468.411458] (       36453.074219) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4267 cycles late
[   407930947.161458] (       44478.750000) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4187 cycles late
[   407950400.104167] (       19452.943359) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[   407969469.062500] (       19068.958984) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3914 cycles late
[   408192947.447917] (      223478.390625) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4205 cycles late
[   408202141.015625] (        9193.567383) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c080 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 4214, max 5047
[   408202341.119792] (         200.104172) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c1c0 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7103, max 7653
[   408202353.593750] (          12.473958) c3 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 11600, max 12722, overruns 0
[   408203210.494792] (         856.901062) c2 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe088d00 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7196, max 8373
[   408203222.942708] (          12.447917) c2 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 7561, max 8564, overruns 0
[   408227392.213542] (       24169.271484) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[   408352454.843750] (      125062.632812) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4501 cycles late
[   408401926.901042] (       49472.058594) c1 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 4211 cycles late
[   408418354.244792] (       16427.343750) c1 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe083e80 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 10521, max 12456
[   408418366.328125] (          12.083333) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 10791, max 12658, overruns 0
[   408560402.005208] (      142035.671875) c1 host         20.56         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 768 free in DMA
[   408639940.000000] (       79537.992188) c0 zll-schedule       src/schedule/zephyr_ll.c:192  INFO LL scheduling 3893 cycles late
[   408788566.901042] (      148626.906250) c0 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 4089, max 5826, overruns 

I modified the trace a bit to print out the LL-to-LL delay on the N+1 iteration after a delayed invocation. As expected (a fixed timer irq running the system), we systematically catch up on the next iteration:

[     6592101.666667] (       66791.851562) c0 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 4006 cycles late
[     6593062.968750] (         961.302063) c0 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -2449 cycles
[     6597119.921875] (        4056.953125) c0 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3872 cycles late
[     6598102.473958] (         982.552063) c0 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -1990 cycles

And second piece:

[    40701401.223958] (       38392.812500) c1 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3987 cycles late
[    40702358.072917] (         956.848938) c1 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -2406 cycles
[    40812577.916667] (      110219.843750) c2 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe088d00 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7006, max 8295
[    40812590.885417] (          12.968750) c2 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 7435, max 8567, overruns 0
[    40813526.171875] (         935.286438) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c080 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 4152, max 4867
[    40813721.614583] (         195.442703) c3 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe08c1c0 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 7177, max 7717
[    40813733.177083] (          11.562500) c3 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 11615, max 12740, overruns 0
[    40814440.312500] (         707.135437) c0 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3909 cycles late
[    40815368.619792] (         928.307312) c0 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 -3108 cycles
[    40860152.239583] (       44783.621094) c3 host         7.36          src/audio/host.c:475  INFO no bytes to copy, 0 free in buffer, 384 available in DMA
[    40887525.052083] (       27372.812500) c2 host         2.6           src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 384 free in DMA
[    40916856.796875] (       29331.744141) c1 zll-schedule       src/schedule/zephyr_ll.c:195  INFO LL scheduling 3881 cycles late
[    40917896.666667] (        1039.869751) c1 zll-schedule       src/schedule/zephyr_ll.c:201  INFO LL scheduling N+1 1496 cycles
[    40923021.796875] (        5125.130371) c0 host         21.64         src/audio/host.c:481  INFO no bytes to copy, 0 avail in buffer, 256 free in DMA
[    41029471.718750] (      106449.921875) c1 zll-schedule       src/schedule/zephyr_ll.c:148  INFO ll task 0xbe083e80 pipe-task <f11818eb-e92e-4082-82a3-dc54c604ebb3> avg 10481, max 12010
[    41029484.114583] (          12.395833) c1 zll-schedule       ......../zephyr_domain.c:89   INFO ll timer avg 10705, max 12314, overruns 0

There is definitely some unexpected variation, but we stay within 15% budget all the time and no peaks in audio processing are reported that would exceed and cause problem even for a single scheduler run.

We can probably add similar lower cost tracking of the IRQ wakeups to get more data on this, without spamming the DMA-trace. E.g. track avg+max of IRQ wakeup delay.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 17, 2022

Managed to reproduce on one of the CI machines now. There's a Zephyr OS panic. Debugging it now.

@kv2019i kv2019i added the P1 Blocker bugs or important features label Mar 17, 2022
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 17, 2022

Status update:

  • the Zephyr panic dumps are scrambled but @lyakh managed to descramble one and it points to dai_report_xrun()
  • but in today's testing, failure has happened in multiple ways, there is not always a Zephyr panic
  • common thing is an xrun on SSP0 on core0 when three other streams are already active
  • other observed failure patterns
    - running out of memory allocating a new pipe
    - no panic, no out-of-mem, just xrun in the FW log

Bisect shows the spinlock #5286 makes this problem easier to reproduce.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 17, 2022

Another update:

  • issue hit without the new spinlock implementation (Convert spinlock to Zephyr API #5286) -- takes much longer, but can still be hit
  • issue hit on TGL -- harder to hit as well, but similar error footprint

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 21, 2022

There's a draft workaround for the Zephyr timer reliability, but unfortunately the error is still happening with that workaround in place. There's still unexplained spikes in timer invocation, so there can still be a connection, but at least the error in timer average time does not explain the multiple-pipeline-capture fail.

@lgirdwood
Copy link
Member

@kv2019i Could be IRQs OFF in locks ?
could we have a hack in k_spin_unlock() that does

  1. Check int timer status IRQ bit
  2. if 1 is pending then log delta between time now and timeout. Also log caller.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 21, 2022

I'll resume the IRQ timing tomorrow. I spent today with DW-DMA. The variety of symptoms still leaves me puzzling. Sometime the error happens at state transitions (either a Zephyr OS panic via invalid load or a timed out IPC), but often time the main failure is an xrun in stable state. There are four streams running on four cores, all using DW-DMA. Traces like these are very common:

[ 374512296.472385] ( 18.750000) c2 ssp-dai 1.0 /drivers/intel/ssp/ssp.c:1001 INFO ssp_start() OUT
[ 374849807.761057] ( 1738.541626) c2 zll-schedule ......../zephyr_domain.c:125 INFO domain tick avg 38400, max 41351
[ 376558874.411895] ( 11971.875000) c2 zll-schedule src/schedule/zephyr_ll.c:142 INFO ll task 0xbe081f80 pipe-task avg 4581, max 8032
[ 376558910.088977] ( 35.677082) c2 zll-schedule ......../zephyr_domain.c:89 INFO ll timer avg 4845, max 8375, overruns 0
[ 376560620.453492] ( 1710.364502) c2 dw-dma src/drivers/dw/dma.c:1146 INFO dw_dma_get_data_size(): avail 3188078672 free 3188078676
[ 376560639.411825] ( 18.958332) c2 dw-dma src/drivers/dw/dma.c:1148 INFO dw_dma_get_data_size(): av n1 496 n2 520 n3 520

Note: sof-logger run with 19.2 clock reference, so timestamp are off by 2X in the output.

The last trace is a local modification that prints the "avail bytes" value in three last invocations to get_data_size(). So previous iterations have been normal, but suddenly in middle of stable streaming state, the DW-DMA goes crazy (avail
3188078672 ). One soon after this, the streams on other cores start returning errors.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 23, 2022

Status update:

@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 23, 2022

Status update:

kv2019i added a commit to kv2019i/sof that referenced this issue Mar 25, 2022
Allocate "struct zephyr_ll_pdata" in shared/coherent memory as it embeds
a "struct k_sem" object. Zephyr kernel code assumes the object to be in
cache coherent memory, so incorrect operation may result if condition is
not met.

Long test runs of all-core capture stress test on Intel cAVS2.5
platform show failures that are fixed with this change.

Discovered via runtime assert in zephyr/kernel/sched.c:pend() that
is hit without this patch.

BugLink: thesofproject#5556
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 25, 2022

Performed a bisect of sort on all current fixes and it turns out the k_sem coherency fix is the key change. Others are probably valid fixed, but not required for this bug. Test details at #5588

lgirdwood pushed a commit that referenced this issue Mar 28, 2022
Allocate "struct zephyr_ll_pdata" in shared/coherent memory as it embeds
a "struct k_sem" object. Zephyr kernel code assumes the object to be in
cache coherent memory, so incorrect operation may result if condition is
not met.

Long test runs of all-core capture stress test on Intel cAVS2.5
platform show failures that are fixed with this change.

Discovered via runtime assert in zephyr/kernel/sched.c:pend() that
is hit without this patch.

BugLink: #5556
Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Mar 29, 2022

No errors in Intel daily plan 11433, so potentially fixed by the fix to #5556.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Apr 1, 2022

Not seen in Intel daily testing for multiple days, closing.

@kv2019i kv2019i closed this as completed Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADL Applies to Alder Lake platform bug Something isn't working as expected P1 Blocker bugs or important features TGL Applies to Tiger Lake Zephyr Issues only observed with Zephyr integrated
Projects
None yet
Development

No branches or pull requests

3 participants