LL scheduler domain refinement #4089

keyonjie · 2021-04-25T07:48:17Z

clean up of the ll_scheduler domain clients/tasks management.

ll_schedule: domain client/task refinement

Refinement to make the domain client/task management more clear:

a. update the total_num_tasks and num_clients in the helper pair
domain_register/unregister().

b. update the enabled[core] and enabled_clients in the helper pair
domain_enable/disable().

c. for the completed task, do the domain client/task update in the new
created helper schedule_ll_task_done().

d. cleanup and remove the client/task management from the ll_schedule
helpers schedule_ll_clients_enable(), schedule_ll_tasks_run(),
schedule_ll_domain_set() and schedule_ll_domain_clear().

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

This is aim to fix #3947 #3949 #3950.

Without those refinement, the num_clients was not used correctly, we wanted to use it for both num_registered_cores and 
num_enabled_cores, with that @slawblauciak observed it may be changed to '-1' when component/pipeline is asked for 
running on a slave core. That's why we are seeing issues #3947 #3949 #3950.

This reverts changes in 'commit b500999477ca ("scheduler: guard against
subdivision underflow on domain clear")' as it is not needed anymore.

jgunn0262

Sorry @keyonjie I really dont understand what you are fixing here.

jgunn0262 · 2021-04-26T08:37:45Z

src/schedule/dma_multi_chan_domain.c

@@ -201,7 +202,7 @@ static void dma_multi_chan_domain_unregister(struct ll_schedule_domain *domain,

 	/* check if task should be unregistered */
 	if (!pipe_task->registrable)
-		return;
+		return -EINVAL;


I dont know how failure will impact the scheduler, but this is probably important so should it be logged ?

src/schedule/dma_single_chan_domain.c

jgunn0262 · 2021-04-26T08:40:40Z

src/include/sof/schedule/ll_schedule_domain.h

@@ -46,6 +46,7 @@ struct ll_schedule_domain {
 	spinlock_t lock;		/**< standard lock */
 	atomic_t total_num_tasks;	/**< total number of registered tasks */
 	atomic_t num_clients;		/**< number of registered cores */
+	atomic_t enabled_clients;	/**< number of enabled cores */


This is confusing - clients should be renamed to cores.

This is confusing - clients should be renamed to cores.

yes, let me change them.

src/include/sof/schedule/ll_schedule_domain.h

keyonjie · 2021-04-26T09:45:51Z

Sorry @keyonjie I really dont understand what you are fixing here.

This is aim to fix #3947 #3949 #3950.

lgirdwood · 2021-04-26T14:52:35Z

Sorry @keyonjie I really dont understand what you are fixing here.

This is aim to fix #3947 #3949 #3950.

@keyonjie I think @jgunn0262 means this is not clear in the commit message. i.e the commit message should explain the bug and why this PR fixes the bug,

keyonjie · 2021-04-27T09:51:15Z

Sorry @keyonjie I really dont understand what you are fixing here.

This is aim to fix #3947 #3949 #3950.

@keyonjie I think @jgunn0262 means this is not clear in the commit message. i.e the commit message should explain the bug and why this PR fixes the bug,

Got it. Basically the num_clients was not used currently, we wanted to use it for both num_registered_cores and num_enabled_cores, with that @slawblauciak observed it may be changed to '-1' when component/pipeline is asked for running on a slave core.

@slawblauciak provided a fix here: #4088, but we should really remove this mess to avoid encountering more similar issues.

keyonjie · 2021-04-28T09:39:36Z

@zrombel @slawblauciak @mwasko This actually reverts changes in 'commit b500999 ("scheduler: guard against
subdivision underflow on domain clear")' as it is actually not needed anymore. Can you please help to check if this looks good to you as I actually can't reproduce them at my end?

lyakh

This PR really has to explain what and why it changes...

lyakh · 2021-04-28T09:10:51Z

src/schedule/dma_multi_chan_domain.c

@@ -201,7 +202,7 @@ static void dma_multi_chan_domain_unregister(struct ll_schedule_domain *domain,

 	/* check if task should be unregistered */
 	if (!pipe_task->registrable)
-		return;
+		return -EINVAL;


is this actually an error? dma_multi_chan_domain_register() returns 0 in this case.

Oh thanks let's don't change the existed logic here and return 0 for both at the moment.

lyakh · 2021-04-28T09:13:52Z

src/schedule/dma_multi_chan_domain.c

 		}
 	}
+
+	return -EINVAL;


presumably we'd end up here if the domain hadn't been registered or maybe is still busy?

yes, it could be either, so I use '-EINVAL' here.

lyakh · 2021-04-28T09:14:53Z

src/schedule/dma_single_chan_domain.c

@@ -370,27 +371,29 @@ static void dma_single_chan_domain_unregister(struct ll_schedule_domain *domain,

 	/* check if task should be unregistered */
 	if (!pipe_task->registrable)
-		return;
+		return -EINVAL;


as above - is this an error?

yes, let me make it consistent with the register() part.

src/schedule/dma_single_chan_domain.c

lyakh · 2021-04-28T09:16:03Z

src/schedule/timer_domain.c

@@ -192,14 +192,16 @@ static void timer_domain_unregister(struct ll_schedule_domain *domain,

 	/* tasks still registered on this core */
 	if (!timer_domain->arg[core] || num_tasks)
-		return;
+		return -EINVAL;


lyakh · 2021-04-28T09:52:07Z

src/include/sof/schedule/ll_schedule_domain.h

+
+			/* no client anymore, clear the domain */
+			if (!atomic_read(&domain->registered_cores))
+				domain_clear(domain);


this changes behaviour, right? E.g. in timer domain this clears the interrupt, which if done wrongly can cause a missed interrupt. Why is this needed? Was there a bug before?

lyakh · 2021-04-28T09:55:37Z

src/schedule/ll_schedule.c

+	atomic_sub(&sch->num_tasks, 1);
+
+	/* the last task of the core, unregister the client/core */
+	if (!atomic_read(&sch->num_tasks) &&


I think this has been discussed before. The sequence

atomic_sub(y); x = atomic_read(y);

isn't the same as

x = atomic_sub(y);

namely because the former one breaks atomicity.

as the lock is hold by the caller, so even the atomicity is broken we are still fine, I change it for readability improvement, otherwise we have to use this as the atomic_xx() returns the original value:

x = atomic_sub(y); if (!(x-1))

I don't think it improves readability at all. It adds confusion why a wrong use of an API is allowed. Either we're protected by some external locking, then we should remove atomic_t and just make it an int and make sure it's consistently protected, or we use atomic_t and we understand why we need it and we use it correctly.

I don't think it improves readability at all. It adds confusion why a wrong use of an API is allowed. Either we're protected by some external locking, then we should remove atomic_t and just make it an int and make sure it's consistently protected, or we use atomic_t and we understand why we need it and we use it correctly.

We don't have to make sure the atomicity, but we still need the make them as volatile to make sure not reading them from cache as we need sync from L1 cache of different Cores, @lyakh what do you suggest here?

IIRC the atomic types are used as there is or was a need to add tasks from within other tasks and the scheduler lock was already held at this point. Not sure if this is the case today.
@lyakh the Zephyr timer domain work you are doing should be a lot simpler wrt to locking/atomic since this is all done by Zephyr. We should see a significant reduction in complexity.

lyakh · 2021-04-28T09:55:56Z

src/schedule/ll_schedule.c

+		atomic_sub(&sch->domain->registered_cores, 1);
+
+		/* no client anymore, clear the domain */
+		if (!atomic_read(&sch->domain->registered_cores))


lyakh · 2021-04-28T09:56:28Z

src/schedule/ll_schedule.c

+
+		/* no client anymore, clear the domain */
+		if (!atomic_read(&sch->domain->registered_cores))
+			domain_clear(sch->domain);


this wasn't done before, why is it needed?

good point.
the domain_clear() clears interrupts, so it should be called only at interrupt handler, should not call it here, or in domain_unregister(), or domain_disable().

Let me refine this part also. Thanks!

lgirdwood · 2021-05-08T13:17:59Z

@keyonjie before I review, does this fix the DSM issue ?

keyonjie · 2021-05-08T13:55:07Z

@keyonjie before I review, does this fix the DSM issue ?

I would say it improves a lot for multi-core support, for the DSM issue, need @zrombel 's double check.

Without this PR, we see many errors in multi-core validation, see it here: https://sof-ci.sh.intel.com/#/result/planresultdetail/3832

I am waiting for the validation result with this PR applied here:
https://sof-ci.sh.intel.com/#/result/planresultdetail/3837

keyonjie · 2021-05-08T14:34:35Z

@lgirdwood some conflict with the latest main, just rebased, let's wait for the result here: https://sof-ci.sh.intel.com/#/result/planresultdetail/3837

gkbldcig · 2021-05-10T02:19:05Z

Can one of the admins verify this patch?

keyonjie · 2021-05-10T05:12:25Z

SOFCI TEST

zrombel · 2021-05-10T13:00:13Z

I run some stress tests on this PR, a it seems that stream stall problem #4101 is fixed!
There were no stream stalls in 200 iterations in DSM multicore tests (in def and chrome config).
Now I can only see sporadic glitch in recorded file.
Glitch can be seen in the beginning of the stream, or in whole stream:

Glitch at the beginning
Glitch in whole stream

But this is another bug reported here: #4052

A scheduler will run on a specific scheduler only, add a core index flag to schedule_data to denote that. To schedule or cancel a task, looking for the scheduler with correct core index to perform the action. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

The timer_disable() in the timer_register() is wrong, the interrupt_enable() calling handles the interrupt enabling already, remove the wrong timer_disable() calling to correct it. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

Add return value of domain_unregister(), which will be used for the LL scheduler and the domain state management in the subsequent refinement. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

Rename the 'num_clients' in struct ll_schedule_domain to 'registered_cores' to reflect the real use of it. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

The flag registered_cores was created to denote the number of registered cores, but it is also used as the number of the enabled cores today. Here add a new flag enabled_cores for the former purpose to remove the confusion and help for the subsequent domain tasks/cores management refinement. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

Refinement to make the domain tasks/cores management more clear: a. update the total_num_tasks and num_cores in the helper pair domain_register/unregister(). b. update the enabled[core] and enabled_cores in the helper pair domain_enable/disable(). c. for the completed task, do the domain client/task update in the new created helper schedule_ll_task_done(). d. cleanup and remove the tasks/cores management from the ll_schedule helpers schedule_ll_clients_enable(), schedule_ll_tasks_run(), schedule_ll_domain_set() and schedule_ll_domain_clear(). Without those refinement, the num_clients was not used correctly, we used it for both num_registered_cores and num_enabled_cores, with that we observed that it may be changed to '-1' when component/pipeline is asked for running on a slave core. This reverts changes in 'commit b500999 ("scheduler: guard against subdivision underflow on domain clear")' as it is not needed anymore. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

Add a new_target_tick to store the new target tick for the next set, which is used during the reschedule stage. Update the new_target_tick on tasks done on each core, and do the final domain_set() at the point that tasks on all cores are finished. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

keyonjie · 2021-05-11T11:19:51Z

@slawblauciak rescheduling part refining added on top, please help to review.

keyonjie · 2021-05-12T06:56:16Z

SOFCI TEST

keyonjie · 2021-05-12T07:02:08Z

@lgirdwood This PR will address > 90% failed cases observed here: http://sof-ci.sh.intel.com/#/result/planresultdetail/3868 and some other multi-core related issues observed by slim driver, I think we need this for 1.8/

lgirdwood

Is this intended for rc2 ? Are we happy with validation on TGL 12 ?

lgirdwood · 2021-05-12T08:35:22Z

src/schedule/ll_schedule.c

+	atomic_sub(&sch->num_tasks, 1);
+
+	/* the last task of the core, unregister the client/core */
+	if (!atomic_read(&sch->num_tasks) &&


IIRC the atomic types are used as there is or was a need to add tasks from within other tasks and the scheduler lock was already held at this point. Not sure if this is the case today.
@lyakh the Zephyr timer domain work you are doing should be a lot simpler wrt to locking/atomic since this is all done by Zephyr. We should see a significant reduction in complexity.

keyonjie · 2021-05-12T09:40:03Z

Is this intended for rc2 ? Are we happy with validation on TGL 12 ?

Yes about the intending for rc2.
For TGL 12 validation, @slawblauciak @mwasko can you help to cherry-pick related commits (there could be conflicts so other related ones together may needed) to the 12 branch? or do you prefer to branch out a new branch from the latest main for the release?

lgirdwood · 2021-05-12T12:10:18Z

@keyonjie can you check the zephyr build CI.

marc-hb · 2021-05-12T17:21:22Z

I've seen similar issues in other PRs, hopefully just a temporary glitch on llvm.org. I re-ran the job and it passed.

https://github.com/thesofproject/sof/pull/4089/checks?check_run_id=2554961695

Err:7 http://apt.llvm.org/focal llvm-toolchain-focal-12/main i386 Packages
  File has unexpected size (2110 != 2108). Mirror sync in progress? [IP: 151.101.250.49 443]

marc-hb · 2021-05-12T18:49:22Z

Err:7 http://apt.llvm.org/focal llvm-toolchain-focal-12/main i386 Packages
  File has unexpected size (2110 != 2108). Mirror sync in progress? [IP: 151.101.250.49 443]

I just removed the entire dependency on llvm.org in #4182 , please review.

keyonjie · 2021-05-13T02:35:11Z

@keyonjie can you check the zephyr build CI.

Yes, as @marc-hb mentioned that looks a common issue for PRs, @marc-hb just helped addressed it.

keyonjie requested review from dbaluta, lbetlej, lgirdwood, mmaka1, mrajwa and plbossart as code owners April 25, 2021 07:48

keyonjie force-pushed the ll branch 2 times, most recently from be46b06 to f5eddbd Compare April 26, 2021 03:55

keyonjie mentioned this pull request Apr 26, 2021

scheduler: guard against subdivision underflow on domain clear #4088

Merged

keyonjie requested a review from slawblauciak April 26, 2021 04:49

keyonjie changed the title ~~[TEST ONLY] LL scheduler domain refinement~~ LL scheduler domain refinement Apr 26, 2021

jgunn0262 requested changes Apr 26, 2021

View reviewed changes

keyonjie force-pushed the ll branch from f5eddbd to ba6667f Compare April 27, 2021 02:11

keyonjie requested a review from jgunn0262 April 27, 2021 02:12

keyonjie requested review from lyakh, mwasko and zrombel April 28, 2021 09:38

keyonjie force-pushed the ll branch from ba6667f to 4410828 Compare April 28, 2021 10:04

lyakh requested changes Apr 28, 2021

View reviewed changes

keyonjie force-pushed the ll branch 3 times, most recently from cf6b41f to 340c306 Compare April 30, 2021 06:58

keyonjie requested a review from lyakh April 30, 2021 07:12

keyonjie mentioned this pull request Apr 30, 2021

[BUG] DSM standalone Multicore Stream Stall #4101

Closed

keyonjie force-pushed the ll branch 2 times, most recently from 82d8f30 to 37f11e3 Compare April 30, 2021 12:11

keyonjie force-pushed the ll branch from aae0b8a to 65ff95c Compare May 8, 2021 14:38

zrombel approved these changes May 11, 2021

View reviewed changes

This was referenced May 11, 2021

[BUG] dai_config IPC timeout on TGLU_RVP_NOCODEC when arecord #4164

Closed

[BUG] pcm_Read error when capturing on TGLU_RVP_NOCODEC #4163

Closed

keyonjie added 7 commits May 11, 2021 19:04

cavs: timer: remove the wrong timer_disable

a2188d8

The timer_disable() in the timer_register() is wrong, the interrupt_enable() calling handles the interrupt enabling already, remove the wrong timer_disable() calling to correct it. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

ll_schedule_domain: add return value of domain_unregister()

79e2923

Add return value of domain_unregister(), which will be used for the LL scheduler and the domain state management in the subsequent refinement. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

ll_schedule: rename num_clients to registered_cores

d17e28d

Rename the 'num_clients' in struct ll_schedule_domain to 'registered_cores' to reflect the real use of it. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

keyonjie force-pushed the ll branch from 65ff95c to 2efcb00 Compare May 11, 2021 11:06

slawblauciak approved these changes May 12, 2021

View reviewed changes

lgirdwood reviewed May 12, 2021

View reviewed changes

lgirdwood approved these changes May 13, 2021

View reviewed changes

lgirdwood merged commit fc0fe9f into thesofproject:main May 13, 2021

LL scheduler domain refinement #4089

LL scheduler domain refinement #4089

Conversation

keyonjie commented Apr 25, 2021 • edited Loading

jgunn0262 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyonjie commented Apr 26, 2021

lgirdwood commented Apr 26, 2021

keyonjie commented Apr 27, 2021

keyonjie commented Apr 28, 2021 • edited Loading

lyakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyonjie Apr 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgirdwood commented May 8, 2021

keyonjie commented May 8, 2021 • edited Loading

keyonjie commented May 8, 2021 • edited Loading

gkbldcig commented May 10, 2021

keyonjie commented May 10, 2021

zrombel commented May 10, 2021 • edited Loading

keyonjie commented May 11, 2021 • edited Loading

keyonjie commented May 12, 2021

keyonjie commented May 12, 2021 • edited Loading

lgirdwood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyonjie commented May 12, 2021 • edited Loading

lgirdwood commented May 12, 2021

marc-hb commented May 12, 2021

marc-hb commented May 12, 2021

keyonjie commented May 13, 2021

keyonjie commented Apr 25, 2021 •

edited

Loading

keyonjie commented Apr 28, 2021 •

edited

Loading

keyonjie Apr 30, 2021 •

edited

Loading

keyonjie commented May 8, 2021 •

edited

Loading

keyonjie commented May 8, 2021 •

edited

Loading

zrombel commented May 10, 2021 •

edited

Loading

keyonjie commented May 11, 2021 •

edited

Loading

keyonjie commented May 12, 2021 •

edited

Loading

keyonjie commented May 12, 2021 •

edited

Loading