-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LL scheduler domain refinement #4089
Conversation
be46b06
to
f5eddbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @keyonjie I really dont understand what you are fixing here.
src/schedule/dma_multi_chan_domain.c
Outdated
@@ -201,7 +202,7 @@ static void dma_multi_chan_domain_unregister(struct ll_schedule_domain *domain, | |||
|
|||
/* check if task should be unregistered */ | |||
if (!pipe_task->registrable) | |||
return; | |||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know how failure will impact the scheduler, but this is probably important so should it be logged ?
@@ -46,6 +46,7 @@ struct ll_schedule_domain { | |||
spinlock_t lock; /**< standard lock */ | |||
atomic_t total_num_tasks; /**< total number of registered tasks */ | |||
atomic_t num_clients; /**< number of registered cores */ | |||
atomic_t enabled_clients; /**< number of enabled cores */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing - clients should be renamed to cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing - clients should be renamed to cores.
yes, let me change them.
Got it. Basically the num_clients was not used currently, we wanted to use it for both num_registered_cores and num_enabled_cores, with that @slawblauciak observed it may be changed to '-1' when component/pipeline is asked for running on a slave core. @slawblauciak provided a fix here: #4088, but we should really remove this mess to avoid encountering more similar issues. |
@zrombel @slawblauciak @mwasko This actually reverts changes in 'commit b500999 ("scheduler: guard against |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR really has to explain what and why it changes...
src/schedule/dma_multi_chan_domain.c
Outdated
@@ -201,7 +202,7 @@ static void dma_multi_chan_domain_unregister(struct ll_schedule_domain *domain, | |||
|
|||
/* check if task should be unregistered */ | |||
if (!pipe_task->registrable) | |||
return; | |||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually an error? dma_multi_chan_domain_register()
returns 0 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks let's don't change the existed logic here and return 0 for both at the moment.
} | ||
} | ||
|
||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably we'd end up here if the domain hadn't been registered or maybe is still busy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it could be either, so I use '-EINVAL' here.
@@ -370,27 +371,29 @@ static void dma_single_chan_domain_unregister(struct ll_schedule_domain *domain, | |||
|
|||
/* check if task should be unregistered */ | |||
if (!pipe_task->registrable) | |||
return; | |||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above - is this an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let me make it consistent with the register() part.
src/schedule/timer_domain.c
Outdated
@@ -192,14 +192,16 @@ static void timer_domain_unregister(struct ll_schedule_domain *domain, | |||
|
|||
/* tasks still registered on this core */ | |||
if (!timer_domain->arg[core] || num_tasks) | |||
return; | |||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-EBUSY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
|
||
/* no client anymore, clear the domain */ | ||
if (!atomic_read(&domain->registered_cores)) | ||
domain_clear(domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changes behaviour, right? E.g. in timer domain this clears the interrupt, which if done wrongly can cause a missed interrupt. Why is this needed? Was there a bug before?
atomic_sub(&sch->num_tasks, 1); | ||
|
||
/* the last task of the core, unregister the client/core */ | ||
if (!atomic_read(&sch->num_tasks) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has been discussed before. The sequence
atomic_sub(y);
x = atomic_read(y);
isn't the same as
x = atomic_sub(y);
namely because the former one breaks atomicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the lock is hold by the caller, so even the atomicity is broken we are still fine, I change it for readability improvement, otherwise we have to use this as the atomic_xx() returns the original value:
x = atomic_sub(y);
if (!(x-1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it improves readability at all. It adds confusion why a wrong use of an API is allowed. Either we're protected by some external locking, then we should remove atomic_t
and just make it an int
and make sure it's consistently protected, or we use atomic_t
and we understand why we need it and we use it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it improves readability at all. It adds confusion why a wrong use of an API is allowed. Either we're protected by some external locking, then we should remove
atomic_t
and just make it anint
and make sure it's consistently protected, or we useatomic_t
and we understand why we need it and we use it correctly.
We don't have to make sure the atomicity, but we still need the make them as volatile to make sure not reading them from cache as we need sync from L1 cache of different Cores, @lyakh what do you suggest here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the atomic types are used as there is or was a need to add tasks from within other tasks and the scheduler lock was already held at this point. Not sure if this is the case today.
@lyakh the Zephyr timer domain work you are doing should be a lot simpler wrt to locking/atomic since this is all done by Zephyr. We should see a significant reduction in complexity.
src/schedule/ll_schedule.c
Outdated
atomic_sub(&sch->domain->registered_cores, 1); | ||
|
||
/* no client anymore, clear the domain */ | ||
if (!atomic_read(&sch->domain->registered_cores)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/schedule/ll_schedule.c
Outdated
|
||
/* no client anymore, clear the domain */ | ||
if (!atomic_read(&sch->domain->registered_cores)) | ||
domain_clear(sch->domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wasn't done before, why is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
the domain_clear() clears interrupts, so it should be called only at interrupt handler, should not call it here, or in domain_unregister(), or domain_disable().
Let me refine this part also. Thanks!
cf6b41f
to
340c306
Compare
82d8f30
to
37f11e3
Compare
@keyonjie before I review, does this fix the DSM issue ? |
I would say it improves a lot for multi-core support, for the DSM issue, need @zrombel 's double check. Without this PR, we see many errors in multi-core validation, see it here: https://sof-ci.sh.intel.com/#/result/planresultdetail/3832 I am waiting for the validation result with this PR applied here: |
@lgirdwood some conflict with the latest main, just rebased, let's wait for the result here: https://sof-ci.sh.intel.com/#/result/planresultdetail/3837 |
Can one of the admins verify this patch? |
SOFCI TEST |
I run some stress tests on this PR, a it seems that stream stall problem #4101 is fixed! But this is another bug reported here: #4052 |
A scheduler will run on a specific scheduler only, add a core index flag to schedule_data to denote that. To schedule or cancel a task, looking for the scheduler with correct core index to perform the action. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
The timer_disable() in the timer_register() is wrong, the interrupt_enable() calling handles the interrupt enabling already, remove the wrong timer_disable() calling to correct it. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Add return value of domain_unregister(), which will be used for the LL scheduler and the domain state management in the subsequent refinement. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Rename the 'num_clients' in struct ll_schedule_domain to 'registered_cores' to reflect the real use of it. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
The flag registered_cores was created to denote the number of registered cores, but it is also used as the number of the enabled cores today. Here add a new flag enabled_cores for the former purpose to remove the confusion and help for the subsequent domain tasks/cores management refinement. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Refinement to make the domain tasks/cores management more clear: a. update the total_num_tasks and num_cores in the helper pair domain_register/unregister(). b. update the enabled[core] and enabled_cores in the helper pair domain_enable/disable(). c. for the completed task, do the domain client/task update in the new created helper schedule_ll_task_done(). d. cleanup and remove the tasks/cores management from the ll_schedule helpers schedule_ll_clients_enable(), schedule_ll_tasks_run(), schedule_ll_domain_set() and schedule_ll_domain_clear(). Without those refinement, the num_clients was not used correctly, we used it for both num_registered_cores and num_enabled_cores, with that we observed that it may be changed to '-1' when component/pipeline is asked for running on a slave core. This reverts changes in 'commit b500999 ("scheduler: guard against subdivision underflow on domain clear")' as it is not needed anymore. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Add a new_target_tick to store the new target tick for the next set, which is used during the reschedule stage. Update the new_target_tick on tasks done on each core, and do the final domain_set() at the point that tasks on all cores are finished. Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
@slawblauciak rescheduling part refining added on top, please help to review. |
SOFCI TEST |
@lgirdwood This PR will address > 90% failed cases observed here: http://sof-ci.sh.intel.com/#/result/planresultdetail/3868 and some other multi-core related issues observed by slim driver, I think we need this for 1.8/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended for rc2 ? Are we happy with validation on TGL 12 ?
atomic_sub(&sch->num_tasks, 1); | ||
|
||
/* the last task of the core, unregister the client/core */ | ||
if (!atomic_read(&sch->num_tasks) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the atomic types are used as there is or was a need to add tasks from within other tasks and the scheduler lock was already held at this point. Not sure if this is the case today.
@lyakh the Zephyr timer domain work you are doing should be a lot simpler wrt to locking/atomic since this is all done by Zephyr. We should see a significant reduction in complexity.
Yes about the intending for rc2. |
@keyonjie can you check the zephyr build CI. |
I've seen similar issues in other PRs, hopefully just a temporary glitch on llvm.org. I re-ran the job and it passed. https://github.com/thesofproject/sof/pull/4089/checks?check_run_id=2554961695
|
I just removed the entire dependency on llvm.org in #4182 , please review. |
clean up of the ll_scheduler domain clients/tasks management.