Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LL scheduler domain refinement #4089

Merged
merged 7 commits into from
May 13, 2021
Merged

LL scheduler domain refinement #4089

merged 7 commits into from
May 13, 2021

Conversation

keyonjie
Copy link
Contributor

@keyonjie keyonjie commented Apr 25, 2021

clean up of the ll_scheduler domain clients/tasks management.

ll_schedule: domain client/task refinement

Refinement to make the domain client/task management more clear:

a. update the total_num_tasks and num_clients in the helper pair
domain_register/unregister().

b. update the enabled[core] and enabled_clients in the helper pair
domain_enable/disable().

c. for the completed task, do the domain client/task update in the new
created helper schedule_ll_task_done().

d. cleanup and remove the client/task management from the ll_schedule
helpers schedule_ll_clients_enable(), schedule_ll_tasks_run(),
schedule_ll_domain_set() and schedule_ll_domain_clear().

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>

This is aim to fix #3947 #3949 #3950.

Without those refinement, the num_clients was not used correctly, we wanted to use it for both num_registered_cores and 
num_enabled_cores, with that @slawblauciak observed it may be changed to '-1' when component/pipeline is asked for 
running on a slave core. That's why we are seeing issues #3947 #3949 #3950.

This reverts changes in 'commit b500999477ca ("scheduler: guard against
subdivision underflow on domain clear")' as it is not needed anymore.

@keyonjie keyonjie changed the title [TEST ONLY] LL scheduler domain refinement LL scheduler domain refinement Apr 26, 2021
Copy link
Contributor

@jgunn0262 jgunn0262 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @keyonjie I really dont understand what you are fixing here.

@@ -201,7 +202,7 @@ static void dma_multi_chan_domain_unregister(struct ll_schedule_domain *domain,

/* check if task should be unregistered */
if (!pipe_task->registrable)
return;
return -EINVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know how failure will impact the scheduler, but this is probably important so should it be logged ?

src/schedule/dma_single_chan_domain.c Outdated Show resolved Hide resolved
@@ -46,6 +46,7 @@ struct ll_schedule_domain {
spinlock_t lock; /**< standard lock */
atomic_t total_num_tasks; /**< total number of registered tasks */
atomic_t num_clients; /**< number of registered cores */
atomic_t enabled_clients; /**< number of enabled cores */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing - clients should be renamed to cores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing - clients should be renamed to cores.

yes, let me change them.

src/include/sof/schedule/ll_schedule_domain.h Show resolved Hide resolved
src/include/sof/schedule/ll_schedule_domain.h Show resolved Hide resolved
@keyonjie
Copy link
Contributor Author

Sorry @keyonjie I really dont understand what you are fixing here.

This is aim to fix #3947 #3949 #3950.

@lgirdwood
Copy link
Member

Sorry @keyonjie I really dont understand what you are fixing here.

This is aim to fix #3947 #3949 #3950.

@keyonjie I think @jgunn0262 means this is not clear in the commit message. i.e the commit message should explain the bug and why this PR fixes the bug,

@keyonjie
Copy link
Contributor Author

Sorry @keyonjie I really dont understand what you are fixing here.

This is aim to fix #3947 #3949 #3950.

@keyonjie I think @jgunn0262 means this is not clear in the commit message. i.e the commit message should explain the bug and why this PR fixes the bug,

Got it. Basically the num_clients was not used currently, we wanted to use it for both num_registered_cores and num_enabled_cores, with that @slawblauciak observed it may be changed to '-1' when component/pipeline is asked for running on a slave core.

@slawblauciak provided a fix here: #4088, but we should really remove this mess to avoid encountering more similar issues.

@keyonjie
Copy link
Contributor Author

keyonjie commented Apr 28, 2021

@zrombel @slawblauciak @mwasko This actually reverts changes in 'commit b500999 ("scheduler: guard against
subdivision underflow on domain clear")' as it is actually not needed anymore. Can you please help to check if this looks good to you as I actually can't reproduce them at my end?

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR really has to explain what and why it changes...

@@ -201,7 +202,7 @@ static void dma_multi_chan_domain_unregister(struct ll_schedule_domain *domain,

/* check if task should be unregistered */
if (!pipe_task->registrable)
return;
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually an error? dma_multi_chan_domain_register() returns 0 in this case.

Copy link
Contributor Author

@keyonjie keyonjie Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks let's don't change the existed logic here and return 0 for both at the moment.

}
}

return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably we'd end up here if the domain hadn't been registered or maybe is still busy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it could be either, so I use '-EINVAL' here.

@@ -370,27 +371,29 @@ static void dma_single_chan_domain_unregister(struct ll_schedule_domain *domain,

/* check if task should be unregistered */
if (!pipe_task->registrable)
return;
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above - is this an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me make it consistent with the register() part.

src/schedule/dma_single_chan_domain.c Outdated Show resolved Hide resolved
@@ -192,14 +192,16 @@ static void timer_domain_unregister(struct ll_schedule_domain *domain,

/* tasks still registered on this core */
if (!timer_domain->arg[core] || num_tasks)
return;
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-EBUSY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


/* no client anymore, clear the domain */
if (!atomic_read(&domain->registered_cores))
domain_clear(domain);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes behaviour, right? E.g. in timer domain this clears the interrupt, which if done wrongly can cause a missed interrupt. Why is this needed? Was there a bug before?

atomic_sub(&sch->num_tasks, 1);

/* the last task of the core, unregister the client/core */
if (!atomic_read(&sch->num_tasks) &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been discussed before. The sequence

atomic_sub(y);
x = atomic_read(y);

isn't the same as

x = atomic_sub(y);

namely because the former one breaks atomicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the lock is hold by the caller, so even the atomicity is broken we are still fine, I change it for readability improvement, otherwise we have to use this as the atomic_xx() returns the original value:

x = atomic_sub(y);
if (!(x-1))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it improves readability at all. It adds confusion why a wrong use of an API is allowed. Either we're protected by some external locking, then we should remove atomic_t and just make it an int and make sure it's consistently protected, or we use atomic_t and we understand why we need it and we use it correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it improves readability at all. It adds confusion why a wrong use of an API is allowed. Either we're protected by some external locking, then we should remove atomic_t and just make it an int and make sure it's consistently protected, or we use atomic_t and we understand why we need it and we use it correctly.

We don't have to make sure the atomicity, but we still need the make them as volatile to make sure not reading them from cache as we need sync from L1 cache of different Cores, @lyakh what do you suggest here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the atomic types are used as there is or was a need to add tasks from within other tasks and the scheduler lock was already held at this point. Not sure if this is the case today.
@lyakh the Zephyr timer domain work you are doing should be a lot simpler wrt to locking/atomic since this is all done by Zephyr. We should see a significant reduction in complexity.

atomic_sub(&sch->domain->registered_cores, 1);

/* no client anymore, clear the domain */
if (!atomic_read(&sch->domain->registered_cores))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


/* no client anymore, clear the domain */
if (!atomic_read(&sch->domain->registered_cores))
domain_clear(sch->domain);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wasn't done before, why is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.
the domain_clear() clears interrupts, so it should be called only at interrupt handler, should not call it here, or in domain_unregister(), or domain_disable().

Let me refine this part also. Thanks!

@lgirdwood
Copy link
Member

@keyonjie before I review, does this fix the DSM issue ?

@keyonjie
Copy link
Contributor Author

keyonjie commented May 8, 2021

@keyonjie before I review, does this fix the DSM issue ?

I would say it improves a lot for multi-core support, for the DSM issue, need @zrombel 's double check.

Without this PR, we see many errors in multi-core validation, see it here: https://sof-ci.sh.intel.com/#/result/planresultdetail/3832

I am waiting for the validation result with this PR applied here:
https://sof-ci.sh.intel.com/#/result/planresultdetail/3837

@keyonjie
Copy link
Contributor Author

keyonjie commented May 8, 2021

@lgirdwood some conflict with the latest main, just rebased, let's wait for the result here: https://sof-ci.sh.intel.com/#/result/planresultdetail/3837

@gkbldcig
Copy link
Collaborator

Can one of the admins verify this patch?

@keyonjie
Copy link
Contributor Author

SOFCI TEST

@zrombel
Copy link

zrombel commented May 10, 2021

I run some stress tests on this PR, a it seems that stream stall problem #4101 is fixed!
There were no stream stalls in 200 iterations in DSM multicore tests (in def and chrome config).
Now I can only see sporadic glitch in recorded file.
Glitch can be seen in the beginning of the stream, or in whole stream:

  1. Glitch at the beginning
    image

  2. Glitch in whole stream
    image

But this is another bug reported here: #4052

A scheduler will run on a specific scheduler only, add a core index flag
to schedule_data to denote that.

To schedule or cancel a task, looking for the scheduler with correct
core index to perform the action.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
The timer_disable() in the timer_register() is wrong, the
interrupt_enable() calling handles the interrupt enabling already,
remove the wrong timer_disable() calling to correct it.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Add return value of domain_unregister(), which will be used for the LL
scheduler and the domain state management in the subsequent refinement.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Rename the 'num_clients' in struct ll_schedule_domain to 'registered_cores'
to reflect the real use of it.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
The flag registered_cores was created to denote the number of registered
cores, but it is also used as the number of the enabled cores today.

Here add a new flag enabled_cores for the former purpose to remove the
confusion and help for the subsequent domain tasks/cores management
refinement.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Refinement to make the domain tasks/cores management more clear:

a. update the total_num_tasks and num_cores in the helper pair
domain_register/unregister().

b. update the enabled[core] and enabled_cores in the helper pair
domain_enable/disable().

c. for the completed task, do the domain client/task update in the new
created helper schedule_ll_task_done().

d. cleanup and remove the tasks/cores management from the ll_schedule
helpers schedule_ll_clients_enable(), schedule_ll_tasks_run(),
schedule_ll_domain_set() and schedule_ll_domain_clear().

Without those refinement, the num_clients was not used correctly,
we used it for both num_registered_cores and num_enabled_cores, with
that we observed that it may be changed to '-1' when component/pipeline
is asked for running on a slave core.

This reverts changes in 'commit b500999 ("scheduler: guard against
subdivision underflow on domain clear")' as it is not needed anymore.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
Add a new_target_tick to store the new target tick for the next set,
which is used during the reschedule stage.

Update the new_target_tick on tasks done on each core, and do the final
domain_set() at the point that tasks on all cores are finished.

Signed-off-by: Keyon Jie <yang.jie@linux.intel.com>
@keyonjie
Copy link
Contributor Author

keyonjie commented May 11, 2021

@slawblauciak rescheduling part refining added on top, please help to review.

@keyonjie
Copy link
Contributor Author

SOFCI TEST

@keyonjie
Copy link
Contributor Author

keyonjie commented May 12, 2021

@lgirdwood This PR will address > 90% failed cases observed here: http://sof-ci.sh.intel.com/#/result/planresultdetail/3868 and some other multi-core related issues observed by slim driver, I think we need this for 1.8/

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended for rc2 ? Are we happy with validation on TGL 12 ?

atomic_sub(&sch->num_tasks, 1);

/* the last task of the core, unregister the client/core */
if (!atomic_read(&sch->num_tasks) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the atomic types are used as there is or was a need to add tasks from within other tasks and the scheduler lock was already held at this point. Not sure if this is the case today.
@lyakh the Zephyr timer domain work you are doing should be a lot simpler wrt to locking/atomic since this is all done by Zephyr. We should see a significant reduction in complexity.

@keyonjie
Copy link
Contributor Author

keyonjie commented May 12, 2021

Is this intended for rc2 ? Are we happy with validation on TGL 12 ?

Yes about the intending for rc2.
For TGL 12 validation, @slawblauciak @mwasko can you help to cherry-pick related commits (there could be conflicts so other related ones together may needed) to the 12 branch? or do you prefer to branch out a new branch from the latest main for the release?

@lgirdwood
Copy link
Member

@keyonjie can you check the zephyr build CI.

@marc-hb
Copy link
Collaborator

marc-hb commented May 12, 2021

I've seen similar issues in other PRs, hopefully just a temporary glitch on llvm.org. I re-ran the job and it passed.

https://github.com/thesofproject/sof/pull/4089/checks?check_run_id=2554961695

Err:7 http://apt.llvm.org/focal llvm-toolchain-focal-12/main i386 Packages
  File has unexpected size (2110 != 2108). Mirror sync in progress? [IP: 151.101.250.49 443]

@marc-hb
Copy link
Collaborator

marc-hb commented May 12, 2021

Err:7 http://apt.llvm.org/focal llvm-toolchain-focal-12/main i386 Packages
  File has unexpected size (2110 != 2108). Mirror sync in progress? [IP: 151.101.250.49 443]

I just removed the entire dependency on llvm.org in #4182 , please review.

@keyonjie
Copy link
Contributor Author

@keyonjie can you check the zephyr build CI.

Yes, as @marc-hb mentioned that looks a common issue for PRs, @marc-hb just helped addressed it.

@lgirdwood lgirdwood merged commit fc0fe9f into thesofproject:main May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants