Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency on manycore/fabric CPUs #9

Open
dumblob opened this issue Jun 1, 2022 · 8 comments
Open

Efficiency on manycore/fabric CPUs #9

dumblob opened this issue Jun 1, 2022 · 8 comments

Comments

@dumblob
Copy link

dumblob commented Jun 1, 2022

One of the unsolved questions in the realm of work stealing is the fixed (as opposed to dynamic/elastic) number of workers determined AOT without the chance of allowing to grow/shrink arbitrarily during running time.

Most modern devices have multicore architectures and there is a strong trend to commission manycore systems. There are at least two distinguishable driving forces behind the desire of growing/shrinking the number of workers during running time.

  1. long-running applications allowing on-the-fly exchange/addition/removal of processing cores (typical for manycore servers, fabric CPUs, etc.) without interruption despite being under constant heavy load maintaining high throughput and low latency
  2. power-efficiency in mobile devices (smartphones with 8+ cores and big.LITTLE/non-unified architectures) - they move tasks from performance cores to less performant cores and vice versa, they turn off whole cores arbitrarily very often and then turn them on very quickly once the need arises again (typically in response to human input - e.g. switching between a gaming window and a web browser)

In both cases we have some non-negligible number of running workers (tens, hundreds or even thousands) and we map them to cores. If the number of workers does not match number of cores, context switching will become dominant and the performance of the app significantly decreases (in case of multiple workers being mapped to one low-performance core in response to power saving demand) or the app will not use the full power of the HW (in case some cores were added to the server or got turned on quickly on a smartphone in response to the user starting a game).

What are the plans for Lace to embrace all this?

@trolando
Copy link
Owner

trolando commented Jun 2, 2022

Right now, one can simply stop all workers and restart Lace to change the number of workers. That would not be exactly on the fly, the problem with that is that the task queues need to support this and I imagine it's pretty difficult to untangle all the dependencies between tasks.

Lace is not meant for general purpose applications anyway. If someone wants to reimagine the framework or use the work-stealing deque in a framework that supports those things, be my guest. It is meant for research programs that basically run solo on a big machine to do some task, not to run concurrently with other user applications.

@dumblob
Copy link
Author

dumblob commented Jun 2, 2022

Yep, as I mentioned above, this is a highly underresearched feature. But it is necessary even on big machines (where booked resources provisioned by a hypervisor are fully changeable on-the-fly) - otherwise the running apps need to maintain persistent state themselves which greatly increases their complexity.

If you knew about some references or people researching this, I would very much welcome talking to them.

@trolando
Copy link
Owner

trolando commented Jun 3, 2022

I have not been active in parallel processing communities for quite a while. I was briefly engaged when I wrote Lace back in 2013 and visited EuroPar, but have not published or engaged since.

I can imagine that someone (e.g. a MSc or PhD student) could adapt Lace for on-the-fly addition (trivial) and removal (slightly involved) of workers. There are roughly two ways to deal with worker death. The core issue is that local task variables are on the program stack of the worker, so if the worker is going to die, something has to be done with local variables. One option is to have some way of saving a task state, but that would require significant changes to tasks to the point where one might as well completely replace the framework to allow this kind of synchronization... the other option is to simply "unsteal" the work, but this introduces a large cost when workers extensively steal from each other. This is relatively easy to implement, when dying the worker simply has to "unsteal" all stolen tasks.

@dumblob
Copy link
Author

dumblob commented Jun 3, 2022

Thanks for the pointers! If I find someone interested, I will contact you.

I will leave this thread open as an open topic.

@trolando trolando closed this as completed Sep 9, 2022
@dumblob dumblob mentioned this issue Sep 10, 2022
@dumblob
Copy link
Author

dumblob commented Sep 12, 2022

This is a continuation of the discussion from #10 (comment) to not pollute that PR's thread:

I think the proposed two points (i.e. only grow number of threads and let them sleep) is a good one as sleeping threads are IMHO enough efficient and should not pose nearly any harm to the running system (neither power-wise nor performance-wise nor latency-wise even in a long run). So yes, I would be totally in favor of not really removing threads on-the-fly but just let them sleep.

Thoughts?

@trolando
Copy link
Owner

trolando commented Feb 14, 2023

Let's say some work is being done by 6 out of 8 Lace workers, 2 are sleeping. How would they wake up if work becomes available? Should every worker that pushes new work wake up all potentially sleeping threads? Should they do that each time a task is pushed to the queue?

@dumblob
Copy link
Author

dumblob commented Feb 15, 2023

Let's say some work is being done by 6 out of 8 Lace workers, 2 are sleeping. How would they wake up if work becomes available? Should every worker that pushes new work wake up all potentially sleeping threads?

I guess not if it is just one task (would this scale up enough?). But if it is a batch of tasks, then probably yes.

Should they do that each time a task is pushed to the queue?

If Lace does have unshared per-Lace-worker queues, then I think all these decisions could be made heuristically based on the fill percentage of these queues or better the fill rate of these queues. But again, this is just guess-work.

These are just my speculations - it needs real tests of both approaches to decide this.

@trolando
Copy link
Owner

trolando commented Mar 5, 2023

Would this help? 67eaa42

It suspends all Lace workers if there is no work, and resumes all Lace workers when work is offered. The assumption anyway is that Lace is used for fine-grained work, so if there is any work, it should be enough to keep all workers busy anyway.

@trolando trolando reopened this Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants