Efficiency on manycore/fabric CPUs #9

dumblob · 2022-06-01T09:34:05Z

One of the unsolved questions in the realm of work stealing is the fixed (as opposed to dynamic/elastic) number of workers determined AOT without the chance of allowing to grow/shrink arbitrarily during running time.

Most modern devices have multicore architectures and there is a strong trend to commission manycore systems. There are at least two distinguishable driving forces behind the desire of growing/shrinking the number of workers during running time.

long-running applications allowing on-the-fly exchange/addition/removal of processing cores (typical for manycore servers, fabric CPUs, etc.) without interruption despite being under constant heavy load maintaining high throughput and low latency
power-efficiency in mobile devices (smartphones with 8+ cores and big.LITTLE/non-unified architectures) - they move tasks from performance cores to less performant cores and vice versa, they turn off whole cores arbitrarily very often and then turn them on very quickly once the need arises again (typically in response to human input - e.g. switching between a gaming window and a web browser)

In both cases we have some non-negligible number of running workers (tens, hundreds or even thousands) and we map them to cores. If the number of workers does not match number of cores, context switching will become dominant and the performance of the app significantly decreases (in case of multiple workers being mapped to one low-performance core in response to power saving demand) or the app will not use the full power of the HW (in case some cores were added to the server or got turned on quickly on a smartphone in response to the user starting a game).

What are the plans for Lace to embrace all this?

trolando · 2022-06-02T19:05:17Z

Right now, one can simply stop all workers and restart Lace to change the number of workers. That would not be exactly on the fly, the problem with that is that the task queues need to support this and I imagine it's pretty difficult to untangle all the dependencies between tasks.

Lace is not meant for general purpose applications anyway. If someone wants to reimagine the framework or use the work-stealing deque in a framework that supports those things, be my guest. It is meant for research programs that basically run solo on a big machine to do some task, not to run concurrently with other user applications.

dumblob · 2022-06-02T19:28:01Z

Yep, as I mentioned above, this is a highly underresearched feature. But it is necessary even on big machines (where booked resources provisioned by a hypervisor are fully changeable on-the-fly) - otherwise the running apps need to maintain persistent state themselves which greatly increases their complexity.

If you knew about some references or people researching this, I would very much welcome talking to them.

trolando · 2022-06-03T12:45:30Z

I have not been active in parallel processing communities for quite a while. I was briefly engaged when I wrote Lace back in 2013 and visited EuroPar, but have not published or engaged since.

I can imagine that someone (e.g. a MSc or PhD student) could adapt Lace for on-the-fly addition (trivial) and removal (slightly involved) of workers. There are roughly two ways to deal with worker death. The core issue is that local task variables are on the program stack of the worker, so if the worker is going to die, something has to be done with local variables. One option is to have some way of saving a task state, but that would require significant changes to tasks to the point where one might as well completely replace the framework to allow this kind of synchronization... the other option is to simply "unsteal" the work, but this introduces a large cost when workers extensively steal from each other. This is relatively easy to implement, when dying the worker simply has to "unsteal" all stolen tasks.

dumblob · 2022-06-03T15:32:27Z

Thanks for the pointers! If I find someone interested, I will contact you.

I will leave this thread open as an open topic.

dumblob · 2022-09-12T10:53:57Z

This is a continuation of the discussion from #10 (comment) to not pollute that PR's thread:

I think the proposed two points (i.e. only grow number of threads and let them sleep) is a good one as sleeping threads are IMHO enough efficient and should not pose nearly any harm to the running system (neither power-wise nor performance-wise nor latency-wise even in a long run). So yes, I would be totally in favor of not really removing threads on-the-fly but just let them sleep.

Thoughts?

trolando · 2023-02-14T13:04:05Z

Let's say some work is being done by 6 out of 8 Lace workers, 2 are sleeping. How would they wake up if work becomes available? Should every worker that pushes new work wake up all potentially sleeping threads? Should they do that each time a task is pushed to the queue?

dumblob · 2023-02-15T10:55:01Z

Let's say some work is being done by 6 out of 8 Lace workers, 2 are sleeping. How would they wake up if work becomes available? Should every worker that pushes new work wake up all potentially sleeping threads?

I guess not if it is just one task (would this scale up enough?). But if it is a batch of tasks, then probably yes.

Should they do that each time a task is pushed to the queue?

If Lace does have unshared per-Lace-worker queues, then I think all these decisions could be made heuristically based on the fill percentage of these queues or better the fill rate of these queues. But again, this is just guess-work.

These are just my speculations - it needs real tests of both approaches to decide this.

trolando · 2023-03-05T14:43:04Z

Would this help? 67eaa42

It suspends all Lace workers if there is no work, and resumes all Lace workers when work is offered. The assumption anyway is that Lace is used for fine-grained work, so if there is any work, it should be enough to keep all workers busy anyway.

trolando closed this as completed Sep 9, 2022

dumblob mentioned this issue Sep 10, 2022

Lace 2.0 #10

Open

dumblob mentioned this issue Jan 21, 2023

Comparison to Weave #7

Open

trolando reopened this Mar 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiency on manycore/fabric CPUs #9

Efficiency on manycore/fabric CPUs #9

dumblob commented Jun 1, 2022

trolando commented Jun 2, 2022

dumblob commented Jun 2, 2022

trolando commented Jun 3, 2022

dumblob commented Jun 3, 2022

dumblob commented Sep 12, 2022

trolando commented Feb 14, 2023 •

edited

Loading

dumblob commented Feb 15, 2023

trolando commented Mar 5, 2023

Efficiency on manycore/fabric CPUs #9

Efficiency on manycore/fabric CPUs #9

Comments

dumblob commented Jun 1, 2022

trolando commented Jun 2, 2022

dumblob commented Jun 2, 2022

trolando commented Jun 3, 2022

dumblob commented Jun 3, 2022

dumblob commented Sep 12, 2022

trolando commented Feb 14, 2023 • edited Loading

dumblob commented Feb 15, 2023

trolando commented Mar 5, 2023

trolando commented Feb 14, 2023 •

edited

Loading