Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper support for distributed computing, parallelism and concurrency #1868

Closed
dumblob opened this issue Sep 4, 2019 · 86 comments
Closed

Proper support for distributed computing, parallelism and concurrency #1868

dumblob opened this issue Sep 4, 2019 · 86 comments
Assignees
Labels
Feature Request This issue is made to request a feature.
Projects

Comments

@dumblob
Copy link
Contributor

dumblob commented Sep 4, 2019

There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.

Note, I didn't want to write this to #561 as the scope I'm describing is broader.

I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).

Why both concurrency and parallelism and not either of them?

In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).

We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about restrict - but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.

Because instruction-level paralelism is solved, the best approach nowadays seems to be to:

  1. write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations

  2. then schedule these go/v routines in cooperative manner on one physical CPU core with fixed-size queue (single-producer-single-consumer "SPSC" - for performant implementations see discussion below) in between these go/v routines (this queue is called a channel in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware.

  3. and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.

  4. and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.

The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.

This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).

There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically explicit). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.

See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.

IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.

Table of parallelism possibilities we have nowadays (just for reference):

parallelism kind suited for execution duration latency overhead suited for how frequent execution startup internode bandwidth quality has influence special parallel SW architecture required requirements for execution & deployment
CPU non-JIT vectorization on a CPU core very short to long ~none frequent none (it's just 1 node) no binary for the particular CPU
CPU JIT vectorization on a CPU core short to long low moderate to frequent none (it's just 1 node) no app source code + VM binary for the particular CPU
CPU cores utilization (thread, process, ...) long low to moderate moderate lower to none (it's just 1 node) yes (except: pure objects as actors or alike) binary for the particular CPU
accelerators (GPU, FPGA, ...) long low to moderate moderate lower to none (it's just 1 node) yes (except: array/matrix/tensor/...-based languages) binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology long moderate sporadic to moderate moderate (hypercube/... is really fast) yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages) binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster long moderate to high sporadic to moderate moderate to high (can be fast, but not always) yes binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster long high sporadic high (basically can't be that fast) yes binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)

(in practice these might overlap and/or be used simultaneously)

P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.


Other things to consider and/or track:

  1. As of March 2021 the fastest publicly known (and most comprehensive & general) parallelism & concurrency backend/library is Weave
  2. use PRNG (pseudo-random number generator) which copes well with dynamic threading (i.e. doesn't suffer from "same seed leads to same random numbers in all threads") - see e.g. splittable PRNGs such as JAX
  3. maybe infinite loop in rand a7c8483#r39632493
  4. g_str_buf with parallel access:
  5. FPU stuff needs special attention from the threading (work stealing) scheduler - see [Request] Optional per-thread startup procedure mratsim/weave#163
  6. under the hood, we might utilize some wait-free and lock-free algorithms (see e.g. https://github.com/pramalhe/ConcurrencyFreaks )
  7. reconsider deadlock/livelock/no_thread_running/... detection - see dead lock error mesage like in Go #10340
  8. incorporate a "sharded" RwMutex into the standard library and use it also for V's built-in "sharded" data structures (map etc.)
  9. memory models of multicore architectures (x86, ARM, POWER, ...) are still sometimes not fully formally defined but SW memory models are even worse at that 😮 https://research.swtch.com/mm
@dumblob dumblob added the Feature Request This issue is made to request a feature. label Sep 4, 2019
@gslicer
Copy link

gslicer commented Sep 5, 2019

Wow, you should write a full master thesis around this topic one day ;)
For me I would prioritize this topic as follows:
Prio1: provide safe multi-threading/scheduler (no data races, locks, semaphores etc. -> ponylang.io)
Prio2: provide spawning and killing of new processes for SMP, where each can be multithreaded
Prio3: provide processes for non CPU-ressources like MMX/SSE,GPU/CUDA,PhysiX,RayTracing you name it
Prio4: provide distributed computing (cluster nodes over networks) -> julialang.org (of course this is something one could code even himself having good standard libs for networking in V)

@dumblob
Copy link
Contributor Author

dumblob commented Sep 5, 2019

Wow, you should write a full master thesis around this topic one day ;)

I already did many years ago (wow, the time flies) - but on a (very) different topic 😉.

Prio1: provide safe multi-threading/scheduler (no data races, locks, semaphores etc. -> ponylang.io)

I'd say let's keep things simple and actually rather provide no locks, no semaphores etc. to "force" programmers stick with the builtin parallelism. Of course, there will always be the unsafe module providing raw locks etc., but that has nothing to do with parallelism in V.

Prio2: provide spawning and killing of new processes for SMP, where each can be multithreaded

There shall be just one construct - imagine go from golang, but nothing else. The point is to completely forget about SMP etc. - the programmer shall be actually strongly discouraged to think that the platform running her app will be capable of SMP and instead encourage her to think about fully dynamic behavior.

Prio3: provide processes for non CPU-ressources like MMX/SSE,GPU/CUDA,PhysiX,RayTracing you name it

There is no need for this - if one would want to offload stuff, then she would just use the unsafe or ffi module for interfacing with e.g. a CUDA library. In other words instruction-level parallelism shall IMHO not have any explicit support in V (instruction-level parallelism is already supported in V through patterns the same way as in C99+ to which V compiles - we shall though "nudge" the compiler to do more vectorization and maybe even use smart auto-annotation SW). If utterly needed, then explicit instruction-level parallelism could be added the same way as C11/C++17/non-standard intrinsics (__m128i _mm_or_si128 _mm_cmpeq_epi8 and tens/hundreds of others).

Prio4: provide distributed computing (cluster nodes over networks) -> julialang.org (of course this is something one could code even himself having good standard libs for networking in V)

Actually rather not. There shall be no need for this - that's the responsibility of the programmer who will define her own cooperative/parallel scheduler (as I outlined in my first post). This might be a simple scheduler, but also - as you pointed out - something super complex adding tens/hundreds of new nodes, many new multiplexers and demultiplexers to the huge graph of go/v routines with queues scaling over heterogeneous clusters (for a moderately complex scheduler see internals of Python Dask). So this can't be part of V.

@gslicer
Copy link

gslicer commented Sep 6, 2019

@dumblob I see the "go" keyword as a builtin for multi-threading only (lightweight processess) - V could however try to loadbalance the threads automatically among available CPU cores?

Fully agree, that higher layers of parallelizm should be spinned off the the standard-libs (or even seperate lib-projects written in V)

@dumblob
Copy link
Contributor Author

dumblob commented Sep 6, 2019

I see the "go" keyword as a builtin for multi-threading only (lightweight processess)

My proposal is about changing it to a similar semantics as in go lang - i.e. the go keyword would spawn a go/v routine (aka CSP thread or green thread or fiber or you name it).

V could however try to loadbalance the threads automatically among available CPU cores?

If you refer to operating system threads, then there is no need for V to loadbalance them as the operating system does it. If you refer to go/v routines, then sure, V will need a scheduler which shall be aware of the number of physical CPU cores available (disregarding whether tiny ones or fast ones like e.g. in big.LITTLE), spawn this number of threads and schedule go/v routines among them).

@dumblob
Copy link
Contributor Author

dumblob commented Sep 9, 2019

As @aprell pointed out in nim-lang/RFCs#160 (comment) , there is a nice and very current material describing the golang scheduler. It's a must read for those trying to implement an efficient scheduler for the mix of parallelism & concurrency. Some downsides of the golang scheduler are the following:

  1. it treats all operating system threads uniformly assuming each thread runs on one physical CPU core (or "hyperthread" on hyperthreading CPUs), but this is not true on big.LITTLE or in cases of power management (see my comment Project Picasso - A multithreading runtime for Nim nim-lang/RFCs#160 (comment) ) - golang could though solve this in the future by moving running goroutines among threads (there are some tricky corners like thread local storage which would need to be atomically moved to the other thread as well, etc., but these are solvable)

  2. there is no concept of "reentrant" goroutines in golang and thus it's impossible to spawn new instances (or kill running ones) and do the muxing/demuxing as outlined in my initial post above to accommodate for bottle necks - this is not so quickly solved in golang compared to the issue (1) above (as it would probably require extending golang) and I think this could be the major advantage of V compared to golang

Otherwise golang implements pretty closely what I described in my initial post above - so it would make sense to "port" all the parallelism & concurrency logic (which is not exactly a trivial code) to V and start improving on top of it.

@medvednikov
Copy link
Member

Thanks a lot @dumblob

This is the last unresolved thing in V for me, so all this info is very helpful.

I'll start working on this in early October, but if anyone is willing to do this earlier, they are more than welcome.

@gslicer
Copy link

gslicer commented Sep 9, 2019

so it would make sense to "port" all the parallelism & concurrency logic to V and start improving on top of it.

I hope you mean whith:
parallelism => preemptive
&
concurrency => cooperative
multitasking.

For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use.

I hope one day we can state about V as follows (taken from the Pony-project):

"It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.

It’s deadlock free. This one is easy, because V has no locks at all! So they definitely don’t deadlock, because they don’t exist."

That's all from my side for now.

@dumblob dumblob mentioned this issue Sep 9, 2019
@dumblob
Copy link
Contributor Author

dumblob commented Sep 9, 2019

I hope you mean whith:
parallelism => preemptive
&
concurrency => cooperative
multitasking.

Sure, I do.

For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use.

I don't understand why these 2 should be 2 different use cases. For me it's definitely just one use case - I want things to run purely in parallel (concurrency won't help me at all). But because parallelism is super expensive (memory consumption as well as context switching of operating system threads doesn't scale into millions), I have to combine it with concurrency by leveraging all the more knowledge I have from the programmer (unlike an operating system kernel, which doesn't know what the programmer intended and thus has to waste memory and context switches on fully featured operating system threads) to make also go/v routines (i.e. concurrent threads) behave like truly parallel operating system threads. Thus combining the preemptiveness of operating system threads and the memory and context switching efficiency of go/v routines.

I really don't see any point in having concurrency if we have true parallelism with the same amount of resources and comfort.

Mind, that my proposal gives one full power over scheduling by exposing the scheduler to the programmer.

"It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.

I'm pretty sure this will not be possible (see latest research). At least not with the current Vs semantics. Such guarantees can be achieved through constraining the programmer (see e.g. Rust as already many times mentioned here in V's discussions) or through making the language so incapable that it'll get very impractical 😢. What can be though achieved in V is to make races highly improbable (the syntax & semantics will guide the programmer into writing more or less race-free code) and on top of that provide e.g. a race analyzer as golang does (and run it automatically as part of compilation?). This is the difference between guarantee and high probability, and it's enormous.

Besides, if one utterly needs plain concurrency, it's easy to do it yourself (there are many implementations on github as it's really an easy high school exercise). Even on embedded systems I always try to use first one big event loop and if that's not enough, then I resort directly to interrupts (which are costly, but provide true preemptiveness) instead of using any coroutines, because they won't bring me anything more than an event loop and they're about as complex as true interrupts with preemptiveness.

@oleg-kachan
Copy link

oleg-kachan commented Sep 16, 2019

upvote for parallelism + SIMD vectorization as requested here #1079

@dumblob
Copy link
Contributor Author

dumblob commented Nov 4, 2019

@medvednikov some implementation ideas:

  1. there is a good looking attempt to improve on a single producer single consumer ring buffer by avoiding the extra empty slot in the buffer, but especially avoiding any modulo (which is very slow instruction) - see Memory and Speed efficient ring-buffer design in Project Picasso - A multithreading runtime for Nim nim-lang/RFCs#160 (comment) ;
    the (hopefully) final implementation of a lock-less unbounded Multiple Producer Single Consumer queue (ring buffer) is to be found in Change to lockless unbounded mpsc mratsim/weave#21
    Note though, that current implementation of V channels (vlib/sync/channels.v) uses something akin to SPSC with subscribing in run time. There is a potential alternative to SPSC ring buffer in the form of a "BipBuffer" (bipartite buffer) offering higher performance.

  2. follow the yet incomplete design & implementation of a modern thread-safe object pool in Project Picasso - A multithreading runtime for Nim nim-lang/RFCs#160 (comment) (the objectives are really challenging and if delivered without any side effects, it'll be a leap forwards)

  3. get inspired by ideas as well as by the actual implementation of mimalloc, the most stable and efficient (and probably fastest as of now) os-threads-friendly malloc() replacement which doesn't sacrifice single-thread performance

@dumblob
Copy link
Contributor Author

dumblob commented Nov 19, 2019

@medvednikov a possible implementation of a highly efficient preemptive scheduling:

Note also, that we can easily analyze in compile time the maximum stack size needed for each V parallel routine, so we don't need dynamic stacks (see also the cactus stack problem and an innovative solution mratsim/weave#24 - worth reading!), but rather a fixed-size array of different size (known in compile time) for each V parallel routine. This saves a lot of trouble and allows for the best possible performance (yeah, for indirect & direct recursion, for FFI, and for unsafe stuff we actually need dynamic stacks which means tricks like the jump used in Go, but that shouldn't be very common in performance critical places, so we can afford some performance penalty in such cases).

@dumblob
Copy link
Contributor Author

dumblob commented Dec 1, 2019

There seems to be an implementation effort having very similar goals (maybe except for the reentrancy which I'd definitely like to see in V as mentioned above) and with extensive survey of existing implementations (Go is mentioned as well as an interesting example).

mratsim/weave#22

@medvednikov
Copy link
Member

Thanks @dumblob I'll check these out.

@cristian-ilies-vasile
Copy link

@dumblob Could you please take a look at Chapel a parallel programming language (https://chapel-lang.org/ https://github.com/chapel-lang/ ) designed by CRAY engineers? Maybe you can borrow some design ideas from Chapel.

@dumblob
Copy link
Contributor Author

dumblob commented Jan 6, 2020

@cristian-ilies-vasile I already did that. Chapel has a large set of very nice features (e.g. regions, multi-paradigm, etc.), but they don't seem to match V goals (especially the "there is only one way to do a thing") and second they're not generic enough (read: "Chapel is a pile of best practices & optimized principles for a - moderately large - selected finite number of use cases"). Due to this, Chapel programs do squeeze the maximum out of the given supercomputer, but can't be run enough efficiently on non-supercomputer hardware (which is the majority nowadays - see the table in the initial post above) as well as the hardware of tomorrow (yeah, stuff like several layers of remote memories with different access speeds instead of just local and remote etc.). Chapel is also quite complex language from my experience (despite authors saying the opposite 😉).

I have to admit I couldn't find in Chapel nearly anything useful for V (the above proposal is generic enough to support quite efficiently nearly any of the Chapel abstractions implemented later in specific V libraries libraries - note especially the scheduler "callback").

If you have anything specific in mind, feel free to discuss it here though. I'll be more than happy to learn something or clarify or modify the proposal.

@cristian-ilies-vasile
Copy link

@dumblob Thank you for the nice words.

//1 - framework to eliminate bugs
On this concurrency front my understanding is that more or less the GO green threads scheduler model (threads monitor) shall be used with improvements. However concurrency is hard.
According with this paper "Understanding Real-World Concurrency Bugs in Go" [1] is it easy to induce concurrency bugs even using GO.
Shall be great if compiler or a separate tool can check for such bugs and produce clever human readable error messages.

//2 - Function as a Service at run-time
Quote
To run foo() concurrently, just call it with go foo()
On top of this functionality can we have vm foo() and then V compiler should create some sort of micro virtual machine/container, so the OS and in turn the hypervisor can move foo() on a different machine and execute the function there?
That microVM can be wrapped around WASM binary format, so the processor of the remote server does not matter.

[1] https://songlh.github.io/paper/go-study.pdf
Thank you.

Regards,
Cristian.

@gslicer
Copy link

gslicer commented Jan 11, 2020

@cristian-ilies-vasile YES to //1 but NOOOO to //2 - V shall have no "runtime" or even VM in my opinion - you are free to build your own VM using V though

@cristian-ilies-vasile
Copy link

@gslicer I feel that you did not get it. It's not like Java and JVM, my suggestion is related to one specific case. The coder could decide that a long running function would run automatically on a different server/environment. The V ecosystem should provide tools and mechanisms for such approach on year 2020.

Think about C, it is still used after 50 years... so I presume that V lifetime'll be the same. In this case let's prepare the foundations to sustain this type of weight.

Regards,
Cristian.

@dumblob
Copy link
Contributor Author

dumblob commented Jan 12, 2020

//1...

Well, my proposal diverges from Go lang in a way, that they won't be cooperatively concurrent but preemptively parallel. This is more in line with the "standard threading" experience one gets on Windows or Unix (pthreads and alike) than with Go lang. So one can reuse the knowledge, tooling etc. to more easily reason about the program than one can about Go programs.

My proposal goes further though - I envision a builtin support (e.g. a dedicated syntax or even a new keyword) for "reentrancy" (a scheduler can spawn an arbitrary number of instances of a go routine any time and can kill and remove an arbitrary number of them also any time. Second, the programmer shall get the full power over scheduling (this might be achieved by many means, so I don't have any particular opinion on implementation - imagine e.g. the most error-prone but widely known: callback).

//2

This sounds somewhat similar to the reentrancy & scheduler "callback" I'm proposing, but my proposal is about the primitives needed (the language syntax & semantics) rather than about any built-in "high level" stuff like VMs. My goal is to offer the very minimum set of primitives to make it possible to efficiently implement something like you described in custom libraries, but not as part of V.

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Jan 19, 2020

@dumblob Related to
//1
could you please add a protection mechanism like Time to live (TTL) when a go routine is called, to set a maximum lifespan of that green thread?
if TTL is 0 (zero) then is ignored otherwise for strict integer positive value then the V's green threads monitor/scheduler must return an error if that value is exceed.

@dumblob
Copy link
Contributor Author

dumblob commented Jan 19, 2020

@cristian-ilies-vasile sounds like something easily doable in the app itself - in other words I don't see any obstacles in my proposal which would prevent implementing such TTL functionality efficiently in your app without any cooperation from V side. It also doesn't sound like a default behavior of V's parallelism, so I don't see currently any reason to add explicit support directly in V.

Or am I missing something? Please elaborate 😉.

@cristian-ilies-vasile
Copy link

@dumblob
It's up to you, however if we follow yours school of thinking ("asily doable in the app itself") the green thread / fibers logic should be placed at app level not as part and parcel of V run-time, because we shall operate only with low level primitives.

Indeed, one can implement Timing Wheels logic on their own, but wouldn't be more helpful for coders to use a rock solid language abstraction out of the box with TTL included?

Regards,
Cristian.

P.S.
For goroutines scheduler/monitor I feel that this blog [1] is worth to read, that data structures (Range-Encoded Bitmaps, Range-Encoded Bit-Slice Indexes) are implemented in GO [2]
[1] https://www.pilosa.com/blog/range-encoded-bitmaps/
[2] https://github.com/pilosa/pilosa

@dumblob
Copy link
Contributor Author

dumblob commented Jul 21, 2020

THE Fuzzer of Fuzzy Latency for User Space Threads

This seems more needed for purely non-preemptive scheduling. Otherwise it seems to scratch the whole point of determinism especially in latency-critical situations like games, etc. - an area where Go has had (and still has) historically issues. Btw. Weave doesn't have this problem and if fully strict ordering of processing requests is needed, then Weave even offers a "gather it all in one place" queue for such scenarios. I envision something similar in V.

Btw. I'm slowly reading all the newly referenced materials and will hopefully get to this THE Fuzzer of Fuzzy Latency for User Space Threads soon.

@cristian-ilies-vasile
Copy link

@dumblob
THE Fuzzer of Fuzzy Latency for User Space Threads is designed only for debugging purposes. I forgot to mention this detail.

@cristian-ilies-vasile
Copy link

// 8
Nice to have
A way to let the coder adjust 3..5  thread scheduler/monitor's  run time parameters.
At compile time we can use a special knob  "-trace_threads", which will instruct thread scheduler to track and collect necessary internal parameters during "preflight" mode, when application will run on a canary/preprod environment. When application exists those parameters will be written on a text file. 
Based on collected data and maybe few formulas the negative feedback loop can be closed, and scheduling run time main attributes could be fine tuned.   

@cristian-ilies-vasile
Copy link

I found a paper where the influence of processor frequency on tasks completion times is studied.
In essence, in the real world the processor frequency moves up and down, generally between 1Ghz and 3Ghz.

Based on this, one can think of a mechanism for allocating threads, including "overloading" at the processor level and dispatch more user space threads on fast CPUs.

Paper: Fork/Wait and Multicore Frequency Scaling: a Generational Clash
Link: https://hal.inria.fr/hal-02349987v2/document

@cristian-ilies-vasile
Copy link

cristian-ilies-vasile commented Aug 7, 2020

@dumblob
dumblob commented on Nov 19, 2019
Note also, that we can easily analyze in compile time the maximum stack size needed for each V parallel routine

Well, is not so easy. If inside a go fn() there is a piece of code using recursive calls to functions the static measurement will fail.
We need to:

  • detect recursive calls at compile time
  • add a reasonable amount of RAM for that particular stack on top of computed one
  • enforce proper check at run time against stack overflow disasters

@cristian-ilies-vasile
Copy link

@dumblob
Do you know resources describing how to multiplex/demultiplex M user space processes on N kernel threads, where N = number of CPUs?
(see also krinkinmu/userspace-threads#1 ) - here this part is missing.
How to feed a limited number of kernel/OS threads with preemptive tasks?
Thank you.

@dumblob
Copy link
Contributor Author

dumblob commented Aug 14, 2020

Answering in hurry, bear with me.

@dumblob
Do you know resources describing how to multiplex/demultiplex M user space processes on N kernel threads, where N = number of CPUs?
(see also krinkinmu/userspace-threads#1 ) - here this part is missing.
How to feed a limited number of kernel/OS threads with preemptive tasks?

I'm not sure I understand, but it sounds like you're asking about the basic idea behind work-stealing 😉. If it is so, then my answer is: "See any good work-stealing implementation (the best I saw is in Weave , another good implementation is the Go lang scheduler)."

Nonetheless, there is no work-stealing implementation I know of, which handles dynamic removal and addition of physical CPU cores efficiently (i.e. changing N dynamically in runtime - e.g. in response to power saving). I envision the work-stealing scheduler should spawn os threads or stop (join) running os threads (but first stop stealing work in the designated os threads, e.g. those having the highest identification number /because this condition is detectable in each os thread separately without stop-the-world/ or those having shortest queues, and wait for their work queue to empty) as a response.

Btw. https://github.com/krinkinmu/userspace-threads/ (pure preemptive scheduling in user space) could be merged with code from Weave (pure non-preemptive scheduling in user space) and we should be good to go. Actually rather incorporate preemptiveness (inspired by https://github.com/krinkinmu/userspace-threads/ ) into Weave as Weave is formally verified to be correct and at the same time it's already quite polished and as you wrote hella fast (scales nearly linearly with the number of physical cores! There is no other such capable library in the world I'd know of!).

@shelby3
Copy link

shelby3 commented Sep 15, 2020

@dumblob wrote:

3. and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work

[…]

2. there is no concept of "reentrant" goroutines in golang and thus it's impossible to spawn new instances (or kill running ones) and do the muxing/demuxing as outlined in my initial post above to accommodate for bottle necks - this is not so quickly solved in golang compared to the issue (1) above (as it would probably require extending golang) and I think this could be the major advantage of V compared to golang

As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?

2. Actually channels should be encouraged and anything else strongly discouraged as anything shared leads to locks and locks translate to stalls so the programmer should not think of concurrency in terms of anything shared, but rather in terms of a "laystall", i.e. "one-way pipes" with buffered data recently processed by one processing unit instance and waiting there to be processed by another processing unit instance in one batch while allowing the first processing unit instance to do some other work in the mean time.

Contention leads to a hornet’s nest of Heisenbugs. Lockfree is the only way to go, please. Lift the semantics out of the low-level gutter. Or I mean at least strongly discourage the use of low-level thread synchronization primitives.

@dumblob
Copy link
Contributor Author

dumblob commented Sep 15, 2020

As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?

Reentrancy is a flag a go routine will get from the programmer serving as a covenant with the meaning "I programmer hereby certify, that the following go routine could be spawned 1+ times any time and any running instance if it's not the last one can be stopped and discarded any time". It's just an optimization giving the scheduler a huge amount of freedom to make scheduling fully dynamic in time (i.e. spawning "unlimited" number of instances of the same go thread or stopping/discarding running instances of go thread if the throughput is not any more a bottle neck) and in general more efficient.

Contention leads to a hornet’s nest of Heisenbugs. Lockfree is the only way to go, please. Lift the semantics out of the low-level gutter.

Hm, I hope you don't confuse lock-free and lockless. The proposal above is fully preemptive and thus what the language (V) itself offers is always lock-free. The programmer though can freely build lock-free as well as not lock-free algorithms (V has unsafe{} to pinpoint usage of structures and approaches leading to potentially non-lock-free behavior). I hope it's clear that V can't guarantee anything if any unsafe{} section is being used and thus I believe your concern is being 100% addressed by not using unsafe{} at all 😉.

Note though, that I'm not sure V is already in a state when this lock-free gurantee holds in all cases - V is still alpha software, so keep than in mind.

But if you meant lockless, then I'd disagree. Please google for lockless variants of efficient lock-based algorithms and you'll be surprised that lockless is only sometimes more efficient in practice (on top of this lockless quite often suffers from other symptoms). It's by far not this easy to say "lockless is best"...

Or I mean at least strongly discourage the use of low-level thread synchronization primitives.

That's what we also advocate for - mainly through unsafe{} and second by promoting use of channels and other higher-level though still efficient primitives for communication among go routines.

@shelby3
Copy link

shelby3 commented Sep 15, 2020

As I mentioned in my post in the related issues thread #3814, I think this reentrancy invariant is onerous and should be dropped. Unless I am misunderstanding what you intend “reentrancy” to mean in this context?

Reentrancy is a flag a go routine will get from the programmer serving as a covenant with the meaning "I programmer hereby certify, that the following go routine could be spawned 1+ times any time and any running instance if it's not the last one can be stopped and discarded any time". It's just an optimization giving the scheduler a huge amount of freedom to make scheduling fully dynamic in time (i.e. spawning "unlimited" number of instances of the same go thread or stopping/discarding running instances of go thread if the throughput is not any more a bottle neck) and in general more efficient.

Oh if it is optional then that seems more acceptable. The programmer could set the flag if for example they don’t mutate any resources which are shared between all copies of the thread. It is very difficult otherwise to prove compliance with the invariant — Rust’s exclusive write ownership is another means to prove reentrancy but I personally find that to be an unpleasant paradigm to code in. And I explained why I think Pony’s model is more flexible than Rust’s.

I hope it's clear that V can't guarantee anything if any unsafe{} section is being used and thus I believe your concern is being 100% addressed by not using unsafe{} at all wink.

That seems reasonable to me iff (if and only if) thread synchronization primitives will only accessible from unsafe{} code.

Hm, I hope you don't confuse lock-free and lockless.

There seems to be no salient distinction between the two, when I distill their definitions down because afaics in theory any code which employs locks will fail to meet the stated requirement of lockfree:

https://stackoverflow.com/questions/20467247/whats-the-difference-between-lockless-and-lockfree

Seems someone has in their mind that the terms should be applied in different contexts?

So per my edit in my post in #3814 thread, I understand now that Dmitry was referring to the scheduler not being lockfree, which is not what I was concerned with. I am referring to the APIs that the safety-conscious Vlang programmer has access to, not the implementation of the runtime or unsafe{} code.

But if you meant lockless, then I'd disagree. Please google for lockless variants of efficient lock-based algorithms and you'll be surprised that lockless is only sometimes more efficient in practice (on top of this lockless quite often suffers from other symptoms). It's by far not this easy to say "lockless is best"...

I’m already aware of that. I mean to suggest to avoid the need to lock by avoiding contention, e.g. by when accessing shared memory only allowing that be access to immutable data, for the code that is not marked unsafe{}. Because locking synchronization in general can not be proven to be free of live-locks, deadlocks and insidious heisenbugs. Analogously a Turing-complete program can’t be proven to halt. Said heisenbugs can in some cases be impossible to track down.

This is a facet of Go that really disappoints me. Unfortunately there is no construct in the Go language for declaring and enforcing data to be immutable.

Note as you may know, immutable data structures have the advantage that they can’t contain circular references, which is important if employing ARC instead of a tracing GC, because ARC can’t collect circular references (although there exist probabilistic designs which attempt to do some localized tracing along with ARC).

One possible model for mutating shared immutable data is for the exclusive owner to create a new mutated, immutable copy (perhaps via an efficient data structure that only copies the changed part which afaik is what Clojure employs), then send a message to threads to tell them to update the copy of the immutable data they are referencing. As I think you all know (bcz I’ve seen it referenced already) that the Pony language has a references model for tracking which resources can be consumed to an immutable and then shared between threads. And there are other possible models such as Arena/Actor-like partitions which I have contemplated.

I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy.

Of course there will still be locking issues at some higher-level semantics layer such as shared access to a shared database from concurrent threads, but afaics that is out-of-scope of Vlang’s design concerns.

You do seem very knowledgeable about concurrency and more so than myself. I found myself upvoting most of your comment posts. Thanks.

@dumblob
Copy link
Contributor Author

dumblob commented May 7, 2021

For those interested I posted a report On the way to an ultra low overhead equal-priority hard real time preemptive scheduler in the V Concurrency Considerations thread.

@dumblob
Copy link
Contributor Author

dumblob commented May 17, 2021

Just an example for eBPF non-believers 😉 how authors of picoev (which is bundled with V) implemented eBPF in their h2o HTTP/1,2,3 server (btw. the server itself seems to be a pretty good piece of SW despite being still experimental; it's also not bulky 😮 but quite vice versa).

@dumblob
Copy link
Contributor Author

dumblob commented Jul 8, 2021

@UweKrueger @medvednikov @spytheman @cristian-ilies-vasile @JalonSolov and others - feel free to read new mini-series about the current state of the art memory models (which exist only because of multicore architectures sharing the same physical memory) and their major flaws (led by the fact that they're mostly not formally defined 😮):

https://research.swtch.com/mm

They're written by the Go lang lead dev - Russ Cox (rsc).

rsc asked in golang/go#47141 about how Go lang shall approach synchronization primitives and generally memory access in the future. It's worth reading to get a glimpse of issues V will face rather sooner or later.

For V I think this is the most important point of the series:

Both considerations strongly suggest we should adopt sequentially consistent atomics over acquire/release atomics: sequentially consistent atomics are more useful, and some chips have already completely closed the gap between the two levels. Presumably others will do the same if the gap is significant.

The same considerations, along with Go's overall philosophy of having minimal, easily understood APIs, argue against providing acquire/release as an additional, parallel set of APIs. It seems best to provide only the most understandable, most useful, least misusable set of atomic operations.

@vlang vlang locked and limited conversation to collaborators Sep 22, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Feature Request This issue is made to request a feature.
Projects
No open projects
0.3
  
To do
Development

No branches or pull requests