New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel execution #63

Open
bitprophet opened this Issue Apr 29, 2013 · 14 comments

Comments

Projects
None yet
5 participants
@bitprophet
Member

bitprophet commented Apr 29, 2013

Clearinghouse for "how to execute N instances of a given task, parameterized in some fashion, in parallel?"

Fabric 1.x has a naive parallelization which assumes a single parameterization vector (host lists); we need to support this more generally while still honoring that sort of use case.

Fab 1.x's implementation has to leverage multiprocessing due to not being threadsafe; Invoke should be threadsafe, so threads are an option, as is multiprocessing or (if we decide it's worthwhile to add a dependency on them) coroutines/greenlets. EDIT: see also Tulip/asyncio in Python 3.x and its port in 2.x, Trollius.

In either case we have to solve the following problems:

  • Display of live output - how to tell different "channels" apart?
    • Fabric handles this by forcing 'linewise' output in parallel mode, and using a line prefix string. This is functional but not very readable (though troubleshooting-related "reading" can be fixed by using logging to per-channel files or data structures).
  • Communicating data between channels
    • Generally, not doable or desirable since you have no idea which permutations of the task will run before/after others.
    • Though if threading is used, users can simply opt to use locks/sempahores/queues/etc
  • Communication between channels and the "master" process, e.g. return values
  • Error handling - should errors partway through the run cause a halt, or just get tallied and displayed at the end?
  • Probably more.
@riltsken

This comment has been minimized.

riltsken commented May 21, 2013

Some thoughts:

Error handling - I like options. I feel as if the default should be for errors to not cause a halt, but adding onto the 'Communication' bit where you can specify when a task has an invalid return value, and that could halt the other tasks.

Display of live output - Not sure if you have experienced 'live' output as a big benefit for parallelized tasks. It tends to clutter things up for me when you have more than 3-4 processes spitting out information. I didn't realize there was a 'prefix' option in Fabric. I'll take a look at that for my own use actually.

@bitprophet

This comment has been minimized.

Member

bitprophet commented May 26, 2013

@riltsken The prefix in Fabric is the default, so a simple setup like this will show you how it behaves there:

from fabric.api import task, parallel, run

@task
@parallel
def mytask():
    run("ls /") # or some other multiline-printing, ideally not-super-fast command

# Execute as: fab -H host1,host2,host3 mytask

Options, yea. The setup I have right now is to have a user-selectable/subclassable "Executor" class and ideally any sort of "how to go from parsed args + a task namespace, to actual execution" pattern can be implemented in one.

The default and only instance right now is of course naively serial, but the idea is to use it to support parallelism, different error handling behaviors, etc. Having Fabric users locked into the core implementation has always been crummy.

@alendit

This comment has been minimized.

alendit commented Nov 21, 2013

Would a concurent execution (i.e. start a watcher for coffeescript and an another for SCSS at the same time) be the same feature or should I open an additional issue?

@bitprophet

This comment has been minimized.

Member

bitprophet commented Nov 27, 2013

I'd argue that's strongly enough related to live here, @alendit, it's really just making the concept that much more general.

That use case also lends weight to the "just make sure we are 100% threadsafe and ensure using threads is as easy as possible" option. Which is both less work for the core codebase, and maximum flexibility for users.

If I remember my threading correctly (dubious...) it means your use case would look something like this - one task using threads to do >1 thing "concurrently":

@task
def coffee():
    run("coffeescript --watch")

@task
def sass():
    run("sass --scss --watch")

@task(default=True
def main():
    threads = map(lambda x: Thread(target=x), (coffee, sass))
    # Kick off
    [x.start() for x in threads]
    # Wait for completion - maybe pending KeyboardInterrupt or similar
    [x.join() for x in threads]

Then maybe store that in watch.py, add it as watch to your tasks.py's namespace, and call e.g. invoke watch (to watch both, via threads) or invoke sass (to just watch one) etc.

Note that I haven't tried this myself yet - it might work fine now, but it's possible something in how run behaves would goof things up.

As in comments, this is also fudging over control stuff. That + the thread busywork are places where Invoke could grow an API so users can save some boilerplate - but it would also be optional and you could just do the above on your own.

@alendit

This comment has been minimized.

alendit commented Nov 29, 2013

Thanks, works perfectly :)

@bitprophet bitprophet added the Feature label Aug 25, 2014

@ghost

This comment has been minimized.

ghost commented Oct 23, 2014

I'm interested in using invoke to write a bioinformatics pipeline that would be executed in parallel on an LSF or Amazon EC2 cluster, but it wasn't clear to me from this thread whether this is currently possible in the current incarnation of invoke. Do you have any suggestions on this by any chance, or would you suggest going with something that explicitly supports this and interfaces with the python cluster-managing library, DRMAA, like ruffus?

@bitprophet

This comment has been minimized.

Member

bitprophet commented Nov 6, 2014

@erasmusthereformer No parallelization is in yet, though I expect to poke at it soon since it's a key part of the lib I'm writing on top of it (fabric 2).

@bitprophet

This comment has been minimized.

Member

bitprophet commented Mar 24, 2015

See also paramiko/paramiko#389 which links to some asyncore/asyncio related prior art. Part of this ticket here is to figure out if an interface/implementation can work well for both local+remote execution or if they need to be separate concerns.

@bitprophet

This comment has been minimized.

Member

bitprophet commented Sep 18, 2015

A great example of this which is possibly distinct from the Fab use case, that I find myself wanting now, is running >1 file-watch task at the same time. Specifically, two concurrent instances of watchmedo running w/ different arguments at the same time.

I can work around this by creating a single hybrid invocation of watchmedo but it feels more "natural" to implement them as distinct invocations so one could e.g. inv -P www.watch docs.watch.

@cdman

This comment has been minimized.

cdman commented Jun 27, 2016

Hello,

I'm looking into using invoke as a build tool and the "parallel execution" would be very useful to reduce build times / make use of the machines we actually have. For example we have multiple modules in the same repository, and using parallel execution we could run the linting on all of them in parallel. Or an other example would be to run the linting of our JS code in parallel with the linting of our PY code.

A simplistic way to do this (but which nonetheless works perfectly fine up to ~10k tasks as least) would be something like:

import collections

Task = collections.namedtuple('Task', 'dependencies')

tasks = {
  'a': Task(dependencies=[]),
  'b': Task(dependencies=[]),
  'c': Task(dependencies=['a', 'b']),
  'd': Task(dependencies=['a', 'c']),
}

tasks_by_dependencies = {k: set(v.dependencies) for k, v in tasks.iteritems()}
blocked_by_me = {k: set() for k in tasks.keys()}
for k, v in tasks.iteritems():
  for d in v.dependencies:
    blocked_by_me[d].add(k)

while tasks_by_dependencies:
  name = next(
    name for name, unmet_dependencies in tasks_by_dependencies.iteritems()
    if not unmet_dependencies)
  del tasks_by_dependencies[name]

  print "Executed %s" % name

  blocked = blocked_by_me.pop(name)
  for k in blocked:
     tasks_by_dependencies[k].remove(name)
@chaos-ad

This comment has been minimized.

chaos-ad commented Jun 27, 2016

Just discovered this library and thought at first that it could be used to replace GNU Make.
And it would do that with bliss, at least for my use cases, if only it would have an option which works like a -j

@cdman

This comment has been minimized.

cdman commented Jun 28, 2016

@chaos-ad - that was my thinking exactly!

What I liked about make: well supported and integrated (ie. you get by default tab completion on many systems) and of course the -j option.

Why I would like to use invoke or the equivalent: because I know more Python than makefile and frankly the Makefile syntax seems like Bash - a lot of data structures shoved into strings and a lot of things which "seems to work" but can actually break and there is little way to discover it in advance (see bash handling filenames with spaces in them).

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jul 3, 2016

This should probs be linked to #45 since enhanced dependency resolution is pretty important re: the kinds of use cases @cdman describes.

If it wasn't clear from the original description, @chaos-ad / @cdman - yes, you've hit on one major reason why this feature's needed! I totally want Invoke to be a good make replacement and that includes -j type stuff. (The other reason, and one I'll personally be banging on soon, is for parallel remote execution, aka the Fabric use case).

I'll update this once I start poking that functionality, with notes / requests for API feedback / etc :)

@bitprophet

This comment has been minimized.

Member

bitprophet commented Jun 19, 2018

This should also link to #15 since both are concerned with how to handle differentiating output from concurrently executing contexts (different hosts in Fabric, different tasks or subprocesses in Invoke, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment