Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface for executing workflows on multiple nodes #1273

Closed
BoPeng opened this issue Jun 27, 2019 · 5 comments
Closed

Interface for executing workflows on multiple nodes #1273

BoPeng opened this issue Jun 27, 2019 · 5 comments

Comments

@BoPeng
Copy link
Contributor

BoPeng commented Jun 27, 2019

Currently the -j option specifies the number of workers, which are processes on the same machine. Although this more or less works on modern machines with plenty of cores, it would be better to allow the execution of workflows on multiple computers.

We are almost there because

  1. With the use of zmq, the processes exchange information using http://localhost:port, which can be almost trivially change to http://ip.address:port if we can start the processes on a remote computer.
  2. Tasks can be used although they are no longer needed. We now have option -q none to disable tasks.

So there are currently two problems:

  1. How to start a sos worker on remote machine. This looks like some sort of sos remote server --port. We already have the remote command and mechanism to run command on remote machine so this is almost done. However, on cluster, it is a bit more difficult to start since we need something alone the line of mpistart.

  2. What would be a good interface? The question is whether or not we should allow non-cluster multi-node mode, which is a lot harder to use.

Existing options are

  1. -j, which could be extended to multiple nodes, which seems to make most sense.
  2. -r (remote), which could be used to specify multiple remote machines
  3. -q, which could be used to define special multi-node queues.

In any case we need to specify

  1. If in the cluster mode, sos should assume that workers are automatically started by the batch manager and the master only need to connect to them. In this case the master do not even have to know the number of workers in advance, and could be more intelligent in working with dead nodes.

The procedure could be something like

mpistart sos run workflow -j 5 *:5

The master process will start working, collect worker information, and treat them as workers.
The worker process will just start as worker. Here we assume that the master node will have 5 workers, and each node will also have five workers.

  1. If not in the cluster mode, sos commands could be started by the master process on the nodes, and the users need to specify machines on which sos need to be started, and optionally how many processes.

The process could be

sos run workflow -j 4 server1:4 server2:4

so that there will be 4 local workers, 4 workers only server1 and 4 workers on server2. In this case the workers will be started by sos.

Relationship with other options

-r can remain untouched.

That is to say,

sos run workflow -r server -j4 server1:4

would copy the workflow to server, start 4 workers on server and 4 workers on server1 as defined by hosts.yml on server.

-q can remain untouched. as this is still the mechanism for tasks. It seems natural to use -q none in this mode but we could still allow the use of -q.

@gaow
Copy link
Member

gaow commented Jun 27, 2019

almost trivially change to http://ip.address:port ... now have option -q none to disable tasks.

This should be more robust than talking to PBS right? Not sure how robust http://ip.address:port is, or, if any remote system has some constraints to prevent it from happening -- do you know if it works at least on MDA HPC?

How to start a sos worker on remote machine.

This refers to how internally the SoS mini-cluster works? It sounds like a redis queue -- once we "booked" multiple nodes, there has to be some existing solutions to use those nodes implementing the mini-cluster, do we?

The question is whether or not we should allow non-cluster multi-node mode, which is a lot harder to use.

I cannot think of a user case for it, unless someone has lots of desktops in the lab not configured as a cluster which would be weird (please correct me if I get it wrong). And the mechanism is not going to be robust anyways (have to configure each computer separately, worry about file sync, etc). So I'm not very excited about the non-cluster multi-node mode unless you've got a need?

If without having to worry about non-cluster mode, should we only change the yml and leave most command interface untouched?

@BoPeng
Copy link
Contributor Author

BoPeng commented Jun 27, 2019

This should be more robust than talking to PBS right? Not sure how robust http://ip.address:port is, or, if any remote system has some constraints to prevent it from happening -- do you know if it works at least on MDA HPC?

Our HPC limits outside connection but seems to allow inter-node communication with any port. Otherwise I think MPI stuff would not be able to run, I think.

This refers to how internally the SoS mini-cluster works? It sounds like a redis queue -- once we "booked" multiple nodes, there has to be some existing solutions to use those nodes implementing the mini-cluster, do we?

Yes, mpirun starts the same program on multiple nodes, and the nodes can learn its position in the MPI geometry through environmental variables and act accordingly. This is very mature on cluster.

I cannot think of a user case for it,

The idea is like task to another server with process queue, but all in memory/network. For example, if previously we need to rsync a bunch of files on the cluster head node, I would have to submit a bunch of rsync tasks. With this option i am basically starting a sos worker on the headnode, which accepts rsync substeps through network. There are pros and cons in this approach, at least in our system the cluster headnode does not allow anything other than ssh/port 22 so the remote worker stuff will not work. This is one of the reasons we used ssh/rsync for everything to achieve best compatibility.

I agree that we could implement this as another kind of task engine with different kind of template.

@gaow
Copy link
Member

gaow commented Jun 27, 2019

For example, if previously we need to rsync a bunch of files on the cluster head node

My point is that I dont see this happen often ... Mostly i see people work on a single HPC cluster, or running from their desktop but submit to that cluster, or at most a couple of random desktop computers to use for remote tasks (in which case the existing task model is not so bad). But it would be warranted if you've got solid user case.

Implementing a new template seems the cleanest and most flexible at least for the beta phase.

@BoPeng
Copy link
Contributor Author

BoPeng commented Jun 27, 2019

My point is that I dont see this happen often ...

It all depends on how far we are willing to go, because if we are only assuming nodes of a cluster, the interface would be a lot cleaner, and we automatically assume shared disk and do not have to worry about path translation and file synchronization. Adding support for non-cluster multi-node computation looks useful, but introduces a lot of complexity (e.g. heterogeneous disks) for developer and user as well, with little benefit. Right now I tend to agree that we should focus on the mini-cluster design.

@gaow
Copy link
Member

gaow commented Jun 27, 2019

Yes, and we can worry about the other case if there indeed is a need (at least among ourselves).

@BoPeng BoPeng closed this as completed Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants