-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add MPI support on Ray cluster #40917
Conversation
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm. Some comments in input validation & API doc
["mpirun", "--version"], capture_output=True, check=True | ||
) | ||
except subprocess.CalledProcessError: | ||
logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe logger.exception to print stacktrace?
except subprocess.CalledProcessError: | ||
logger.error( | ||
"Failed to run mpi run. Please make sure mpi has been installed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we kill proc here? Or does it guarantee the proc is killed? (if so can you comment here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify context is in runtime env agent I think. Exception should be good? I can test it.
from pathlib import Path | ||
|
||
# mpirun -n 10 python mpi.py worker_entry_func | ||
worker_entry = mpi_config["worker_entry"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we need to handle a case where "worker_entry" doesn't exist because we don't have input validation iiuc
worker_Entry = mpi_config.get("worker_entry")
if worker_entry is None:
raise
|
||
# mpirun -n 10 python mpi.py worker_entry_func | ||
worker_entry = mpi_config["worker_entry"] | ||
assert Path(worker_entry).is_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert Path(worker_entry).is_file(), "worker_entry is not a file but ..."
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Setup MPI worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the main function? or the parser?
This will will be used as the mpi entry point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main function. I don't see any main function from other plugins though. Maybe it should be a part of mpi_worker.py not here? (this means the function is executed when you import it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPIRUN is like a fork and the rest plugin doesn't have this.
The function won't execute when import since it checks main. If you import, __name__
won't be __main__
.
This piece of code is part of the plugin, that's why I put it here and it's simple.
But open to move if you insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I see. I think this is a bit confusing when I read code first time. I prefer to move it to a separate file (something like mpi_start.py), but it is also okay if you add comments in details in the main block. E.g., "the plugin starts a subprocess that runs this main method. It is not executed as a part of normal plugin" or sth like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about moving it to python/ray/_private/workers/mpi_workers.py
but feel it just moves the code to far away from the place where it's used. I think split it and move it to the other file is better.
@@ -287,6 +288,7 @@ def __init__( | |||
nsight: Optional[Union[str, Dict[str, str]]] = None, | |||
config: Optional[Union[Dict, RuntimeEnvConfig]] = None, | |||
_validate: bool = True, | |||
mpi: Optional[Dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update a doc, or consider _mpi
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll update the doc. i think it's a feature.
python/ray/tests/mpi/test_mpi.py
Outdated
runtime_env={ | ||
"mpi": { | ||
"args": ["-n", "4"], | ||
"worker_entry": "mpi_worker.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does it find the file? From the current directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be clear about it in the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's from the working dir. I'll add the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm having main inside runtime env plugin file seems wrong?
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Setup MPI worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main function. I don't see any main function from other plugins though. Maybe it should be a part of mpi_worker.py not here? (this means the function is executed when you import it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code lgtm now. Can you ping me after updating the docstring for mpi API? I think we also need api approval as it is a new plugin
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="Setup MPI worker") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I see. I think this is a bit confusing when I read code first time. I prefer to move it to a separate file (something like mpi_start.py), but it is also okay if you add comments in details in the main block. E.g., "the plugin starts a subprocess that runs this main method. It is not executed as a part of normal plugin" or sth like that
@rkooo567 I'll create another PR for the doc. |
@@ -160,6 +160,9 @@ install_miniconda() { | |||
) | |||
fi | |||
|
|||
# Install mpi4py | |||
"${WORKSPACE_DIR}"/ci/suppress_output conda install -c anaconda mpi4py -y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it is a bit weird we have it here? (it will be requested by every dev to download mpi4py)
Why don't we just make it test requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that only conda can install it. It has system deps and doesn't work with py311 :(
@rkooo567 I updated the API to avoid the extra worker file. |
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
This PR adds the support to run MPI based code on top of Ray. The support is done with runtime env plugin. To enable it, the following decorator needs to be added inside ray remote options: @ray.remote( runtime_env={ "mpi": { "args": ["-n", "4"], "worker_entry": "mpi_worker.run", } } ) def f(): pass Here the mpi_worker.run is the function the process with rank > 0 will run. It'll run as import mpi_worker; mpi_worker.run(). The parameter needs to be passed with MPI comm.bcast. Here the process with rank 0 sill will run the remote function f. Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Why are these changes needed?
This PR adds the support to run MPI based code on top of Ray.
The support is done with runtime env plugin. To enable it, the following decorator needs to be added inside ray remote options:
Here the
mpi_worker.run
is the function the process with rank > 0 will run. It'll run asimport mpi_worker; mpi_worker.run()
. The parameter needs to be passed with MPI comm.bcast.Here the process with rank 0 sill will run the remote function f.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.