Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add MPI support on Ray cluster #40917

Merged
merged 32 commits into from
Nov 9, 2023
Merged

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Nov 2, 2023

Why are these changes needed?

This PR adds the support to run MPI based code on top of Ray.

The support is done with runtime env plugin. To enable it, the following decorator needs to be added inside ray remote options:

@ray.remote(
        runtime_env={
            "mpi": {
                "args": ["-n", "4"],
                "worker_entry": "mpi_worker.run",
            }
        }
)
def f():
    pass

Here the mpi_worker.run is the function the process with rank > 0 will run. It'll run as import mpi_worker; mpi_worker.run(). The parameter needs to be passed with MPI comm.bcast.

Here the process with rank 0 sill will run the remote function f.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm. Some comments in input validation & API doc

python/ray/_private/runtime_env/mpi.py Show resolved Hide resolved
["mpirun", "--version"], capture_output=True, check=True
)
except subprocess.CalledProcessError:
logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe logger.exception to print stacktrace?

except subprocess.CalledProcessError:
logger.error(
"Failed to run mpi run. Please make sure mpi has been installed"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we kill proc here? Or does it guarantee the proc is killed? (if so can you comment here?)

Copy link
Contributor Author

@fishbone fishbone Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify context is in runtime env agent I think. Exception should be good? I can test it.

from pathlib import Path

# mpirun -n 10 python mpi.py worker_entry_func
worker_entry = mpi_config["worker_entry"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to handle a case where "worker_entry" doesn't exist because we don't have input validation iiuc

worker_Entry = mpi_config.get("worker_entry")
if worker_entry is None:
    raise


# mpirun -n 10 python mpi.py worker_entry_func
worker_entry = mpi_config["worker_entry"]
assert Path(worker_entry).is_file()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a error message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert Path(worker_entry).is_file(), "worker_entry is not a file but ..."



if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Setup MPI worker")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the main function? or the parser?

This will will be used as the mpi entry point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main function. I don't see any main function from other plugins though. Maybe it should be a part of mpi_worker.py not here? (this means the function is executed when you import it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPIRUN is like a fork and the rest plugin doesn't have this.
The function won't execute when import since it checks main. If you import, __name__ won't be __main__.

This piece of code is part of the plugin, that's why I put it here and it's simple.

But open to move if you insist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see. I think this is a bit confusing when I read code first time. I prefer to move it to a separate file (something like mpi_start.py), but it is also okay if you add comments in details in the main block. E.g., "the plugin starts a subprocess that runs this main method. It is not executed as a part of normal plugin" or sth like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about moving it to python/ray/_private/workers/mpi_workers.py but feel it just moves the code to far away from the place where it's used. I think split it and move it to the other file is better.

@@ -287,6 +288,7 @@ def __init__(
nsight: Optional[Union[str, Dict[str, str]]] = None,
config: Optional[Union[Dict, RuntimeEnvConfig]] = None,
_validate: bool = True,
mpi: Optional[Dict] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update a doc, or consider _mpi!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll update the doc. i think it's a feature.

runtime_env={
"mpi": {
"args": ["-n", "4"],
"worker_entry": "mpi_worker.py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does it find the file? From the current directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be clear about it in the docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's from the working dir. I'll add the doc.

python/ray/widgets/util.py Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2023
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm having main inside runtime env plugin file seems wrong?



if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Setup MPI worker")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main function. I don't see any main function from other plugins though. Maybe it should be a part of mpi_worker.py not here? (this means the function is executed when you import it?)

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code lgtm now. Can you ping me after updating the docstring for mpi API? I think we also need api approval as it is a new plugin



if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Setup MPI worker")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see. I think this is a bit confusing when I read code first time. I prefer to move it to a separate file (something like mpi_start.py), but it is also okay if you add comments in details in the main block. E.g., "the plugin starts a subprocess that runs this main method. It is not executed as a part of normal plugin" or sth like that

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 8, 2023
@fishbone
Copy link
Contributor Author

fishbone commented Nov 8, 2023

@rkooo567 I'll create another PR for the doc.

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@@ -160,6 +160,9 @@ install_miniconda() {
)
fi

# Install mpi4py
"${WORKSPACE_DIR}"/ci/suppress_output conda install -c anaconda mpi4py -y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it is a bit weird we have it here? (it will be requested by every dev to download mpi4py)

Why don't we just make it test requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that only conda can install it. It has system deps and doesn't work with py311 :(

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@fishbone
Copy link
Contributor Author

fishbone commented Nov 9, 2023

@rkooo567 I updated the API to avoid the extra worker file.

fishbone and others added 4 commits November 9, 2023 00:07
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@can-anyscale can-anyscale merged commit 99b1a2c into ray-project:master Nov 9, 2023
23 of 30 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
This PR adds the support to run MPI based code on top of Ray.

The support is done with runtime env plugin. To enable it, the following decorator needs to be added inside ray remote options:

@ray.remote(
        runtime_env={
            "mpi": {
                "args": ["-n", "4"],
                "worker_entry": "mpi_worker.run",
            }
        }
)
def f():
    pass
Here the mpi_worker.run is the function the process with rank > 0 will run. It'll run as import mpi_worker; mpi_worker.run(). The parameter needs to be passed with MPI comm.bcast.

Here the process with rank 0 sill will run the remote function f.

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants