Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime env]: Integrating Nsight to Ray worker process #39998

Merged
merged 21 commits into from
Oct 17, 2023

Conversation

jonathan-anyscale
Copy link
Contributor

@jonathan-anyscale jonathan-anyscale commented Sep 29, 2023

Why are these changes needed?

Nsight internal docs: https://docs.google.com/document/d/11RlNTbGLf6fat7HYARU8yWhodBD9j5uiZCdAB0geEik
Related issue: #39094

Nsight integration with Ray using runtime_env. Currently nsight can't profile the GPU usage from Ray tasks/actors since the processes that can be traced by nsight must be driver processes and it's subprocesses, whereas Ray tasks/actors are run by worker process. Thus, we added nsight native to runtime_env in order to modify the worker process to run with nsys profile which can produce the report for each worker processes once it exits.

The nsight API in the runtime_env can be specified with flags that user want to add to the nsys profile for example

@ray.remote(runtime_env={"nsight": {
"-t": "cuda,nvtx", 
"--cudabacktrace": "true"
}})
def task():
    ....

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale changed the title Nsight ray [runtime env]: Integrating Nsight to Ray worker process Sep 29, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale marked this pull request as ready for review October 7, 2023 06:52
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought we decided to go with "profiler" now? Is this updated correctly?

python/ray/remote_function.py Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
python/ray/runtime_env/runtime_env.py Outdated Show resolved Hide resolved
python/ray/tests/test_runtime_env_profiler.py Show resolved Hide resolved
python/ray/tests/test_runtime_env_profiler.py Show resolved Hide resolved
os.environ.get("CI") and sys.platform != "linux",
reason="Requires PR wheels built in CI, so only run on linux CI machines.",
)
def test_nsight_custom_name(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_nsight_custom_name(
def test_nsight_custom_option(

(since the purpose of the test is to make sure options work, not the custom name. There's no reason to test if custom name works because that's already should've tested with nsight itself)

python/ray/_private/runtime_env/nsight.py Show resolved Hide resolved
python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
)
# add set output path to logs dir
nsight_config["-o"] = f"{self._logs_dir}/" + nsight_config.get(
"-o", NSIGHT_DEFAULT_CONFIG["-o"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to include worker id to the output file name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Jiajun before, since worker isn't launch when creating plugin, we can't pass it to report filename

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that sounds tricky...

python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
)

self.nsight_cmd = parse_nsight_config(nsight_config)
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what's the plan for GC policy now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, cc @architkulkarni how do we currently GC runtime resources? Which directory should we use to utilize the default GC features? (Also, should create return the size of artifact to utilize this feature?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create in runtime_env is used if we requires installing some packages (like pip, or working_dir)
So I set it to 0 as we don't install the dependency for user, only modifying the RuntimeContext

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess the file generated by a profiler is a separate question. do you plan to tackle it in the followup prs?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 11, 2023
)

self.nsight_cmd = parse_nsight_config(nsight_config)
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess the file generated by a profiler is a separate question. do you plan to tackle it in the followup prs?

)
# add set output path to logs dir
nsight_config["-o"] = f"{self._logs_dir}/" + nsight_config.get(
"-o", NSIGHT_DEFAULT_CONFIG["-o"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that sounds tricky...

"""

# use empty as nsight report test filename
nsight_config_copy = copy.deepcopy(nsight_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raylet just stuck and keep trying to relaunch worker process instead of terminate

Hmm this actually seems like a bug. Maybe we cannot catch the error if a worker process fails to start by a command error. Can you create an issue for this?

"""

# use empty as nsight report test filename
nsight_config_copy = copy.deepcopy(nsight_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, the workaround looks okay.

python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
"""

# use empty as nsight report test filename
nsight_config_copy = copy.deepcopy(nsight_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically the cpp worker pool should be able to catch this (maybe it could be follow-up for you if you have time in the future)

python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
python/ray/_private/runtime_env/nsight.py Outdated Show resolved Hide resolved
Function to check if the provided nsight options are
valid nsys profile options and if nsys profile is installed

The function returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not changed?

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
ci/docker/core.build.Dockerfile Outdated Show resolved Hide resolved
python/requirements/nsight_mock/nsys_mock.py Outdated Show resolved Hide resolved
python/requirements/nsight_mock/setup.py Outdated Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2023
jonathan-anyscale and others added 2 commits October 13, 2023 13:02
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not include the fake nsight to the official docker image.

@@ -16,3 +16,5 @@ RUN pip install -U --ignore-installed \
-r python/requirements.txt \
-r python/requirements/test-requirements.txt \
-r python/requirements/ml/dl-cpu-requirements.txt

RUN pip install python/requirements/nsight_fake
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I feel like we shouldn't include this to actual docker image. Can we only include it to a test images? Or install is separately from the build

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try move it on basetest.Dockerfile

@@ -6,5 +6,7 @@ srcs:
- python/requirements_compiled.txt
- python/requirements/test-requirements.txt
- python/requirements/ml/dl-cpu-requirements.txt
- python/requirements/nsight_fake/nsys_fake.py
- python/requirements/nsight_fake/setup.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

python/ray/_private/runtime_env/nsight.py Show resolved Hide resolved
@rkooo567
Copy link
Contributor

Follow up:

  • enhance the API to follow the API proposal
  • GC policy
  • other profilers

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 16, 2023
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@rkooo567 rkooo567 merged commit 4113ab4 into ray-project:master Oct 17, 2023
63 of 69 checks passed
@jonathan-anyscale jonathan-anyscale removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants