-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[runtime env]: Integrating Nsight to Ray worker process #39998
Conversation
47c5c1f
to
7865f2c
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
7865f2c
to
8d96d4a
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
3305c6b
to
c445e24
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
a68df52
to
27895b1
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
3519905
to
bfc7c1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I thought we decided to go with "profiler" now? Is this updated correctly?
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
d1af0ec
to
cc2e783
Compare
os.environ.get("CI") and sys.platform != "linux", | ||
reason="Requires PR wheels built in CI, so only run on linux CI machines.", | ||
) | ||
def test_nsight_custom_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_nsight_custom_name( | |
def test_nsight_custom_option( |
(since the purpose of the test is to make sure options work, not the custom name. There's no reason to test if custom name works because that's already should've tested with nsight itself)
) | ||
# add set output path to logs dir | ||
nsight_config["-o"] = f"{self._logs_dir}/" + nsight_config.get( | ||
"-o", NSIGHT_DEFAULT_CONFIG["-o"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to include worker id to the output file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with Jiajun before, since worker isn't launch when creating plugin, we can't pass it to report filename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah that sounds tricky...
) | ||
|
||
self.nsight_cmd = parse_nsight_config(nsight_config) | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what's the plan for GC policy now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, cc @architkulkarni how do we currently GC runtime resources? Which directory should we use to utilize the default GC features? (Also, should create
return the size of artifact to utilize this feature?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create
in runtime_env is used if we requires installing some packages (like pip, or working_dir)
So I set it to 0 as we don't install the dependency for user, only modifying the RuntimeContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess the file generated by a profiler is a separate question. do you plan to tackle it in the followup prs?
) | ||
|
||
self.nsight_cmd = parse_nsight_config(nsight_config) | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess the file generated by a profiler is a separate question. do you plan to tackle it in the followup prs?
) | ||
# add set output path to logs dir | ||
nsight_config["-o"] = f"{self._logs_dir}/" + nsight_config.get( | ||
"-o", NSIGHT_DEFAULT_CONFIG["-o"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah that sounds tricky...
""" | ||
|
||
# use empty as nsight report test filename | ||
nsight_config_copy = copy.deepcopy(nsight_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raylet just stuck and keep trying to relaunch worker process instead of terminate
Hmm this actually seems like a bug. Maybe we cannot catch the error if a worker process fails to start by a command error. Can you create an issue for this?
""" | ||
|
||
# use empty as nsight report test filename | ||
nsight_config_copy = copy.deepcopy(nsight_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, the workaround looks okay.
""" | ||
|
||
# use empty as nsight report test filename | ||
nsight_config_copy = copy.deepcopy(nsight_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically the cpp worker pool should be able to catch this (maybe it could be follow-up for you if you have time in the future)
Function to check if the provided nsight options are | ||
valid nsys profile options and if nsys profile is installed | ||
|
||
The function returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems not changed?
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
c9699e7
to
1426054
Compare
446f32e
to
eb2c199
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
eb2c199
to
e83a1a0
Compare
444183d
to
6ebe5d3
Compare
ccd822e
to
9f2b321
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not include the fake nsight to the official docker image.
ci/docker/core.build.Dockerfile
Outdated
@@ -16,3 +16,5 @@ RUN pip install -U --ignore-installed \ | |||
-r python/requirements.txt \ | |||
-r python/requirements/test-requirements.txt \ | |||
-r python/requirements/ml/dl-cpu-requirements.txt | |||
|
|||
RUN pip install python/requirements/nsight_fake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I feel like we shouldn't include this to actual docker image. Can we only include it to a test images? Or install is separately from the build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try move it on basetest.Dockerfile
ci/docker/core.build.wanda.yaml
Outdated
@@ -6,5 +6,7 @@ srcs: | |||
- python/requirements_compiled.txt | |||
- python/requirements/test-requirements.txt | |||
- python/requirements/ml/dl-cpu-requirements.txt | |||
- python/requirements/nsight_fake/nsys_fake.py | |||
- python/requirements/nsight_fake/setup.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
Follow up:
|
794abbd
to
4c6908e
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
4c6908e
to
ed7f9ee
Compare
Why are these changes needed?
Nsight internal docs: https://docs.google.com/document/d/11RlNTbGLf6fat7HYARU8yWhodBD9j5uiZCdAB0geEik
Related issue: #39094
Nsight integration with Ray using runtime_env. Currently nsight can't profile the GPU usage from Ray tasks/actors since the processes that can be traced by nsight must be driver processes and it's subprocesses, whereas Ray tasks/actors are run by worker process. Thus, we added nsight native to runtime_env in order to modify the worker process to run with
nsys profile
which can produce the report for each worker processes once it exits.The nsight API in the runtime_env can be specified with flags that user want to add to the
nsys profile
for exampleRelated issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.