Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime env] Fix Ray hangs when nonexistent conda environment is specified #28105 #34956

Merged
merged 13 commits into from
Aug 23, 2023

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented May 2, 2023

Why are these changes needed?

When a conda name is given to the runtime env, we assume the env already exits. However, there are times the env doesn't exist, and if that happens, it hangs forever. This fixes the issue by always checking conda env list before creating a conda runtime env.

Related issue number

Closes #28105

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
def _create():
if uri is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to inside a function because I want get_conda_env_list to run in a separate thread to avoid blocking agent main thread.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple of nits/questions.

By the way, do you happen to know the impact of this on worker startup time? My impression is that conda activate already takes a couple seconds, and this check might take several more seconds depending on how many envs there are, but this performance impact still seems worth the correctness tradeoff.

Comment on lines 327 to 328
# TODO(architkulkarni): Try "conda activate" here to see if the
# env exists, and raise an exception if it doesn't.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO(architkulkarni): Try "conda activate" here to see if the
# env exists, and raise an exception if it doesn't.

Thanks for fixing this :)

python/ray/_private/runtime_env/conda.py Outdated Show resolved Hide resolved
@jjyao
Copy link
Contributor

jjyao commented May 2, 2023

Could you fix the PR title?

@rkooo567 rkooo567 changed the title Fixed lint [runtime env] Fix Ray hangs when nonexistent conda environment is specified #28105 May 2, 2023
@rkooo567
Copy link
Contributor Author

rkooo567 commented May 2, 2023

By the way, do you happen to know the impact of this on worker startup time? My impression is that conda activate already takes a couple seconds, and this check might take several more seconds depending on how many envs there are, but this performance impact still seems worth the correctness tradeoff.

Hmm actually a good point. Maybe I can make it do only once? Maybe we can create a URI that has the state about it.

@architkulkarni
Copy link
Contributor

I think despite the additional complexity, your URI approach is the correct solution. That's what we do for pip currently. I'm fine with leaving it as a followup issue, unless you think the performance regression in this PR is unacceptable.

Note that for production use cases, we don't recommend runtime_env anyway (we recommend using a container image). If the user must use an existing conda environment and they don't need multiple environments, we can recommend they call conda activate before ray start.

@rkooo567
Copy link
Contributor Author

rkooo567 commented May 3, 2023

Yeah I think it is a perf regression if we merge it this way. Let me see how complicated it is to support the URI.

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com>
@rkooo567
Copy link
Contributor Author

rkooo567 commented May 3, 2023

@architkulkarni can you check it again? I addressed the problem ^. I think testing is a bit tough (do we have a test that already has conda env?), so I manually verified it, but lmk if you have an idea for testing.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized a concern about deletion, can you take a look?

python/ray/_private/runtime_env/conda.py Outdated Show resolved Hide resolved
python/ray/_private/runtime_env/conda.py Show resolved Hide resolved
@stale
Copy link

stale bot commented Jun 10, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 10, 2023
@rkooo567 rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 21, 2023
@stale
Copy link

stale bot commented Aug 10, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 10, 2023
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 22, 2023
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is that using time.time() in the test is risky, it might make the test flaky. I'll let you decide the urgency, given that branch cut is tomorrow.

python/ray/_private/runtime_env/conda.py Outdated Show resolved Hide resolved
python/ray/tests/test_runtime_env_complicated.py Outdated Show resolved Hide resolved
@rkooo567
Copy link
Contributor Author

Failed tests seem unrelated

@rkooo567 rkooo567 merged commit 69ed38b into ray-project:master Aug 23, 2023
86 of 94 checks passed
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…cified ray-project#28105 (ray-project#34956)

When a conda name is given to the runtime env, we assume the env already exits. However, there are times the env doesn't exist, and if that happens, it hangs forever. This fixes the issue by always checking conda env list before creating a conda runtime env.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…cified ray-project#28105 (ray-project#34956)

When a conda name is given to the runtime env, we assume the env already exits. However, there are times the env doesn't exist, and if that happens, it hangs forever. This fixes the issue by always checking conda env list before creating a conda runtime env.

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[runtime env] Ray hangs when nonexistent conda environment is specified
3 participants