Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[operator] Use of '~' hard-coding in ray code #14155

Closed
erikerlandson opened this issue Feb 17, 2021 · 10 comments
Closed

[operator] Use of '~' hard-coding in ray code #14155

erikerlandson opened this issue Feb 17, 2021 · 10 comments
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Milestone

Comments

@erikerlandson
Copy link
Contributor

erikerlandson commented Feb 17, 2021

What is the problem?

There are some places in the ray code where the use of ~ for 'home directory' is hard-coded.
This is a problem for ray images where no home directory has been created. It is also a potential problem for any image running in openshift, where images run using anonymous random UID, and no passwd entry will exist.

Ray version and other system information (Python version, TensorFlow version, OS):

ray 2.0 (dev branch)

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

run a ray cluster using quay.io/erikerlandson/ray-ubi as the image. It will fail when trying to execute source ~/.bashrc because ~ maps to / and there is no such .bashrc file. For example, any command run via _with_interactive will fail in this way. Possibly there are other places ~ is used.

One possible solution is that all appearances of ~ in the ray code are located and replaced by an environment variable, perhaps called RAY_HOME, but the name is unimportant.

I suspect the "root" problem is assumption of the use of a home directory during the installation of ray on the image, which is where the references to ~/.bashrc and similar come from. So possibly a RAY_HOME environment variable is less the issue than simply documenting requirements for a ray image to operate: for example "construct your image such that ray is in the path" or better yet "by the time 'ray start' is run, ray must be in the path and the ray libs be in the python environment", for example I replace invocations of ray start... with pipenv run ray start ... in places where that is configured on YAML.

@erikerlandson erikerlandson added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 17, 2021
@erikerlandson
Copy link
Contributor Author

I may be able to hack around this by setting HOME on images, will experiment.
However avoiding the use of ~ still seems desirable.

On a related note, assuming ray has been installed via conda, or needs the use of .bashrc and friends, is an assumption it is worth avoiding, if possible.

@erikerlandson erikerlandson changed the title Use of '~' hard-coding in ray code should be replaced by $RAY_HOME [operator] Use of '~' hard-coding in ray code should be replaced by $RAY_HOME Feb 17, 2021
@erikerlandson erikerlandson changed the title [operator] Use of '~' hard-coding in ray code should be replaced by $RAY_HOME [operator] Use of '~' hard-coding in ray code Feb 17, 2021
@erikerlandson
Copy link
Contributor Author

erikerlandson commented Feb 17, 2021

@DmitriGekhtman what do you think - is the use of a RAY_HOME environment var a good way forward or should the goal be to remove all references to appearances of ~, ~/.bashrc, etc, altogether?

Use of ~/.bashrc is not relevant to my images at all, so IMO it argues for not assuming its existence in the code.

@erikerlandson
Copy link
Contributor Author

OK, a workaround exists: by setting HOME=/valid/path and creating an empty $HOME/.bashrc on my image

A side-note, I had to install uptime on my image as it wasn't there by default. This should be easy to do on any standard OS package manager, but might be worth documenting in a "requirements for ray container images" topic.

@DmitriGekhtman
Copy link
Contributor

@erikerlandson

Thanks for the suggestions -- agreed that Ray code should make fewer assumptions on the existence of a home directory -- .bashrc, etc
and that we should more carefully document requirements for building Ray images for use on Kubernetes.

cc @ijrsvt

Hmm...as for uptime in particular, I think the code path that calls it could be avoided with some restructuring...

@ijrsvt
Copy link
Contributor

ijrsvt commented Feb 22, 2021

@erikerlandson Are you using the Ray Cluster Launcher with the Kubernetes NodeProvider?

@DmitriGekhtman
Copy link
Contributor

The focus here is on the K8s operator, which does this under the hood
https://github.com/ray-project/ray/blob/master/python/ray/ray_operator/operator.py#L58

@erikerlandson
Copy link
Contributor Author

@ijrsvt I am using the ray operator to create clusters. To see specifically what I'm doing, the relevant demo is here:
https://github.com/erikerlandson/ray-odh-demo

(note I'm working purely with the ray 2.0 head of dev branch, including the new client/server connection)

@stale
Copy link

stale bot commented Jun 22, 2021

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 22, 2021
@erikerlandson
Copy link
Contributor Author

bump

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 22, 2021
@DmitriGekhtman
Copy link
Contributor

Should be solved at the same time as this one #16093

@richardliaw richardliaw added this to the Serverless Autoscaling milestone Jul 2, 2021
@richardliaw richardliaw added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 7, 2021
@AmeerHajAli AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

5 participants