-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][Bug] Ray processes escaping hermetic python environment #22977
Comments
Also can confirm that the issue gets fixed by passing Will see if i can repro the issue in a shareable example |
Would deleting the |
Thanks @architkulkarni , will try passing in the runtime environment args and confirm |
Sounds good, thanks! I'm not too familiar with bazel environments, but it might be appropriate to pass in a certain directory for |
Naively setting
I will continue checking other env variables that could possibly impact this behaviour. The easiest solution from our perspective would be an additional param in ray.init that would toggle the Something like Do you think adding that is a feasible option at your end ? |
@ponner-github thanks for trying it out! I'd like to figure out why the runtime_env failed to be set up. Do you mind sharing your code that defines and passes in the runtime_env? Also, could you see if there are relevant logs in Regarding the possibility of I'd say we should try to get the environment variable approach to work first. I think it should be possible using PYTHONPATH, PYTHONHOME or something similar. |
So it looks like
https://peps.python.org/pep-0370/#implementation Looking for -S now... |
It seems like It doesn't seem like PYTHONPATH/PYTHONHOME will have the same effect -- these environment variables augment the import flow for Python, but do not disable the searching in site-packages. |
Thanks @architkulkarni and @richardliaw , apologies for the delayed response as i was working on making an internal patch to unblock. Here is a simple repro of the issue emulating the behaviour we are observing
Content of run_dummy_job.py
Output
The raylet extends the module search path into site-packages, as @richardliaw |
Also confirming that passing in env variables with runtime_env do not fix the issue
Output
|
@ponner-github I see, thanks for running those tests. I'll make a patch for this and try to get the API change approved. |
BTW @architkulkarni we should consider doing this via runtime envs rather than through the top-level API. But yeah, lets go through the api change process. |
@architkulkarni this was labeled P1; have we made progress on this? |
@richardliaw Sorry for the delay, there hasn't been progress on this. I'll try to prioritize it. |
Hi @ponner-github , sorry for the late reply. I did some more thinking about this and I think your specific reproduction is actually supported by the current Ray if you add
Running your command:
The output is now
which is what we wanted. There is a separate issue, which is that this only handles the Could you help me confirm my understanding of the gap between just having |
One more question--you mentioned you added the flags in four places for it to work: start_worker_cmd, agent_command, java_command and start_ray_client_server Would it be enough to have it just for the Ray workers (the start_worker_cmd)? One of the implementations I have in mind only works for Ray workers. If not, I can think of a different approach. |
Thanks @architkulkarni
let me revisit my notes and retry the setup on
I think the start worker command and the java_command should be sufficient, when i was initially trying to fix the issue i added the |
Hi @ponner-github @architkulkarni @richardliaw , I have seen the same issue while running it in databricks. I am starting the cluster with What can I do to solve it? Trigger it like: |
@WaterKnight1998 can you give more details about your issue? You can pass Can you try the solution in #22977 (comment)? If you also need the |
Hi @architkulkarni ,
https://discuss.ray.io/t/ray-tune-not-working-inside-databricks/6594/6
I tried this with Ray on Databricks but it didn't work, it looks like it is getting packages from pyspark: covid_df = (spark
.read
.option("header", "true")
.option('inferSchema', 'true')
.csv('/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv'))
select_cols = covid_df.columns[4:]
df = (covid_df
.select(
col('County Name').alias('county_name'),
array([col(n) for n in select_cols]
).alias('deaths')))
@ray.remote
def linear_pred(x,y, i):
import sys
sys.path="/databricks/python/lib/python3.8/site-packages"
from os import listdir
raise Exception(listdir("/databricks/python/lib/python3.8/site-packages"))
import imblearn
import pandas as pd
reg = linear_model.ElasticNet().fit(x, y)
p = reg.predict(np.array([[i + 1]]))
return p[0]
@pandas_udf(ArrayType(LongType()))
def ray_udf(s):
ray.init(ignore_reinit_error=True, address='auto', _redis_password='d4t4bricks', runtime_env = {"env_vars": {"PYTHONPATH": "/databricks/python/lib/python3.8/site-packages", "PYTHONNOUSERSITE": "1"}})
s = list(s)
pred = []
workers = []
for i in range(len(s)):
x = list(range(i+1))
x = np.asarray([[n] for n in x])
y = s[:i+1]
y = np.asarray(y)
workers.append(linear_pred.remote(x, y, i))
pred = ray.get(workers)
return pd.Series(pred)
res = df.select("county_name", "deaths", ray_udf("deaths").alias("preds"))
display(res)
Maybe I could try to do the same at databricks init script, I guess I just need to replace installation by this: mkdir /tmp/custom_install && \
python3.7 -m pip install -t /tmp/custom_install setuptools && \
python3.7 -m pip install -t /tmp/custom_install ray So init script would look like:
|
For the record, we've been patching the -S flag into the Ray code each time we upgrade Ray. Adding myself as an assignee just so I can easily track Ray issues that affect Cruise -- I don't necessarily intend to fix this myself. |
Proposed quick fix that wouldn't require an explicit API change is to pass the -S and -s from the root Ray process down to the Python workers. |
Search before asking
Ray Component
Ray Core
Issue Severity
Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.
What happened + What you expected to happen
ray.init() the (grand)child python processes are escaping our hermetic python environment (specifically they start to look for modules on the system, instead of our bazel build sandbox)
This leads to Actor failures
Looking at
python/ray/_private/services.py
most python subprocesses are started as{sys.executable} <some script>
without the-Ss
flags that would prevent extending the module search path intosite-packages
Please let us know if there are any workarounds that can be applied to deal with this or code references that show this should not be happening.
Versions / Dependencies
ray==1.9.1
python==3.7(via bazel)
Reproduction script
Working on a shareable repro...
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: