-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] "Windows fatal exception: access violation" cluttering terminal #13511
Comments
any idea how to solve this?
(pid=31996) Windows fatal exception: access violation And then at some point, it will crash with 2021-02-05 17:24:29,648 WARNING worker.py:1034 -- The log monitor on node DESKTOP-QJDSQ0R failed with the following error: The above exception was the direct cause of the following exception: Traceback (most recent call last): forrtl: error (200): program aborting due to control-C event please do help |
Hi Evan,
A partial solution to this problem is to use ray.init(log_to_driver=False) when
initializing your ray cluster. This got rid of some of the mess in the
terminal due to the particular library I was using (jpype), but the
messages still show up sometimes related to other things (seems random).
Wish I could help more, and if you find a solution please post to Github!
Thanks,
Avi
…On Fri, Feb 5, 2021 at 4:26 AM Evan Hu (YiFan Hu) ***@***.***> wrote:
any idea how to solve this?
I have similar problems when I use deap package
the code seems to run fine but it keeps yelled "fatal" exception
and it seems to been printed out, not a real exception
@ray.remote
class Ray_Deap_Map():
def __init__(self, creator_setup=None, pset_creator = None):
# issue 946? Ensure non trivial startup to prevent bad load balance across a cluster
# sleep(0.01)
# recreate scope from global
# For GA no need to provide pset_creator. Both needed for GP
self.creator_setup = creator_setup
self.psetCreator = pset_creator
if creator_setup is not None:
self.creator_setup()
self.psetCreator()
def ray_remote_eval_batch(self, f, iterable):
# iterable, id_ = zipped_input
# attach id so we can reorder the batches
return [f(i) for i in iterable]
def ray_deap_map(func, pop, creator_setup, pset_creator):
n_workers = int(ray.cluster_resources()['CPU'])
if n_workers == 1:
results = list(map(func, pop)) #forced eval to time it
else:
# many workers
if len(pop) < n_workers:
n_workers = len(pop)
else:
n_workers = n_workers
n_per_batch = int(len(pop)/n_workers) + 1
batches = [pop[i:i + n_per_batch] for i in range(0, len(pop), n_per_batch)]
actors = [Ray_Deap_Map.remote(creator_setup, pset_creator) for _ in range(n_workers)]
result_ids = [a.ray_remote_eval_batch.remote(func, b) for a, b in zip(actors,batches)]
results = ray.get(result_ids)
return sum(results, [])
(pid=31996) Windows fatal exception: access violation
(pid=31996)
(pid=21820) Windows fatal exception: access violation
(pid=21820)
(pid=31372) Windows fatal exception: access violation
(pid=31372)
(pid=24640) Windows fatal exception: access violation
(pid=24640)
(pid=31380) Windows fatal exception: access violation
(pid=31380)
(pid=15396) Windows fatal exception: access violation
(pid=15396)
(pid=21660) Windows fatal exception: access violation
(pid=21660)
(pid=21976) Windows fatal exception: access violation
(pid=21976)
(pid=29076) Windows fatal exception: access violation
(pid=29076)
(pid=32212) Windows fatal exception: access violation
(pid=32212)
(pid=25964) Windows fatal exception: access violation
(pid=25964)
(pid=17224) Windows fatal exception: access violation
(pid=17224)
(pid=31964) Windows fatal exception: access violation
(pid=31964)
(pid=25632) Windows fatal exception: access violation
(pid=25632)
(pid=27112) Windows fatal exception: access violation
(pid=27112)
(pid=32620) Windows fatal exception: access violation
And then at some point, it will crash with
2021-02-05 17:24:29,648 WARNING worker.py:1034 -- The log monitor on node
DESKTOP-QJDSQ0R failed with the following error:
OSError: [WinError 87] 參數錯誤。
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py",
line 354, in
log_monitor.run()
File
"C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py",
line 275, in run
self.open_closed_files()
File
"C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py",
line 164, in open_closed_files
self.close_all_files()
File
"C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py",
line 102, in close_all_files
os.kill(file_info.worker_pid, 0)
SystemError: returned a result with an error set
forrtl: error (200): program aborting due to control-C event
Image PC Routine Line Source
libifcoremd.dll 00007FFDC0AE3B58 Unknown Unknown Unknown
KERNELBASE.dll 00007FFE221862A3 Unknown Unknown Unknown
KERNEL32.DLL 00007FFE24217C24 Unknown Unknown Unknown
ntdll.dll 00007FFE2470D4D1 Unknown Unknown Unknown
Windows fatal exception: access violation
please do help
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#13511 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG7IVY4OIZXSUXAONTGVVJ3S5O2U3ANCNFSM4WGRFOIQ>
.
|
Thanks I'll try that out |
I have this problem just by running the example code from the README. What is the cause and why it's happening in an obvious place? from ray import tune
def objective(step, alpha, beta):
return (0.1 + alpha * step / 100)**(-1) + beta * 0.1
def training_function(config):
# Hyperparameters
alpha, beta = config["alpha"], config["beta"]
for step in range(10):
# Iterative training function - can be any arbitrary training procedure.
intermediate_score = objective(step, alpha, beta)
# Feed the score back back to Tune.
tune.report(mean_loss=intermediate_score)
analysis = tune.run(
training_function,
config={
"alpha": tune.grid_search([0.001, 0.01, 0.1]),
"beta": tune.choice([1, 2, 3])
})
print("Best config: ", analysis.get_best_config(metric="mean_loss", mode="min"))
# Get a dataframe for analyzing trial results.
df = analysis.results_df Here's some of the outputs:
|
I have this problem when I try to run the example code from the tutorials.
Here are some outputs:
|
I have a very similar error using the ray tune setup from here: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html#sphx-glr-beginner-hyperparameter-tuning-tutorial-py |
I have the same issue, using the following reproducer. from ray.util.queue import Queue
from ray import available_resources
from time import sleep
queue1 = Queue(actor_options={"num_cpus": 4})
sleep(10)
print(available_resources())
queue2 = Queue(actor_options={"num_cpus": 4})
sleep(10)
print(available_resources())
queue2 = Queue(actor_options={"num_cpus": 1})
sleep(10)
print(available_resources()) Logs: 2021-08-27 09:21:41,123 INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8265
{'object_store_memory': 4175845785.0, 'memory': 8351691572.0, 'node:192.168.100.108': 1.0, 'CPU': 4.0}
{'memory': 8351691572.0, 'node:192.168.100.108': 1.0, 'object_store_memory': 4175845785.0}
(pid=37296) Windows fatal exception: access violation
(pid=37296)
{'object_store_memory': 4175845785.0, 'memory': 8351691572.0, 'node:192.168.100.108': 1.0, 'CPU': 3.0} Initially found in modin-project/modin#3256. cc @rkooo567 In some cases this leads to hangs tests. |
I am able to reproduce the exception in the description with the following code, import psutil
import ray
import jpype
import sys
print("psutil", psutil.__version__)
print("ray", ray.__version__)
print("jpype", jpype.__version__)
print("sys", sys.version_info)
@ray.remote
class ObjectiveFunc(object):
def __init__(self):
self.java = jpype.startJVM()
class RayMap(object):
def __init__(self, num_workers):
self.workers = []
for _ in range(num_workers):
self.workers.append(ObjectiveFunc.remote())
num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus, include_dashboard=False)
rm = RayMap(4) Output psutil 5.8.0
ray 2.0.0.dev0
jpype 1.3.0
sys sys.version_info(major=3, minor=8, micro=11, releaselevel='final', serial=0)
c:\users\gagan\gsingh\ray\python\ray\_private\services.py:238: UserWarning: Not all Ray Dashboard dependencies were found. To use the dashboard please install Ray using `pip install ray[default]`. To disable this message, set RAY_DISABLE_IMPORT_WARNING env var to '1'.
warnings.warn(warning_message)
(pid=6856) Windows fatal exception: access violation
(pid=6856)
(pid=6856) Stack (most recent call first):
(pid=6856) File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=6856) File "ray_jpype.py", line 13 in __init__
(pid=6856) File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=6856) File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=6856) File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>
(pid=5812) Windows fatal exception: access violation
(pid=5812)
(pid=5812) Stack (most recent call first):
(pid=5812) File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=5812) File "ray_jpype.py", line 13 in __init__
(pid=5812) File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=5812) File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=5812) File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>
(pid=4984) Windows fatal exception: access violation
(pid=4984)
(pid=4984) Stack (most recent call first):
(pid=4984) File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=4984) File "ray_jpype.py", line 13 in __init__
(pid=4984) File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=4984) File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=4984) File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>
(pid=9560) Windows fatal exception: access violation
(pid=9560)
(pid=9560) Stack (most recent call first):
(pid=9560) File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=9560) File "ray_jpype.py", line 13 in __init__
(pid=9560) File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=9560) File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=9560) File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module> |
Can I investigate this further? Upon further investigation I found that this issue is related to access of unallocated memory address by import jpype
from multiprocessing import Pool
def f(x):
obj = jpype.startJVM()
print(obj)
return x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3])) |
Hi. I am investigating this issue. I noticed that there is a concept of worker (which is a process as far as I understand). IMO, the above issue is caused because of some memory allocation issues while creating that worker. Would it be possible to know the code (its location inside the project) in ray which is used to create that worker. Thanks. |
Hmm, I think you might want to look at ray/services.py? |
Updates, With different versions I observed different things, ray-1.6.0 - Output is as described by the author. ray-1.3.0 - There is some exception ignored in (pid=6764) Windows fatal exception: access violation
(pid=6764)
(pid=6764) Stack (most recent call first):
(pid=6764) File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=6764) File "ray_jpype.py", line 13 in __init__
(pid=6764) File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\_private\function_manager.py", line 556 in actor_method_executor
(pid=6764) File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\worker.py", line 382 in main_loop
(pid=6764) File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\workers/default_worker.py", line 196 in <module>
Exception ignored in: <function ActorHandle.__del__ at 0x000001BE40690AF0>
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\actor.py", line 809, in __del__
AttributeError: 'NoneType' object has no attribute 'global_worker' ray-1.0.0 |
Hi. I dug deeper into the issue of access violation exception cluttering the command prompt (a.k.a terminal). Following are some noteworthy points, @ray.remote
def f(i):
return subprocess.run(['python', 'C:\\Users\\gagan\\ray_project\\java_work.py', str(i)]) In words, we are launching a python subprocess and then calling jpype APIs there because from my observations, jpype APIs work fine when called from python processes but not from C++ libraries (as described above). |
IMO, Fix in 1 would be the best to have as it is easy. We need to find the right spot inside C++ code of ray to add |
Our current hypothesis is that the access violation messages are not impacting functionality (but they are very annoying indeed). There is a summary in #18944. We are suppressing them for now, see #19561 I'm closing this issue for now as #19561 should fix most of the inconvenience here. However if somebody has more insight into this problem and can actually make these access violation errors go away that would be most welcome. There are several open source projects (including Python/C extension related ones) that have been wrestling with this issue and to the best of my knowledge the problem is not super well understood at the moment. |
For what it's worth, albeit with an entirely different call stack, I'm seeing a similar error message: "Windows fatal exception: access violation". This is with an application using sockets under asyncio, with the IOCP Proactor on Windows 10. The Python version is 3.11.5 installed via Chocolatey When using a selector event loop on Windows, the segfault does not occur as such. import asyncio as aio
import sys
loop = aio.SelectorEventLoop() if sys.platform == "win32" else aio.get_event_loop_policy().get_event_loop() Towards reproducing the error: There's an example using HTTPX to run a single HTTP request [moved to gist] With the example, the Windows access violation might not occur until the end of Using a selector event loop, the segfault does not occur. HTH, apologies if it's too far off topic, moreover with the different call stack in the example. |
What is the problem?
I am using Ray 1.1.0 with Python 3.7.6 to run an ActorPool. Each actor needs access to it's own copy of a java virtual machine (created using jpype, which is a dependency of another package which is used by the Actors, but it seems to be the root of this issue). Ray seems to handle this just fine, however, it prints many lines of errors to the terminal, all of which are repeats of:
�[2m�[36m(pid=18064)�[0m Windows fatal exception: access violation
�[2m�[36m(pid=18064)�[0m
�[2m�[36m(pid=18064)�[0m Stack (most recent call first):
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\jpype_core.py", line 222 in startJVM
�[2m�[36m(pid=18064)�[0m File "c:\Users\Kursti\Documents\Python\ray_access_violation.py", line 15 in init
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\ray\function_manager.py", line 556 in actor_method_executor
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\ray\worker.py", line 383 in main_loop
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\ray\workers/default_worker.py", line 181 in
�[2m�[36m(pid=11676)�[0m Windows fatal exception: access violation
Again, the code we're running seems to work fine, but the terminal clutter makes it challenging to work with our code. This issue has also come up intermittently without using jpype, but is not reproducible. Any idea how we can fix this problem?
Reproduction (REQUIRED)
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
The text was updated successfully, but these errors were encountered: