[core] "Windows fatal exception: access violation" cluttering terminal #13511

ashragai · 2021-01-18T00:42:50Z

What is the problem?

I am using Ray 1.1.0 with Python 3.7.6 to run an ActorPool. Each actor needs access to it's own copy of a java virtual machine (created using jpype, which is a dependency of another package which is used by the Actors, but it seems to be the root of this issue). Ray seems to handle this just fine, however, it prints many lines of errors to the terminal, all of which are repeats of:

�[2m�[36m(pid=18064)�[0m Windows fatal exception: access violation
�[2m�[36m(pid=18064)�[0m
�[2m�[36m(pid=18064)�[0m Stack (most recent call first):
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\jpype_core.py", line 222 in startJVM
�[2m�[36m(pid=18064)�[0m File "c:\Users\Kursti\Documents\Python\ray_access_violation.py", line 15 in init
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\ray\function_manager.py", line 556 in actor_method_executor
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\ray\worker.py", line 383 in main_loop
�[2m�[36m(pid=18064)�[0m File "C:\ProgramData\Anaconda3\lib\site-packages\ray\workers/default_worker.py", line 181 in
�[2m�[36m(pid=11676)�[0m Windows fatal exception: access violation

Again, the code we're running seems to work fine, but the terminal clutter makes it challenging to work with our code. This issue has also come up intermittently without using jpype, but is not reproducible. Any idea how we can fix this problem?

Reproduction (REQUIRED)

import psutil
import ray
import jpype

@ray.remote
class ObjectiveFunc(object):
    def __init__(self):
        self.java = jpype.startJVM()

class RayMap(object):
    def __init__(self, num_workers):
        self.workers = []
        for _ in range(num_workers):
            self.workers.append(ObjectiveFunc.remote())

num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus, include_dashboard= True)
rm = RayMap(4)

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

[X ] I have verified my script runs in a clean environment and reproduces the issue.
[X ] I have verified the issue also occurs with the latest wheels.

eiahb3838ya · 2021-02-05T09:26:23Z

any idea how to solve this?
I have similar problems when I use deap package
the code seems to run fine but it keeps yelled "fatal" exception
and it seems to been printed out, not a real exception

@ray.remote
class Ray_Deap_Map():
    def __init__(self, creator_setup=None, pset_creator = None):
        # issue 946? Ensure non trivial startup to prevent bad load balance across a cluster
        # sleep(0.01)

        # recreate scope from global
        # For GA no need to provide pset_creator. Both needed for GP
        self.creator_setup = creator_setup
        self.psetCreator = pset_creator
        if creator_setup is not None:
            self.creator_setup()
            self.psetCreator()

    def ray_remote_eval_batch(self, f, iterable):
        # iterable, id_ = zipped_input
        # attach id so we can reorder the batches
        return [f(i) for i in iterable]

def ray_deap_map(func, pop, creator_setup, pset_creator):
    n_workers = int(ray.cluster_resources()['CPU'])
    if n_workers == 1:
        results = list(map(func, pop)) #forced eval to time it
    else:
        # many workers
        if len(pop) < n_workers:
            n_workers = len(pop)
        else:
            n_workers = n_workers

    n_per_batch = int(len(pop)/n_workers) + 1
    batches = [pop[i:i + n_per_batch] for i in range(0, len(pop), n_per_batch)]
    actors = [Ray_Deap_Map.remote(creator_setup, pset_creator) for _ in range(n_workers)]
    result_ids = [a.ray_remote_eval_batch.remote(func, b) for a, b in zip(actors,batches)]
    results = ray.get(result_ids)

    return sum(results, [])

(pid=31996) Windows fatal exception: access violation
(pid=31996)
(pid=21820) Windows fatal exception: access violation
(pid=21820)
(pid=31372) Windows fatal exception: access violation
(pid=31372)
(pid=24640) Windows fatal exception: access violation
(pid=24640)
(pid=31380) Windows fatal exception: access violation
(pid=31380)
(pid=15396) Windows fatal exception: access violation
(pid=15396)
(pid=21660) Windows fatal exception: access violation
(pid=21660)
(pid=21976) Windows fatal exception: access violation
(pid=21976)
(pid=29076) Windows fatal exception: access violation
(pid=29076)
(pid=32212) Windows fatal exception: access violation
(pid=32212)
(pid=25964) Windows fatal exception: access violation
(pid=25964)
(pid=17224) Windows fatal exception: access violation
(pid=17224)
(pid=31964) Windows fatal exception: access violation
(pid=31964)
(pid=25632) Windows fatal exception: access violation
(pid=25632)
(pid=27112) Windows fatal exception: access violation
(pid=27112)
(pid=32620) Windows fatal exception: access violation

And then at some point, it will crash with

2021-02-05 17:24:29,648 WARNING worker.py:1034 -- The log monitor on node DESKTOP-QJDSQ0R failed with the following error:
OSError: [WinError 87] 參數錯誤。

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 354, in
log_monitor.run()
File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 275, in run
self.open_closed_files()
File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 164, in open_closed_files
self.close_all_files()
File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 102, in close_all_files
os.kill(file_info.worker_pid, 0)
SystemError: returned a result with an error set

forrtl: error (200): program aborting due to control-C event
Image PC Routine Line Source
libifcoremd.dll 00007FFDC0AE3B58 Unknown Unknown Unknown
KERNELBASE.dll 00007FFE221862A3 Unknown Unknown Unknown
KERNEL32.DLL 00007FFE24217C24 Unknown Unknown Unknown
ntdll.dll 00007FFE2470D4D1 Unknown Unknown Unknown
Windows fatal exception: access violation

please do help

ashragai · 2021-02-05T22:27:42Z

Hi Evan, A partial solution to this problem is to use ray.init(log_to_driver=False) when initializing your ray cluster. This got rid of some of the mess in the terminal due to the particular library I was using (jpype), but the messages still show up sometimes related to other things (seems random). Wish I could help more, and if you find a solution please post to Github! Thanks, Avi

…

On Fri, Feb 5, 2021 at 4:26 AM Evan Hu (YiFan Hu) ***@***.***> wrote: any idea how to solve this? I have similar problems when I use deap package the code seems to run fine but it keeps yelled "fatal" exception and it seems to been printed out, not a real exception @ray.remote class Ray_Deap_Map(): def __init__(self, creator_setup=None, pset_creator = None): # issue 946? Ensure non trivial startup to prevent bad load balance across a cluster # sleep(0.01) # recreate scope from global # For GA no need to provide pset_creator. Both needed for GP self.creator_setup = creator_setup self.psetCreator = pset_creator if creator_setup is not None: self.creator_setup() self.psetCreator() def ray_remote_eval_batch(self, f, iterable): # iterable, id_ = zipped_input # attach id so we can reorder the batches return [f(i) for i in iterable] def ray_deap_map(func, pop, creator_setup, pset_creator): n_workers = int(ray.cluster_resources()['CPU']) if n_workers == 1: results = list(map(func, pop)) #forced eval to time it else: # many workers if len(pop) < n_workers: n_workers = len(pop) else: n_workers = n_workers n_per_batch = int(len(pop)/n_workers) + 1 batches = [pop[i:i + n_per_batch] for i in range(0, len(pop), n_per_batch)] actors = [Ray_Deap_Map.remote(creator_setup, pset_creator) for _ in range(n_workers)] result_ids = [a.ray_remote_eval_batch.remote(func, b) for a, b in zip(actors,batches)] results = ray.get(result_ids) return sum(results, []) (pid=31996) Windows fatal exception: access violation (pid=31996) (pid=21820) Windows fatal exception: access violation (pid=21820) (pid=31372) Windows fatal exception: access violation (pid=31372) (pid=24640) Windows fatal exception: access violation (pid=24640) (pid=31380) Windows fatal exception: access violation (pid=31380) (pid=15396) Windows fatal exception: access violation (pid=15396) (pid=21660) Windows fatal exception: access violation (pid=21660) (pid=21976) Windows fatal exception: access violation (pid=21976) (pid=29076) Windows fatal exception: access violation (pid=29076) (pid=32212) Windows fatal exception: access violation (pid=32212) (pid=25964) Windows fatal exception: access violation (pid=25964) (pid=17224) Windows fatal exception: access violation (pid=17224) (pid=31964) Windows fatal exception: access violation (pid=31964) (pid=25632) Windows fatal exception: access violation (pid=25632) (pid=27112) Windows fatal exception: access violation (pid=27112) (pid=32620) Windows fatal exception: access violation And then at some point, it will crash with 2021-02-05 17:24:29,648 WARNING worker.py:1034 -- The log monitor on node DESKTOP-QJDSQ0R failed with the following error: OSError: [WinError 87] 參數錯誤。 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 354, in log_monitor.run() File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 275, in run self.open_closed_files() File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 164, in open_closed_files self.close_all_files() File "C:\Users\eiahb.conda\envs\env_genetic_programming\lib\site-packages\ray\log_monitor.py", line 102, in close_all_files os.kill(file_info.worker_pid, 0) SystemError: returned a result with an error set forrtl: error (200): program aborting due to control-C event Image PC Routine Line Source libifcoremd.dll 00007FFDC0AE3B58 Unknown Unknown Unknown KERNELBASE.dll 00007FFE221862A3 Unknown Unknown Unknown KERNEL32.DLL 00007FFE24217C24 Unknown Unknown Unknown ntdll.dll 00007FFE2470D4D1 Unknown Unknown Unknown Windows fatal exception: access violation please do help — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13511 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG7IVY4OIZXSUXAONTGVVJ3S5O2U3ANCNFSM4WGRFOIQ> .

eiahb3838ya · 2021-02-06T03:01:11Z

Thanks I'll try that out

offchan42 · 2021-06-15T20:33:59Z

I have this problem just by running the example code from the README. What is the cause and why it's happening in an obvious place?

from ray import tune


def objective(step, alpha, beta):
    return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


def training_function(config):
    # Hyperparameters
    alpha, beta = config["alpha"], config["beta"]
    for step in range(10):
        # Iterative training function - can be any arbitrary training procedure.
        intermediate_score = objective(step, alpha, beta)
        # Feed the score back back to Tune.
        tune.report(mean_loss=intermediate_score)


analysis = tune.run(
    training_function,
    config={
        "alpha": tune.grid_search([0.001, 0.01, 0.1]),
        "beta": tune.choice([1, 2, 3])
    })

print("Best config: ", analysis.get_best_config(metric="mean_loss", mode="min"))

# Get a dataframe for analyzing trial results.
df = analysis.results_df

Here's some of the outputs:

(pid=4448) Windows fatal exception: access violation
(pid=4448)
(pid=9924) Windows fatal exception: access violation
(pid=9924)
Result for training_function_b0b44_00002:
  date: 2021-06-16_03-31-42
  done: false
  experiment_id: aa74e089743f42979edb606b5abf80d3
  hostname: LENOVO-LAPTOP
  iterations_since_restore: 1
  mean_loss: 10.2
  neg_mean_loss: -10.2
  node_ip: 192.168.1.102
  pid: 8600
  time_since_restore: 0.0009984970092773438
  time_this_iter_s: 0.0009984970092773438
  time_total_s: 0.0009984970092773438
  timestamp: 1623789102
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: b0b44_00002

Result for training_function_b0b44_00002:
  date: 2021-06-16_03-31-42
  done: true
  experiment_id: aa74e089743f42979edb606b5abf80d3
  experiment_tag: 2_alpha=0.1,beta=2
  hostname: LENOVO-LAPTOP
  iterations_since_restore: 10
  mean_loss: 9.374311926605502
  neg_mean_loss: -9.374311926605502
  node_ip: 192.168.1.102
  pid: 8600
  time_since_restore: 0.039893150329589844
  time_this_iter_s: 0.0039899349212646484
  time_total_s: 0.039893150329589844
  timestamp: 1623789102
  timesteps_since_restore: 0
  training_iteration: 10
  trial_id: b0b44_00002

== Status ==
Memory usage on this node: 15.3/31.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/24 CPUs, 0/2 GPUs, 0.0/20.5 GiB heap, 0.0/10.25 GiB objects
Result logdir: C:\Users\off99\ray_results\training_function_2021-06-16_03-31-40
Number of trials: 3/3 (3 TERMINATED)
+-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------+
| Trial name                    | status     | loc   |   alpha |   beta |     loss |   iter |   total time (s) |   neg_mean_loss |
|-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------|
| training_function_b0b44_00000 | TERMINATED |       |   0.001 |      2 | 10.191   |     10 |        0.0398934 |       -10.191   |
| training_function_b0b44_00001 | TERMINATED |       |   0.01  |      1 | 10.0108  |     10 |        0.0688159 |       -10.0108  |
| training_function_b0b44_00002 | TERMINATED |       |   0.1   |      2 |  9.37431 |     10 |        0.0398932 |        -9.37431 |
+-------------------------------+------------+-------+---------+--------+----------+--------+------------------+-----------------+


(pid=8600) Windows fatal exception: access violation
(pid=8600)
2021-06-16 03:31:42,258 INFO tune.py:549 -- Total run time: 4.48 seconds (1.65 seconds for the tuning loop).
Best config:  {'alpha': 0.1, 'beta': 2}

swagshaw · 2021-06-19T12:21:49Z

I have this problem when I try to run the example code from the tutorials.
Here are my codes:


import ray

from io import BytesIO

import io

from PIL import Image

import requests


import torch

from torchvision import transforms

from torchvision.models import resnet18

@serve.deployment(route_prefix="/image_predict")

class ImageModel:
    def __init__(self):
        self.model = resnet18(pretrained=True).eval()
        self.preprocessor = transforms.Compose([
            transforms.Resize(224),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Lambda(lambda t: t[:3, ...]),  # remove alpha channel
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ])

    async def __call__(self, starlette_request):
        image_payload_bytes = await starlette_request.body()
        pil_image = Image.open(BytesIO(image_payload_bytes))
        print("[1/3] Parsed image data: {}".format(pil_image))

        pil_images = [pil_image]  # Our current batch size is one
        input_tensor = torch.cat(
            [self.preprocessor(i).unsqueeze(0) for i in pil_images])
        print("[2/3] Images transformed, tensor shape {}".format(
            input_tensor.shape))

        with torch.no_grad():
            output_tensor = self.model(input_tensor)
        print("[3/3] Inference done!")
        return {"class_index": int(torch.argmax(output_tensor[0]))}

if __name__ == '__main__':
    # ray.init(log_to_driver=False)
    serve.start()
    ImageModel.deploy()
    
    # ray_logo_bytes = requests.get(
    # "https://github.com/ray-project/ray/raw/"
    # "master/doc/source/images/ray_header_logo.png").content
    
    # Transform the image to Bytes
    img = Image.open('D:\\develop\\ModelCI-e\\experiment\\data\\cat.jpg', mode='r')
    imgByteArr = io.BytesIO()
    img.save(imgByteArr, format='JPEG')
    imgByteArr = imgByteArr.getvalue()
    resp = requests.post(
    "http://localhost:8000/image_predict", data=imgByteArr)
    print(resp.text)
    # Output
    # {'class_index': 463}

Here are some outputs:

2021-06-19 18:11:20,745 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265

(pid=15160) 2021-06-19 18:11:31,655     INFO http_state.py:72 -- Starting HTTP proxy with name 

'IhznBQ:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.0.108-0' on node 'node:192.168.0.108-0' listening on '127.0.0.1:8000'

2021-06-19 18:11:31,913 INFO api.py:415 -- Updating deployment 'ImageModel'.

(pid=14328) INFO:     Started server process [14328]

(pid=15160) 2021-06-19 18:11:31,962     INFO backend_state.py:773 -- Adding 1 replicas to backend 'ImageModel'.

(pid=14264) D:\Miniconda3\envs\ray\lib\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)

(pid=14264)   return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

(pid=14264) [1/3] Parsed image data: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=498x720 at 0x2397A0E4CA0>

(pid=14264) [2/3] Images transformed, tensor shape torch.Size([1, 3, 224, 224])
{
  "class_index": 285
}

(pid=15160) Windows fatal exception: access violation

(pid=15160)

(pid=14328) Windows fatal exception: access violation

(pid=14328)

(pid=14264) [3/3] Inference done!

(pid=14264) Windows fatal exception: access violation

(pid=14264)

AnaMakarevich · 2021-07-13T09:57:44Z

I have a very similar error using the ray tune setup from here: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html#sphx-glr-beginner-hyperparameter-tuning-tutorial-py
The error I get on Windows is:
PermissionError: [Errno 13] Permission denied: 'C:\Users\Ana\ray_results\DEFAULT_2021-07-13_12-17-01\DEFAULT_15e3d_00000_0_anneal_rate=0.015316,lr=0.039717_2021-07-13_12-17-23\checkpoint_000000\checkpoint'
I tried changing the log directory to store the results but it doesn't work either. I also allowed everything when running ray tune for the first time.
However, the data seems to be stored (this is the new directory that I specified and that still throws this error):

anmyachev · 2021-08-27T06:40:25Z

I have the same issue, using the following reproducer.
Env: windows, ray==1.6.0, 8 logical processors.

from ray.util.queue import Queue
from ray import available_resources
from time import sleep

queue1 = Queue(actor_options={"num_cpus": 4})
sleep(10)
print(available_resources())

queue2 = Queue(actor_options={"num_cpus": 4})
sleep(10)
print(available_resources())

queue2 = Queue(actor_options={"num_cpus": 1})
sleep(10)
print(available_resources())

Logs:

2021-08-27 09:21:41,123 INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8265
{'object_store_memory': 4175845785.0, 'memory': 8351691572.0, 'node:192.168.100.108': 1.0, 'CPU': 4.0}
{'memory': 8351691572.0, 'node:192.168.100.108': 1.0, 'object_store_memory': 4175845785.0}
(pid=37296) Windows fatal exception: access violation
(pid=37296)
{'object_store_memory': 4175845785.0, 'memory': 8351691572.0, 'node:192.168.100.108': 1.0, 'CPU': 3.0}

Initially found in modin-project/modin#3256.

cc @rkooo567

In some cases this leads to hangs tests.

czgdp1807 · 2021-09-04T10:03:45Z

I am able to reproduce the exception in the description with the following code,

import psutil
import ray
import jpype
import sys
print("psutil", psutil.__version__)
print("ray", ray.__version__)
print("jpype", jpype.__version__)
print("sys", sys.version_info)

@ray.remote
class ObjectiveFunc(object):
    def __init__(self):
        self.java = jpype.startJVM()

class RayMap(object):
    def __init__(self, num_workers):
        self.workers = []
        for _ in range(num_workers):
            self.workers.append(ObjectiveFunc.remote())

num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus, include_dashboard=False)
rm = RayMap(4)

Output

psutil 5.8.0
ray 2.0.0.dev0
jpype 1.3.0
sys sys.version_info(major=3, minor=8, micro=11, releaselevel='final', serial=0)
c:\users\gagan\gsingh\ray\python\ray\_private\services.py:238: UserWarning: Not all Ray Dashboard dependencies were found. To use the dashboard please install Ray using `pip install ray[default]`. To disable this message, set RAY_DISABLE_IMPORT_WARNING env var to '1'.
  warnings.warn(warning_message)
(pid=6856) Windows fatal exception: access violation
(pid=6856)
(pid=6856) Stack (most recent call first):
(pid=6856)   File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=6856)   File "ray_jpype.py", line 13 in __init__
(pid=6856)   File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=6856)   File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=6856)   File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>
(pid=5812) Windows fatal exception: access violation
(pid=5812)
(pid=5812) Stack (most recent call first):
(pid=5812)   File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=5812)   File "ray_jpype.py", line 13 in __init__
(pid=5812)   File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=5812)   File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=5812)   File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>
(pid=4984) Windows fatal exception: access violation
(pid=4984)
(pid=4984) Stack (most recent call first):
(pid=4984)   File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=4984)   File "ray_jpype.py", line 13 in __init__
(pid=4984)   File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=4984)   File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=4984)   File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>
(pid=9560) Windows fatal exception: access violation
(pid=9560)
(pid=9560) Stack (most recent call first):
(pid=9560)   File "C:\ProgramData\Anaconda3\envs\ray_dev\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=9560)   File "ray_jpype.py", line 13 in __init__
(pid=9560)   File "c:\users\gagan\gsingh\ray\python\ray\_private\function_manager.py", line 579 in actor_method_executor
(pid=9560)   File "c:\users\gagan\gsingh\ray\python\ray\worker.py", line 429 in main_loop
(pid=9560)   File "c:\users\gagan\gsingh\ray\python\ray\workers/default_worker.py", line 214 in <module>

czgdp1807 · 2021-09-04T10:04:00Z

Can I investigate this further?

Upon further investigation I found that this issue is related to access of unallocated memory address by jpype.startJVM. Doing the same thing with Python's multiprocessing module doesn't result in any such exception. See code below. I will see how actually ray.remote is working here. If it is creating standalone processes then the access violation exception shouldn't have been thrown.

import jpype
from multiprocessing import Pool

def f(x):
    obj = jpype.startJVM()
    print(obj)
    return x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))

czgdp1807 · 2021-09-06T10:15:19Z

Hi. I am investigating this issue. I noticed that there is a concept of worker (which is a process as far as I understand). IMO, the above issue is caused because of some memory allocation issues while creating that worker. Would it be possible to know the code (its location inside the project) in ray which is used to create that worker. Thanks.

richardliaw · 2021-09-06T16:38:58Z

Hmm, I think you might want to look at ray/services.py?

czgdp1807 · 2021-09-07T15:55:21Z

Updates,

With different versions I observed different things,

ray-1.6.0 - Output is as described by the author.

ray-1.3.0 - There is some exception ignored in __del__ method of an actor object. Now, "Windows fatal exception: access violation" is thrown by the OS if we try to access something that is not ours (either not allocated or deallocated already). Since, __del__ deals with freeing memory occupied by the object, it seems like the exception raised inside it is related to the current issue at hand. Trace below,

(pid=6764) Windows fatal exception: access violation
(pid=6764)
(pid=6764) Stack (most recent call first):
(pid=6764)   File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\jpype\_core.py", line 226 in startJVM
(pid=6764)   File "ray_jpype.py", line 13 in __init__
(pid=6764)   File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\_private\function_manager.py", line 556 in actor_method_executor
(pid=6764)   File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\worker.py", line 382 in main_loop
(pid=6764)   File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\workers/default_worker.py", line 196 in <module>
Exception ignored in: <function ActorHandle.__del__ at 0x000001BE40690AF0>
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\actor.py", line 809, in __del__
AttributeError: 'NoneType' object has no attribute 'global_worker'

ray-1.0.0
Everything works fine without any exception or errors. rm.workers is, [Actor(ObjectiveFunc,df5a1a8201000000)]

czgdp1807 · 2021-09-14T15:47:39Z

Hi. I dug deeper into the issue of access violation exception cluttering the command prompt (a.k.a terminal). Following are some noteworthy points,
The problem is only with jpype.startJVM and nothing else. That means, when user code contains, jpype.startJVM only then the terminal is cluttered with access violation exceptions. Other than that all the user code works fine.
An interesting point is that when jpype.startJVM is called from python files inside ray project (default_worker.py ) then the access violation exception is shown in log files and the terminal is not cluttered with the same. However, when called from task_execution_handler or any of the functions which are called inside it, the terminal is cluttered with these exceptions.
One more interesting point is that task_execution_handler is assigned to options.task_execution_callback which is called in core_worker.cc code. So, it can be said that when called from C++ code only then terminal is cluttered with the above discussed exceptions otherwise it is shown only in log files.
I also thought of a not so good workaround. We can just put all the jpype related code in a file(say java_work.py ). And then do something like the following,

@ray.remote
def f(i):
    return subprocess.run(['python', 'C:\\Users\\gagan\\ray_project\\java_work.py', str(i)])

In words, we are launching a python subprocess and then calling jpype APIs there because from my observations, jpype APIs work fine when called from python processes but not from C++ libraries (as described above).

czgdp1807 · 2021-09-14T15:48:14Z

Since, “Windows fatal: access violation” is Windows specific, so I tried to catch it using Microsoft specific stuff in C++. I stumbled across SEH (Structured Exception Handling). See, https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160. I have added to it various levels in the call flow, but C2712 doesn’t go away.
Another fix is to redesign the architecture such that we don’t call Python functions in C++ code. May be, the user function can be called directly in default_worker.py or a python process can be spawned from a worker process to do the execution there.
For now, we can also update the documentation so that users know Windows access violation is nothing much to worry about. It’s just some hardware error specific to Windows.

IMO, Fix in 1 would be the best to have as it is easy. We need to find the right spot inside C++ code of ray to add __try , __except block to catch EXCEPTION_ACCESS_VIOLATION. The fix in 2 is more robust though but would require a lot of effort leading to some other unexpected breakages.

pcmoritz · 2021-10-24T18:33:52Z

Our current hypothesis is that the access violation messages are not impacting functionality (but they are very annoying indeed). There is a summary in #18944.

We are suppressing them for now, see #19561

I'm closing this issue for now as #19561 should fix most of the inconvenience here.

However if somebody has more insight into this problem and can actually make these access violation errors go away that would be most welcome. There are several open source projects (including Python/C extension related ones) that have been wrestling with this issue and to the best of my knowledge the problem is not super well understood at the moment.

spchamp · 2023-09-21T17:19:47Z

For what it's worth, albeit with an entirely different call stack, I'm seeing a similar error message: "Windows fatal exception: access violation".

This is with an application using sockets under asyncio, with the IOCP Proactor on Windows 10. The Python version is 3.11.5 installed via Chocolatey

When using a selector event loop on Windows, the segfault does not occur as such.

import asyncio as aio
import sys
loop = aio.SelectorEventLoop() if sys.platform == "win32" else aio.get_event_loop_policy().get_event_loop()

Towards reproducing the error: There's an example using HTTPX to run a single HTTP request [moved to gist]

With the example, the Windows access violation might not occur until the end of loop.run_until_complete().

Using a selector event loop, the segfault does not occur.

HTH, apologies if it's too far off topic, moreover with the different call stack in the example.

ashragai added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 18, 2021

krfricke added core windows labels Jan 18, 2021

krfricke changed the title ~~"Windows fatal exception: access violation" cluttering terminal~~ [core] "Windows fatal exception: access violation" cluttering terminal Jan 18, 2021

ericl added P2 Important issue, but not time-critical and removed core triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2021

DMTSource mentioned this issue Jun 11, 2021

[actor_pool] Using Actorpool.map_unordered in Deap causes A worker died or was killed while executing task DMTSource/deap_ray#3

Open

stefanbschneider mentioned this issue Jul 6, 2021

Ray initialization error - "Windows fatal exception: access violation" #15073

Closed

btseytlin mentioned this issue Jul 20, 2021

dfsql tests failing on Windows/MacOS after the last modin update modin-project/modin#3256

Closed

pcmoritz added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Jul 30, 2021

Addalin mentioned this issue Aug 15, 2021

Error during training : AttributeError: 'Tee' object has no attribute 'close' Addalin/pyALiDAn#27

Open

czgdp1807 mentioned this issue Sep 16, 2021

Documented information regarding AccessViolationException on Windows #18662

Closed

6 tasks

czgdp1807 mentioned this issue Sep 28, 2021

AccessViolationExceptions on Windows #18944

Closed

wuisawesome assigned czgdp1807 Oct 5, 2021

pcmoritz mentioned this issue Oct 20, 2021

[Windows] Suppress 'Windows fatal exception: access violation' #19561

Merged

6 tasks

pcmoritz closed this as completed Oct 24, 2021

ayushdg mentioned this issue Feb 24, 2022

⚠️ Upstream CI failed ⚠️ dask-contrib/dask-sql#410

Closed

fishbone mentioned this issue Jul 19, 2022

[ci] Skip windows for test_actor_failure_async for now #26696

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] "Windows fatal exception: access violation" cluttering terminal #13511

[core] "Windows fatal exception: access violation" cluttering terminal #13511

ashragai commented Jan 18, 2021 •

edited

Loading

eiahb3838ya commented Feb 5, 2021

ashragai commented Feb 5, 2021 via email

eiahb3838ya commented Feb 6, 2021

offchan42 commented Jun 15, 2021 •

edited

Loading

swagshaw commented Jun 19, 2021 •

edited

Loading

AnaMakarevich commented Jul 13, 2021 •

edited

Loading

anmyachev commented Aug 27, 2021 •

edited

Loading

czgdp1807 commented Sep 4, 2021

czgdp1807 commented Sep 4, 2021 •

edited

Loading

czgdp1807 commented Sep 6, 2021

richardliaw commented Sep 6, 2021

czgdp1807 commented Sep 7, 2021

czgdp1807 commented Sep 14, 2021

czgdp1807 commented Sep 14, 2021 •

edited

Loading

pcmoritz commented Oct 24, 2021

spchamp commented Sep 21, 2023 •

edited

Loading

[core] "Windows fatal exception: access violation" cluttering terminal #13511

[core] "Windows fatal exception: access violation" cluttering terminal #13511

Comments

ashragai commented Jan 18, 2021 • edited Loading

What is the problem?

Reproduction (REQUIRED)

eiahb3838ya commented Feb 5, 2021

ashragai commented Feb 5, 2021 via email

eiahb3838ya commented Feb 6, 2021

offchan42 commented Jun 15, 2021 • edited Loading

swagshaw commented Jun 19, 2021 • edited Loading

AnaMakarevich commented Jul 13, 2021 • edited Loading

anmyachev commented Aug 27, 2021 • edited Loading

czgdp1807 commented Sep 4, 2021

czgdp1807 commented Sep 4, 2021 • edited Loading

czgdp1807 commented Sep 6, 2021

richardliaw commented Sep 6, 2021

czgdp1807 commented Sep 7, 2021

czgdp1807 commented Sep 14, 2021

czgdp1807 commented Sep 14, 2021 • edited Loading

pcmoritz commented Oct 24, 2021

spchamp commented Sep 21, 2023 • edited Loading

ashragai commented Jan 18, 2021 •

edited

Loading

offchan42 commented Jun 15, 2021 •

edited

Loading

swagshaw commented Jun 19, 2021 •

edited

Loading

AnaMakarevich commented Jul 13, 2021 •

edited

Loading

anmyachev commented Aug 27, 2021 •

edited

Loading

czgdp1807 commented Sep 4, 2021 •

edited

Loading

czgdp1807 commented Sep 14, 2021 •

edited

Loading

spchamp commented Sep 21, 2023 •

edited

Loading