Skip to content

Couldn't open shared file mapping: <torch_573824_1569179339>, error code: <1455> #31874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
weilueluo opened this issue Jan 5, 2020 · 11 comments
Labels
module: tensorboard triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@weilueluo
Copy link

🐛 Bug

Runtime error similar to #18797.

To Reproduce

Steps to reproduce the behavior:
no idea how to do that

Expected behavior

no error

Environment

Collecting environment information...
PyTorch version: 1.2.0+cu92
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Microsoft Windows 10 家庭版 <--- this means family version
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 441.28
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.2.0+cu92
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.4.0+cu92
[conda] Could not collect

Additional context

[...]
epoch 35/40: 100%|████████████████████████████████████████████████| 174/174 [03:04<00:00,  1.73it/s]
evaluate: 100%|█████████████████████████████████████████████████████| 39/39 [00:51<00:00,  1.31s/it]
epoch 36/40: 100%|████████████████████████████████████████████████| 174/174 [03:12<00:00,  1.03s/it]
evaluate: 100%|█████████████████████████████████████████████████████| 39/39 [00:53<00:00,  1.48it/s]
epoch 37/40: 100%|████████████████████████████████████████████████| 174/174 [03:10<00:00,  1.56it/s]
evaluate: 100%|█████████████████████████████████████████████████████| 39/39 [00:54<00:00,  1.40s/it]
epoch 38/40:  52%|█████████████████████████▋                       | 91/174 [01:59<01:40,  1.21s/it]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-186500daf789> in <module>
----> 1 acc, w = run(lr=0.001, epochs=40, data_loaders=data_loaders, net=None, write_tensorboard=True, weight_decay=0, start_epoch=1)

<ipython-input-12-7ab892883eff> in run(lr, epochs, data_loaders, net, write_tensorboard, weight_decay, start_epoch)
     19     best_acc, best_w = train_net(net=net, criterion=criterion, optimizer=optimizer, 
     20                                  data_loaders=data_loaders, epochs=epochs, writer=writer,
---> 21                                  scheduler=scheduler, start_epochs=start_epoch)
     22     if writer:
     23         writer.close()

<ipython-input-11-0d98cdc0971c> in train_net(net, criterion, optimizer, data_loaders, epochs, writer, scheduler, start_epochs)
      6 
      7         loss_sum, train_acc, val_acc = train_one_epoch(net=net, criterion=criterion, optimizer=optimizer,
----> 8                                                       data_loaders=data_loaders, epoch=epoch, epochs=epochs)
      9 
     10         if writer:

<ipython-input-10-d126b1e19a5e> in train_one_epoch(net, criterion, optimizer, data_loaders, epoch, epochs)
      3     corrects_sum = 0.0
      4     total_samples = 0
----> 5     for images, labels in tqdm(data_loaders[train], desc=f'epoch {epoch}/{epochs}', ncols=100):
      6         loss, corrects = train_one_sample(net=net, criterion=criterion, optimizer=optimizer,
      7                                          inputs=images, labels=labels)

~\AppData\Roaming\Python\Python37\site-packages\tqdm\_tqdm.py in __iter__(self)
   1015                 """), fp_write=getattr(self.fp, 'write', sys.stderr.write))
   1016 
-> 1017             for obj in iterable:
   1018                 yield obj
   1019                 # Update and possibly print the progressbar.

~\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    817             else:
    818                 del self.task_info[idx]
--> 819                 return self._process_data(data)
    820 
    821     next = __next__  # Python 2 compatibility

~\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\dataloader.py in _process_data(self, data)
    844         self._try_put_index()
    845         if isinstance(data, ExceptionWrapper):
--> 846             data.reraise()
    847         return data
    848 

~\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\_utils.py in reraise(self)
    367             # (https://bugs.python.org/issue2651), so we work around it.
    368             msg = KeyErrorMessage(msg)
--> 369         raise self.exc_type(msg)

RuntimeError: Caught RuntimeError in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\_utils\worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\_utils\fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\_utils\collate.py", line 80, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\_utils\collate.py", line 80, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\utils\data\_utils\collate.py", line 54, in default_collate
    storage = elem.storage()._new_shared(numel)
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\storage.py", line 128, in _new_shared
    return cls._new_using_filename(size)
RuntimeError: Couldn't open shared file mapping: <torch_573824_1569179339>, error code: <1455>
@weilueluo
Copy link
Author

Not sure if it is related, but I also have the following error in tensorbord:

TensorBoard 2.0.2 at http://localhost:6006/ (Press CTRL+C to quit)
Fatal Python error: Aborted

Current thread 0x0007c8c0 (most recent call first):
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 1034 in GetNext
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 71 in Load
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\event_file_loader.py", line 94 in Load
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 113 in _LoadInternal
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\directory_watcher.py", line 89 in Load
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\plugin_event_accumulator.py", line 177 in Reload
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 224 in Worker
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\event_processing\plugin_event_multiplexer.py", line 246 in Reload
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\backend\application.py", line 504 in _reload
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\threading.py", line 870 in run
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\threading.py", line 926 in _bootstrap_inner
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\threading.py", line 890 in _bootstrap

Thread 0x00080788 (most recent call first):
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\selectors.py", line 314 in _select
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\selectors.py", line 323 in select
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\socketserver.py", line 232 in serve_forever
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\werkzeug\serving.py", line 735 in serve_forever
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\program.py", line 284 in _run_serve_subcommand
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\program.py", line 267 in main
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\absl\app.py", line 250 in _run_main
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\absl\app.py", line 299 in run
  File "C:\Users\wweilue\AppData\Roaming\Python\Python37\site-packages\tensorboard\main.py", line 66 in run_main
  File "C:\Users\wweilue\AppData\Local\Programs\Python\Python37\Scripts\tensorboard.exe\__main__.py", line 7 in <module>
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\runpy.py", line 85 in _run_code
  File "c:\users\wweilue\appdata\local\programs\python\python37\lib\runpy.py", line 193 in _run_module_as_main

@peterjc123
Copy link
Collaborator

What about using python my_script.py instead of calling ipython?

@peterjc123
Copy link
Collaborator

Alternatively, you may try out reducing the number of workers in the DataLoader.

@weilueluo
Copy link
Author

@peterjc123 I am not able to reproduce the same error with the second time I run it. I got a memory error instead, it runs normally after reducing the number of workers.

@jerryzh168 jerryzh168 added module: tensorboard triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 8, 2020
@jerryzh168
Copy link
Contributor

@Redcxx is this resolved?

@weilueluo
Copy link
Author

@jerryzh168 For me yes.

@zimonitrome
Copy link

Alternatively, you may try out reducing the number of workers in the DataLoader.

I feel like num_workers should not need to be reduced. I am trying to run 8 workers on a 16 thread CPU and it fails ~50% of the time while my CPU is working at ~25% and there are lots of memory available.

@Di-Ma-S21
Copy link

@Redcxx Hi, I also encountered this error when using 'multiprocessing' function in PyTorch on Windows10. Here is my issue post: #63331. I noticed that you have resolved this problem. Could you please let me know how did you solve it? I have tried to put all the codes (including train()) under if__name__ == '__main__':, but it still could not solve the problem. Thanks in advance!

@sinhong96
Copy link

@Redcxx Hi, I also encountered this error when using 'multiprocessing' function in PyTorch on Windows10. Here is my issue post: #63331. I noticed that you have resolved this problem. Could you please let me know how did you solve it? I have tried to put all the codes (including train()) under if__name__ == '__main__':, but it still could not solve the problem. Thanks in advance!

Hi, I am still facing the same problem. May I know have you solved the problem? How you solved it? thanks!

@Keeyahto
Copy link

Keeyahto commented Mar 3, 2023

I think I have solved this problem. torch.multiprocessing.set_start_method('spawn', True) I used this code in the train function and the error stopped appearing.

@Originofamonia
Copy link

@sinhong96

Hi weilueluo, could you please share how you solved this problem? I also faced the same problem. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: tensorboard triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants