New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error (core dumped) model share memory #2244

Closed
acrosson opened this Issue Jul 29, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@acrosson
Copy link

acrosson commented Jul 29, 2017

I'm getting a Bus error (core dumped) when using the share_memory method on a model.

OS : Ubuntu 16.04
It's happening in python 2.7 and 3.5, conda environment and hard install. I'm using the latest version from http://pytorch.org/. I've also tried installing from source, same issue.

I tried doing a basic test using this code:

import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(2563*50, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

n = Net()
n.share_memory()
print('okay')

If the input size is small it works fine, but anything greater than some threshold throws the Bus error. If I don't call share_memory() it works fine.

I ran trace, here are the last few lines of the output.

module.py(391):             if module is not None and module not in memo:                                                                                                                                 [567/1904]
module.py(392):                 memo.add(module)
module.py(393):                 yield name, module
module.py(378):             yield module
module.py(118):             module._apply(fn)
 --- modulename: module, funcname: _apply
module.py(117):         for module in self.children():
 --- modulename: module, funcname: children
module.py(377):         for name, module in self.named_children():
 --- modulename: module, funcname: named_children
module.py(389):         memo = set()
module.py(390):         for name, module in self._modules.items():
module.py(120):         for param in self._parameters.values():
module.py(121):             if param is not None:
module.py(124):                 param.data = fn(param.data)
 --- modulename: module, funcname: <lambda>
module.py(468):         return self._apply(lambda t: t.share_memory_())
 --- modulename: tensor, funcname: share_memory_
tensor.py(86):         self.storage().share_memory_()
 --- modulename: storage, funcname: share_memory_
storage.py(95):         from torch.multiprocessing import get_sharing_strategy
 --- modulename: _bootstrap, funcname: _handle_fromlist
<frozen importlib._bootstrap>(1006): <frozen importlib._bootstrap>(1007): <frozen importlib._bootstrap>(1012): <frozen importlib._bootstrap>(1013): <frozen importlib._bootstrap>(1012): <frozen importlib._bootstra
p>(1025): storage.py(96):         if self.is_cuda:
storage.py(98):         elif get_sharing_strategy() == 'file_system':
 --- modulename: __init__, funcname: get_sharing_strategy
__init__.py(59):     return _sharing_strategy
storage.py(101):             self._share_fd_()
Bus error (core dumped)

I tried running gdb, but it wont give me a full trace.

I've tried creating a symbolic link to the libgomp.so.1 as I suspect it's a similar issue, but still the same error.

Any suggestions? This is running inside a docker container btw.

@acrosson

This comment has been minimized.

Copy link

acrosson commented Jul 29, 2017

Okay. I think I solved it. Looks like the shared memory of the docker container wasn't set high enough. Setting a higher amount by adding --shm-size 8G to the docker run command seems to be the trick as mentioned here. Let me fully test it, if solved I'll close issue.

@acrosson

This comment has been minimized.

Copy link

acrosson commented Jul 30, 2017

Works fine now!

@dneprDroid

This comment has been minimized.

Copy link

dneprDroid commented Sep 13, 2018

@acrosson Do you have experience with Google Cloud ML? Sorry for disturbing, but I got this error on cloud ml job with machine params standard_gpu (NVIDIA Tesla K80 GPU, 30GB memory).
How can I configure --shm-size param on Cloud ML Job?

My config.yaml file:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerType: standard_gpu
  workerCount: 1
  parameterServerCount: 1
  parameterServerType: standard_gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment