Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network surgery for transfer fails #12181

Open
johncorring opened this issue Sep 28, 2018 · 0 comments
Open

Network surgery for transfer fails #12181

johncorring opened this issue Sep 28, 2018 · 0 comments
Labels

Comments

@johncorring
Copy link

Per the pytorch/caffe2 Readme I am asking here.

I would like to use an existing network definition and weights from the model zoo as the backbone for a new network. In this specific example the architecture will be squeezenet, and the new network simply has a different shape for the top parameterized layers ['conv10_w', 'conv10_b'], to accommodate a different set of classes from Imagenet.

Unfortunately, it is not clear from the documentation, tutorials, or examples how to achieve this (to me). Some OS notes: I have built caffe2+OpenCV from source with the current master, into a python2.7.12 virtualenv, cuda 9.0, cuDNN 7.0.

I wrote a script ( based on https://nbviewer.jupyter.org/gist/kyamagu/6cff70840c10ca374e069a3a7eb00cb4/dogs-vs-cats.ipynb )
that I think should do this: https://gist.github.com/johncorring/d735675e75add96fbdfbcc40fa00f3ba

I get the following error message:
Traceback (most recent call last):
File "dogsvscats.py", line 184, in
shtyp = workspace.InferShapesAndTypes([train_model.net])
File "/home/john/Code/pytorch/build/caffe2/python/workspace.py", line 258, in InferShapesAndTypes
blobdesc_prototxt = C.infer_shapes_and_types_from_workspace(net_protos)
MemoryError: std::bad_alloc

which isn't very helpful (especially since cross referencing against caffe2 docs doesn't yield anything).

When I comment out the offending line and try to continue to training I recieve a seg fault that I have narrowed down to coming from line 204, workspace.RunNet(train_model.net). lldb returns the following stack trace:

thread #1: tid = 9130, 0x00007fffaa112240 libcaffe2.so`void caffe2::math::CopyMatrix<float, caffe2::CPUContext>(int, int, float const*, int, int, float*, int, int, caffe2::CPUContext*) + 208, name = 'python', stop reason = signal SIGSEGV: address access protected (fault address: 0xb15400000)

  • frame #0: 0x00007fffaa112240 libcaffe2.sovoid caffe2::math::CopyMatrix<float, caffe2::CPUContext>(int, int, float const*, int, int, float*, int, int, caffe2::CPUContext*) + 208 frame #1: 0x00007fffaa11392f libcaffe2.sovoid caffe2::math::Im2Col<float, caffe2::CPUContext, (caffe2::StorageOrder)2>(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float*, caffe2::CPUContext*, int) + 1087
    frame Don't support legacy Python #2: 0x00007fffaa3f52b1 libcaffe2.socaffe2::ConvOp<float, caffe2::CPUContext>::RunOnDeviceWithOrderNCHW()::{lambda(caffe2::Tensor*)#1}::operator()(caffe2::Tensor*) const + 1169 frame #3: 0x00007fffaa3f77f8 libcaffe2.socaffe2::ConvOp<float, caffe2::CPUContext>::RunOnDeviceWithOrderNCHW() + 2712
    frame PEP8 #4: 0x00007fffaa1c93ed libcaffe2.socaffe2::ConvPoolOpBase<caffe2::CPUContext>::RunOnDevice() + 301 frame #5: 0x00007fffa9fb52e5 libcaffe2.socaffe2::Operatorcaffe2::CPUContext::Run(int) + 229
    frame Remove dampening from SGD #6: 0x00007fffaa09275c libcaffe2.socaffe2::SimpleNet::Run() + 460 frame #7: 0x00007fffaa0aeb8a libcaffe2.socaffe2::Workspace::RunNet(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 954
    frame fake commit #8: 0x00007fffab11a277 caffe2_pybind11_state_gpu.sovoid pybind11::cpp_function::initialize<caffe2::python::addGlobalMethods(pybind11::module&)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool)#21}, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool, pybind11::name, pybind11::scope, pybind11::sibling>(caffe2::python::addGlobalMethods(pybind11::module&)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool)#21}&&, bool (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 311 frame #9: 0x00007fffab160220 caffe2_pybind11_state_gpu.sopybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 3552
    frame Tensors don't print sometimes #10: 0x00000000004c30ce pythonPyEval_EvalFrameEx + 29342 frame #11: 0x00000000004b9ab6 pythonPyEval_EvalCodeEx + 774
    frame Initial utils implementation + bug fixes #12: 0x00000000004c1e6f pythonPyEval_EvalFrameEx + 24639 frame #13: 0x00000000004b9ab6 pythonPyEval_EvalCodeEx + 774
    frame Clean up Module forward and __call__ #14: 0x00000000004c16e7 pythonPyEval_EvalFrameEx + 22711 frame #15: 0x00000000004b9ab6 pythonPyEval_EvalCodeEx + 774
    frame Error on legacy.nn serialization #16: 0x00000000004eb30f python??? + 63 frame #17: 0x00000000004e5422 pythonPyRun_FileExFlags + 130
    frame OS X build issue in THP_decodeInt64Buffer #18: 0x00000000004e3cd6 pythonPyRun_SimpleFileExFlags + 390 frame #19: 0x0000000000493ae2 pythonPy_Main + 1554
    frame Figure out and fix Tensor(Storage) constructor #20: 0x00007ffff7810830 libc.so.6__libc_start_main(main=(pythonmain), argc=2, argv=0x00007fffffffda18, init=, fini=, rtld_fini=, stack_end=0x00007fffffffda08) + 240 at libc-start.c:291
    frame import torch works in ipython but not in python (_THRefcountedMapAllocator) #21: 0x00000000004933e9 python`_start + 41
@zou3519 zou3519 added the caffe2 label Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants