Network surgery for transfer fails #12181

johncorring · 2018-09-28T20:37:56Z

Per the pytorch/caffe2 Readme I am asking here.

I would like to use an existing network definition and weights from the model zoo as the backbone for a new network. In this specific example the architecture will be squeezenet, and the new network simply has a different shape for the top parameterized layers ['conv10_w', 'conv10_b'], to accommodate a different set of classes from Imagenet.

Unfortunately, it is not clear from the documentation, tutorials, or examples how to achieve this (to me). Some OS notes: I have built caffe2+OpenCV from source with the current master, into a python2.7.12 virtualenv, cuda 9.0, cuDNN 7.0.

I wrote a script ( based on https://nbviewer.jupyter.org/gist/kyamagu/6cff70840c10ca374e069a3a7eb00cb4/dogs-vs-cats.ipynb )
that I think should do this: https://gist.github.com/johncorring/d735675e75add96fbdfbcc40fa00f3ba

I get the following error message:
Traceback (most recent call last):
File "dogsvscats.py", line 184, in
shtyp = workspace.InferShapesAndTypes([train_model.net])
File "/home/john/Code/pytorch/build/caffe2/python/workspace.py", line 258, in InferShapesAndTypes
blobdesc_prototxt = C.infer_shapes_and_types_from_workspace(net_protos)
MemoryError: std::bad_alloc

which isn't very helpful (especially since cross referencing against caffe2 docs doesn't yield anything).

When I comment out the offending line and try to continue to training I recieve a seg fault that I have narrowed down to coming from line 204, workspace.RunNet(train_model.net). lldb returns the following stack trace:

thread #1: tid = 9130, 0x00007fffaa112240 libcaffe2.so`void caffe2::math::CopyMatrix<float, caffe2::CPUContext>(int, int, float const*, int, int, float*, int, int, caffe2::CPUContext*) + 208, name = 'python', stop reason = signal SIGSEGV: address access protected (fault address: 0xb15400000)

frame #0: 0x00007fffaa112240 libcaffe2.sovoid caffe2::math::CopyMatrix<float, caffe2::CPUContext>(int, int, float const*, int, int, float*, int, int, caffe2::CPUContext*) + 208 frame #1: 0x00007fffaa11392f libcaffe2.sovoid caffe2::math::Im2Col<float, caffe2::CPUContext, (caffe2::StorageOrder)2>(int, int, int, int, int, int, int, int, int, int, int, int, int, float const*, float*, caffe2::CPUContext*, int) + 1087
frame Don't support legacy Python #2: 0x00007fffaa3f52b1 libcaffe2.socaffe2::ConvOp<float, caffe2::CPUContext>::RunOnDeviceWithOrderNCHW()::{lambda(caffe2::Tensor*)#1}::operator()(caffe2::Tensor*) const + 1169 frame #3: 0x00007fffaa3f77f8 libcaffe2.socaffe2::ConvOp<float, caffe2::CPUContext>::RunOnDeviceWithOrderNCHW() + 2712
frame PEP8 #4: 0x00007fffaa1c93ed libcaffe2.socaffe2::ConvPoolOpBase<caffe2::CPUContext>::RunOnDevice() + 301 frame #5: 0x00007fffa9fb52e5 libcaffe2.socaffe2::Operatorcaffe2::CPUContext::Run(int) + 229
frame Remove dampening from SGD #6: 0x00007fffaa09275c libcaffe2.socaffe2::SimpleNet::Run() + 460 frame #7: 0x00007fffaa0aeb8a libcaffe2.socaffe2::Workspace::RunNet(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 954
frame fake commit #8: 0x00007fffab11a277 caffe2_pybind11_state_gpu.sovoid pybind11::cpp_function::initialize<caffe2::python::addGlobalMethods(pybind11::module&)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool)#21}, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool, pybind11::name, pybind11::scope, pybind11::sibling>(caffe2::python::addGlobalMethods(pybind11::module&)::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool)#21}&&, bool (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 311 frame #9: 0x00007fffab160220 caffe2_pybind11_state_gpu.sopybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 3552
frame Tensors don't print sometimes #10: 0x00000000004c30ce pythonPyEval_EvalFrameEx + 29342 frame #11: 0x00000000004b9ab6 pythonPyEval_EvalCodeEx + 774
frame Initial utils implementation + bug fixes #12: 0x00000000004c1e6f pythonPyEval_EvalFrameEx + 24639 frame #13: 0x00000000004b9ab6 pythonPyEval_EvalCodeEx + 774
frame Clean up Module forward and __call__ #14: 0x00000000004c16e7 pythonPyEval_EvalFrameEx + 22711 frame #15: 0x00000000004b9ab6 pythonPyEval_EvalCodeEx + 774
frame Error on legacy.nn serialization #16: 0x00000000004eb30f python??? + 63 frame #17: 0x00000000004e5422 pythonPyRun_FileExFlags + 130
frame OS X build issue in THP_decodeInt64Buffer #18: 0x00000000004e3cd6 pythonPyRun_SimpleFileExFlags + 390 frame #19: 0x0000000000493ae2 pythonPy_Main + 1554
frame Figure out and fix Tensor(Storage) constructor #20: 0x00007ffff7810830 libc.so.6__libc_start_main(main=(pythonmain), argc=2, argv=0x00007fffffffda18, init=, fini=, rtld_fini=, stack_end=0x00007fffffffda08) + 240 at libc-start.c:291
frame import torch works in ipython but not in python (_THRefcountedMapAllocator) #21: 0x00000000004933e9 python`_start + 41

The text was updated successfully, but these errors were encountered:

zou3519 added the caffe2 label Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network surgery for transfer fails #12181

Network surgery for transfer fails #12181

johncorring commented Sep 28, 2018

Network surgery for transfer fails #12181

Network surgery for transfer fails #12181

Comments

johncorring commented Sep 28, 2018

Per the pytorch/caffe2 Readme I am asking here.