Chase down haphazard core dump when running mnist example main #52

forbearer · 2018-08-15T20:31:52Z

When running the mnist main, about 2 out of every 3 run fails with a core dump, typically during the Reader open phase (before training begins). Once in my latest run, the seg fault occurs at the end of the first train epoch, but before the first test batch, so it's very likely still during Reader construction.

The core dump occurs in data page release within pyarrow libparquet:

#0  0x00007f1b48f9db48 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#1  0x00007f1b48febe73 in std::_Sp_counted_ptr_inplace<parquet::DataPage, std::allocator<parquet::DataPage>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#2  0x00007f1b48f9db29 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#3  0x00007f1b48fc520f in parquet::internal::TypedRecordReader<parquet::DataType<(parquet::Type::type)1> >::ReadNewPage() ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#4  0x00007f1b48fc5ba0 in parquet::internal::TypedRecordReader<parquet::DataType<(parquet::Type::type)1> >::ReadRecords(long) ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#5  0x00007f1b48f9b676 in parquet::arrow::PrimitiveImpl::NextBatch(long, std::shared_ptr<arrow::Array>*) () from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#6  0x00007f1b48f96fae in parquet::arrow::ColumnReader::NextBatch(long, std::shared_ptr<arrow::Array>*) () from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#7  0x00007f1b48f97bcb in parquet::arrow::FileReader::Impl::ReadColumnChunk(int, int, std::shared_ptr<arrow::Array>*) ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#8  0x00007f1b48f9810f in parquet::arrow::FileReader::Impl::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*)::{lambda(int)#1}::operator()(int) const () from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#9  0x00007f1b48f98f20 in parquet::arrow::FileReader::Impl::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#10 0x00007f1b48f994c2 in parquet::arrow::FileReader::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/libparquet.so.1
#11 0x00007f1b47de8cc7 in __pyx_pw_7pyarrow_8_parquet_13ParquetReader_7read_row_group(_object*, _object*, _object*) ()
   from /usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so
#12 0x00000000004cdea9 in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7f1b1493ea70, 
    func=<built-in method read_row_group of pyarrow._parquet.ParquetReader object at remote 0x7f1b484fcb50>) at ../Python/ceval.c:4235
#13 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493ea70) at ../Python/ceval.c:4043
#14 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b45f92250, for file /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py, line 125, in read_row_group (self=<ParquetFile(common_metadata=None, _nested_paths_by_prefix={'digit': [0], 'image': [2], 'idx': [1]}, reader=<pyarrow._parquet.ParquetReader at remote 0x7f1b484fcb50>) at remote 0x7f1b45f7c990>, i=0, columns=set(['digit', 'image', 'idx']), nthreads=1, use_pandas_metadata=False, column_indices=[0, 2, 1]), throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#15 0x00000000004704ea in PyEval_EvalCodeEx (closure=<optimized out>, defcount=<optimized out>, defs=0x7f1b4858fd88, kwcount=<optimized out>, kws=<optimized out>, 
    argcount=<optimized out>, args=<optimized out>, locals=0x0, globals=<optimized out>, co=<optimized out>) at ../Python/ceval.c:3252
#16 function_call.15337 (func=<optimized out>, arg=<optimized out>, kw=<optimized out>) at ../Objects/funcobject.c:526
#17 0x00000000004c9aa5 in PyObject_Call (kw={'use_pandas_metadata': False, 'nthreads': 1, 'columns': set(['digit', 'image', 'idx'])}, 
    arg=(<ParquetFile(common_metadata=None, _nested_paths_by_prefix={'digit': [0], 'image': [2], 'idx': [1]}, reader=<pyarrow._parquet.ParquetReader at remote 0x7f1b484fcb50>) at remote 0x7f1b45f7c990>, 0), func=<function at remote 0x7f1b485948c0>) at ../Objects/abstract.c:2529
#18 ext_do_call (nk=<optimized out>, na=<optimized out>, flags=<optimized out>, pp_stack=0x7f1b1493ecb0, func=<function at remote 0x7f1b485948c0>) at ../Python/ceval.c:4333
#19 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b10007080, for file /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py, line 459, in read (self=<ParquetDatasetPiece(path='/home/ocheng/dev/datasets/mnist/test/part-00000-c04ef970-dd95-44b3-8ca6-0f6d4cbf321e-c000.parquet', partition_keys=[], row_group=0) at remote 0x7f1b0f71e190>, columns=set(['digit', 'image', 'idx']), nthreads=1, partitions=<ParquetPartitions(levels=[], partition_names=set([])) at remote 0x7f1b0f71e090>, open_file_func=<function at remote 0x7f1b45f77ed8>, file=None, use_pandas_metadata=False, reader=<ParquetFile(common_metadata=None, _nested_paths_by_prefix={'digit': [0], 'image': [2], 'idx': [1]}, reader=<pyarrow._parquet.ParquetReader at remote 0x7f1b484fcb50>) at remote 0x7f1b45f7c990>, options={'use_pandas_metadata': False, 'nthreads': 1, 'columns': set(...)}), throwflag=throwflag@entry=0) at ../Python/ceval.c:2705
#20 0x00000000004cfedc in PyEval_EvalCodeEx (co=0x7f1b48575030, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kws=<optimized out>, kwcount=3, defs=0x7f1b4857ec98, defcount=6, closure=0x0) at ../Python/ceval.c:3252
#21 0x00000000004c8314 in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7f1b1493eed0, func=<function at remote 0x7f1b48599320>) at ../Python/ceval.c:4116
#22 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493eed0) at ../Python/ceval.c:4041
#23 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b10006e30, for file /home/ocheng/dev/petastorm/petastorm/reader_worker.py, line 191, in _read_with_shuffle_row_drop (self=<ReaderWorker(_split_pieces=[<ParquetDatasetPiece(path='/home/ocheng/dev/datasets/mnist/test/part-00000-c04ef970-dd95-44b3-8ca6-0f6d4cbf321e-c000.parquet', partition_keys=[], row_group=0) at remote 0x7f1b0f71e190>], publish_func=<function at remote 0x7f1b45f77c08>, _sequence=None, args=('file:///home/ocheng/dev/datasets/mnist/test', <Unischema(digit=<UnischemaField at remote 0x7f1b45fb0e88>, _namedtuple=None, idx=<UnischemaField at remote 0x7f1b45fb0ef0>, image=<UnischemaField at remote 0x7f1b45fb0f58>, _name='MnistSchema', _fields=<OrderedDict(_OrderedDict__root=[[[[[...], [...], 'digit'], [...], 'idx'], [...], 'image'], [...], None], _OrderedDict__map={'digit': [...], 'image': [...], 'idx': [...]}) at remote 0x7f1b46353ab8>) at remote 0x7f1b45f7cd10>, None, [...], <NullCache at remote 0x7f1b45f7c950>, None), _schema=<...>, _dataset=<ParquetDataset(paths='/home/ocheng/dev/datas...(truncated), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#24 0x00000000004cfedc in PyEval_EvalCodeEx (co=0x7f1b48543830, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0) at ../Python/ceval.c:3252
---Type <return> to continue, or q <return> to quit---
#25 0x00000000004c9419 in fast_function (nk=<optimized out>, na=5, n=5, pp_stack=0x7f1b1493f0f0, func=<function at remote 0x7f1b4855c410>) at ../Python/ceval.c:4116
#26 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493f0f0) at ../Python/ceval.c:4041
#27 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b10006c20, for file /home/ocheng/dev/petastorm/petastorm/reader_worker.py, line 122, in _load_rows (self=<ReaderWorker(_split_pieces=[<ParquetDatasetPiece(path='/home/ocheng/dev/datasets/mnist/test/part-00000-c04ef970-dd95-44b3-8ca6-0f6d4cbf321e-c000.parquet', partition_keys=[], row_group=0) at remote 0x7f1b0f71e190>], publish_func=<function at remote 0x7f1b45f77c08>, _sequence=None, args=('file:///home/ocheng/dev/datasets/mnist/test', <Unischema(digit=<UnischemaField at remote 0x7f1b45fb0e88>, _namedtuple=None, idx=<UnischemaField at remote 0x7f1b45fb0ef0>, image=<UnischemaField at remote 0x7f1b45fb0f58>, _name='MnistSchema', _fields=<OrderedDict(_OrderedDict__root=[[[[[...], [...], 'digit'], [...], 'idx'], [...], 'image'], [...], None], _OrderedDict__map={'digit': [...], 'image': [...], 'idx': [...]}) at remote 0x7f1b46353ab8>) at remote 0x7f1b45f7cd10>, None, [...], <NullCache at remote 0x7f1b45f7c950>, None), _schema=<...>, _dataset=<ParquetDataset(paths='/home/ocheng/dev/datasets/mnist/test', ...(truncated), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#28 0x00000000004c8762 in fast_function (nk=<optimized out>, na=<optimized out>, n=4, pp_stack=0x7f1b1493f270, func=<function at remote 0x7f1b4855c320>) at ../Python/ceval.c:4106
#29 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493f270) at ../Python/ceval.c:4041
#30 PyEval_EvalFrameEx (f=f@entry=Frame 0x7f1b45fb89b0, for file /home/ocheng/dev/petastorm/petastorm/reader_worker.py, line 104, in <lambda> (), throwflag=throwflag@entry=0)
    at ../Python/ceval.c:2666
#31 0x00000000004cfedc in PyEval_EvalCodeEx (co=0x7f1b4853e530, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, 
    kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, 
    closure=(<cell at remote 0x7f1b45f74b08>, <cell at remote 0x7f1b45f74948>, <cell at remote 0x7f1b45f74bb0>, <cell at remote 0x7f1b45f74a28>)) at ../Python/ceval.c:3252
#32 0x00000000004c9419 in fast_function (nk=<optimized out>, na=0, n=0, pp_stack=0x7f1b1493f490, func=<function at remote 0x7f1b45f77a28>) at ../Python/ceval.c:4116
#33 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493f490) at ../Python/ceval.c:4041
#34 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b45f9ce50, for file /home/ocheng/dev/petastorm/petastorm/cache.py, line 36, in get (self=<NullCache at remote 0x7f1b45f7c950>, key='d6aff736faf6f8954553f8bed1c01cf1:/home/ocheng/dev/datasets/mnist/test/part-00000-c04ef970-dd95-44b3-8ca6-0f6d4cbf321e-c000.parquet:0', fill_cache_func=<function at remote 0x7f1b45f77a28>), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#35 0x00000000004c8762 in fast_function (nk=<optimized out>, na=<optimized out>, n=3, pp_stack=0x7f1b1493f610, func=<function at remote 0x7f1b4859f5f0>) at ../Python/ceval.c:4106
#36 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493f610) at ../Python/ceval.c:4041
#37 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b10000dc0, for file /home/ocheng/dev/petastorm/petastorm/reader_worker.py, line 104, in process (self=<ReaderWorker(_split_pieces=[<ParquetDatasetPiece(path='/home/ocheng/dev/datasets/mnist/test/part-00000-c04ef970-dd95-44b3-8ca6-0f6d4cbf321e-c000.parquet', partition_keys=[], row_group=0) at remote 0x7f1b0f71e190>], publish_func=<function at remote 0x7f1b45f77c08>, _sequence=None, args=('file:///home/ocheng/dev/datasets/mnist/test', <Unischema(digit=<UnischemaField at remote 0x7f1b45fb0e88>, _namedtuple=None, idx=<UnischemaField at remote 0x7f1b45fb0ef0>, image=<UnischemaField at remote 0x7f1b45fb0f58>, _name='MnistSchema', _fields=<OrderedDict(_OrderedDict__root=[[[[[...], [...], 'digit'], [...], 'idx'], [...], 'image'], [...], None], _OrderedDict__map={'digit': [...], 'image': [...], 'idx': [...]}) at remote 0x7f1b46353ab8>) at remote 0x7f1b45f7cd10>, None, [...], <NullCache at remote 0x7f1b45f7c950>, None), _schema=<...>, _dataset=<ParquetDataset(paths='/home/ocheng/dev/datasets/mnist/test', com...(truncated), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#38 0x00000000004704ea in PyEval_EvalCodeEx (closure=<optimized out>, defcount=<optimized out>, defs=0x7f1b10000f38, kwcount=<optimized out>, kws=<optimized out>, 
    argcount=<optimized out>, args=<optimized out>, locals=0x0, globals=<optimized out>, co=<optimized out>) at ../Python/ceval.c:3252
#39 function_call.15337 (func=<optimized out>, arg=<optimized out>, kw=<optimized out>) at ../Objects/funcobject.c:526
#40 0x00000000004c9aa5 in PyObject_Call (kw={'worker_predicate': None, 'shuffle_row_drop_partition': (0, 1), 'piece_index': 0}, 
    arg=(<ReaderWorker(_split_pieces=[<ParquetDatasetPiece(path='/home/ocheng/dev/datasets/mnist/test/part-00000-c04ef970-dd95-44b3-8ca6-0f6d4cbf321e-c000.parquet', partition_keys=[], row_group=0) at remote 0x7f1b0f71e190>], publish_func=<function at remote 0x7f1b45f77c08>, _sequence=None, args=('file:///home/ocheng/dev/datasets/mnist/test', <Unischema(digit=<UnischemaField at remote 0x7f1b45fb0e88>, _namedtuple=None, idx=<UnischemaField at remote 0x7f1b45fb0ef0>, image=<UnischemaField at remote 0x7f1b45fb0f58>, _name='MnistSchema', _fields=<OrderedDict(_OrderedDict__root=[[[[[...], [...], 'digit'], [...], 'idx'], [...], 'image'], [...], None], _OrderedDict__map={'digit': [...], 'image': [...], 'idx': [...]}) at remote 0x7f1b46353ab8>) at remote 0x7f1b45f7cd10>, None, [...], <NullCache at remote 0x7f1b45f7c950>, None), _schema=<...>, _dataset=<ParquetDataset(paths='/home/ocheng/dev/datasets/mnist/test', common_metadata=<pyarrow._parquet.FileMetaData at remote 0x7f1b484fcaf8>, fs=<LocalFileSystem at remote 0x7f1b4a066...(truncated), 
    func=<function at remote 0x7f1b4855c2a8>) at ../Objects/abstract.c:2529
#41 ext_do_call (nk=<optimized out>, na=<optimized out>, flags=<optimized out>, pp_stack=0x7f1b1493f850, func=<function at remote 0x7f1b4855c2a8>) at ../Python/ceval.c:4333
#42 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b4613a810, for file /home/ocheng/dev/petastorm/petastorm/workers_pool/thread_pool.py, line 60, in run (self=<WorkerThread(_ventilator_queue=<Queue(unfinished_tasks=1, queue=<collections.deque at remote 0x7f1b45faa980>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e290>, mutex=<thread.lock at remote 0x7f1b45f75510>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e250>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45...(truncated), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2705
#43 0x00000000004c8762 in fast_function (nk=<optimized out>, na=<optimized out>, n=1, pp_stack=0x7f1b1493f9d0, func=<function at remote 0x7f1b484f2c80>) at ../Python/ceval.c:4106
---Type <return> to continue, or q <return> to quit---
#44 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493f9d0) at ../Python/ceval.c:4041
#45 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b08000910, for file /usr/lib/python2.7/threading.py, line 810, in __bootstrap_inner (self=<WorkerThread(_ventilator_queue=<Queue(unfinished_tasks=1, queue=<collections.deque at remote 0x7f1b45faa980>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e290>, mutex=<thread.lock at remote 0x7f1b45f75510>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e250>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<...(truncated), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#46 0x00000000004c8762 in fast_function (nk=<optimized out>, na=<optimized out>, n=1, pp_stack=0x7f1b1493fb50, func=<function at remote 0x7f1ba6334398>) at ../Python/ceval.c:4106
#47 call_function (oparg=<optimized out>, pp_stack=0x7f1b1493fb50) at ../Python/ceval.c:4041
#48 PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f1b45f9cad0, for file /usr/lib/python2.7/threading.py, line 783, in __bootstrap (self=<WorkerThread(_ventilator_queue=<Queue(unfinished_tasks=1, queue=<collections.deque at remote 0x7f1b45faa980>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e290>, mutex=<thread.lock at remote 0x7f1b45f75510>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e250>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-...(truncated), 
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2666
#49 0x00000000004704ea in PyEval_EvalCodeEx (closure=<optimized out>, defcount=<optimized out>, defs=0x0, kwcount=<optimized out>, kws=<optimized out>, argcount=<optimized out>, 
    args=<optimized out>, locals=0x0, globals=<optimized out>, co=<optimized out>) at ../Python/ceval.c:3252
#50 function_call.15337 (func=func@entry=<function at remote 0x7f1ba63342a8>, 
    arg=arg@entry=(<WorkerThread(_ventilator_queue=<Queue(unfinished_tasks=1, queue=<collections.deque at remote 0x7f1b45faa980>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e290>, mutex=<thread.lock at remote 0x7f1b45f75510>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e250>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], rel...(truncated), kw=kw@entry=0x0)
    at ../Objects/funcobject.c:526
#51 0x00000000004d8194 in PyObject_Call (kw=0x0, 
    arg=(<WorkerThread(_ventilator_queue=<Queue(unfinished_tasks=1, queue=<collections.deque at remote 0x7f1b45faa980>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e290>, mutex=<thread.lock at remote 0x7f1b45f75510>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e250>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], rel...(truncated), 
    func=<function at remote 0x7f1ba63342a8>) at ../Objects/abstract.c:2529
#52 instancemethod_call.8802 (func=<function at remote 0x7f1ba63342a8>, func@entry=<instancemethod at remote 0x7f1b45f7b190>, 
    arg=(<WorkerThread(_ventilator_queue=<Queue(unfinished_tasks=1, queue=<collections.deque at remote 0x7f1b45faa980>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e290>, mutex=<thread.lock at remote 0x7f1b45f75510>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f1b45f75510>) at remote 0x7f1b0f71e250>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f1b45f75510>, acquire=<built-in method acquire of thread.lock object at remote 0x7f1b45f75510>, _Condition__waiters=[], rel...(truncated), arg@entry=(), kw=0x0)
    at ../Objects/classobject.c:2602
#53 0x00000000004d40fb in PyObject_Call (kw=<optimized out>, arg=(), func=<instancemethod at remote 0x7f1b45f7b190>) at ../Objects/abstract.c:2529
#54 PyEval_CallObjectWithKeywords (func=<instancemethod at remote 0x7f1b45f7b190>, arg=(), kw=<optimized out>) at ../Python/ceval.c:3889
#55 0x000000000057f3a2 in t_bootstrap.71638 (boot_raw=0x462e8b0) at ../Modules/threadmodule.c:614
#56 0x00007f1bbd621184 in start_thread (arg=0x7f1b14940700) at pthread_create.c:312
#57 0x00007f1bbd34e37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The text was updated successfully, but these errors were encountered:

forbearer · 2018-08-28T15:39:02Z

@selitvin pointed out: "The segfault happens in memory release. We had these kind of issues with TF and related to memory allocators. Just as another point, you can try using a different memory allocator (e.g. mentioned here pytorch/pytorch#2314):
LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4""

rgruener added the bug Something isn't working label Aug 17, 2018

forbearer mentioned this issue Aug 23, 2018

Pytorch example with DataLoader adapter, using MNIST data #50

Merged

forbearer pushed a commit to forbearer/petastorm that referenced this issue Sep 6, 2018

Add troubleshooting steps to address issue uber#52

24283cf

forbearer mentioned this issue Sep 6, 2018

Add troubleshooting steps to address issue #52 #122

Merged

forbearer pushed a commit to forbearer/petastorm that referenced this issue Sep 6, 2018

Add troubleshooting steps to address issue uber#52

773e38c

forbearer pushed a commit to forbearer/petastorm that referenced this issue Sep 7, 2018

Add troubleshooting steps as workaround for issue uber#52.

f74c9ee

forbearer closed this as completed in #122 Sep 7, 2018

forbearer pushed a commit that referenced this issue Sep 7, 2018

Add troubleshooting steps as workaround for issue #52.

1bfb605

selitvin pushed a commit to selitvin/petastorm that referenced this issue Oct 9, 2018

Add troubleshooting steps as workaround for issue uber#52.

d9941d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chase down haphazard core dump when running mnist example main #52

Chase down haphazard core dump when running mnist example main #52

forbearer commented Aug 15, 2018

forbearer commented Aug 28, 2018

Chase down haphazard core dump when running mnist example main #52

Chase down haphazard core dump when running mnist example main #52

Comments

forbearer commented Aug 15, 2018

forbearer commented Aug 28, 2018