Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test: node crashed due to OOM #4869

Closed
denesb opened this issue Aug 20, 2019 · 13 comments

Comments

@denesb
Copy link
Contributor

denesb commented Aug 20, 2019

Links:

EDIT: Scylla version: 131acc0

@denesb
Copy link
Contributor Author

denesb commented Aug 20, 2019

I'm getting this when trying to look at the core:

(gdb) p seastar::local_engine
Cannot find thread-local storage for LWP 2014, executable file /work/scylla-jenkins-crash/scylla:
Cannot find thread-local variables on this target
Cannot find thread-local storage for LWP 2014, executable file /work/scylla-jenkins-crash/scylla:
Cannot find thread-local variables on this target

Probably #4673.

@tgrabiec
Copy link
Contributor

@denesb It doesn't look like a relocatable package, so I think it's probably just a mismatch between host's and core's libraries. Are you running the GDB inside dbuild container?

@bhalevy bhalevy changed the title materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test: node crashed due toOOM materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test: node crashed due to OOM Aug 20, 2019
@denesb
Copy link
Contributor Author

denesb commented Aug 20, 2019

No, I tried in a docker container where I installed a relocatable package. I'll try in a dbuild container.

@denesb
Copy link
Contributor Author

denesb commented Aug 20, 2019

Doesn't work in a dbuild container either.

$ ./tools/toolchain/dbuild -v /home/bdenes/Storage/Work/scylla-jenkins-crash:/work --interactive -- gdb -q --core=/work/node5-ld-linux-x86-64.2014.1565692120.core /work/scylla
Reading symbols from /work/scylla...done.

warning: core file may not match specified executable file.
[New LWP 2014]
[New LWP 2020]
[New LWP 2018]
[New LWP 2019]
[New LWP 2017]
[New LWP 2016]

warning: Could not load shared library symbols for 55 libraries, e.g. /jenkins/workspace/scylla-master/dtest-release/scylla-dtest/../scylla/dynamic_libs/libthrift-0.10.0.so.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
Core was generated by `scylla --library-path /jenkins/workspace/scylla-master/dtest-release/scylla-dte'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fe1fbc2c53f in ?? ()
[Current thread is 1 (LWP 2014)]
warning: File "/work/scylla-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /work/scylla-gdb.py
line to your configuration file "/home/bdenes/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/bdenes/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
(gdb) p seastar::local_engine 
(gdb) Cannot find thread-local storage for LWP 2014, executable file /work/scylla:
Cannot find thread-local variables on this target

@tgrabiec
Copy link
Contributor

Looks like the problem is that GDB cannot find the right libraries.

Try this under gdb invoked without any arguments:

set solib-search-path /lib64
file ./scylla 
core ./node5-ld-linux-x86-64.2014.1565692120.core 

@denesb
Copy link
Contributor Author

denesb commented Aug 22, 2019

Still doesn't work:

$ ./tools/toolchain/dbuild -v /home/bdenes/Storage/Work/:/work --interactive -t --privileged -- gdb -q
(gdb) set solib-search-path /lib64
(gdb) file /work/jenkins-core/scylla 
Reading symbols from /work/jenkins-core/scylla...done.
(gdb) core /work/jenkins-core/node5-ld-linux-x86-64.2014.1565692120.core 
warning: core file may not match specified executable file.
[New LWP 2014]
[New LWP 2020]
[New LWP 2018]
[New LWP 2019]
[New LWP 2017]
[New LWP 2016]
warning: Could not load shared library symbols for scylla.
Do you need "set solib-search-path" or "set sysroot"?
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `scylla --library-path /jenkins/workspace/scylla-master/dtest-release/scylla-dte'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fe1fbc2c53f in raise () from /usr/lib64/libc-2.28.so
[Current thread is 1 (Thread 0x7fe1f9916100 (LWP 2014))]
Missing separate debuginfos, use: dnf debuginfo-install boost-date-time-1.66.0-15.1.scylladb.fc29.x86_64 boost-filesystem-1.66.0-15.1.scylladb.fc29.x86_64 boost-program-options-1.66.0-15.1.scylladb.fc29.x86_64 boost-regex-1.66.0-15.1.scylladb.fc29.x86_64 boost-system-1.66.0-15.1.scylladb.fc29.x86_64 boost-thread-1.66.0-15.1.scylladb.fc29.x86_64 c-ares-1.13.0-5.fc29.x86_64 cryptopp-7.0.0-2.fc29.x86_64 fmt-5.2.1-1.fc29.x86_64 glibc-2.28-26.fc29.x86_64 gmp-6.1.2-9.fc29.x86_64 gnutls-3.6.7-1.fc29.x86_64 hwloc-libs-1.11.12-1.fc29.x86_64 libatomic-8.3.1-2.fc29.x86_64 libblkid-2.32.1-1.fc29.x86_64 libcap-2.25-12.fc29.x86_64 libffi-3.1-18.fc29.x86_64 libgcrypt-1.8.4-1.fc29.x86_64 libgpg-error-1.33-1.fc29.x86_64 libidn2-2.1.1a-1.fc29.x86_64 libmount-2.32.1-1.fc29.x86_64 libselinux-2.8-6.fc29.x86_64 libstdc++-8.3.1-2.fc29.x86_64 libtasn1-4.13-5.fc29.x86_64 libtool-ltdl-2.4.6-27.fc29.x86_64 libunistring-0.9.10-4.fc29.x86_64 libuuid-2.32.1-1.fc29.x86_64 libxcrypt-4.4.4-2.fc29.x86_64 lz4-libs-1.8.3-1.fc29.x86_64 nettle-3.4.1rc1-1.fc29.x86_64 numactl-libs-2.0.12-1.fc29.x86_64 p11-kit-0.23.15-2.fc29.x86_64 protobuf-3.5.0-8.fc29.x86_64 snappy-1.1.7-8.fc29.x86_64 thrift-0.10.0-15.fc29.x86_64 trousers-lib-0.3.13-11.fc29.x86_64 xz-libs-5.2.4-3.fc29.x86_64 yaml-cpp-0.6.1-4.fc29.x86_64 zlib-1.2.11-14.fc29.x86_64
(gdb) p seastar::local_engine
Cannot find thread-local storage for Thread 0x7fe1f9916100 (LWP 2014), executable file /work/jenkins-core/scylla:
generic error
(gdb) 

Tried on my box, master dbuild, 131acc0 dbuild.

@denesb
Copy link
Contributor Author

denesb commented Aug 22, 2019

This is strange, as it seems that jenkins is also using dbuild to build and run scylla.

@avikivity
Copy link
Member

ccm using relocatable packages in its own unique way, and that defeats debugging. It should use install.sh to install scylla instead of their hacks (but it needs an extra fix first)

@denesb
Copy link
Contributor Author

denesb commented Aug 22, 2019

Trying to find something without relying on TLS. Successfully fished a reactor and database pointer from the stack.

(gdb) p $1
$19 = (database * const) 0x600000596010
(gdb) p $15
$20 = (seastar::reactor * const) 0x600000020000

A storage_proxy pointer would be nice too.

There isn't a crazy number of tasks:

(gdb) scylla task-queues *$15
   id name                             shares  tasks
 A 00 "main"                           1000.00 7
   01 "atexit"                         1000.00 0
 A 02 "streaming"                       200.00 3
 A 03 "compaction"                       50.54 4
   04 "mem_compaction"                 1000.00 0
*A 05 "statement"                      1000.00 1008
   06 "memtable"                          3.96 0
   07 "memtable_to_cache"               200.00 0

We have 1k statement related tasks executing. The crash also happens on a database::apply().

@denesb
Copy link
Contributor Author

denesb commented Aug 22, 2019

The tasks seem to be coming from the apply stage:

(gdb) python inheriting_execution_stage('$1->_data_query_stage').print()
ES: id= 0 enqueued=74 executed=74 diff=0
ES: id= 1 None
ES: id= 2 None
ES: id= 3 None
ES: id= 4 None
ES: id= 5 None
ES: id= 6 None
ES: id= 7 None
ES: id= 8 None
ES: id= 9 None
ES: id=10 None
ES: id=11 None
ES: id=12 None
ES: id=13 None
ES: id=14 None
ES: id=15 None
(gdb) python inheriting_execution_stage('$1->_apply_stage').print()
ES: id= 0 enqueued=93 executed=93 diff=0
ES: id= 1 None
ES: id= 2 None
ES: id= 3 enqueued=40 executed=40 diff=0
ES: id= 4 None
ES: id= 5 enqueued=168036 executed=167016 diff=1020
ES: id= 6 None
ES: id= 7 None
ES: id= 8 None
ES: id= 9 None
ES: id=10 None
ES: id=11 None
ES: id=12 None
ES: id=13 None
ES: id=14 None
ES: id=15 None

No exploded queue found so far.

@denesb
Copy link
Contributor Author

denesb commented Aug 22, 2019

_mutation_query_stage is sane too.

@denesb
Copy link
Contributor Author

denesb commented Aug 22, 2019

Managed to fish a storage_proxy pointer:

(gdb) p $41                         
$53 = (service::storage_proxy *) 0x6000007fd400

I don't know why I thought it is also located in TLS. In fact it is a static which is accessible.
Found nothing of note in the stats. No queue seems to be exploding.

@denesb
Copy link
Contributor Author

denesb commented Aug 23, 2019

Without access to TLS (#4673) I am just shooting in the dark without a way to tell that the little that I'm seeing (e.g. 1000 tasks) is a problem or not. Closing for now, should reopen (or create new issue) when we manage to reproduce with TLS debugging.

@denesb denesb closed this as completed Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants