Crash in intrusive_list.hpp line 174 #3205

timjrobinson · 2014-10-17T04:50:18Z

One of my servers in my 6 node cluster just crashed with the following error:

2014-10-17T04:52:46.750552143 572340.036310s error: Error in ./src/containers/intrusive_list.hpp at line 174:
2014-10-17T04:52:46.750636141 572340.036393s error: Guarantee failed: [!node->in_a_list()]
2014-10-17T04:52:46.750656669 572340.036414s error: Backtrace:
2014-10-17T04:52:47.502883385 572340.788641s error: Fri Oct 17 04:52:46 2014\n\n1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)\n2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)\n3: report_fatal_error(char const*, int, char const*, ...) at 0xd760c5 (/usr/bin/rethinkdb)\n4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)\n5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)\n6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)\n7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)\n8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()\n9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)
2014-10-17T04:52:47.507345988 572340.793103s error: Exiting.

RethinkDB version:

root@db-7:/var/lib/rethinkdb/instances.d/db-7# rethinkdb -v
rethinkdb 1.15.0-1-0ubuntu1~trusty (GCC 4.8.2)

Ubuntu version 14.04

The text was updated successfully, but these errors were encountered:

timjrobinson · 2014-10-17T04:54:28Z

After restarting RethinkDB it appears to be working fine without any data corruption.

timmaxw · 2014-10-17T17:08:33Z

Hi @timjrobinson, sorry you ran into this, and thanks for the bug report.

I reformatted your backtrace for legibility:

1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)
2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)
3: report_fatal_error(char const*, int, char const*, ...) at 0xd760c5 (/usr/bin/rethinkdb)
4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)
5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)
6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)
7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)
8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()
9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)

@mlucy, this looks changefeed-related.

danielmewes · 2014-10-17T18:22:33Z

A few additional questions @timjrobinson

Do you know if any specific query was running when this happened?
Had you previously reconfigured (a table in) the cluster, or had a different machine been shut down or restarted?
As @timmaxw mentioned this appears to be related to change feeds. Can you explain a little bit how you are using change feeds in your application? How many do you have open at a time (approximately)? What are the actualy change feed queries you are running?
Are the tables that you use change feeds on sharded? Are they replicated? If yes, how many shards? How many replicas?

Thank you for your help!

Please also let us know in case this happens a second time.

mlucy · 2014-10-18T00:14:05Z

@AtnNn -- is there a way to go from that backtrace to line numbers? It would really help to know which line of mailbox_cb is failing.

AtnNn · 2014-10-18T00:28:57Z

@mlucy:

/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/backtrace.cc:214
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/backtrace.cc:282
??:0
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/./src/containers/intrusive_list.hpp:174
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/rdb_protocol/changefeed.cc:766
/usr/include/boost/variant/variant.hpp:1373
/usr/include/boost/variant/variant.hpp:1373
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/rpc/mailbox/mailbox.cc:267
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/arch/runtime/coroutines.cc:206

The debug symbols are available in the buildbot assets folder, in this case rethinkdb-dbg_1.15.0-1-0ubuntu1~trusty_amd64.deb. Deb files can be extracted with dpkg -x and addresses converted using addr2line

mlucy · 2014-10-18T01:24:26Z

@AtnNn -- thanks!

That line is the end of the second if statement in mailbox_cb, which suggests that it's the rwlock_in_line_t destructor causing the problem. Nothing looks wrong with the logic, though. We can't get there without, in order:

Checking that we aren't detached.
Acquiring a drainer lock.
Waiting for the queues to be ready.

There are no obvious race conditions that we might be falling afoul of.

@danielmewes -- is it plausible for destroying an rwlock_in_line_t to cause that guarantee to trigger? Can you glean any information from the fact that that guarantee in particular triggered?

@timjrobinson -- thanks for reporting this! Is there any chance you have a core dump from the crash?

timjrobinson · 2014-10-18T01:33:55Z

@danielmewes

I'm not sure what queries were being run at the time, it crashed in the middle of the night unfortunately.
I had 2 servers die a month or two ago (Rethinkdb cluster random servers crashing protob.cc line 299 #3077). I also upgraded rethinkdb on the servers a day or two after 1.15 was released but other than that nothing has been restarted.
I don't know how many change feeds I have open at a time, best guess is somewhere between 50 and 300.
I have change feeds for 3 different tables: users, queuers and lobbies. All are accessed with the following code on the model that corresponds to each table. This looks the same for all 3 tables just with different Model name substituted in.

User.changes = (id, callback) ->
  User.closeChangesConnection(id)
  db.onConnect (err, conn) =>
    if err then return callback(err)
    User.table.changes().filter(db.row('new_val')('id').eq(id)).run conn, (err, cursor) =>
      if err then return callback(err)
      User.changeConnections[id] = conn
      cursor.each (err, data) ->
        if err then return callback(err)
        callback(null, new User(data['new_val']))

User.closeChangesConnection = (id) ->
  if User.changeConnections[id]?
    User.changeConnections[id].close()
    delete User.changeConnections[id]

In posting this I just realized there's a bug in this code if connections attempt to listen to the same user / lobby / queuer ID it will close the existing connection -_-.

I use this function like so:

User.changes userId, (err, user) =>
            if err then return log.error("User changes encountered error", {err})
            log.info("User details changed, new user is: ", user.getInfo())
            socket.emit('user.details', user.getInfo())

It closes the connection like so:

socket.on 'disconnect', ->
      User.closeChangesConnection(userId)

All three of these tables have 8 shards and 3 replicas. I'm running on 6 nodes each with 8GB of ram and 4CPU's .

timjrobinson · 2014-10-18T01:35:59Z

@mlucy How do I get a core dump?

Also db-2 crashed today with what looks to be the same error:

2014-10-18T08:54:28.238998527 49506.913574s error: Error in ./src/containers/intrusive_list.hpp at line 174:
2014-10-18T08:54:28.240229488 49506.914805s error: Guarantee failed: [!node->in_a_list()]
2014-10-18T08:54:28.240865301 49506.915479s error: Backtrace:
2014-10-18T08:54:29.164835765 49507.839410s error: Sat Oct 18 08:54:28 2014\n\n
1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)\n
2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)\n
3: report_fatal_error(char const*, int, char const*, ...) at 0xd76115 (/usr/bin/rethinkdb)\n
4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)\n
5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)\n
6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)\n
7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)\n
8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()\n
9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)
2014-10-18T08:54:29.166202345 49507.840776s error: Exiting.

I upgraded to rethinkdb 15.1 yesterday hoping it might fix the issue but looks like it didn't.

mlucy · 2014-10-18T01:50:57Z

@timjrobinson: if you have core dumps enabled there will be a file core in the directory where your program crashed. You can enable core dumps by running ulimit -c unlimited before running RethinkDB.

mlucy · 2014-10-18T01:52:25Z

Sorry to hear it crashed again! At least it's reproducible, which will help a lot with tracking it down.

If you could get a core file for us, that would be a big help. In the meantime I'm going to take another look at the surrounding code.

timjrobinson · 2014-10-18T02:34:37Z

What directory should this core file be in on Ubuntu 14.04? I checked /var/lib/rethinkdb/instances.d/db-2, /etc/rethinkdb/instances.d and /usr/bin and it's not in any of those folders. I guess it doesn't work until I do ulimit -c unlimited.

To get the core dump working I've done the following:

Logged in as root then:

su - rethinkdb
ulimit -c unlimited 
service rethinkdb restart

Should this be all I need to do?

timjrobinson · 2014-10-20T07:44:28Z

db-4 crashed today with the same error:

2014-10-19T09:27:03.837396955 75090.467864s error: Error in ./src/containers/intrusive_list.hpp at line 174:
2014-10-19T09:27:03.837505603 75090.467972s error: Guarantee failed: [!node->in_a_list()]
2014-10-19T09:27:03.837531775 75090.467998s error: Backtrace:
2014-10-19T09:27:04.553268635 75091.183736s error: Sun Oct 19 09:27:03 2014\n\n
1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)\n
2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)\n
3: report_fatal_error(char const*, int, char const*, ...) at 0xd76115 (/usr/bin/rethinkdb)\n
4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)\n
5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)\n
6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)\n
7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)\n
8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()\n
9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)
2014-10-19T09:27:04.555531080 75091.186004s error: Exiting.

A core file was created in / but it's 5.3GB. What's the best way of getting it to you guys?

danielmewes · 2014-10-20T17:51:21Z

@timjrobinson can you write an email to mike at rethinkdb.com ? He will send you directions on how to upload the core file. I also recommend gzipping it to reduce its size a bit. (@mglukhovsky )

danielmewes · 2014-10-20T18:00:04Z

(@timjrobinson Also, note that the core file will contain data from your database. If you have sensitive data in there, we are happy to sign an NDA.)

mglukhovsky · 2014-10-22T21:17:36Z

@danielmewes, I've worked with @timjrobinson to transfer the core file to our internal servers, so you should be all set.

mlucy · 2014-10-23T01:51:50Z

Thanks @timjrobinson, @mglukhovsky . Looking at this now.

mlucy · 2014-10-24T01:03:36Z

Alright, @danielmewes and I looked at this together and we think we have a fix for this.

@timjrobinson -- if we put together a custom binary for you, could you try running it and see if this problem goes away?

@AtnNn -- could we put together a custom binary based on https://github.com/rethinkdb/rethinkdb/tree/mlucy_3205 ? (It's branched off 1.15.x.)

mlucy · 2014-10-24T01:04:09Z

(That branch fixes the problem and also turns on some guarantees, so we'll have more information if we misdiagnosed it and the server crashes again.)

AtnNn · 2014-10-24T01:06:41Z

Yes. I will build a package for Trusty amd64 from the mlucy_3205 branch.

AtnNn · 2014-10-24T01:53:48Z

Here is the package:

http://atnnn.github.io/rethinkdb/rethinkdb_1.15.1+1+g505d7d~0trusty_amd64.deb

It can be installed using dpkg, replacing an already installed rethinkdb_1.15.1 package:

wget http://atnnn.github.io/rethinkdb/rethinkdb_1.15.1+1+g505d7d~0trusty_amd64.deb
dpkg -i rethinkdb_1.15.1+1+g505d7d~0trusty_amd64.deb

mlucy · 2014-10-30T03:47:38Z

@timjrobinson -- not to be a bother, but did you ever get a chance to test this?

timjrobinson · 2014-11-04T09:10:02Z

Sorry I haven't had a chance yet. Will try it tonight.

mlucy · 2014-11-04T20:36:28Z

@timjrobinson -- Thanks! Here's hoping it fixes the problem.

danielmewes · 2014-11-05T19:18:58Z

@mlucy I would like to ship the likely fix you've implemented with 1.15.2. I'm happy to do the code review so we can merge it today.

timjrobinson · 2014-11-05T19:40:48Z

I've installed it, haven't had any crashes yet, usually crashes once every 24 hours, will let you guys know in a day or two.

danielmewes · 2014-11-05T19:44:07Z

Thank you for your feedback @timjrobinson .
We are going to ship the fix you are testing with the upcoming 1.15.2 release (likely release data Friday or Monday). Should it turn out during your further testing that there is another issue here, we will follow that up with another patch and release later.

mlucy · 2014-11-05T22:40:36Z

Alright, this is in next and v1.15.x .

@timjrobinson -- if it turns out this does crash in production even after the fix, please re-open this and we'll look into it again. (Also, thanks again for all the help.)

danielmewes added tp:bug pr:high labels Oct 17, 2014

danielmewes added this to the 1.15.x milestone Oct 17, 2014

danielmewes assigned mlucy Oct 17, 2014

timjrobinson mentioned this issue Nov 4, 2014

NodeJS Client crash Error: write after end #3207

Closed

danielmewes modified the milestones: 1.15.2, 1.15.x Nov 5, 2014

This was referenced Nov 5, 2014

1.15.2 release preparation #3290

Closed

Watts 1.15.2 notes #3291

Merged

chipotle pushed a commit that referenced this issue Nov 5, 2014

Add bugs #3287 and #3205

a17ed6b

mlucy closed this as completed Nov 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in intrusive_list.hpp line 174 #3205

Crash in intrusive_list.hpp line 174 #3205

timjrobinson commented Oct 17, 2014

timjrobinson commented Oct 17, 2014

timmaxw commented Oct 17, 2014

danielmewes commented Oct 17, 2014

mlucy commented Oct 18, 2014

AtnNn commented Oct 18, 2014

mlucy commented Oct 18, 2014

timjrobinson commented Oct 18, 2014

timjrobinson commented Oct 18, 2014

mlucy commented Oct 18, 2014

mlucy commented Oct 18, 2014

timjrobinson commented Oct 18, 2014

timjrobinson commented Oct 20, 2014

danielmewes commented Oct 20, 2014

danielmewes commented Oct 20, 2014

mglukhovsky commented Oct 22, 2014

mlucy commented Oct 23, 2014

mlucy commented Oct 24, 2014

mlucy commented Oct 24, 2014

AtnNn commented Oct 24, 2014

AtnNn commented Oct 24, 2014

mlucy commented Oct 30, 2014

timjrobinson commented Nov 4, 2014

mlucy commented Nov 4, 2014

danielmewes commented Nov 5, 2014

timjrobinson commented Nov 5, 2014

danielmewes commented Nov 5, 2014

mlucy commented Nov 5, 2014

Crash in intrusive_list.hpp line 174 #3205

Crash in intrusive_list.hpp line 174 #3205

Comments

timjrobinson commented Oct 17, 2014

timjrobinson commented Oct 17, 2014

timmaxw commented Oct 17, 2014

danielmewes commented Oct 17, 2014

mlucy commented Oct 18, 2014

AtnNn commented Oct 18, 2014

mlucy commented Oct 18, 2014

timjrobinson commented Oct 18, 2014

timjrobinson commented Oct 18, 2014

mlucy commented Oct 18, 2014

mlucy commented Oct 18, 2014

timjrobinson commented Oct 18, 2014

timjrobinson commented Oct 20, 2014

danielmewes commented Oct 20, 2014

danielmewes commented Oct 20, 2014

mglukhovsky commented Oct 22, 2014

mlucy commented Oct 23, 2014

mlucy commented Oct 24, 2014

mlucy commented Oct 24, 2014

AtnNn commented Oct 24, 2014

AtnNn commented Oct 24, 2014

mlucy commented Oct 30, 2014

timjrobinson commented Nov 4, 2014

mlucy commented Nov 4, 2014

danielmewes commented Nov 5, 2014

timjrobinson commented Nov 5, 2014

danielmewes commented Nov 5, 2014

mlucy commented Nov 5, 2014