Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in intrusive_list.hpp line 174 #3205

Closed
timjrobinson opened this issue Oct 17, 2014 · 27 comments
Closed

Crash in intrusive_list.hpp line 174 #3205

timjrobinson opened this issue Oct 17, 2014 · 27 comments
Assignees
Milestone

Comments

@timjrobinson
Copy link

One of my servers in my 6 node cluster just crashed with the following error:

2014-10-17T04:52:46.750552143 572340.036310s error: Error in ./src/containers/intrusive_list.hpp at line 174:
2014-10-17T04:52:46.750636141 572340.036393s error: Guarantee failed: [!node->in_a_list()]
2014-10-17T04:52:46.750656669 572340.036414s error: Backtrace:
2014-10-17T04:52:47.502883385 572340.788641s error: Fri Oct 17 04:52:46 2014\n\n1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)\n2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)\n3: report_fatal_error(char const*, int, char const*, ...) at 0xd760c5 (/usr/bin/rethinkdb)\n4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)\n5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)\n6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)\n7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)\n8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()\n9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)
2014-10-17T04:52:47.507345988 572340.793103s error: Exiting.

RethinkDB version:

root@db-7:/var/lib/rethinkdb/instances.d/db-7# rethinkdb -v
rethinkdb 1.15.0-1-0ubuntu1~trusty (GCC 4.8.2)

Ubuntu version 14.04

@timjrobinson
Copy link
Author

After restarting RethinkDB it appears to be working fine without any data corruption.

@timmaxw
Copy link
Member

timmaxw commented Oct 17, 2014

Hi @timjrobinson, sorry you ran into this, and thanks for the bug report.

I reformatted your backtrace for legibility:

1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)
2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)
3: report_fatal_error(char const*, int, char const*, ...) at 0xd760c5 (/usr/bin/rethinkdb)
4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)
5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)
6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)
7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)
8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()
9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)

@mlucy, this looks changefeed-related.

@danielmewes danielmewes added this to the 1.15.x milestone Oct 17, 2014
@danielmewes
Copy link
Member

A few additional questions @timjrobinson

  • Do you know if any specific query was running when this happened?
  • Had you previously reconfigured (a table in) the cluster, or had a different machine been shut down or restarted?
  • As @timmaxw mentioned this appears to be related to change feeds. Can you explain a little bit how you are using change feeds in your application? How many do you have open at a time (approximately)? What are the actualy change feed queries you are running?
  • Are the tables that you use change feeds on sharded? Are they replicated? If yes, how many shards? How many replicas?

Thank you for your help!

Please also let us know in case this happens a second time.

@mlucy
Copy link
Member

mlucy commented Oct 18, 2014

@AtnNn -- is there a way to go from that backtrace to line numbers? It would really help to know which line of mailbox_cb is failing.

@AtnNn
Copy link
Member

AtnNn commented Oct 18, 2014

@mlucy:

/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/backtrace.cc:214
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/backtrace.cc:282
??:0
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/./src/containers/intrusive_list.hpp:174
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/rdb_protocol/changefeed.cc:766
/usr/include/boost/variant/variant.hpp:1373
/usr/include/boost/variant/variant.hpp:1373
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/rpc/mailbox/mailbox.cc:267
/home/buildslave/buildslave/package-trusty/build-deb/rethinkdb-1.15.0-1/src/arch/runtime/coroutines.cc:206

The debug symbols are available in the buildbot assets folder, in this case rethinkdb-dbg_1.15.0-1-0ubuntu1~trusty_amd64.deb. Deb files can be extracted with dpkg -x and addresses converted using addr2line

@mlucy
Copy link
Member

mlucy commented Oct 18, 2014

@AtnNn -- thanks!

That line is the end of the second if statement in mailbox_cb, which suggests that it's the rwlock_in_line_t destructor causing the problem. Nothing looks wrong with the logic, though. We can't get there without, in order:

  • Checking that we aren't detached.
  • Acquiring a drainer lock.
  • Waiting for the queues to be ready.

There are no obvious race conditions that we might be falling afoul of.

@danielmewes -- is it plausible for destroying an rwlock_in_line_t to cause that guarantee to trigger? Can you glean any information from the fact that that guarantee in particular triggered?

@timjrobinson -- thanks for reporting this! Is there any chance you have a core dump from the crash?

@timjrobinson
Copy link
Author

@danielmewes

  1. I'm not sure what queries were being run at the time, it crashed in the middle of the night unfortunately.
  2. I had 2 servers die a month or two ago (Rethinkdb cluster random servers crashing protob.cc line 299 #3077). I also upgraded rethinkdb on the servers a day or two after 1.15 was released but other than that nothing has been restarted.
  3. I don't know how many change feeds I have open at a time, best guess is somewhere between 50 and 300.
    I have change feeds for 3 different tables: users, queuers and lobbies. All are accessed with the following code on the model that corresponds to each table. This looks the same for all 3 tables just with different Model name substituted in.
User.changes = (id, callback) ->
  User.closeChangesConnection(id)
  db.onConnect (err, conn) =>
    if err then return callback(err)
    User.table.changes().filter(db.row('new_val')('id').eq(id)).run conn, (err, cursor) =>
      if err then return callback(err)
      User.changeConnections[id] = conn
      cursor.each (err, data) ->
        if err then return callback(err)
        callback(null, new User(data['new_val']))

User.closeChangesConnection = (id) ->
  if User.changeConnections[id]?
    User.changeConnections[id].close()
    delete User.changeConnections[id]

In posting this I just realized there's a bug in this code if connections attempt to listen to the same user / lobby / queuer ID it will close the existing connection -_-.

I use this function like so:

User.changes userId, (err, user) =>
            if err then return log.error("User changes encountered error", {err})
            log.info("User details changed, new user is: ", user.getInfo())
            socket.emit('user.details', user.getInfo())

It closes the connection like so:

socket.on 'disconnect', ->
      User.closeChangesConnection(userId)
  1. All three of these tables have 8 shards and 3 replicas. I'm running on 6 nodes each with 8GB of ram and 4CPU's .

@timjrobinson
Copy link
Author

@mlucy How do I get a core dump?

Also db-2 crashed today with what looks to be the same error:

2014-10-18T08:54:28.238998527 49506.913574s error: Error in ./src/containers/intrusive_list.hpp at line 174:
2014-10-18T08:54:28.240229488 49506.914805s error: Guarantee failed: [!node->in_a_list()]
2014-10-18T08:54:28.240865301 49506.915479s error: Backtrace:
2014-10-18T08:54:29.164835765 49507.839410s error: Sat Oct 18 08:54:28 2014\n\n
1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)\n
2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)\n
3: report_fatal_error(char const*, int, char const*, ...) at 0xd76115 (/usr/bin/rethinkdb)\n
4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)\n
5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)\n
6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)\n
7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)\n
8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()\n
9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)
2014-10-18T08:54:29.166202345 49507.840776s error: Exiting.

I upgraded to rethinkdb 15.1 yesterday hoping it might fix the issue but looks like it didn't.

@mlucy
Copy link
Member

mlucy commented Oct 18, 2014

@timjrobinson: if you have core dumps enabled there will be a file core in the directory where your program crashed. You can enable core dumps by running ulimit -c unlimited before running RethinkDB.

@mlucy
Copy link
Member

mlucy commented Oct 18, 2014

Sorry to hear it crashed again! At least it's reproducible, which will help a lot with tracking it down.

If you could get a core file for us, that would be a big help. In the meantime I'm going to take another look at the surrounding code.

@timjrobinson
Copy link
Author

What directory should this core file be in on Ubuntu 14.04? I checked /var/lib/rethinkdb/instances.d/db-2, /etc/rethinkdb/instances.d and /usr/bin and it's not in any of those folders. I guess it doesn't work until I do ulimit -c unlimited.

To get the core dump working I've done the following:

Logged in as root then:

su - rethinkdb
ulimit -c unlimited 
service rethinkdb restart

Should this be all I need to do?

@timjrobinson
Copy link
Author

db-4 crashed today with the same error:

2014-10-19T09:27:03.837396955 75090.467864s error: Error in ./src/containers/intrusive_list.hpp at line 174:
2014-10-19T09:27:03.837505603 75090.467972s error: Guarantee failed: [!node->in_a_list()]
2014-10-19T09:27:03.837531775 75090.467998s error: Backtrace:
2014-10-19T09:27:04.553268635 75091.183736s error: Sun Oct 19 09:27:03 2014\n\n
1: backtrace_t::backtrace_t() at 0xb8b090 (/usr/bin/rethinkdb)\n
2: format_backtrace(bool) at 0xb8b423 (/usr/bin/rethinkdb)\n
3: report_fatal_error(char const*, int, char const*, ...) at 0xd76115 (/usr/bin/rethinkdb)\n
4: linux_message_hub_t::store_message_sometime(threadnum_t, linux_thread_message_t*) at 0xace9d9 (/usr/bin/rethinkdb)\n
5: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa30c91 (/usr/bin/rethinkdb)\n
6: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::changefeed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) at 0xa3748d (/usr/bin/rethinkdb)\n
7: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3991c (/usr/bin/rethinkdb)\n
8: /usr/bin/rethinkdb() [0xaea453] at 0xaea453 ()\n
9: coro_t::run() at 0xacc468 (/usr/bin/rethinkdb)
2014-10-19T09:27:04.555531080 75091.186004s error: Exiting.

A core file was created in / but it's 5.3GB. What's the best way of getting it to you guys?

@danielmewes
Copy link
Member

@timjrobinson can you write an email to mike at rethinkdb.com ? He will send you directions on how to upload the core file. I also recommend gzipping it to reduce its size a bit. (@mglukhovsky )

@danielmewes
Copy link
Member

(@timjrobinson Also, note that the core file will contain data from your database. If you have sensitive data in there, we are happy to sign an NDA.)

@mglukhovsky
Copy link
Member

@danielmewes, I've worked with @timjrobinson to transfer the core file to our internal servers, so you should be all set.

@mlucy
Copy link
Member

mlucy commented Oct 23, 2014

Thanks @timjrobinson, @mglukhovsky . Looking at this now.

@mlucy
Copy link
Member

mlucy commented Oct 24, 2014

Alright, @danielmewes and I looked at this together and we think we have a fix for this.

@timjrobinson -- if we put together a custom binary for you, could you try running it and see if this problem goes away?

@AtnNn -- could we put together a custom binary based on https://github.com/rethinkdb/rethinkdb/tree/mlucy_3205 ? (It's branched off 1.15.x.)

@mlucy
Copy link
Member

mlucy commented Oct 24, 2014

(That branch fixes the problem and also turns on some guarantees, so we'll have more information if we misdiagnosed it and the server crashes again.)

@AtnNn
Copy link
Member

AtnNn commented Oct 24, 2014

Yes. I will build a package for Trusty amd64 from the mlucy_3205 branch.

@AtnNn
Copy link
Member

AtnNn commented Oct 24, 2014

Here is the package:

http://atnnn.github.io/rethinkdb/rethinkdb_1.15.1+1+g505d7d~0trusty_amd64.deb

It can be installed using dpkg, replacing an already installed rethinkdb_1.15.1 package:

wget http://atnnn.github.io/rethinkdb/rethinkdb_1.15.1+1+g505d7d~0trusty_amd64.deb
dpkg -i rethinkdb_1.15.1+1+g505d7d~0trusty_amd64.deb

@mlucy
Copy link
Member

mlucy commented Oct 30, 2014

@timjrobinson -- not to be a bother, but did you ever get a chance to test this?

@timjrobinson
Copy link
Author

Sorry I haven't had a chance yet. Will try it tonight.

@mlucy
Copy link
Member

mlucy commented Nov 4, 2014

@timjrobinson -- Thanks! Here's hoping it fixes the problem.

@danielmewes danielmewes modified the milestones: 1.15.2, 1.15.x Nov 5, 2014
This was referenced Nov 5, 2014
@danielmewes
Copy link
Member

@mlucy I would like to ship the likely fix you've implemented with 1.15.2. I'm happy to do the code review so we can merge it today.

chipotle pushed a commit that referenced this issue Nov 5, 2014
@timjrobinson
Copy link
Author

I've installed it, haven't had any crashes yet, usually crashes once every 24 hours, will let you guys know in a day or two.

@danielmewes
Copy link
Member

Thank you for your feedback @timjrobinson .
We are going to ship the fix you are testing with the upcoming 1.15.2 release (likely release data Friday or Monday). Should it turn out during your further testing that there is another issue here, we will follow that up with another patch and release later.

@mlucy
Copy link
Member

mlucy commented Nov 5, 2014

Alright, this is in next and v1.15.x .

@timjrobinson -- if it turns out this does crash in production even after the fix, please re-open this and we'll look into it again. (Also, thanks again for all the help.)

@mlucy mlucy closed this as completed Nov 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants