Core Dump #3393

internalfx · 2014-12-03T19:07:23Z

Not sure what to make of it exactly, I do use change feeds....

Server was under high load when it happened.

Version: rethinkdb 1.15.2~0trusty (GCC 4.8.2)
error: Error in src/concurrency/auto_drainer.cc at line 24:
error: Guarantee failed: [!parent->draining.is_pulsed()] New processes should not acquire a draining `auto_drainer_t`.
error: Backtrace:
addr2line: 'rethinkdb': No such file
addr2line: 'rethinkdb': No such file
error: Wed Dec  3 13:54:07 2014

       1: backtrace_t::backtrace_t() at 0xb927b0 (rethinkdb)
       2: format_backtrace(bool) at 0xb92b43 (rethinkdb)
       3: report_fatal_error(char const*, int, char const*, ...) at 0xd7ed35 (rethinkdb)
       4: auto_drainer_t::lock_t::lock_t(auto_drainer_t*) at 0xb9b1e8 (rethinkdb)
       5: ql::changefeed::feed_t::each_sub_in_vec(std::vector<std::set<ql::changefeed::subscription_t*, std::less<ql::changefeed::subscrip
tion_t*>, std::allocator<ql::changefeed::subscription_t*> >, std::allocator<std::set<ql::changefeed::subscription_t*, std::less<ql::change
feed::subscription_t*>, std::allocator<ql::changefeed::subscription_t*> > > > const&, rwlock_in_line_t*, std::function<void (ql::changefee
d::subscription_t*)> const&) at 0xa319b5 (rethinkdb)
       6: ql::changefeed::feed_t::each_table_sub(std::function<void (ql::changefeed::subscription_t*)> const&) at 0xa31b35 (rethinkdb)
       7: ql::changefeed::msg_visitor_t::operator()(ql::changefeed::msg_t::change_t const&) const at 0xa3bc53 (rethinkdb)
       8: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa32691 (rethinkdb)
       9: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::change
feed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) 
at 0xa3923d (rethinkdb)
       10: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3b6a5 (rethinkdb)
       11: mailbox_manager_t::mailbox_read_coroutine(connectivity_cluster_t::connection_t*, auto_drainer_t::lock_t, threadnum_t, unsigned 
long, std::vector<char, std::allocator<char> >*, long, mailbox_manager_t::force_yield_t) at 0xaefadb (rethinkdb)
       12: rethinkdb() [0xaefbd2] at 0xaefbd2 ()
       13: coro_t::run() at 0xad11c8 (rethinkdb)
error: Exiting.
Trace/breakpoint trap (core dumped)

The text was updated successfully, but these errors were encountered:

danielmewes · 2014-12-03T19:18:57Z

Thank you for the bug report @internalfx .
I know you have been working with sensitive data. Is there a chance you can give us the core file for further analysis? Note that it might contain parts of your database. We would be happy to sign an NDA.

@mlucy can you take a look?

internalfx · 2014-12-03T19:53:05Z

I will check on the possibility of giving data...

I can at least give you the script that was running....

sync = require('synchronize')
async = require('async')
_ = require('lodash')
r = require('rethinkdb')
moment = require('moment')
numeral = require('numeral')
inflect = require('inflection')
deep = require('deep-diff')

module.exports = (grunt) ->
  grunt.registerTask('devsync', '', ->
    done = this.async()
    process.maxTickDepth = 9999999

    json_deserializer = (key, value) ->
      if _.isString(value)
        regexp = /^\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d.\d\d\dZ$/.exec(value)
        if regexp
          return new Date(value)
      return value

    sync.fiber(->
      conn = sync.await(r.connect({host: 'localhost', db: 'aamcrm'}, sync.defer()))

      #Create Tables
      table_list = sync.await(r.db('aamcrm').tableList().run(conn, sync.defer()))
      table_list_dev = sync.await(r.db('aamcrm_dev').tableList().run(conn, sync.defer()))

      for table in table_list
        if _.contains(table_list_dev, table)
          sync.await(r.db('aamcrm_dev').tableDrop(table).run(conn, sync.defer()))

      table_list_dev = sync.await(r.db('aamcrm_dev').tableList().run(conn, sync.defer()))

      console.log "REBUILDING TABLES..."
      for table in table_list
        unless _.contains(table_list_dev, table)
          sync.await(r.db('aamcrm_dev').tableCreate(table).run(conn, sync.defer()))

      console.log "REBUILDING INDEXES..."
      for table in table_list
        indexes = sync.await(r.db('aamcrm').table(table).indexList().run(conn, sync.defer()))
        for index in indexes
          indexes_dev = sync.await(r.db('aamcrm_dev').table(table).indexList().run(conn, sync.defer()))
          unless _.contains(indexes_dev, index)
            index_obj = _.first(sync.await(r.db('aamcrm').table(table).indexStatus(index).run(conn, sync.defer())))
            sync.await(r.db('aamcrm_dev').table(table).indexCreate(
              index_obj.index, index_obj.function, {geo: index_obj.geo, multi: index_obj.multi}
            ).run(conn, sync.defer()))

      for table in table_list
        console.log "CLONING DATA:", table
        sync.await(r.db('aamcrm_dev').table(table).insert(r.db('aamcrm').table(table)).run(conn, sync.defer()))

      sync.await(conn.close(sync.defer()))

      done()
    )

  )

It died on cloning one of tables...I'm sure I just overloaded something. It had to flood the feed like crazy.

danielmewes · 2014-12-03T20:31:04Z

Thank you for the script @internalfx . Even under the heaviest possible load, the server shouldn't crash. So this is definitely a bug and we'll try to fix it asap.

internalfx · 2014-12-03T21:07:27Z

agreed...

I think I can at least get by for now by not doing dumb stuff like rewriting the WHOLE TABLE while listening on a change feed.

Keep up the great work guys!

mlucy · 2014-12-03T23:14:38Z

Working on this now. Thanks for the bug report!

mlucy · 2014-12-05T02:10:11Z

Alright, this is a bug similar to other bugs we've had in the past: the feed_t is being destroyed, but there are a couple of ongoing operations so the auto_drainer's destruction blocks, then a change comes in, we try to acquire an auto drainer lock partway through processing the change, and things blow up.

What makes this bug different from the last few is that the lock we're acquiring is superfluous. It looks like we're correctly acquiring an auto drainer lock without crashing in feed_t::mailbox_cb (because we check first that we aren't detached), but then we do some blocking work and later try to acquire another superfluous lock in feed_t::each_sub_in_vec, which causes us to crash because we started destroying ourselves in-between the two events.

(In other words, this bug can't just be solved by moving the mailbox below the auto_drainer, unlike the last similar bug we encountered.)

The immediate solution is probably to make each_sub_in_vec take a lock rather than acquiring a lock, so that mailbox_cb can just thread its legally-acquired lock down to prove we have one.

Longer-term, I think it might be nice to have an alternate interface to auto drainers. I understand why we don't let people acquire a lock on a draining drainer; the chance of infinitely delayed destruction is too high (and this is especially the case when the drainer is protecting a callback that can spawn jobs). But it would be really nice if I could write if (maybe_lock_t = drainer.maybe_lock()) { do_work(); }, and if this interface were encouraged. 90% of the time the behavior I want is "acquire a drainer lock and execute this code if you can, otherwise do nothing" rather than "acquire a drainer lock and execute this code if you can, otherwise crash the server". @danielmewes, @timmaxw -- any thoughts on this?

@larkost -- How hard would it be to write a test to duplicate this scenario (high write load, all of it going to a changefeed, and then the changefeed gets disconnected in a variety of different ways)?

timmaxw · 2014-12-05T02:25:24Z

I would be OK with the thing you're proposing, although I almost never want that pattern in my own code.

We should check if the C++ standard guarantees that it's legal to access an object during that object's destructor.

VeXocide · 2014-12-05T02:36:27Z

Quoting the C++ standard:

3.8 Object lifetime
The lifetime of an object of type T ends when:
— if T is a class type with a non-trivial destructor (12.4), the destructor call starts, or
— the storage which the object occupies is reused or released.

12.4 Destructors
A destructor is trivial if it is not user-provided and if: ...

Since the destructor of auto_drainer_t is user-provided the lifetime of an instance ends when the destructor call starts, thus it can't be legally accessed during the destructor.

mlucy · 2014-12-05T02:54:17Z

We violate that rule pretty freely, for the record.

mlucy · 2014-12-05T02:59:51Z

I looked into this more, and section 12.7 of https://isocpp.org/files/papers/N3690.pdf says:

For an object with a non-trivial constructor, referring to any non-static member or base class of the object before the constructor begins execution results in undefined behavior. For an object with a non-trivial destructor, referring to any non-static member or base class of the object after the destructor finishes execution results in undefined behavior.

Which seems to give us a bit more leeway. In practice, since the destructor itself needs to be able to manipulate the object freely, I have trouble imagining an optimization that would allow the destructor to behave correctly in all circumstances but prevent a function running in another coroutine from accessing the object safely.

danielmewes · 2014-12-05T03:15:27Z

@mlucy I think that interface would be useful, though possibly a little unsafe.
One would have to be careful not to call it after the auto_drainer_t has finished destructing.
I might be missing something, but as far as I can see you can be sure that this olds only if you know that we (or our caller) already hold a lock on the drainer. This poses the question though why we want to create a new lock in the first place.

mlucy · 2014-12-05T03:17:19Z

@danielmewes -- that's a good point, actually. I retract my proposed interface.

mlucy · 2014-12-06T02:51:27Z

The fix is in CR 2381 by @danielmewes, although we don't yet have a good way to test that it actually fixes the crashing bug in question.

mlucy · 2014-12-09T00:47:02Z

This is in next. @danielmewes -- are we doing another point release between now and 1.16 (i.e. should I backport this)?

danielmewes · 2014-12-09T02:05:34Z

Since 1.16 is at least a month away, I think we should do another point release.
@mlucy can you backport it to 1.15.x?

@AtnNn Can you start making preparations for a 1.15.3 point release? We would include this, #3346 , #3253 , and should also change libcurl to be linked statically (--static curl) on OS X by default while we're at it.

danielmewes · 2014-12-16T08:57:03Z

@mlucy is this in v1.15.x?

mlucy · 2014-12-18T23:14:03Z

@danielmewes: merging this didn't work too well because next and v1.15.x have diverged quite a bit, so I just made the change again manually. It's up in CR 2415.

danielmewes · 2014-12-23T18:56:14Z

@mlucy just merged this 9e30583

AtnNn · 2015-01-09T23:13:19Z

@internalfx The fix has been release in RethinkDB 1.15.3

internalfx · 2015-01-10T00:54:48Z

@danielmewes @mlucy @AtnNn You guys never cease to impress.

danielmewes added this to the 1.15.x milestone Dec 3, 2014

danielmewes added pr:high tp:bug labels Dec 3, 2014

danielmewes assigned mlucy Dec 3, 2014

mlucy added the tp:review label Dec 6, 2014

AtnNn modified the milestones: 1.15.3, 1.15.x Dec 16, 2014

danielmewes closed this as completed Dec 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core Dump #3393

Core Dump #3393

internalfx commented Dec 3, 2014

danielmewes commented Dec 3, 2014

internalfx commented Dec 3, 2014

danielmewes commented Dec 3, 2014

internalfx commented Dec 3, 2014

mlucy commented Dec 3, 2014

mlucy commented Dec 5, 2014

timmaxw commented Dec 5, 2014

VeXocide commented Dec 5, 2014

mlucy commented Dec 5, 2014

mlucy commented Dec 5, 2014

danielmewes commented Dec 5, 2014

mlucy commented Dec 5, 2014

mlucy commented Dec 6, 2014

mlucy commented Dec 9, 2014

danielmewes commented Dec 9, 2014

danielmewes commented Dec 16, 2014

mlucy commented Dec 18, 2014

danielmewes commented Dec 23, 2014

AtnNn commented Jan 9, 2015

internalfx commented Jan 10, 2015

Core Dump #3393

Core Dump #3393

Comments

internalfx commented Dec 3, 2014

danielmewes commented Dec 3, 2014

internalfx commented Dec 3, 2014

danielmewes commented Dec 3, 2014

internalfx commented Dec 3, 2014

mlucy commented Dec 3, 2014

mlucy commented Dec 5, 2014

timmaxw commented Dec 5, 2014

VeXocide commented Dec 5, 2014

mlucy commented Dec 5, 2014

mlucy commented Dec 5, 2014

danielmewes commented Dec 5, 2014

mlucy commented Dec 5, 2014

mlucy commented Dec 6, 2014

mlucy commented Dec 9, 2014

danielmewes commented Dec 9, 2014

danielmewes commented Dec 16, 2014

mlucy commented Dec 18, 2014

danielmewes commented Dec 23, 2014

AtnNn commented Jan 9, 2015

internalfx commented Jan 10, 2015