Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core Dump #3393

Closed
internalfx opened this issue Dec 3, 2014 · 20 comments
Closed

Core Dump #3393

internalfx opened this issue Dec 3, 2014 · 20 comments
Assignees
Milestone

Comments

@internalfx
Copy link

Not sure what to make of it exactly, I do use change feeds....

Server was under high load when it happened.

Version: rethinkdb 1.15.2~0trusty (GCC 4.8.2)
error: Error in src/concurrency/auto_drainer.cc at line 24:
error: Guarantee failed: [!parent->draining.is_pulsed()] New processes should not acquire a draining `auto_drainer_t`.
error: Backtrace:
addr2line: 'rethinkdb': No such file
addr2line: 'rethinkdb': No such file
error: Wed Dec  3 13:54:07 2014

       1: backtrace_t::backtrace_t() at 0xb927b0 (rethinkdb)
       2: format_backtrace(bool) at 0xb92b43 (rethinkdb)
       3: report_fatal_error(char const*, int, char const*, ...) at 0xd7ed35 (rethinkdb)
       4: auto_drainer_t::lock_t::lock_t(auto_drainer_t*) at 0xb9b1e8 (rethinkdb)
       5: ql::changefeed::feed_t::each_sub_in_vec(std::vector<std::set<ql::changefeed::subscription_t*, std::less<ql::changefeed::subscrip
tion_t*>, std::allocator<ql::changefeed::subscription_t*> >, std::allocator<std::set<ql::changefeed::subscription_t*, std::less<ql::change
feed::subscription_t*>, std::allocator<ql::changefeed::subscription_t*> > > > const&, rwlock_in_line_t*, std::function<void (ql::changefee
d::subscription_t*)> const&) at 0xa319b5 (rethinkdb)
       6: ql::changefeed::feed_t::each_table_sub(std::function<void (ql::changefeed::subscription_t*)> const&) at 0xa31b35 (rethinkdb)
       7: ql::changefeed::msg_visitor_t::operator()(ql::changefeed::msg_t::change_t const&) const at 0xa3bc53 (rethinkdb)
       8: ql::changefeed::feed_t::mailbox_cb(ql::changefeed::stamped_msg_t) at 0xa32691 (rethinkdb)
       9: std::_Function_handler<void (ql::changefeed::stamped_msg_t), std::_Bind<std::_Mem_fn<void (ql::changefeed::feed_t::*)(ql::change
feed::stamped_msg_t)> (ql::changefeed::feed_t*, std::_Placeholder<1>)> >::_M_invoke(std::_Any_data const&, ql::changefeed::stamped_msg_t) 
at 0xa3923d (rethinkdb)
       10: mailbox_t<void (ql::changefeed::stamped_msg_t)>::read_impl_t::read(cluster_version_t, read_stream_t*) at 0xa3b6a5 (rethinkdb)
       11: mailbox_manager_t::mailbox_read_coroutine(connectivity_cluster_t::connection_t*, auto_drainer_t::lock_t, threadnum_t, unsigned 
long, std::vector<char, std::allocator<char> >*, long, mailbox_manager_t::force_yield_t) at 0xaefadb (rethinkdb)
       12: rethinkdb() [0xaefbd2] at 0xaefbd2 ()
       13: coro_t::run() at 0xad11c8 (rethinkdb)
error: Exiting.
Trace/breakpoint trap (core dumped)
@danielmewes danielmewes added this to the 1.15.x milestone Dec 3, 2014
@danielmewes
Copy link
Member

Thank you for the bug report @internalfx .
I know you have been working with sensitive data. Is there a chance you can give us the core file for further analysis? Note that it might contain parts of your database. We would be happy to sign an NDA.

@mlucy can you take a look?

@internalfx
Copy link
Author

I will check on the possibility of giving data...

I can at least give you the script that was running....

sync = require('synchronize')
async = require('async')
_ = require('lodash')
r = require('rethinkdb')
moment = require('moment')
numeral = require('numeral')
inflect = require('inflection')
deep = require('deep-diff')

module.exports = (grunt) ->
  grunt.registerTask('devsync', '', ->
    done = this.async()
    process.maxTickDepth = 9999999

    json_deserializer = (key, value) ->
      if _.isString(value)
        regexp = /^\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d.\d\d\dZ$/.exec(value)
        if regexp
          return new Date(value)
      return value

    sync.fiber(->
      conn = sync.await(r.connect({host: 'localhost', db: 'aamcrm'}, sync.defer()))

      #Create Tables
      table_list = sync.await(r.db('aamcrm').tableList().run(conn, sync.defer()))
      table_list_dev = sync.await(r.db('aamcrm_dev').tableList().run(conn, sync.defer()))

      for table in table_list
        if _.contains(table_list_dev, table)
          sync.await(r.db('aamcrm_dev').tableDrop(table).run(conn, sync.defer()))

      table_list_dev = sync.await(r.db('aamcrm_dev').tableList().run(conn, sync.defer()))

      console.log "REBUILDING TABLES..."
      for table in table_list
        unless _.contains(table_list_dev, table)
          sync.await(r.db('aamcrm_dev').tableCreate(table).run(conn, sync.defer()))

      console.log "REBUILDING INDEXES..."
      for table in table_list
        indexes = sync.await(r.db('aamcrm').table(table).indexList().run(conn, sync.defer()))
        for index in indexes
          indexes_dev = sync.await(r.db('aamcrm_dev').table(table).indexList().run(conn, sync.defer()))
          unless _.contains(indexes_dev, index)
            index_obj = _.first(sync.await(r.db('aamcrm').table(table).indexStatus(index).run(conn, sync.defer())))
            sync.await(r.db('aamcrm_dev').table(table).indexCreate(
              index_obj.index, index_obj.function, {geo: index_obj.geo, multi: index_obj.multi}
            ).run(conn, sync.defer()))

      for table in table_list
        console.log "CLONING DATA:", table
        sync.await(r.db('aamcrm_dev').table(table).insert(r.db('aamcrm').table(table)).run(conn, sync.defer()))

      sync.await(conn.close(sync.defer()))

      done()
    )

  )

It died on cloning one of tables...I'm sure I just overloaded something. It had to flood the feed like crazy.

@danielmewes
Copy link
Member

Thank you for the script @internalfx . Even under the heaviest possible load, the server shouldn't crash. So this is definitely a bug and we'll try to fix it asap.

@internalfx
Copy link
Author

agreed...

I think I can at least get by for now by not doing dumb stuff like rewriting the WHOLE TABLE while listening on a change feed.

Keep up the great work guys!

@mlucy
Copy link
Member

mlucy commented Dec 3, 2014

Working on this now. Thanks for the bug report!

@mlucy
Copy link
Member

mlucy commented Dec 5, 2014

Alright, this is a bug similar to other bugs we've had in the past: the feed_t is being destroyed, but there are a couple of ongoing operations so the auto_drainer's destruction blocks, then a change comes in, we try to acquire an auto drainer lock partway through processing the change, and things blow up.

What makes this bug different from the last few is that the lock we're acquiring is superfluous. It looks like we're correctly acquiring an auto drainer lock without crashing in feed_t::mailbox_cb (because we check first that we aren't detached), but then we do some blocking work and later try to acquire another superfluous lock in feed_t::each_sub_in_vec, which causes us to crash because we started destroying ourselves in-between the two events.

(In other words, this bug can't just be solved by moving the mailbox below the auto_drainer, unlike the last similar bug we encountered.)

The immediate solution is probably to make each_sub_in_vec take a lock rather than acquiring a lock, so that mailbox_cb can just thread its legally-acquired lock down to prove we have one.


Longer-term, I think it might be nice to have an alternate interface to auto drainers. I understand why we don't let people acquire a lock on a draining drainer; the chance of infinitely delayed destruction is too high (and this is especially the case when the drainer is protecting a callback that can spawn jobs). But it would be really nice if I could write if (maybe_lock_t = drainer.maybe_lock()) { do_work(); }, and if this interface were encouraged. 90% of the time the behavior I want is "acquire a drainer lock and execute this code if you can, otherwise do nothing" rather than "acquire a drainer lock and execute this code if you can, otherwise crash the server". @danielmewes, @timmaxw -- any thoughts on this?


@larkost -- How hard would it be to write a test to duplicate this scenario (high write load, all of it going to a changefeed, and then the changefeed gets disconnected in a variety of different ways)?

@timmaxw
Copy link
Member

timmaxw commented Dec 5, 2014

I would be OK with the thing you're proposing, although I almost never want that pattern in my own code.

We should check if the C++ standard guarantees that it's legal to access an object during that object's destructor.

@VeXocide
Copy link
Member

VeXocide commented Dec 5, 2014

Quoting the C++ standard:

3.8 Object lifetime
The lifetime of an object of type T ends when:
— if T is a class type with a non-trivial destructor (12.4), the destructor call starts, or
— the storage which the object occupies is reused or released.

12.4 Destructors
A destructor is trivial if it is not user-provided and if: ...

Since the destructor of auto_drainer_t is user-provided the lifetime of an instance ends when the destructor call starts, thus it can't be legally accessed during the destructor.

@mlucy
Copy link
Member

mlucy commented Dec 5, 2014

We violate that rule pretty freely, for the record.

@mlucy
Copy link
Member

mlucy commented Dec 5, 2014

I looked into this more, and section 12.7 of https://isocpp.org/files/papers/N3690.pdf says:

For an object with a non-trivial constructor, referring to any non-static member or base class of the object before the constructor begins execution results in undefined behavior. For an object with a non-trivial destructor, referring to any non-static member or base class of the object after the destructor finishes execution results in undefined behavior.

Which seems to give us a bit more leeway. In practice, since the destructor itself needs to be able to manipulate the object freely, I have trouble imagining an optimization that would allow the destructor to behave correctly in all circumstances but prevent a function running in another coroutine from accessing the object safely.

@danielmewes
Copy link
Member

@mlucy I think that interface would be useful, though possibly a little unsafe.
One would have to be careful not to call it after the auto_drainer_t has finished destructing.
I might be missing something, but as far as I can see you can be sure that this olds only if you know that we (or our caller) already hold a lock on the drainer. This poses the question though why we want to create a new lock in the first place.

@mlucy
Copy link
Member

mlucy commented Dec 5, 2014

@danielmewes -- that's a good point, actually. I retract my proposed interface.

@mlucy
Copy link
Member

mlucy commented Dec 6, 2014

The fix is in CR 2381 by @danielmewes, although we don't yet have a good way to test that it actually fixes the crashing bug in question.

@mlucy mlucy added the tp:review label Dec 6, 2014
@mlucy
Copy link
Member

mlucy commented Dec 9, 2014

This is in next. @danielmewes -- are we doing another point release between now and 1.16 (i.e. should I backport this)?

@danielmewes
Copy link
Member

Since 1.16 is at least a month away, I think we should do another point release.
@mlucy can you backport it to 1.15.x?

@AtnNn Can you start making preparations for a 1.15.3 point release? We would include this, #3346 , #3253 , and should also change libcurl to be linked statically (--static curl) on OS X by default while we're at it.

@danielmewes
Copy link
Member

@mlucy is this in v1.15.x?

@AtnNn AtnNn modified the milestones: 1.15.3, 1.15.x Dec 16, 2014
@mlucy
Copy link
Member

mlucy commented Dec 18, 2014

@danielmewes: merging this didn't work too well because next and v1.15.x have diverged quite a bit, so I just made the change again manually. It's up in CR 2415.

@danielmewes
Copy link
Member

@mlucy just merged this 9e30583

@AtnNn
Copy link
Member

AtnNn commented Jan 9, 2015

@internalfx The fix has been release in RethinkDB 1.15.3

@internalfx
Copy link
Author

@danielmewes @mlucy @AtnNn You guys never cease to impress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants