rethinkdb crash 1.12.2 #2264

wojons · 2014-04-16T18:20:45Z

here is the trace below. just to let you know there were a lot of connect disconnects above heartbeat timeout blah blah .

error: Error in ./src/concurrency/fifo_enforcer.hpp at line 219:
error: illegal to destroy fifo_enforcer_sink_t while outstanding exit_write_t objects exist
error: Backtrace:
addr2line: 'rethinkdb': No such file
error: Wed Apr 16 13:52:45 2014

       1: backtrace_t::backtrace_t() at 0xd36be2 (rethinkdb)
       2: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at 0xd36edd (rethinkdb)
       3: format_backtrace(bool) at 0xd38141 (rethinkdb)
       4: report_fatal_error(char const*, int, char const*, ...) at 0xe03405 (rethinkdb)
       5: fifo_enforcer_sink_t::exit_write_t::on_early_shutdown() at 0xdc681a (rethinkdb)
       6: fifo_enforcer_sink_t::~fifo_enforcer_sink_t() at 0xdc5316 (rethinkdb)
       7: directory_read_manager_t<cluster_directory_metadata_t>::propagate_initialization(peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t) at 0xd5e999 (rethinkdb)
       8: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf5<void, directory_read_manager_t<cluster_directory_metadata_t>, peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t>, boost::_bi::list6<boost::_bi::value<directory_read_manager_t<cluster_directory_metadata_t>*>, boost::_bi::value<peer_id_t>, boost::_bi::value<uuid_u>, boost::_bi::value<boost::shared_ptr<cluster_directory_metadata_t> >, boost::_bi::value<fifo_enforcer_state_t>, boost::_bi::value<auto_drainer_t::lock_t> > > >::run_action() at 0xd5122d (rethinkdb)
       9: coro_t::run() at 0xdf455c (rethinkdb)
error: Exiting.

The text was updated successfully, but these errors were encountered:

coffeemug · 2014-04-16T19:35:58Z

Pinging @Tryneus -- could you take a look?

Tryneus · 2014-04-16T20:21:06Z

Well, I imagine the code involved is this:

srh   2013-01-24 17:02:06 |    // Create a metadata FIFO sink and pulse the `got_initial_message` cond so
srh   2013-01-24 17:02:06 |    // that instances of `propagate_update()` can proceed
srh   2013-01-24 17:02:06 |    // TODO: Do we want this .reset() here?  It is not easily provable
srh   2013-01-24 17:02:06 |    // that it'll be initialized only once.
srh   2013-01-24 17:02:06 |    session->metadata_fifo_sink.reset();
srh   2013-01-24 17:02:06 |    session->metadata_fifo_sink.init(new fifo_enforcer_sink_t(metadata_fifo_state));

So it would appear that not only is this very old, it was even anticipated by our prophet and savior, @srh. I will investigate to see what conditions the session metadata_fifo_sink might be stale in.

danielmewes · 2014-04-16T20:43:37Z

This looks like a duplicate of #2092. We thought we had shipped a fix in 1.12.2, but maybe that wasn't actually a full fix.

Tryneus · 2014-04-16T20:52:26Z

@srh, do you want to take this?

coffeemug · 2014-04-16T20:59:32Z

Reassigning to @srh since he's already familiar with the problem.

srh · 2014-04-17T15:08:31Z

@wojons: Were the servers it was connecting to also running 1.12.2?

srh · 2014-04-23T03:44:43Z

I guess that's impossible (that the servers were running 1.12.1). I will try to reproduce this (tomorrow).

wojons · 2014-04-23T06:53:11Z

Since you can't mix versions and I know after restart all machines where online they where all the same version

coffeemug · 2014-05-05T23:02:04Z

I opened a separate issue for the callstack overflow bug -- please track #2357 for progress.

I'm also deleting comments related to that issue from this one, since they're unrelated. Thanks @angry-elf for reporting.

danielmewes · 2014-07-07T19:20:47Z

Still exists in 1.13.1, as reported by @wojons:

Version: rethinkdb 1.13.1-0ubuntu1~lucid (GCC 4.8.1)
error: Error in ./src/concurrency/fifo_enforcer.hpp at line 219:
error: illegal to destroy fifo_enforcer_sink_t while outstanding exit_write_t objects exist
error: Backtrace:
addr2line: 'rethinkdb': No such file
error: Fri Jul  4 03:29:41 2014

       1: backtrace_t::backtrace_t() at 0xcbc880 (rethinkdb)
       2: format_backtrace(bool) at 0xcbcc13 (rethinkdb)
       3: report_fatal_error(char const*, int, char const*, ...) at 0x962ef5 (rethinkdb)
       4: fifo_enforcer_sink_t::exit_write_t::on_early_shutdown() at 0xcb749a (rethinkdb)
       5: fifo_enforcer_sink_t::~fifo_enforcer_sink_t() at 0xcb613d (rethinkdb)
       6: directory_read_manager_t<cluster_directory_metadata_t>::propagate_initialization(peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t) at 0xc939f3 (rethinkdb)
       7: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (directory_read_manager_t<cluster_directory_metadata_t>::*)(peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t)> (directory_read_manager_t<cluster_directory_metadata_t>*, peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t>, fifo_enforcer_state_t, auto_drainer_t::lock_t)> >::run_action() at 0xc91944 (rethinkdb)
       8: coro_t::run() at 0xcdc5c8 (rethinkdb)
error: Exiting.

srh · 2014-07-07T19:38:08Z

Maybe this should be reassigned to @timmaxw now that he's here and since I haven't looked at this.

wojons · 2014-07-07T20:20:49Z

not sure this is relvent but when this crash hapepnes it sometimess if not mostly happenes on more then one node at once or with in a short period of time. and some nodes that dont get it are just locked up. process is running servers some requets but sometiems has hard time with backfilling or something like it has stuck corutines or event loop.

This should eliminate an entire class of bugs, including #2264. Many RethinkDB components have a procedure that is supposed to happen for every connection. For example, the `directory_read_manager_t` expects to receive one initialization message on each connection, followed by some number of update messages. However, the old low-level cluster API didn't explicitly expose the concept of a connection, so components would sometimes get confused when one connection was dropped and a new connection appeared for the same peer in a short period of time. This was the cause of #2264. These changes directly expose the concept of a connection, which should make those bugs impossible. This commit also removes several layers of abstraction: `connectivity_service_t`, `message_service_t`, and `message_multiplexer_t`.

timmaxw · 2014-07-22T21:12:27Z

This has been fixed and merged into next in 75e3424. The fix will be in 1.14.

larkost · 2014-08-18T19:43:14Z

@wojons: This fix is in 1.13.4, which we released today.

coffeemug added this to the 1.12.x milestone Apr 16, 2014

coffeemug added pr:high labels Apr 16, 2014

Tryneus self-assigned this Apr 16, 2014

coffeemug assigned srh and unassigned Tryneus Apr 16, 2014

coffeemug modified the milestones: 1.13.x, 1.12.x Jun 12, 2014

srh mentioned this issue Jul 7, 2014

Rethinkdb crash #2092

Closed

timmaxw assigned timmaxw and unassigned srh Jul 10, 2014

timmaxw closed this as completed Jul 22, 2014

timmaxw mentioned this issue Jul 25, 2014

Uncaught sync_failed_exc_t / RPCSemilatticeTest.MetadataExchange #2758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rethinkdb crash 1.12.2 #2264

rethinkdb crash 1.12.2 #2264

wojons commented Apr 16, 2014

coffeemug commented Apr 16, 2014

Tryneus commented Apr 16, 2014

danielmewes commented Apr 16, 2014

Tryneus commented Apr 16, 2014

coffeemug commented Apr 16, 2014

srh commented Apr 17, 2014

srh commented Apr 23, 2014

wojons commented Apr 23, 2014

coffeemug commented May 5, 2014

danielmewes commented Jul 7, 2014

srh commented Jul 7, 2014

wojons commented Jul 7, 2014

timmaxw commented Jul 22, 2014

larkost commented Aug 18, 2014

rethinkdb crash 1.12.2 #2264

rethinkdb crash 1.12.2 #2264

Comments

wojons commented Apr 16, 2014

coffeemug commented Apr 16, 2014

Tryneus commented Apr 16, 2014

danielmewes commented Apr 16, 2014

Tryneus commented Apr 16, 2014

coffeemug commented Apr 16, 2014

srh commented Apr 17, 2014

srh commented Apr 23, 2014

wojons commented Apr 23, 2014

coffeemug commented May 5, 2014

danielmewes commented Jul 7, 2014

srh commented Jul 7, 2014

wojons commented Jul 7, 2014

timmaxw commented Jul 22, 2014

larkost commented Aug 18, 2014