Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rethinkdb crash 1.12.2 #2264

Closed
wojons opened this issue Apr 16, 2014 · 14 comments
Closed

rethinkdb crash 1.12.2 #2264

wojons opened this issue Apr 16, 2014 · 14 comments
Assignees
Milestone

Comments

@wojons
Copy link
Contributor

wojons commented Apr 16, 2014

here is the trace below. just to let you know there were a lot of connect disconnects above heartbeat timeout blah blah .

error: Error in ./src/concurrency/fifo_enforcer.hpp at line 219:
error: illegal to destroy fifo_enforcer_sink_t while outstanding exit_write_t objects exist
error: Backtrace:
addr2line: 'rethinkdb': No such file
error: Wed Apr 16 13:52:45 2014

       1: backtrace_t::backtrace_t() at 0xd36be2 (rethinkdb)
       2: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at 0xd36edd (rethinkdb)
       3: format_backtrace(bool) at 0xd38141 (rethinkdb)
       4: report_fatal_error(char const*, int, char const*, ...) at 0xe03405 (rethinkdb)
       5: fifo_enforcer_sink_t::exit_write_t::on_early_shutdown() at 0xdc681a (rethinkdb)
       6: fifo_enforcer_sink_t::~fifo_enforcer_sink_t() at 0xdc5316 (rethinkdb)
       7: directory_read_manager_t<cluster_directory_metadata_t>::propagate_initialization(peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t) at 0xd5e999 (rethinkdb)
       8: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf5<void, directory_read_manager_t<cluster_directory_metadata_t>, peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t>, boost::_bi::list6<boost::_bi::value<directory_read_manager_t<cluster_directory_metadata_t>*>, boost::_bi::value<peer_id_t>, boost::_bi::value<uuid_u>, boost::_bi::value<boost::shared_ptr<cluster_directory_metadata_t> >, boost::_bi::value<fifo_enforcer_state_t>, boost::_bi::value<auto_drainer_t::lock_t> > > >::run_action() at 0xd5122d (rethinkdb)
       9: coro_t::run() at 0xdf455c (rethinkdb)
error: Exiting.
@coffeemug coffeemug added this to the 1.12.x milestone Apr 16, 2014
@coffeemug
Copy link
Contributor

Pinging @Tryneus -- could you take a look?

@Tryneus
Copy link
Member

Tryneus commented Apr 16, 2014

Well, I imagine the code involved is this:

srh   2013-01-24 17:02:06 |    // Create a metadata FIFO sink and pulse the `got_initial_message` cond so
srh   2013-01-24 17:02:06 |    // that instances of `propagate_update()` can proceed
srh   2013-01-24 17:02:06 |    // TODO: Do we want this .reset() here?  It is not easily provable
srh   2013-01-24 17:02:06 |    // that it'll be initialized only once.
srh   2013-01-24 17:02:06 |    session->metadata_fifo_sink.reset();
srh   2013-01-24 17:02:06 |    session->metadata_fifo_sink.init(new fifo_enforcer_sink_t(metadata_fifo_state));

So it would appear that not only is this very old, it was even anticipated by our prophet and savior, @srh. I will investigate to see what conditions the session metadata_fifo_sink might be stale in.

@Tryneus Tryneus self-assigned this Apr 16, 2014
@danielmewes
Copy link
Member

This looks like a duplicate of #2092. We thought we had shipped a fix in 1.12.2, but maybe that wasn't actually a full fix.

@Tryneus
Copy link
Member

Tryneus commented Apr 16, 2014

@srh, do you want to take this?

@coffeemug coffeemug assigned srh and unassigned Tryneus Apr 16, 2014
@coffeemug
Copy link
Contributor

Reassigning to @srh since he's already familiar with the problem.

@srh
Copy link
Contributor

srh commented Apr 17, 2014

@wojons: Were the servers it was connecting to also running 1.12.2?

@srh
Copy link
Contributor

srh commented Apr 23, 2014

I guess that's impossible (that the servers were running 1.12.1). I will try to reproduce this (tomorrow).

@wojons
Copy link
Contributor Author

wojons commented Apr 23, 2014

Since you can't mix versions and I know after restart all machines where online they where all the same version

@coffeemug
Copy link
Contributor

I opened a separate issue for the callstack overflow bug -- please track #2357 for progress.

I'm also deleting comments related to that issue from this one, since they're unrelated. Thanks @angry-elf for reporting.

@coffeemug coffeemug modified the milestones: 1.13.x, 1.12.x Jun 12, 2014
@srh srh mentioned this issue Jul 7, 2014
@danielmewes
Copy link
Member

Still exists in 1.13.1, as reported by @wojons:

Version: rethinkdb 1.13.1-0ubuntu1~lucid (GCC 4.8.1)
error: Error in ./src/concurrency/fifo_enforcer.hpp at line 219:
error: illegal to destroy fifo_enforcer_sink_t while outstanding exit_write_t objects exist
error: Backtrace:
addr2line: 'rethinkdb': No such file
error: Fri Jul  4 03:29:41 2014

       1: backtrace_t::backtrace_t() at 0xcbc880 (rethinkdb)
       2: format_backtrace(bool) at 0xcbcc13 (rethinkdb)
       3: report_fatal_error(char const*, int, char const*, ...) at 0x962ef5 (rethinkdb)
       4: fifo_enforcer_sink_t::exit_write_t::on_early_shutdown() at 0xcb749a (rethinkdb)
       5: fifo_enforcer_sink_t::~fifo_enforcer_sink_t() at 0xcb613d (rethinkdb)
       6: directory_read_manager_t<cluster_directory_metadata_t>::propagate_initialization(peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t) at 0xc939f3 (rethinkdb)
       7: callable_action_instance_t<std::_Bind<std::_Mem_fn<void (directory_read_manager_t<cluster_directory_metadata_t>::*)(peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t> const&, fifo_enforcer_state_t, auto_drainer_t::lock_t)> (directory_read_manager_t<cluster_directory_metadata_t>*, peer_id_t, uuid_u, boost::shared_ptr<cluster_directory_metadata_t>, fifo_enforcer_state_t, auto_drainer_t::lock_t)> >::run_action() at 0xc91944 (rethinkdb)
       8: coro_t::run() at 0xcdc5c8 (rethinkdb)
error: Exiting.

@srh
Copy link
Contributor

srh commented Jul 7, 2014

Maybe this should be reassigned to @timmaxw now that he's here and since I haven't looked at this.

@wojons
Copy link
Contributor Author

wojons commented Jul 7, 2014

not sure this is relvent but when this crash hapepnes it sometimess if not mostly happenes on more then one node at once or with in a short period of time. and some nodes that dont get it are just locked up. process is running servers some requets but sometiems has hard time with backfilling or something like it has stuck corutines or event loop.

@timmaxw timmaxw assigned timmaxw and unassigned srh Jul 10, 2014
timmaxw added a commit that referenced this issue Jul 11, 2014
This should eliminate an entire class of bugs, including #2264. Many RethinkDB components have a procedure that is supposed to happen for every connection. For example, the `directory_read_manager_t` expects to receive one initialization message on each connection, followed by some number of update messages. However, the old low-level cluster API didn't explicitly expose the concept of a connection, so components would sometimes get confused when one connection was dropped and a new connection appeared for the same peer in a short period of time. This was the cause of #2264. These changes directly expose the concept of a connection, which should make those bugs impossible.

This commit also removes several layers of abstraction: `connectivity_service_t`, `message_service_t`, and `message_multiplexer_t`.
@timmaxw
Copy link
Member

timmaxw commented Jul 22, 2014

This has been fixed and merged into next in 75e3424. The fix will be in 1.14.

@larkost
Copy link
Collaborator

larkost commented Aug 18, 2014

@wojons: This fix is in 1.13.4, which we released today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants