New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rethinkdb crash 1.12.2 #2264
Comments
Pinging @Tryneus -- could you take a look? |
Well, I imagine the code involved is this: srh 2013-01-24 17:02:06 | // Create a metadata FIFO sink and pulse the `got_initial_message` cond so
srh 2013-01-24 17:02:06 | // that instances of `propagate_update()` can proceed
srh 2013-01-24 17:02:06 | // TODO: Do we want this .reset() here? It is not easily provable
srh 2013-01-24 17:02:06 | // that it'll be initialized only once.
srh 2013-01-24 17:02:06 | session->metadata_fifo_sink.reset();
srh 2013-01-24 17:02:06 | session->metadata_fifo_sink.init(new fifo_enforcer_sink_t(metadata_fifo_state)); So it would appear that not only is this very old, it was even anticipated by our prophet and savior, @srh. I will investigate to see what conditions the session |
This looks like a duplicate of #2092. We thought we had shipped a fix in 1.12.2, but maybe that wasn't actually a full fix. |
@srh, do you want to take this? |
Reassigning to @srh since he's already familiar with the problem. |
@wojons: Were the servers it was connecting to also running 1.12.2? |
I guess that's impossible (that the servers were running 1.12.1). I will try to reproduce this (tomorrow). |
Since you can't mix versions and I know after restart all machines where online they where all the same version |
I opened a separate issue for the callstack overflow bug -- please track #2357 for progress. I'm also deleting comments related to that issue from this one, since they're unrelated. Thanks @angry-elf for reporting. |
Still exists in 1.13.1, as reported by @wojons:
|
Maybe this should be reassigned to @timmaxw now that he's here and since I haven't looked at this. |
not sure this is relvent but when this crash hapepnes it sometimess if not mostly happenes on more then one node at once or with in a short period of time. and some nodes that dont get it are just locked up. process is running servers some requets but sometiems has hard time with backfilling or something like it has stuck corutines or event loop. |
This should eliminate an entire class of bugs, including #2264. Many RethinkDB components have a procedure that is supposed to happen for every connection. For example, the `directory_read_manager_t` expects to receive one initialization message on each connection, followed by some number of update messages. However, the old low-level cluster API didn't explicitly expose the concept of a connection, so components would sometimes get confused when one connection was dropped and a new connection appeared for the same peer in a short period of time. This was the cause of #2264. These changes directly expose the concept of a connection, which should make those bugs impossible. This commit also removes several layers of abstraction: `connectivity_service_t`, `message_service_t`, and `message_multiplexer_t`.
This has been fixed and merged into |
here is the trace below. just to let you know there were a lot of connect disconnects above heartbeat timeout blah blah .
The text was updated successfully, but these errors were encountered: