changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

haraldschilly · 2016-04-22T18:46:07Z

Unfortunately, one of our rethinkdb servers in our 6 node cluster crashed. I didn't found a similar error report here, so please excuse a possible duplicate. (github search isn't good)

running on linux, 2.3.0~0wily, ...

2016-04-22T18:19:52.311159197 34559.402505s info: Rejected a connection from server 39606cf4-03f0-4ec7-90c8-3c13e902cc85 since one is open already.
2016-04-22T18:20:07.311540722 34574.402887s info: Rejected a connection from server 39606cf4-03f0-4ec7-90c8-3c13e902cc85 since one is open already.
2016-04-22T18:20:07.313109523 34574.404455s info: Rejected a connection from server 932c161a-3394-4602-b492-cf4f44680068 since one is open already.
2016-04-22T18:20:21.821005159 34588.912351s notice: Connected to server "db3" 932c161a-3394-4602-b492-cf4f44680068
2016-04-22T18:21:31.500568734 34658.591915s error: Error in src/rdb_protocol/changefeed.cc at line 1799:
2016-04-22T18:21:31.500621603 34658.591967s error: Guarantee failed: [num_subs == 0] 
2016-04-22T18:21:31.500637500 34658.591983s error: Backtrace:
2016-04-22T18:21:32.153313485 34659.244661s error: Fri Apr 22 18:21:31 2016\n\n1 [0xb2e69a]: backtrace_t::backtrace_t() at ??:?\n2 [0xb2eb7a]: format_backtrace[abi:cxx11](bool) at ??:?\n3 [0xdf847c]: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4 [0x922bd4]: ql::changefeed::real_feed_t::~real_feed_t() at ??:?\n5 [0x923190]: ql::changefeed::real_feed_t::constructor_cb() at ??:?\n6 [0xa38812]: coro_t::run() at ??:?
2016-04-22T18:21:32.153380847 34659.244727s error: Exiting.

The text was updated successfully, but these errors were encountered:

haraldschilly · 2016-04-22T18:47:59Z

Oh, and there is also a small one-liner in dmesg:

[Fri Apr 22 18:21:31 2016] traps: rethinkdb[991] trap int3 ip:922bd5 sp:7fd2eef0dc60 error:0

too cryptic for me, but maybe it's relevant

danielmewes · 2016-04-22T19:26:39Z

Ouch. Thanks for the report.

Pinging @mlucy.

danielmewes · 2016-04-22T20:19:52Z

@mlucy Looking at the code, I don't understand how we can get to that guarantee, since there's an identical guarantee just before we start destructing real_feed_t in constructor_cb. Unless the self feed_t is the wrong feed_t object...

danielmewes · 2016-04-22T20:30:49Z

@mlucy I might have a fix in 852c3af but I'm not sure. Let's talk about this later.

danielmewes · 2016-04-22T20:45:52Z

(also I'd like to still sneak this into 2.3.1 if possible)

danielmewes · 2016-04-22T23:28:26Z

My theory about the bug turned out to be wrong, and we currently don't have an explanation for this.

@haraldschilly Did rethinkdb leave a core file behind by any chance?
If not, could you enable core files for the future, in case this happens again on one of your servers? A core file would make it a lot easier to debug this issue if it ever happens again.

haraldschilly · 2016-04-23T10:11:05Z

sorry, we don't have one, but I've enabled it now...

haraldschilly · 2016-04-23T10:41:58Z

ok well, apport might have collected something despite core dumps not being enabled via those process limits. I'll send you the file via email.

williamstein · 2016-05-01T00:11:22Z

We hit this bug almost any time a server is restarted. Now that the apport stuff is enabled, it's even worse, since that spins at 100% forever, instead of rethinkdb getting restarted. This is really so bad that it makes using rethinkdb as a cluster worse regarding "single point of failure" than it would be to just use a single big rethinkdb node... How can we use automatic failover, if failure of one node (even on purpose via "service rethinkdb stop") leads to other nodes segfaulting and exiting? Sorry, not happy right now.

danielmewes · 2016-05-01T02:09:43Z

@williamstein @haraldschilly Very sorry. :-( Did you get a core file from apport?

@mlucy

OTS reviewed by @mlucy

@mlucy

OTS reviewed by @mlucy

danielmewes · 2016-05-04T01:42:48Z

Running final tests right now, but should be closed by 20b6ae9 in branch v2.3.x and by aa69dbe in next.
This will ship with RethinkDB 2.3.2 in the next days.

danielmewes assigned mlucy Apr 22, 2016

danielmewes added the tp:bug label Apr 22, 2016

danielmewes added this to the 2.3.x milestone Apr 22, 2016

danielmewes pushed a commit that referenced this issue Apr 22, 2016

Potential fix for #5708.

852c3af

danielmewes modified the milestones: 2.3.1, 2.3.x Apr 22, 2016

danielmewes modified the milestones: 2.3.x, 2.3.1 Apr 22, 2016

danielmewes modified the milestones: 2.3.x, 2.3.2 May 3, 2016

danielmewes pushed a commit that referenced this issue May 4, 2016

Fix for #5708.

20b6ae9

OTS reviewed by @mlucy

danielmewes pushed a commit that referenced this issue May 4, 2016

Fix for #5708.

aa69dbe

OTS reviewed by @mlucy

danielmewes closed this as completed May 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

haraldschilly commented Apr 22, 2016

haraldschilly commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

haraldschilly commented Apr 23, 2016

haraldschilly commented Apr 23, 2016

williamstein commented May 1, 2016

danielmewes commented May 1, 2016

danielmewes commented May 4, 2016

changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

Comments

haraldschilly commented Apr 22, 2016

haraldschilly commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

danielmewes commented Apr 22, 2016

haraldschilly commented Apr 23, 2016

haraldschilly commented Apr 23, 2016

williamstein commented May 1, 2016

danielmewes commented May 1, 2016

danielmewes commented May 4, 2016