Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

Closed
haraldschilly opened this issue Apr 22, 2016 · 11 comments
Closed

changefeed.cc:1799 Guarantee failed: [num_subs == 0] #5708

haraldschilly opened this issue Apr 22, 2016 · 11 comments
Assignees
Labels
Milestone

Comments

@haraldschilly
Copy link

Unfortunately, one of our rethinkdb servers in our 6 node cluster crashed. I didn't found a similar error report here, so please excuse a possible duplicate. (github search isn't good)

running on linux, 2.3.0~0wily, ...

2016-04-22T18:19:52.311159197 34559.402505s info: Rejected a connection from server 39606cf4-03f0-4ec7-90c8-3c13e902cc85 since one is open already.
2016-04-22T18:20:07.311540722 34574.402887s info: Rejected a connection from server 39606cf4-03f0-4ec7-90c8-3c13e902cc85 since one is open already.
2016-04-22T18:20:07.313109523 34574.404455s info: Rejected a connection from server 932c161a-3394-4602-b492-cf4f44680068 since one is open already.
2016-04-22T18:20:21.821005159 34588.912351s notice: Connected to server "db3" 932c161a-3394-4602-b492-cf4f44680068
2016-04-22T18:21:31.500568734 34658.591915s error: Error in src/rdb_protocol/changefeed.cc at line 1799:
2016-04-22T18:21:31.500621603 34658.591967s error: Guarantee failed: [num_subs == 0] 
2016-04-22T18:21:31.500637500 34658.591983s error: Backtrace:
2016-04-22T18:21:32.153313485 34659.244661s error: Fri Apr 22 18:21:31 2016\n\n1 [0xb2e69a]: backtrace_t::backtrace_t() at ??:?\n2 [0xb2eb7a]: format_backtrace[abi:cxx11](bool) at ??:?\n3 [0xdf847c]: report_fatal_error(char const*, int, char const*, ...) at ??:?\n4 [0x922bd4]: ql::changefeed::real_feed_t::~real_feed_t() at ??:?\n5 [0x923190]: ql::changefeed::real_feed_t::constructor_cb() at ??:?\n6 [0xa38812]: coro_t::run() at ??:?
2016-04-22T18:21:32.153380847 34659.244727s error: Exiting.
@haraldschilly
Copy link
Author

Oh, and there is also a small one-liner in dmesg:

[Fri Apr 22 18:21:31 2016] traps: rethinkdb[991] trap int3 ip:922bd5 sp:7fd2eef0dc60 error:0

too cryptic for me, but maybe it's relevant

@danielmewes
Copy link
Member

Ouch. Thanks for the report.

Pinging @mlucy.

@danielmewes danielmewes added this to the 2.3.x milestone Apr 22, 2016
@danielmewes
Copy link
Member

@mlucy Looking at the code, I don't understand how we can get to that guarantee, since there's an identical guarantee just before we start destructing real_feed_t in constructor_cb. Unless the self feed_t is the wrong feed_t object...

danielmewes pushed a commit that referenced this issue Apr 22, 2016
@danielmewes
Copy link
Member

@mlucy I might have a fix in 852c3af but I'm not sure. Let's talk about this later.

@danielmewes danielmewes modified the milestones: 2.3.1, 2.3.x Apr 22, 2016
@danielmewes
Copy link
Member

(also I'd like to still sneak this into 2.3.1 if possible)

@danielmewes danielmewes modified the milestones: 2.3.x, 2.3.1 Apr 22, 2016
@danielmewes
Copy link
Member

My theory about the bug turned out to be wrong, and we currently don't have an explanation for this.

@haraldschilly Did rethinkdb leave a core file behind by any chance?
If not, could you enable core files for the future, in case this happens again on one of your servers? A core file would make it a lot easier to debug this issue if it ever happens again.

@haraldschilly
Copy link
Author

sorry, we don't have one, but I've enabled it now...

@haraldschilly
Copy link
Author

ok well, apport might have collected something despite core dumps not being enabled via those process limits. I'll send you the file via email.

@williamstein
Copy link

We hit this bug almost any time a server is restarted. Now that the apport stuff is enabled, it's even worse, since that spins at 100% forever, instead of rethinkdb getting restarted. This is really so bad that it makes using rethinkdb as a cluster worse regarding "single point of failure" than it would be to just use a single big rethinkdb node... How can we use automatic failover, if failure of one node (even on purpose via "service rethinkdb stop") leads to other nodes segfaulting and exiting? Sorry, not happy right now.

@danielmewes
Copy link
Member

@williamstein @haraldschilly Very sorry. :-( Did you get a core file from apport?

@danielmewes danielmewes modified the milestones: 2.3.x, 2.3.2 May 3, 2016
danielmewes pushed a commit that referenced this issue May 4, 2016
OTS reviewed by @mlucy
danielmewes pushed a commit that referenced this issue May 4, 2016
OTS reviewed by @mlucy
@danielmewes
Copy link
Member

Running final tests right now, but should be closed by 20b6ae9 in branch v2.3.x and by aa69dbe in next.
This will ship with RethinkDB 2.3.2 in the next days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants