Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server crashing on exit if joined server crashes with an open changefeed #3038

Closed
larkost opened this issue Sep 11, 2014 · 8 comments
Closed
Assignees
Milestone

Comments

@larkost
Copy link
Collaborator

larkost commented Sep 11, 2014

On next I am seeing a crash on exit with Guarantee failed: [mailboxes.empty()] if a joined server with an open changefeed has crashed. This is fully repeatable:

  1. Start two servers, with the second joining to the first.
  2. Create a table (e.g.: test.data), add arbitrary data to it: r.table('data').insert(r.expr([{}]).mul(100))
  3. Shard the data across the two servers
  4. Open a changefeed from python: [x for x in rethinkdb.table('data').changes().run(conn)]
  5. kill -KILL the second server
  6. Gracefully kill the first server (cntl-c) and watch it crash

The log from the first server:

info: Our machine ID: 11e40ea6-c975-4003-b6ae-fe4ed30dad84
info: Created directory '/var/folders/nt/lgyltlr140ndr1r9gy9ygh000000gn/T/tmpwajPC6/0' and a metadata file inside it.
info: Running rethinkdb 1.14.0-509-g7ffaaa (debug) (CLANG 5.1 (clang-503.0.40))...
info: Running on Darwin 13.3.0 x86_64
info: Using cache size of 512 MB
warn: Requested cache size is larger than available memory.
info: Loading data from directory /private/var/folders/nt/lgyltlr140ndr1r9gy9ygh000000gn/T/tmpwajPC6/0
info: Our machine ID is 11e40ea6-c975-4003-b6ae-fe4ed30dad84
info: Listening for intracluster connections on port 49335
info: Listening for client driver connections on port 49336
info: Listening for administrative HTTP connections on port 49337
info: Listening on addresses: 127.0.0.1, 192.168.0.185, ::1, fe80::1%1, fe80::6a5b:35ff:febb:bb47%4
info: Server ready
info: Connected to server "node_1" 0f8037ee-3592-4d62-8046-7849e45f3596
info: Applying data {"rdb_namespaces":{"f8779309-200f-45ed-9a1c-7369a9c97fdb":{"shards":["[\"\", \"Nc0a4640000000000%232610\"]","[\"Nc0a4640000000000%232610\", null]"]}}}
info: Disconnected from server "node_1" 0f8037ee-3592-4d62-8046-7849e45f3596
info: Server got SIGINT from pid 69464, uid 501; shutting down...
info: Shutting down client connections...
info: All client connections closed.
info: Shutting down storage engine... (This may take a while if you had a lot of unflushed data in the writeback cache.)
info: Storage engine shut down.
Version: rethinkdb 1.14.0-509-g7ffaaa (debug) (CLANG 5.1 (clang-503.0.40))
error: Error in src/rpc/mailbox/mailbox.cc at line 129:
error: Guarantee failed: [mailboxes.empty()] Please destroy all mailboxes before destroying the cluster
error: Backtrace:
error: Thu Sep 11 16:11:04 2014

       1: 0   rethinkdb                           0x0000000108a64ed0 _Z19rethinkdb_backtracePPvi + 272 at 0x108a64ed0 ()
       2: 0   rethinkdb                           0x0000000107bc8aa0 _ZN11backtrace_tC2Ev + 304 at 0x107bc8aa0 ()
       3: 0   rethinkdb                           0x0000000107bca99b _ZN26lazy_backtrace_formatter_tC2Ev + 43 at 0x107bca99b ()
       4: 0   rethinkdb                           0x0000000107bc78e5 _ZN26lazy_backtrace_formatter_tC1Ev + 21 at 0x107bc78e5 ()
       5: 0   rethinkdb                           0x0000000107bc7850 _Z16format_backtraceb + 48 at 0x107bc7850 ()
       6: 0   rethinkdb                           0x0000000108506815 _Z18report_fatal_errorPKciS0_z + 757 at 0x108506815 ()
       7: 0   rethinkdb                           0x0000000108ad14a4 _ZN17mailbox_manager_t15mailbox_table_tD2Ev + 116 at 0x108ad14a4 ()
       8: 0   rethinkdb                           0x0000000108ad1425 _ZN17mailbox_manager_t15mailbox_table_tD1Ev + 21 at 0x108ad1425 ()
       9: 0   rethinkdb                           0x0000000107e5e371 _ZN15object_buffer_tIN17mailbox_manager_t15mailbox_table_tEE5resetEv + 49 at 0x107e5e371 ()
       10: 0   rethinkdb                           0x0000000107e5e6e1 _ZNK16one_per_thread_tIN17mailbox_manager_t15mailbox_table_tEE10destruct_tclEi + 81 at 0x107e5e6e1 ()
       11: 0   rethinkdb                           0x0000000107e5e652 _ZN21pmap_runner_one_arg_tIN16one_per_thread_tIN17mailbox_manager_t15mailbox_table_tEE10destruct_tEiEclEv + 34 at 0x107e5e652 ()
       12: 0   rethinkdb                           0x0000000107e5ea4c _ZN26callable_action_instance_tI21pmap_runner_one_arg_tIN16one_per_thread_tIN17mailbox_manager_t15mailbox_table_tEE10destruct_tEiEE10run_actionEv + 28 at 0x107e5ea4c ()
       13: 0   rethinkdb                           0x0000000107bb921c _ZN25callable_action_wrapper_t3runEv + 108 at 0x107bb921c ()
       14: 0   rethinkdb                           0x0000000107ba0739 _ZN6coro_t3runEv + 937 at 0x107ba0739 ()
error: Exiting.

I have a test written for this, but it requires some things that will be added with the server component of #2694

@mlucy mlucy added this to the 1.14.x milestone Sep 12, 2014
@mlucy mlucy self-assigned this Sep 12, 2014
@larkost
Copy link
Collaborator Author

larkost commented Sep 13, 2014

The test for this is up for review with CR 2089 on branch larkost/3038-server-crash-on-exit. It should be noted that this code depends on code in the test for #2694.

larkost added a commit that referenced this issue Sep 20, 2014
@larkost
Copy link
Collaborator Author

larkost commented Sep 20, 2014

The test for this is in next as of ea3fd8b

@AtnNn
Copy link
Member

AtnNn commented Sep 20, 2014

@larkost the test shows up as regression.2790-2 and is enabled by default.

@larkost
Copy link
Collaborator Author

larkost commented Sep 22, 2014

@mlucy Per our conversation on Friday: this problem does appear in 1.14.1. So based on the logic from Friday, this is not a blocker for 1.15.0. But it probably should be scheduled for another milestone.

@larkost
Copy link
Collaborator Author

larkost commented Sep 22, 2014

@AtnNn the name is fixed in 2909282 (over the shoulder review by @deontologician), and it should be enabled, it is a valid failure that needs to be tracked.

@coffeemug coffeemug modified the milestones: 1.14.x, 1.15.x Sep 29, 2014
@mlucy
Copy link
Member

mlucy commented Oct 25, 2014

This is in CR 2239 by @Tryneus .

@mlucy
Copy link
Member

mlucy commented Oct 25, 2014

This is in next.

@mlucy mlucy closed this as completed Oct 25, 2014
@mlucy
Copy link
Member

mlucy commented Oct 25, 2014

(And 1.15.x)

@AtnNn AtnNn modified the milestones: 1.15.2, 1.15.x Nov 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants