New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run time configuration reload #321

Open
wants to merge 171 commits into
base: master
from

Conversation

Projects
None yet
@vlm
Copy link

vlm commented Feb 12, 2015

This patch introduces a runtime configuration reload. killall -USR1 nutcracker would cause a configuration reload.

Changes

  • The patch contains the test cases to test different failure scenarios. Unfortunately, to make the test work across platform (Linux, Mac OS X) I also had to fix a nc_kqueue bug which caused crashes on Mac OS X.
  • The patch switches separate .client, .redis bit fields in nc_connection.h to a more precise enumeration of what a connection is. This also allows significantly more descriptive logging.
  • The event loop in the original branch was crafted in tight coupling with struct conn, in a way that is not very conducive to allowing multiple different kinds of entities receiving file system events. This did not work well when I tried to safely pause the statistics gathering events. I restructured that part by removing the part of event loop which deals with statistics and moving it into the statistics thread itself.

Model of operation

The new config is read into memory, and the new pools are allocated. If the new pools cannot be allocated for some reasons, the configuration change is not done and the configuration state is rolled back. Once the new pools are allocated, we pause all the ingress data transfers from the clients and drain the output queues. Once the client queues are drained (and we have sent out all the outstanding server responses), we replace the old pools with the newly configured pools. Then we unblock the clients so they can send new requests.

I am ok restructuring this patch however you see fit. Looking forward to your feedback.

@saa

This comment has been minimized.

Copy link

saa commented Feb 13, 2015

+1

@bobrik

This comment has been minimized.

Copy link

bobrik commented Feb 23, 2015

Here's what happens when I add servers and run get x with telnet:

[.......................] signal 10 (SIGUSR1) received, config reload
/bin/nutcracker(nc_stacktrace_fd+0x17)[0x418027]
/bin/nutcracker(signal_handler+0xeb)[0x41574b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fd0b79c20a0]
/bin/nutcracker(_stats_server_incr+0x25)[0x415595]
/bin/nutcracker(server_connected+0x2a)[0x40d93a]
/bin/nutcracker(req_send_next+0x65)[0x4113a5]
/bin/nutcracker(msg_send+0x24)[0x4105b4]
/bin/nutcracker(core_core+0x64)[0x40bcf4]
/bin/nutcracker(event_wait+0x11a)[0x42115a]
/bin/nutcracker(core_loop+0x37)[0x40c237]
/bin/nutcracker(main+0x5dc)[0x40b74c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7fd0b73c4ead]
/bin/nutcracker[0x40bb09]
[.......................] signal 11 (SIGSEGV) received, core dumping

Config that crashed nutcracker:

memcached_one:
  listen: 127.0.0.1:12335
  hash: fnv1a_64
  distribution: ketama
  auto_eject_hosts: false
  servers:
  - web321:31545:1 twemproxy_memcached_one.841cae5c-bb4f-11e4-8669-56847afe9799
  - web227:31710:1 twemproxy_memcached_one.8bf224cd-bb4f-11e4-8669-56847afe9799
  - web363:31002:1 twemproxy_memcached_one.8bf24bde-bb4f-11e4-8669-56847afe9799
  - web540:31094:1 twemproxy_memcached_one.8bf24bdf-bb4f-11e4-8669-56847afe9799
  - web338:31418:1 twemproxy_memcached_one.8bf272f0-bb4f-11e4-8669-56847afe9799
  - web322:31513:1 twemproxy_memcached_one.8bf29a01-bb4f-11e4-8669-56847afe9799
  - web172:31592:1 twemproxy_memcached_one.8bf29a02-bb4f-11e4-8669-56847afe9799
  - web453:31636:1 twemproxy_memcached_one.8bf2c113-bb4f-11e4-8669-56847afe9799
  - web447:31582:1 twemproxy_memcached_one.8bf2e824-bb4f-11e4-8669-56847afe9799
  - web397:31048:1 twemproxy_memcached_one.8bf30f35-bb4f-11e4-8669-56847afe9799
  - web358:31000:1 twemproxy_memcached_one.8bf33646-bb4f-11e4-8669-56847afe9799
  - web13:31755:1 twemproxy_memcached_one.8bf35d57-bb4f-11e4-8669-56847afe9799
  - web424:31944:1 twemproxy_memcached_one.8bf35d58-bb4f-11e4-8669-56847afe9799
  - web448:31356:1 twemproxy_memcached_one.8bf38469-bb4f-11e4-8669-56847afe9799
  - web457:31854:1 twemproxy_memcached_one.8bf3ab7a-bb4f-11e4-8669-56847afe9799
  - web300:31940:1 twemproxy_memcached_one.8bf3d28b-bb4f-11e4-8669-56847afe9799
  - web305:31128:1 twemproxy_memcached_one.8bf3d28c-bb4f-11e4-8669-56847afe9799
  - web385:31837:1 twemproxy_memcached_one.8bf3f99d-bb4f-11e4-8669-56847afe9799
  - web324:31160:1 twemproxy_memcached_one.8bf420ae-bb4f-11e4-8669-56847afe9799
  - web311:31173:1 twemproxy_memcached_one.8bf420af-bb4f-11e4-8669-56847afe9799
redis_one:
  listen: 127.0.0.1:12334
  hash: fnv1a_64
  distribution: ketama
  redis: true
  auto_eject_hosts: false
  servers:
  - web203:31962:1 twemproxy_redis_one.18ac61be-bb47-11e4-8669-56847afe9799

This happens from time to time, not always.

@bobrik

This comment has been minimized.

Copy link

bobrik commented Feb 23, 2015

Well, it actually happens with this config all the time. Steps to reproduce:

Run with dummy config:

dummy:
  listen: /tmp/dummy-listen
  servers:
    - /tmp/dummy-server:1

Replace it with actual config from previous comment.

telnet 127.0.0.1 12335 and get x:

web381 ~ # telnet 127.0.0.1 12335
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
get x
Connection closed by foreign host.
/bin/nutcracker(nc_stacktrace_fd+0x17)[0x418027]
/bin/nutcracker(signal_handler+0xeb)[0x41574b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fa64eb5d0a0]
/bin/nutcracker(_stats_server_incr+0x25)[0x415595]
/bin/nutcracker(req_server_enqueue_imsgq+0x66)[0x410d36]
/bin/nutcracker[0x410aab]
/bin/nutcracker(req_recv_done+0x151)[0x411291]
/bin/nutcracker(msg_recv+0x160)[0x410430]
/bin/nutcracker(core_core+0x9c)[0x40bd2c]
/bin/nutcracker(event_wait+0x11a)[0x42115a]
/bin/nutcracker(core_loop+0x37)[0x40c237]
/bin/nutcracker(main+0x5dc)[0x40b74c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7fa64e55fead]
/bin/nutcracker[0x40bb09]
[.......................] signal 11 (SIGSEGV) received, core dumping
@vlm

This comment has been minimized.

Copy link

vlm commented Feb 25, 2015

@bobrik: Fixed, try again!

@davidradunz

This comment has been minimized.

Copy link

davidradunz commented Sep 12, 2015

Any reason this hasn't been merged in yet? We really need it..

@JamieCressey

This comment has been minimized.

Copy link

JamieCressey commented Sep 15, 2015

+1

1 similar comment
@zhitaoli

This comment has been minimized.

Copy link

zhitaoli commented Sep 16, 2015

+1

@davidradunz

This comment has been minimized.

Copy link

davidradunz commented Sep 18, 2015

Due to this issue, and the topology of the stack we've switched to Redis Cluster and implemented specialised clients that support it. All of our problems have gone away and things seem much more reliable now and also faster! (though, to clarify, the stack was app -> local haproxy -> twemproxy -> redis)

Feedback: For twemproxy to be a viable option for the corporate infrastructure, it should not only support reloading of the config but also one of the following:

  • Direct support for redis sentinel, blocking client connections when a master is down - waiting for a sentinel notification (for a timeout period, naturally) and then failing over to that.
  • Support for assignment of the slaves for the masters in a pool, when a master goes down querying the slaves to determine which is the new master (whilst blocking client connections until the failover occurs).
  • Direct replacement for sentinel, management of the whole cluster. Blocking client connections when a master goes down and assigning a replacement for it.
@sethrosenblum

This comment has been minimized.

Copy link

sethrosenblum commented Dec 2, 2015

Is something holding this up?

@draco2003

This comment has been minimized.

Copy link

draco2003 commented Dec 31, 2015

+1 for this patch. Would love to see it merged in so we can run off of the mainline master branch again.

@axot

This comment has been minimized.

Copy link

axot commented Feb 4, 2016

+1

5 similar comments
@tdshank

This comment has been minimized.

Copy link

tdshank commented Feb 25, 2016

+1

@TrumanDu

This comment has been minimized.

Copy link

TrumanDu commented Feb 26, 2016

+1

@homeyjd

This comment has been minimized.

Copy link

homeyjd commented Apr 13, 2016

+1

@JeffXue

This comment has been minimized.

Copy link

JeffXue commented Apr 25, 2016

+1

@artursitarski

This comment has been minimized.

Copy link

artursitarski commented May 20, 2016

+1

@alexef

This comment has been minimized.

Copy link

alexef commented Sep 28, 2016

+1, anything holding up this patch?

@manjuraj

This comment has been minimized.

Copy link
Collaborator

manjuraj commented Sep 28, 2016

@andyqzb - can you look into this?

@vlm

This comment has been minimized.

Copy link

vlm commented Sep 28, 2016

@manjuraj, let me finish this up. I have a list of things I promised you to do about it.

@blsmth

This comment has been minimized.

Copy link

blsmth commented Nov 14, 2016

any news on this?

@vsacheti

This comment has been minimized.

Copy link

vsacheti commented Nov 17, 2016

it will be great to have this functionality added... since in the marathon/mesos world the hosts are dynamic... thanks

@elukey

This comment has been minimized.

Copy link

elukey commented Dec 13, 2016

Sorry to ask again after all the pings but.. Any news? Really looking forward to see this feature merged!

@yjqg6666

This comment has been minimized.

Copy link

yjqg6666 commented Dec 14, 2016

+1 Any updates for this PR?

@fabiomsouto

This comment has been minimized.

Copy link

fabiomsouto commented Dec 14, 2016

+1

2 similar comments
@ge-fa

This comment has been minimized.

Copy link

ge-fa commented Dec 23, 2016

+1

@everpcpc

This comment has been minimized.

Copy link

everpcpc commented Jan 12, 2017

+1

@zeitos

This comment has been minimized.

Copy link

zeitos commented Jun 23, 2017

hello? is this going to get merged?

@armcburney
Copy link

armcburney left a comment

LGTM! 👍

@lpan

lpan approved these changes Mar 22, 2018

Copy link

lpan left a comment

👍

@coderall

This comment has been minimized.

Copy link

coderall commented Mar 22, 2018

+1

@xginn8

This comment has been minimized.

Copy link

xginn8 commented May 6, 2018

I'm maintaining a fork of twemproxy and have merged this commit (https://github.com/xginn8/twemproxy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment