Run time configuration reload #321

vlm · 2015-02-12T05:26:49Z

This patch introduces a runtime configuration reload. killall -USR1 nutcracker would cause a configuration reload.

Changes

The patch contains the test cases to test different failure scenarios. Unfortunately, to make the test work across platform (Linux, Mac OS X) I also had to fix a nc_kqueue bug which caused crashes on Mac OS X.
The patch switches separate .client, .redis bit fields in nc_connection.h to a more precise enumeration of what a connection is. This also allows significantly more descriptive logging.
The event loop in the original branch was crafted in tight coupling with struct conn, in a way that is not very conducive to allowing multiple different kinds of entities receiving file system events. This did not work well when I tried to safely pause the statistics gathering events. I restructured that part by removing the part of event loop which deals with statistics and moving it into the statistics thread itself.

Model of operation

The new config is read into memory, and the new pools are allocated. If the new pools cannot be allocated for some reasons, the configuration change is not done and the configuration state is rolled back. Once the new pools are allocated, we pause all the ingress data transfers from the clients and drain the output queues. Once the client queues are drained (and we have sent out all the outstanding server responses), we replace the old pools with the newly configured pools. Then we unblock the clients so they can send new requests.

I am ok restructuring this patch however you see fit. Looking forward to your feedback.

saa · 2015-02-13T11:11:16Z

+1

bobrik · 2015-02-23T11:34:07Z

Here's what happens when I add servers and run get x with telnet:

[.......................] signal 10 (SIGUSR1) received, config reload
/bin/nutcracker(nc_stacktrace_fd+0x17)[0x418027]
/bin/nutcracker(signal_handler+0xeb)[0x41574b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fd0b79c20a0]
/bin/nutcracker(_stats_server_incr+0x25)[0x415595]
/bin/nutcracker(server_connected+0x2a)[0x40d93a]
/bin/nutcracker(req_send_next+0x65)[0x4113a5]
/bin/nutcracker(msg_send+0x24)[0x4105b4]
/bin/nutcracker(core_core+0x64)[0x40bcf4]
/bin/nutcracker(event_wait+0x11a)[0x42115a]
/bin/nutcracker(core_loop+0x37)[0x40c237]
/bin/nutcracker(main+0x5dc)[0x40b74c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7fd0b73c4ead]
/bin/nutcracker[0x40bb09]
[.......................] signal 11 (SIGSEGV) received, core dumping

Config that crashed nutcracker:

memcached_one:
  listen: 127.0.0.1:12335
  hash: fnv1a_64
  distribution: ketama
  auto_eject_hosts: false
  servers:
  - web321:31545:1 twemproxy_memcached_one.841cae5c-bb4f-11e4-8669-56847afe9799
  - web227:31710:1 twemproxy_memcached_one.8bf224cd-bb4f-11e4-8669-56847afe9799
  - web363:31002:1 twemproxy_memcached_one.8bf24bde-bb4f-11e4-8669-56847afe9799
  - web540:31094:1 twemproxy_memcached_one.8bf24bdf-bb4f-11e4-8669-56847afe9799
  - web338:31418:1 twemproxy_memcached_one.8bf272f0-bb4f-11e4-8669-56847afe9799
  - web322:31513:1 twemproxy_memcached_one.8bf29a01-bb4f-11e4-8669-56847afe9799
  - web172:31592:1 twemproxy_memcached_one.8bf29a02-bb4f-11e4-8669-56847afe9799
  - web453:31636:1 twemproxy_memcached_one.8bf2c113-bb4f-11e4-8669-56847afe9799
  - web447:31582:1 twemproxy_memcached_one.8bf2e824-bb4f-11e4-8669-56847afe9799
  - web397:31048:1 twemproxy_memcached_one.8bf30f35-bb4f-11e4-8669-56847afe9799
  - web358:31000:1 twemproxy_memcached_one.8bf33646-bb4f-11e4-8669-56847afe9799
  - web13:31755:1 twemproxy_memcached_one.8bf35d57-bb4f-11e4-8669-56847afe9799
  - web424:31944:1 twemproxy_memcached_one.8bf35d58-bb4f-11e4-8669-56847afe9799
  - web448:31356:1 twemproxy_memcached_one.8bf38469-bb4f-11e4-8669-56847afe9799
  - web457:31854:1 twemproxy_memcached_one.8bf3ab7a-bb4f-11e4-8669-56847afe9799
  - web300:31940:1 twemproxy_memcached_one.8bf3d28b-bb4f-11e4-8669-56847afe9799
  - web305:31128:1 twemproxy_memcached_one.8bf3d28c-bb4f-11e4-8669-56847afe9799
  - web385:31837:1 twemproxy_memcached_one.8bf3f99d-bb4f-11e4-8669-56847afe9799
  - web324:31160:1 twemproxy_memcached_one.8bf420ae-bb4f-11e4-8669-56847afe9799
  - web311:31173:1 twemproxy_memcached_one.8bf420af-bb4f-11e4-8669-56847afe9799
redis_one:
  listen: 127.0.0.1:12334
  hash: fnv1a_64
  distribution: ketama
  redis: true
  auto_eject_hosts: false
  servers:
  - web203:31962:1 twemproxy_redis_one.18ac61be-bb47-11e4-8669-56847afe9799

This happens from time to time, not always.

bobrik · 2015-02-23T11:41:56Z

Well, it actually happens with this config all the time. Steps to reproduce:

Run with dummy config:

dummy:
  listen: /tmp/dummy-listen
  servers:
    - /tmp/dummy-server:1

Replace it with actual config from previous comment.

telnet 127.0.0.1 12335 and get x:

web381 ~ # telnet 127.0.0.1 12335
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
get x
Connection closed by foreign host.

/bin/nutcracker(nc_stacktrace_fd+0x17)[0x418027]
/bin/nutcracker(signal_handler+0xeb)[0x41574b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7fa64eb5d0a0]
/bin/nutcracker(_stats_server_incr+0x25)[0x415595]
/bin/nutcracker(req_server_enqueue_imsgq+0x66)[0x410d36]
/bin/nutcracker[0x410aab]
/bin/nutcracker(req_recv_done+0x151)[0x411291]
/bin/nutcracker(msg_recv+0x160)[0x410430]
/bin/nutcracker(core_core+0x9c)[0x40bd2c]
/bin/nutcracker(event_wait+0x11a)[0x42115a]
/bin/nutcracker(core_loop+0x37)[0x40c237]
/bin/nutcracker(main+0x5dc)[0x40b74c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7fa64e55fead]
/bin/nutcracker[0x40bb09]
[.......................] signal 11 (SIGSEGV) received, core dumping

vlm · 2015-02-25T05:56:45Z

@bobrik: Fixed, try again!

… array

… copy.

…ent subsystem

davidradunz · 2015-09-12T12:28:39Z

Any reason this hasn't been merged in yet? We really need it..

JamieCressey · 2015-09-15T07:52:26Z

+1

zhitaoli · 2015-09-16T23:59:50Z

+1

davidradunz · 2015-09-18T16:21:00Z

Due to this issue, and the topology of the stack we've switched to Redis Cluster and implemented specialised clients that support it. All of our problems have gone away and things seem much more reliable now and also faster! (though, to clarify, the stack was app -> local haproxy -> twemproxy -> redis)

Feedback: For twemproxy to be a viable option for the corporate infrastructure, it should not only support reloading of the config but also one of the following:

Direct support for redis sentinel, blocking client connections when a master is down - waiting for a sentinel notification (for a timeout period, naturally) and then failing over to that.
Support for assignment of the slaves for the masters in a pool, when a master goes down querying the slaves to determine which is the new master (whilst blocking client connections until the failover occurs).
Direct replacement for sentinel, management of the whole cluster. Blocking client connections when a master goes down and assigning a replacement for it.

sethrosenblum · 2015-12-02T15:43:21Z

Is something holding this up?

draco2003 · 2015-12-31T12:52:20Z

+1 for this patch. Would love to see it merged in so we can run off of the mainline master branch again.

axot · 2016-02-04T02:37:01Z

+1

ghost · 2016-02-25T18:40:13Z

+1

TrumanDu · 2016-02-26T06:22:38Z

+1

homeyjd · 2016-04-13T04:31:44Z

+1

JeffXue · 2016-04-25T09:11:15Z

+1

artursitarski · 2016-05-20T18:46:10Z

+1

alexef · 2016-09-28T08:01:03Z

+1, anything holding up this patch?

manjuraj · 2016-09-28T15:24:16Z

@andyqzb - can you look into this?

vlm · 2016-09-28T16:56:37Z

@manjuraj, let me finish this up. I have a list of things I promised you to do about it.

blsmth · 2016-11-14T15:45:07Z

any news on this?

vsacheti · 2016-11-17T00:54:06Z

it will be great to have this functionality added... since in the marathon/mesos world the hosts are dynamic... thanks

elukey · 2016-12-13T14:31:45Z

Sorry to ask again after all the pings but.. Any news? Really looking forward to see this feature merged!

yjqg6666 · 2016-12-14T02:21:04Z

+1 Any updates for this PR?

fabiomsouto · 2016-12-14T14:37:06Z

+1

geor-g · 2016-12-23T13:50:57Z

+1

everpcpc · 2017-01-12T09:28:50Z

+1

zeitos · 2017-06-23T17:12:48Z

hello? is this going to get merged?

armcburney

LGTM! 👍

lpan

👍

coderall · 2018-03-22T02:15:14Z

+1

xginn8 · 2018-05-06T21:19:22Z

I'm maintaining a fork of twemproxy and have merged this commit (https://github.com/xginn8/twemproxy).

CLAassistant · 2019-07-18T15:16:24Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Lev Walkin
❌ manjuraj

Lev Walkin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Sairammp · 2021-08-27T11:39:45Z

Is this dynamic config reload option planned to be merged to the main branch?

This was referenced Feb 12, 2015

Support to reload configuration file without killing process #6

Open

Twemproxy addition and deletion of servers in config file #215

Open

bobrik mentioned this pull request Feb 12, 2015

Port range support mesosphere/marathon#835

Closed

Lev Walkin added 23 commits February 25, 2015 03:29

size_t srclen to avoid chocking on direct strlen() argument

294d648

make nc_unresolve() to be AF_<family> agnostic

33e6127

slightly more transparent server address parsing

0742060

fix optional name parsing

33e3c47

remove warning re struct sockaddr

314f8e3

get rid of global variables when parsing command line options

5c340cf

warning comment about non-constant access time

7cac921

remove compiler warning

2e80482

turn server pools into their own memory objects, not fragments of the…

2060265

… array

handle the config reload signal

5672eed

more straightforward selection of connection type

2f5dac5

turn async signals into synchronous events

bde831b

duplicate_if_nonempty (or leave empty/unallocated if source if empty)

abd54f2

first stab at config reload. avoid using the current config pointers;…

a68c78c

… copy.

better runtime introspection

9e8c9ef

add introspect files

5ea8bf3

remove warnings

31d2f36

allow connections which do not receive data

a555410

add unresolve function to ease debugging

d85cab0

reloading and replacing the pools

4d3bbae

reindexing turned out not necessary

b876f34

need to return value due to be returned

1b68577

stats related code has no use being in [supposedly generic enough] ev…

3cdfe02

…ent subsystem

armcburney approved these changes Mar 22, 2018

View reviewed changes

lpan approved these changes Mar 22, 2018

View reviewed changes

TysonAndre mentioned this pull request Apr 29, 2021

0.4.1 release support USR1 signal? #467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run time configuration reload #321

Run time configuration reload #321

vlm commented Feb 12, 2015

saa commented Feb 13, 2015

bobrik commented Feb 23, 2015

bobrik commented Feb 23, 2015

vlm commented Feb 25, 2015

davidradunz commented Sep 12, 2015

JamieCressey commented Sep 15, 2015

zhitaoli commented Sep 16, 2015

davidradunz commented Sep 18, 2015

sethrosenblum commented Dec 2, 2015

draco2003 commented Dec 31, 2015

axot commented Feb 4, 2016

ghost commented Feb 25, 2016

TrumanDu commented Feb 26, 2016

homeyjd commented Apr 13, 2016

JeffXue commented Apr 25, 2016

artursitarski commented May 20, 2016

alexef commented Sep 28, 2016 •

edited

manjuraj commented Sep 28, 2016

vlm commented Sep 28, 2016

blsmth commented Nov 14, 2016

vsacheti commented Nov 17, 2016

elukey commented Dec 13, 2016

yjqg6666 commented Dec 14, 2016

fabiomsouto commented Dec 14, 2016

geor-g commented Dec 23, 2016

everpcpc commented Jan 12, 2017

zeitos commented Jun 23, 2017

armcburney left a comment

lpan left a comment

coderall commented Mar 22, 2018

xginn8 commented May 6, 2018

CLAassistant commented Jul 18, 2019 •

edited

Sairammp commented Aug 27, 2021

Run time configuration reload #321

Are you sure you want to change the base?

Run time configuration reload #321

Conversation

vlm commented Feb 12, 2015

Changes

Model of operation

saa commented Feb 13, 2015

bobrik commented Feb 23, 2015

bobrik commented Feb 23, 2015

vlm commented Feb 25, 2015

davidradunz commented Sep 12, 2015

JamieCressey commented Sep 15, 2015

zhitaoli commented Sep 16, 2015

davidradunz commented Sep 18, 2015

sethrosenblum commented Dec 2, 2015

draco2003 commented Dec 31, 2015

axot commented Feb 4, 2016

ghost commented Feb 25, 2016

TrumanDu commented Feb 26, 2016

homeyjd commented Apr 13, 2016

JeffXue commented Apr 25, 2016

artursitarski commented May 20, 2016

alexef commented Sep 28, 2016 • edited

manjuraj commented Sep 28, 2016

vlm commented Sep 28, 2016

blsmth commented Nov 14, 2016

vsacheti commented Nov 17, 2016

elukey commented Dec 13, 2016

yjqg6666 commented Dec 14, 2016

fabiomsouto commented Dec 14, 2016

geor-g commented Dec 23, 2016

everpcpc commented Jan 12, 2017

zeitos commented Jun 23, 2017

armcburney left a comment

Choose a reason for hiding this comment

lpan left a comment

Choose a reason for hiding this comment

coderall commented Mar 22, 2018

xginn8 commented May 6, 2018

CLAassistant commented Jul 18, 2019 • edited

Sairammp commented Aug 27, 2021

alexef commented Sep 28, 2016 •

edited

CLAassistant commented Jul 18, 2019 •

edited