Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If dns lookup fails for all hosts in a pool, memcache get/multiget (and redis mget) requests fail (but not other commands such as incr) #596

Closed
TysonAndre opened this issue Mar 19, 2021 · 3 comments · Fixed by #597

Comments

@TysonAndre
Copy link
Collaborator

TysonAndre commented Mar 19, 2021

Describe the bug

If hostnames are used instead of ip addresses for all hosts within a pool, and dns lookup fails, then get/multiget will hang indefinitely instead of responding with an error such as SERVER_ERROR Host is down. (I expect a client would detect this and close the connection, but this is not ideal, the client timeout could be a second)

  • If a server resolves but is down, then get does respond with SERVER_ERROR Host is down.

I suspect that's because memcached get is implemented to coalesce responses from multiple backend servers, even when there's only one key, and this is likely a bug specific to handling coalescing when there's no attempt to send the request to a backend server

Impact: This may discourage using hostnames instead of ip addresses for nutcracker

To Reproduce
Steps to reproduce the behavior:

  1. Set up conf/nutcracker.yml with a pool with a hostname that fails to resolve, e.g. invalid. This simulates a case where you're removing a dns entry for a host before adding a new dns entry, or if dns is misconfigured (modifying the example nutcracker config)
    (This happens whether or not auto_eject_hosts is true)
gamma:
  listen: 127.0.0.1:22123
  hash: fnv1a_64
  distribution: ketama
  timeout: 4000
  backlog: 1024
  preconnect: true
  auto_eject_hosts: false
  server_retry_timeout: 2000
  server_failure_limit: 3
  servers:
   - invalid:11212:1
   - invalid:11213:1

Observed result of a telnet session to that memcache proxy - the server hangs instead of responding to get requests

» telnet 127.0.0.1 22123
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
incr key 1
SERVER_ERROR Host is down
incr key 1
SERVER_ERROR Host is down
get key
get key
^]
telnet> Connection closed.

Observed nutcracker log output

» src/nutcracker -v 7                                                                                                                                  1 ↵
[2021-03-19 11:24:36.975] nc.c:185 nutcracker-0.4.1 built for Linux 5.4.0-67-generic x86_64 started on pid 20994
[2021-03-19 11:24:36.975] nc.c:190 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig another one
[2021-03-19 11:24:36.975] nc_mbuf.c:271 mbuf hsize 48 chunk size 16384 offset 16336 length 16336
[2021-03-19 11:24:36.975] nc_message.c:422 msg size 440
[2021-03-19 11:24:36.975] nc_connection.c:315 conn size 352
[2021-03-19 11:24:36.975] nc_server.c:166 init 4 servers in pool 0 'beta'
[2021-03-19 11:24:36.975] nc_server.c:166 init 1 servers in pool 1 'alpha'
[2021-03-19 11:24:36.975] nc_server.c:166 init 10 servers in pool 2 'delta'
[2021-03-19 11:24:36.975] nc_server.c:166 init 2 servers in pool 3 'gamma'
[2021-03-19 11:24:36.975] nc_server.c:166 init 2 servers in pool 4 'omega'
[2021-03-19 11:24:36.975] nc_ketama.c:120 4 of 4 servers are live for pool 0 'beta'
[2021-03-19 11:24:36.976] nc_ketama.c:120 1 of 1 servers are live for pool 1 'alpha'
[2021-03-19 11:24:36.976] nc_ketama.c:120 10 of 10 servers are live for pool 2 'delta'
[2021-03-19 11:24:36.976] nc_ketama.c:120 2 of 2 servers are live for pool 3 'gamma'
[2021-03-19 11:24:36.977] nc_ketama.c:120 2 of 2 servers are live for pool 4 'omega'
[2021-03-19 11:24:36.977] nc_server.c:894 init 5 pools
[2021-03-19 11:24:36.977] nc_core.c:41 max fds 1024 max client conns 968 max server conns 24
[2021-03-19 11:24:36.977] nc_stats.c:419 stats buffer size 11856
[2021-03-19 11:24:36.977] nc_stats.c:850 m 3 listening on '0.0.0.0:22222'
[2021-03-19 11:24:36.977] nc_epoll.c:63 e 4 with nevent 1024
[2021-03-19 11:24:37.136] nc_util.c:505 address resolution of node 'invalid' service '11212' failed: Name or service not known
[2021-03-19 11:24:37.136] nc_server.c:236 connect to server 'invalid:11212:1' failed, ignored: Host is down
[2021-03-19 11:24:37.302] nc_util.c:505 address resolution of node 'invalid' service '11213' failed: Name or service not known
[2021-03-19 11:24:37.302] nc_server.c:236 connect to server 'invalid:11213:1' failed, ignored: Host is down
[2021-03-19 11:24:37.302] nc_proxy.c:213 p 6 listening on '127.0.0.1:22122' in redis pool 0 'beta' with 4 servers
[2021-03-19 11:24:37.302] nc_proxy.c:213 p 7 listening on '127.0.0.1:22121' in redis pool 1 'alpha' with 1 servers
[2021-03-19 11:24:37.302] nc_proxy.c:213 p 8 listening on '127.0.0.1:22124' in memcache pool 2 'delta' with 10 servers
[2021-03-19 11:24:37.302] nc_proxy.c:213 p 9 listening on '127.0.0.1:22123' in memcache pool 3 'gamma' with 2 servers
[2021-03-19 11:24:37.302] nc_proxy.c:213 p 10 listening on '/tmp/gamma' in memcache pool 4 'omega' with 2 servers
[2021-03-19 11:24:39.119] nc_proxy.c:385 accepted c 11 on p 9 from '127.0.0.1:45974'
[2021-03-19 11:24:41.070] nc_util.c:505 address resolution of node 'invalid' service '11213' failed: Name or service not known
[2021-03-19 11:24:41.070] nc_request.c:524 forward req 1 len 12 type 10 from c 11 failed: Host is down
[2021-03-19 11:24:41.070] nc_request.c:91 req 1 done on c 11 req_time 163.670 msec type REQ_MC_INCR narg 2 req_len 12 rsp_len 27 key0 'key' peer '127.0.0.1:45974' done 1 error 1
[2021-03-19 11:24:43.029] nc_util.c:505 address resolution of node 'invalid' service '11213' failed: Name or service not known
[2021-03-19 11:24:43.029] nc_request.c:524 forward req 3 len 12 type 10 from c 11 failed: Host is down
[2021-03-19 11:24:43.029] nc_request.c:91 req 3 done on c 11 req_time 155.130 msec type REQ_MC_INCR narg 2 req_len 12 rsp_len 27 key0 'key' peer '127.0.0.1:45974' done 1 error 1
[2021-03-19 11:24:47.709] nc_util.c:505 address resolution of node 'invalid' service '11213' failed: Name or service not known
[2021-03-19 11:24:47.709] nc_request.c:524 forward req 6 len 10 type 1 from c 11 failed: Host is down
[2021-03-19 11:24:49.387] nc_util.c:505 address resolution of node 'invalid' service '11213' failed: Name or service not known
[2021-03-19 11:24:49.387] nc_request.c:524 forward req 9 len 10 type 1 from c 11 failed: Host is down
[2021-03-19 11:24:54.062] nc_connection.c:359 recv on sd 11 eof rb 42 sb 54

This is using commit 48e3f39 (probably an issue for a long, long time)

EDIT: I see that any command following the hanging get command will also hang, which may be useful in investigating this (i.e. twemproxy is failing to send a response rather than sending a blank response to the request, which is delaying other pipelined requests)

Expected behavior
When there is a DNS error, get commands should promptly respond with SERVER_ERROR if all keys had dns lookup errors, or a partial response if only some of the backend servers had errors

Environment
Linux

@TysonAndre
Copy link
Collaborator Author

A possible patch is included to increment the received fragment count if a server pool connection could not be established.

Note that there is still a difference between the behavior for a dns lookup error and an unreachable host

  1. For a dns lookup failure, the client connection to nutcracker is closed by nutcracker with no response
  2. For an error establishing a connection, the client connection is closed by nutcracker with a SERVER ERROR response

With this patch, It seems like operations continue to work properly on hosts that aren't down, e.g. multigets will disconnect if they include a host with a dns error, but will work properly if they don't include a host with a dns error for a pool with a mix of hosts

--- a/src/nc_request.c
+++ b/src/nc_request.c
@@ -579,6 +579,12 @@ req_forward(struct context *ctx, struct conn *c_conn, struct msg *msg)
 
     s_conn = server_pool_conn(ctx, c_conn->owner, key, keylen);
     if (s_conn == NULL) {
+        /* Handle a failure to establish a new connection to a server, e.g. due to dns resolution errors. */
+        /* If this is a fragmented request sent to multiple servers such as a memcache get(multiget) mark the fragment for this request to the server as done. */
+        /* Normally, this would be done when the request was forwarded to the server, but due to failing to connect to the server this check is repeated here */
+        if (msg->frag_owner != NULL) {
+            msg->frag_owner->nfrag_done++;
+        }
         req_forward_error(ctx, c_conn, msg);
         return;
     }

@TysonAndre
Copy link
Collaborator Author

It seems like server_pool_conn returns non-null if there is an error connecting, because connection attempts are asynchronous so there is no error, but returns null if there is an error looking up the dns name, because dns lookups are synchronous.

Normally, rsp_send_next would get called to call msg_get_error and send the error in response to an incoming event from the event loop in response to an outgoing request, for a fragmented request I'm not sure how this would be implemented.

The lack of a response message before closing the connection seems harmless and I think the patch should be sufficient (?). conn->err = 1 is why I'm seeing the unrelated message Operation not permitted with a high verbosity just because the number 1 was used with strerror

[2021-03-19 16:07:01.889] nc_core.c:227 recv on c 7 failed: Host is down
[2021-03-19 16:07:01.889] nc_core.c:266 close c 7 '127.0.0.1:47830' on event 00FF eof 0 done 0 rb 9 sb 0: Operation not permitted
Thread 1 "nutcracker" hit Breakpoint 4, msg_get_error (redis=false, err=err@entry=111) at nc_message.c:327
327     {
(gdb) bt
#0  msg_get_error (redis=false, err=err@entry=111) at nc_message.c:327
#1  0x0000555555567eb4 in rsp_make_error (msg=0x5555555b4460, conn=0x5555555b3e10, ctx=0x5555555a3540) at nc_response.c:83
#2  rsp_send_next (ctx=0x5555555a3540, conn=0x5555555b3e10) at nc_response.c:342
#3  0x0000555555564fdc in msg_send (ctx=0x5555555a3540, conn=0x5555555b3e10) at nc_message.c:876
#4  0x000055555555d81c in core_send (conn=0x5555555b3e10, ctx=0x5555555a3540) at nc_core.c:414
#5  core_core (arg=0x5555555b3e10, events=<optimized out>) at nc_core.c:414
#6  0x000055555557dd1f in event_wait (evb=0x5555555b9000, timeout=<optimized out>) at nc_epoll.c:269
#7  0x000055555555dfc7 in core_loop (ctx=0x5555555a3540) at nc_core.c:429
#8  0x000055555555d150 in nc_run (nci=0x7fffffffd790) at nc.c:530
#9  main (argc=<optimized out>, argv=<optimized out>) at nc.c:579

@TysonAndre TysonAndre changed the title If dns lookup fails for all hosts in a pool, memcache get/multiget requests fail (but not other commands such as incr) If dns lookup fails for all hosts in a pool, memcache get/multiget (and redis mget) requests fail (but not other commands such as incr) Mar 19, 2021
@TysonAndre
Copy link
Collaborator Author

TysonAndre commented Mar 19, 2021

gamma:
  listen: 127.0.0.1:22123
  hash: fnv1a_64
  distribution: ketama
  timeout: 4000
  backlog: 1024
  redis: true
  preconnect: true
  auto_eject_hosts: false
  server_retry_timeout: 2000
  server_failure_limit: 3
  servers:
   - invalid:11212:1
   - invalid:11213:1

hamma:
  listen: 127.0.0.1:22124
  hash: fnv1a_64
  distribution: ketama
  timeout: 4000
  backlog: 1024
  redis: true
  preconnect: true
  auto_eject_hosts: false
  server_retry_timeout: 2000
  server_failure_limit: 3
  servers:
   - 127.0.0.1:11212:1
   - 127.0.0.1:11213:1

Redis is also affected, e.g. for mget. I see it hanging for a redis pool on port 22123 when dns lookup fails (it hangs and I manually interrupt it)

redis-cli -h 127.0.0.1 -p 22123
127.0.0.1:22123> mget a b

^C
» redis-cli -h 127.0.0.1 -p 22124
127.0.0.1:22124> mget a b
(error) ERR Connection refused
127.0.0.1:22124> 

TysonAndre added a commit to ifwe/twemproxy that referenced this issue Mar 19, 2021
If hostnames are used instead of ip addresses for all hosts within a pool,
and dns lookup fails, then get/multiget  will hang indefinitely **instead of**
responding with an error such as `SERVER_ERROR Host is down`.
(I expect a client would detect this and close the connection,
but this is not ideal, the client timeout could be a second)

- Both redis and memcache protocols are affected
- If a server resolves but is down, then `get` does respond with `SERVER_ERROR Host is down`.

I suspect that's because memcached get is implemented to coalesce responses
from multiple backend servers, even when there's only one key,
and this is likely a bug specific to handling coalescing when
there's no attempt to send the request to a backend server

Because connection attempts are async but dns lookup is asynchronous,
there's a non-null server connection for a host that's unavailable
but a null server connection for a host that has a dns lookup error,
and these end up using different code paths.
(e.g. the former will call server_close() which does adjust nfrag_sent)

Fixes twitter#596
TysonAndre added a commit to ifwe/twemproxy that referenced this issue May 11, 2021
If hostnames are used instead of ip addresses for all hosts within a pool,
and dns lookup fails, then get/multiget  will hang indefinitely **instead of**
responding with an error such as `SERVER_ERROR Host is down`.
(I expect a client would detect this and close the connection,
but this is not ideal, the client timeout could be a second)

- Both redis and memcache protocols are affected
- If a server resolves but is down, then `get` does respond with `SERVER_ERROR Host is down`.

I suspect that's because memcached get is implemented to coalesce responses
from multiple backend servers, even when there's only one key,
and this is likely a bug specific to handling coalescing when
there's no attempt to send the request to a backend server

Because connection attempts are async but dns lookup is asynchronous,
there's a non-null server connection for a host that's unavailable
but a null server connection for a host that has a dns lookup error,
and these end up using different code paths.
(e.g. the former will call server_close() which does adjust nfrag_sent)

Fixes twitter#596
TysonAndre added a commit to ifwe/twemproxy that referenced this issue Jun 30, 2021
If hostnames are used instead of ip addresses for all hosts within a pool,
and dns lookup fails, then get/multiget  will hang indefinitely **instead of**
responding with an error such as `SERVER_ERROR Host is down`.
(I expect a client would detect this and close the connection,
but this is not ideal, the client timeout could be a second)

- Both redis and memcache protocols are affected
- If a server resolves but is down, then `get` does respond with `SERVER_ERROR Host is down`.

I suspect that's because memcached get is implemented to coalesce responses
from multiple backend servers, even when there's only one key,
and this is likely a bug specific to handling coalescing when
there's no attempt to send the request to a backend server

Because connection attempts are async but dns lookup is asynchronous,
there's a non-null server connection for a host that's unavailable
but a null server connection for a host that has a dns lookup error,
and these end up using different code paths.
(e.g. the former will call server_close() which does adjust nfrag_sent)

Fixes twitter#596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant