-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If dns lookup fails for all hosts in a pool, memcache get/multiget (and redis mget) requests fail (but not other commands such as incr) #596
Comments
A possible patch is included to increment the received fragment count if a server pool connection could not be established. Note that there is still a difference between the behavior for a dns lookup error and an unreachable host
With this patch, It seems like operations continue to work properly on hosts that aren't down, e.g. multigets will disconnect if they include a host with a dns error, but will work properly if they don't include a host with a dns error for a pool with a mix of hosts --- a/src/nc_request.c
+++ b/src/nc_request.c
@@ -579,6 +579,12 @@ req_forward(struct context *ctx, struct conn *c_conn, struct msg *msg)
s_conn = server_pool_conn(ctx, c_conn->owner, key, keylen);
if (s_conn == NULL) {
+ /* Handle a failure to establish a new connection to a server, e.g. due to dns resolution errors. */
+ /* If this is a fragmented request sent to multiple servers such as a memcache get(multiget) mark the fragment for this request to the server as done. */
+ /* Normally, this would be done when the request was forwarded to the server, but due to failing to connect to the server this check is repeated here */
+ if (msg->frag_owner != NULL) {
+ msg->frag_owner->nfrag_done++;
+ }
req_forward_error(ctx, c_conn, msg);
return;
} |
It seems like server_pool_conn returns non-null if there is an error connecting, because connection attempts are asynchronous so there is no error, but returns null if there is an error looking up the dns name, because dns lookups are synchronous. Normally, rsp_send_next would get called to call msg_get_error and send the error in response to an incoming event from the event loop in response to an outgoing request, for a fragmented request I'm not sure how this would be implemented. The lack of a response message before closing the connection seems harmless and I think the patch should be sufficient (?).
|
gamma:
listen: 127.0.0.1:22123
hash: fnv1a_64
distribution: ketama
timeout: 4000
backlog: 1024
redis: true
preconnect: true
auto_eject_hosts: false
server_retry_timeout: 2000
server_failure_limit: 3
servers:
- invalid:11212:1
- invalid:11213:1
hamma:
listen: 127.0.0.1:22124
hash: fnv1a_64
distribution: ketama
timeout: 4000
backlog: 1024
redis: true
preconnect: true
auto_eject_hosts: false
server_retry_timeout: 2000
server_failure_limit: 3
servers:
- 127.0.0.1:11212:1
- 127.0.0.1:11213:1 Redis is also affected, e.g. for mget. I see it hanging for a redis pool on port 22123 when dns lookup fails (it hangs and I manually interrupt it)
|
If hostnames are used instead of ip addresses for all hosts within a pool, and dns lookup fails, then get/multiget will hang indefinitely **instead of** responding with an error such as `SERVER_ERROR Host is down`. (I expect a client would detect this and close the connection, but this is not ideal, the client timeout could be a second) - Both redis and memcache protocols are affected - If a server resolves but is down, then `get` does respond with `SERVER_ERROR Host is down`. I suspect that's because memcached get is implemented to coalesce responses from multiple backend servers, even when there's only one key, and this is likely a bug specific to handling coalescing when there's no attempt to send the request to a backend server Because connection attempts are async but dns lookup is asynchronous, there's a non-null server connection for a host that's unavailable but a null server connection for a host that has a dns lookup error, and these end up using different code paths. (e.g. the former will call server_close() which does adjust nfrag_sent) Fixes twitter#596
If hostnames are used instead of ip addresses for all hosts within a pool, and dns lookup fails, then get/multiget will hang indefinitely **instead of** responding with an error such as `SERVER_ERROR Host is down`. (I expect a client would detect this and close the connection, but this is not ideal, the client timeout could be a second) - Both redis and memcache protocols are affected - If a server resolves but is down, then `get` does respond with `SERVER_ERROR Host is down`. I suspect that's because memcached get is implemented to coalesce responses from multiple backend servers, even when there's only one key, and this is likely a bug specific to handling coalescing when there's no attempt to send the request to a backend server Because connection attempts are async but dns lookup is asynchronous, there's a non-null server connection for a host that's unavailable but a null server connection for a host that has a dns lookup error, and these end up using different code paths. (e.g. the former will call server_close() which does adjust nfrag_sent) Fixes twitter#596
If hostnames are used instead of ip addresses for all hosts within a pool, and dns lookup fails, then get/multiget will hang indefinitely **instead of** responding with an error such as `SERVER_ERROR Host is down`. (I expect a client would detect this and close the connection, but this is not ideal, the client timeout could be a second) - Both redis and memcache protocols are affected - If a server resolves but is down, then `get` does respond with `SERVER_ERROR Host is down`. I suspect that's because memcached get is implemented to coalesce responses from multiple backend servers, even when there's only one key, and this is likely a bug specific to handling coalescing when there's no attempt to send the request to a backend server Because connection attempts are async but dns lookup is asynchronous, there's a non-null server connection for a host that's unavailable but a null server connection for a host that has a dns lookup error, and these end up using different code paths. (e.g. the former will call server_close() which does adjust nfrag_sent) Fixes twitter#596
Describe the bug
If hostnames are used instead of ip addresses for all hosts within a pool, and dns lookup fails, then get/multiget will hang indefinitely instead of responding with an error such as
SERVER_ERROR Host is down
. (I expect a client would detect this and close the connection, but this is not ideal, the client timeout could be a second)get
does respond withSERVER_ERROR Host is down
.I suspect that's because memcached get is implemented to coalesce responses from multiple backend servers, even when there's only one key, and this is likely a bug specific to handling coalescing when there's no attempt to send the request to a backend server
Impact: This may discourage using hostnames instead of ip addresses for nutcracker
To Reproduce
Steps to reproduce the behavior:
conf/nutcracker.yml
with a pool with a hostname that fails to resolve, e.g.invalid
. This simulates a case where you're removing a dns entry for a host before adding a new dns entry, or if dns is misconfigured (modifying the example nutcracker config)(This happens whether or not auto_eject_hosts is true)
Observed result of a telnet session to that memcache proxy - the server hangs instead of responding to get requests
Observed nutcracker log output
This is using commit 48e3f39 (probably an issue for a long, long time)
EDIT: I see that any command following the hanging get command will also hang, which may be useful in investigating this (i.e. twemproxy is failing to send a response rather than sending a blank response to the request, which is delaying other pipelined requests)
Expected behavior
When there is a DNS error, get commands should promptly respond with SERVER_ERROR if all keys had dns lookup errors, or a partial response if only some of the backend servers had errors
Environment
Linux
The text was updated successfully, but these errors were encountered: