New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075
Comments
For some reason it loses the (already established) connection.
and then a Are you sure this is only due to the use of a FQDN? Some client sends a slaveof again. Is this done automatically by your system? Or did you send that manually? |
It happens automatically and I can reproduce it every time. When the FQDN is changed to IP, full sync is performed once. |
After a few attempts, I can't reproduce the name-based auto-disconnect on my systems. Do you have more details about your steps or setup? Are you using redis-cli directly? A different redis client? A management application? What OS? Anything going through proxies or load balancers? |
The setup was done in the "2.6.32-431.29.2.el6.centos.plus.x86_64" running in VMware. There was |
It looks like a strange networking issue since the connection just goes away without any user interaction. Are timeouts set too low? If you can still reproduce the problem try increasing the Redis log level to get more details about what's happening. |
As said it happens only with FQDN. It seems that it happens during / after config rewrite (FQDN => IP) Params: set loglevel debug (on both master and slave) master: [8098] 24 Oct 20:54:28.899 * Slave asks for synchronization slave: [1100] 24 Oct 20:54:15.410 # Server started, Redis version 2.8.17 |
You mention this:
But I'm not sure that's true. Your new log shows the disconnect is caused by a user initiated SLAVEOF command. A SLAVEOF command to an ip address. Something in your monitoring system is running "SLAVEOF 10.86.4.140 6400" even though you are already connected to detta1-redis-db-901.data.det:6400. |
Also, are you using Sentinel at all? There's no mention of Sentinel above, but Sentinel will force all configurations to use IP addresses. If you are using Sentinel, you'll see an event in your sentinel log like "+fix-slave-config" — Sentinel sees your replica is connected to "detta1-redis-db-901.data.det:6400", but it doesn't know that "detta1-redis-db-901.data.det" is the correct IP address for your master. |
We should probably be forcing Redis to only act on IP addresses after name resolution anyway. Otherwise we have things like INFO output showing hostnames instead of IP addresses: master_host:poo1333.vcap.me
master_port:6379 Using hostnames here makes Sentinel unable to verify the configuration, so Sentinel re-attaches the replica to the IP address of the same host. The only change we need to make Redis use the IP of hostnames during SLAVEOF is: diff --git a/src/replication.c b/src/replication.c
index 16014c8..af68ece 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,15 @@ void slaveofCommand(redisClient *c) {
} else {
long port;
+ char *ip = c->argv[1]->ptr;
+ char ipbuf[REDIS_IP_STR_LEN];
+ if (anetResolve(NULL, ip, ipbuf, sizeof(ipbuf)) == ANET_OK) {
+ ip = ipbuf;
+ } else {
+ addReplyError(c, "Requested hostname can't be resolved.");
+ return;
+ }
+
if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
return;
@@ -1317,7 +1326,7 @@ void slaveofCommand(redisClient *c) {
}
/* There was no previous master or the user specified a different one,
* we can continue. */
- replicationSetMaster(c->argv[1]->ptr, port);
+ replicationSetMaster(ip, port);
redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
server.masterhost, server.masterport);
} Which gives us this INFO even after a master_host:127.0.0.1
master_port:6379 The only problem with that approach: it could confuse people who think the hostname will be saved in the config for future connections. If the user later changes where their hostname points, expecting Redis to connect to the same hostname again, it won't work because Redis is already pinned (and config-saved/sentinel-saved) to the original IP address of the original hostname. So, the other way to fix it is to just throw a big "NO HOSTNAMES IN SLAVEOF" error if somebody tries to do that. The BIG ERROR approach would be: diff --git a/src/replication.c b/src/replication.c
index 16014c8..a49c8c4 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,11 @@ void slaveofCommand(redisClient *c) {
} else {
long port;
+ if (anetResolveIP(NULL, c->argv[1]->ptr, NULL, 0) != ANET_OK) {
+ addReplyError(c, "Must use only IP address with SLAVEOF");
+ return;
+ }
+
if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
return; which results in: 127.0.0.1:9999> slaveof foo7.vcap.me 9999
(error) ERR Must use only IP address with SLAVEOF But wait—there's one more option where we mix allow config hostnames, but use IP addresses! The third fix would be: store the original host name to persist in the config file, but don't use it for live connection tracking. The live "config.masterhost" will always be an IP address but maybe a "config.originalmasterhost" would be the hostname, and the hostname is what would be persisted to the config file (but Sentinel would just check against the live IP address given at config.masterhost/INFO output). |
Yes, you right I use Sentinel (actually 4 of them). I haven't mention it as it seemed obvious for me - sorry about that. I am sure that your analysis is correct I will verify Sentinels' logs on Monday. In the initial setup, all IPs i.e. in the “slaveof” and “sentinel monitor” were provided as FQDNs. All of them were rewritten at the end. In my opinion, during the server/sentinel start-up, DNS-Lookup and some “initial” config rewrite (even if only in memory) would be the cleanest solution. This would provide consistent state as only IPs would be used and using FQDNs would be also possible. Manual commands should perform DNS-Lookup (probably on the server side) before application. Thanks to everyone for the analysis ! |
As you mentioned, I see "+fix-slave-config" in the sentinel.log [8049] 24 Oct 20:54:15.465 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400 |
Should I close the issue or do you need as a reference for any CR/fix ? |
Sentniel verifies configurations by IP, but if a hostname gets set as master_host, Sentniel re-syncs to the already connected master because it can't match the hostname to the IP address. The downside to this approach is the DNS resolve is blocking (we don't have an async DNS resolver), so it can potentially stall the server while the lookup happens. Fixes redis#2075 Fixes redis#2032
If we don't resolve hosts to IP addresses, Sentniel compares a slave's masterhost (potentially a hostname) against the master's slave list (only IPs). This commit resolves hostnames provided by replicas so we can string compare IP->IP instead of Hostname->IP. During any failures, Sentinel will use IP addresses to re-parent replicas or rewrite the main config, so the hostnames are for convenience only and never get persisted in the Sentinel config. If the resolve fails, we emit a "-noresolve" event and set the local IP field equal to the non-existing hostname anyway. (Note: since the Redis server is _already_ connected to that hostname, it is *very* unlikely that Sentniel can't also resolve the same hostname because Redis already had to resolve it when it ran `SLAVEOF hostname port`.) Fixes redis#2075 Fixes redis#2032
If we don't resolve hosts to IP addresses, Sentniel compares a slave's masterhost (potentially a hostname) against the master's slave list (only IPs). This commit resolves hostnames provided by replicas so we can string compare IP->IP instead of Hostname->IP. During any failures, Sentinel will use IP addresses to re-parent replicas or rewrite the main config, so the hostnames are for convenience only and never get persisted in the Sentinel config. If the resolve fails, we emit a "-noresolve" event and set the local IP field equal to the non-existing hostname anyway. (Note: since the Redis server is _already_ connected to that hostname, it is *very* unlikely that Sentniel can't also resolve the same hostname because Redis already had to resolve it when it ran `SLAVEOF hostname port`.) Fixes redis#2075 Fixes redis#2032
The issue of Sentinel and hostnames should be resolved by #8282. |
In case the host parameter is provided as a FQDN (SLAVEOF host port), slave full resync is performed twice.
slave log:
[13868] 16 Oct 11:28:30.119 # Server started, Redis version 2.8.17
[13868] 16 Oct 11:28:43.881 * DB loaded from disk: 13.762 seconds
[13868] 16 Oct 11:28:43.882 * The server is now ready to accept connections on port 6400
[13868] 16 Oct 11:28:43.888 * Connecting to MASTER detta1-redis-db-901.data.det:6400
[13868] 16 Oct 11:28:43.889 * MASTER <-> SLAVE sync started
[13868] 16 Oct 11:28:43.889 * Non blocking connect for SYNC fired the event.
[13868] 16 Oct 11:28:43.889 * Master replied to PING, replication can continue...
[13868] 16 Oct 11:28:43.890 * Partial resynchronization not possible (no cached master)
[13868] 16 Oct 11:28:43.890 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:1
[13868] 16 Oct 11:29:01.898 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master
[13868] 16 Oct 11:29:15.473 * MASTER <-> SLAVE sync: Flushing old data
[13868] 16 Oct 11:29:18.733 * MASTER <-> SLAVE sync: Loading DB in memory
[13868] 16 Oct 11:29:35.499 * MASTER <-> SLAVE sync: Finished with success
[13868] 16 Oct 11:29:40.326 # Connection with master lost.
[13868] 16 Oct 11:29:40.326 * Caching the disconnected master state.
[13868] 16 Oct 11:29:40.326 * Discarding previously cached master state.
[13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request)
[13868] 16 Oct 11:29:40.345 # CONFIG REWRITE executed with success.
[13868] 16 Oct 11:29:41.120 * Connecting to MASTER 10.86.3.175:6400
[13868] 16 Oct 11:29:41.120 * MASTER <-> SLAVE sync started
[13868] 16 Oct 11:29:41.120 * Non blocking connect for SYNC fired the event.
[13868] 16 Oct 11:29:41.121 * Master replied to PING, replication can continue...
[13868] 16 Oct 11:29:41.121 * Partial resynchronization not possible (no cached master)
[13868] 16 Oct 11:29:41.121 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:17244
[13868] 16 Oct 11:29:59.130 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master
[13868] 16 Oct 11:30:09.797 * MASTER <-> SLAVE sync: Flushing old data
[13868] 16 Oct 11:30:12.871 * MASTER <-> SLAVE sync: Loading DB in memory
[13868] 16 Oct 11:30:27.986 * MASTER <-> SLAVE sync: Finished with success
master log:
[13319] 16 Oct 11:27:05.917 # Server started, Redis version 2.8.17
[13319] 16 Oct 11:27:18.744 * DB loaded from disk: 12.827 seconds
[13319] 16 Oct 11:27:18.745 * The server is now ready to accept connections on port 6400
[13319] 16 Oct 11:28:43.890 * Slave asks for synchronization
[13319] 16 Oct 11:28:43.890 * Full resync requested by slave.
[13319] 16 Oct 11:28:43.890 * Starting BGSAVE for SYNC
[13319] 16 Oct 11:28:43.892 * Background saving started by pid 13344
[13344] 16 Oct 11:29:01.838 * DB saved on disk
[13344] 16 Oct 11:29:01.849 * RDB: 24 MB of memory used by copy-on-write
[13319] 16 Oct 11:29:01.897 * Background saving terminated with success
[13319] 16 Oct 11:29:15.265 * Synchronization with slave succeeded
[13319] 16 Oct 11:29:40.326 # Connection with slave 10.86.3.176:6400 lost.
[13319] 16 Oct 11:29:41.121 * Slave asks for synchronization
[13319] 16 Oct 11:29:41.121 * Full resync requested by slave.
[13319] 16 Oct 11:29:41.121 * Starting BGSAVE for SYNC
[13319] 16 Oct 11:29:41.124 * Background saving started by pid 13346
[13346] 16 Oct 11:29:59.027 * DB saved on disk
[13346] 16 Oct 11:29:59.028 * RDB: 22 MB of memory used by copy-on-write
[13319] 16 Oct 11:29:59.129 * Background saving terminated with success
[13319] 16 Oct 11:30:09.577 * Synchronization with slave succeeded
The text was updated successfully, but these errors were encountered: