Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075

everesio · 2014-10-16T09:56:10Z

In case the host parameter is provided as a FQDN (SLAVEOF host port), slave full resync is performed twice.

slave log:

[13868] 16 Oct 11:28:30.119 # Server started, Redis version 2.8.17
[13868] 16 Oct 11:28:43.881 * DB loaded from disk: 13.762 seconds
[13868] 16 Oct 11:28:43.882 * The server is now ready to accept connections on port 6400
[13868] 16 Oct 11:28:43.888 * Connecting to MASTER detta1-redis-db-901.data.det:6400
[13868] 16 Oct 11:28:43.889 * MASTER <-> SLAVE sync started
[13868] 16 Oct 11:28:43.889 * Non blocking connect for SYNC fired the event.
[13868] 16 Oct 11:28:43.889 * Master replied to PING, replication can continue...
[13868] 16 Oct 11:28:43.890 * Partial resynchronization not possible (no cached master)
[13868] 16 Oct 11:28:43.890 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:1
[13868] 16 Oct 11:29:01.898 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master
[13868] 16 Oct 11:29:15.473 * MASTER <-> SLAVE sync: Flushing old data
[13868] 16 Oct 11:29:18.733 * MASTER <-> SLAVE sync: Loading DB in memory
[13868] 16 Oct 11:29:35.499 * MASTER <-> SLAVE sync: Finished with success
[13868] 16 Oct 11:29:40.326 # Connection with master lost.
[13868] 16 Oct 11:29:40.326 * Caching the disconnected master state.
[13868] 16 Oct 11:29:40.326 * Discarding previously cached master state.
[13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request)
[13868] 16 Oct 11:29:40.345 # CONFIG REWRITE executed with success.
[13868] 16 Oct 11:29:41.120 * Connecting to MASTER 10.86.3.175:6400
[13868] 16 Oct 11:29:41.120 * MASTER <-> SLAVE sync started
[13868] 16 Oct 11:29:41.120 * Non blocking connect for SYNC fired the event.
[13868] 16 Oct 11:29:41.121 * Master replied to PING, replication can continue...
[13868] 16 Oct 11:29:41.121 * Partial resynchronization not possible (no cached master)
[13868] 16 Oct 11:29:41.121 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:17244
[13868] 16 Oct 11:29:59.130 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master
[13868] 16 Oct 11:30:09.797 * MASTER <-> SLAVE sync: Flushing old data
[13868] 16 Oct 11:30:12.871 * MASTER <-> SLAVE sync: Loading DB in memory
[13868] 16 Oct 11:30:27.986 * MASTER <-> SLAVE sync: Finished with success

master log:

[13319] 16 Oct 11:27:05.917 # Server started, Redis version 2.8.17
[13319] 16 Oct 11:27:18.744 * DB loaded from disk: 12.827 seconds
[13319] 16 Oct 11:27:18.745 * The server is now ready to accept connections on port 6400
[13319] 16 Oct 11:28:43.890 * Slave asks for synchronization
[13319] 16 Oct 11:28:43.890 * Full resync requested by slave.
[13319] 16 Oct 11:28:43.890 * Starting BGSAVE for SYNC
[13319] 16 Oct 11:28:43.892 * Background saving started by pid 13344
[13344] 16 Oct 11:29:01.838 * DB saved on disk
[13344] 16 Oct 11:29:01.849 * RDB: 24 MB of memory used by copy-on-write
[13319] 16 Oct 11:29:01.897 * Background saving terminated with success
[13319] 16 Oct 11:29:15.265 * Synchronization with slave succeeded
[13319] 16 Oct 11:29:40.326 # Connection with slave 10.86.3.176:6400 lost.
[13319] 16 Oct 11:29:41.121 * Slave asks for synchronization
[13319] 16 Oct 11:29:41.121 * Full resync requested by slave.
[13319] 16 Oct 11:29:41.121 * Starting BGSAVE for SYNC
[13319] 16 Oct 11:29:41.124 * Background saving started by pid 13346
[13346] 16 Oct 11:29:59.027 * DB saved on disk
[13346] 16 Oct 11:29:59.028 * RDB: 22 MB of memory used by copy-on-write
[13319] 16 Oct 11:29:59.129 * Background saving terminated with success
[13319] 16 Oct 11:30:09.577 * Synchronization with slave succeeded

badboy · 2014-10-16T11:32:32Z

[13868] 16 Oct 11:29:40.326 # Connection with master lost.

For some reason it loses the (already established) connection.

[13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request)

and then a SLAVEOF is issued again, resulting in the full resync.

Are you sure this is only due to the use of a FQDN? Some client sends a slaveof again. Is this done automatically by your system? Or did you send that manually?

everesio · 2014-10-16T11:35:57Z

It happens automatically and I can reproduce it every time. When the FQDN is changed to IP, full sync is performed once.

mattsta · 2014-10-23T16:15:30Z

It happens automatically and I can reproduce it every time.

After a few attempts, I can't reproduce the name-based auto-disconnect on my systems.

Do you have more details about your steps or setup? Are you using redis-cli directly? A different redis client? A management application? What OS? Anything going through proxies or load balancers?

everesio · 2014-10-23T19:09:40Z

The setup was done in the "2.6.32-431.29.2.el6.centos.plus.x86_64" running in VMware. There was
only one NIC on each host, the same network and no load balancer was used.
FQDN of master was configured in the redis.conf and redis slave was started. The instances were provision by Foreman/Puppet, but during the sync there was no interaction with provisioning system (maybe apart from DNS Lookup). After the slave was killed and the rewritten config replaced with the original one, after a slave start-up, the double sync happened again.

mattsta · 2014-10-23T20:26:50Z

It looks like a strange networking issue since the connection just goes away without any user interaction. Are timeouts set too low? If you can still reproduce the problem try increasing the Redis log level to get more details about what's happening.

everesio · 2014-10-24T19:03:09Z

As said it happens only with FQDN. It seems that it happens during / after config rewrite (FQDN => IP)

Params:
timeout 300
tcp-keepalive 60
repl-timeout 180
repl-disable-tcp-nodelay no

set loglevel debug (on both master and slave)

master:

[8098] 24 Oct 20:54:28.899 * Slave asks for synchronization
[8098] 24 Oct 20:54:28.900 * Full resync requested by slave.
[8098] 24 Oct 20:54:28.900 * Starting BGSAVE for SYNC
[8098] 24 Oct 20:54:28.927 * Background saving started by pid 24219
[24219] 24 Oct 20:54:49.248 * DB saved on disk
[24219] 24 Oct 20:54:49.274 * RDB: 6 MB of memory used by copy-on-write
[8098] 24 Oct 20:54:49.404 * Background saving terminated with success
[8098] 24 Oct 20:55:04.426 * Synchronization with slave succeeded
[8098] 24 Oct 20:55:37.963 # Connection with slave 10.86.4.141:6400 lost.
[8098] 24 Oct 20:55:38.513 * Slave asks for synchronization
[8098] 24 Oct 20:55:38.513 * Full resync requested by slave.
[8098] 24 Oct 20:55:38.513 * Starting BGSAVE for SYNC
[8098] 24 Oct 20:55:38.536 * Background saving started by pid 24221
[24221] 24 Oct 20:56:00.509 * DB saved on disk
[24221] 24 Oct 20:56:00.542 * RDB: 6 MB of memory used by copy-on-write
[8098] 24 Oct 20:56:00.616 * Background saving terminated with success
[8098] 24 Oct 20:56:17.239 * Synchronization with slave succeeded

slave:

[1100] 24 Oct 20:54:15.410 # Server started, Redis version 2.8.17
[1100] 24 Oct 20:54:28.701 * DB loaded from disk: 13.291 seconds
[1100] 24 Oct 20:54:28.701 * The server is now ready to accept connections on port 6400
[1100] 24 Oct 20:54:28.894 * Connecting to MASTER detta1-redis-db-901.data.det:6400
[1100] 24 Oct 20:54:28.895 * MASTER <-> SLAVE sync started
[1100] 24 Oct 20:54:28.896 * Non blocking connect for SYNC fired the event.
[1100] 24 Oct 20:54:28.896 * Master replied to PING, replication can continue...
[1100] 24 Oct 20:54:28.897 * Partial resynchronization not possible (no cached master)
[1100] 24 Oct 20:54:28.898 * Full resync from master: 5169384220cbb44a145c047930d787e83e957477:202434404
[1100] 24 Oct 20:54:49.402 * MASTER <-> SLAVE sync: receiving 1551513934 bytes from master
[1100] 24 Oct 20:55:04.768 * MASTER <-> SLAVE sync: Flushing old data
[1100] 24 Oct 20:55:07.740 * MASTER <-> SLAVE sync: Loading DB in memory
[1100] 24 Oct 20:55:29.928 * MASTER <-> SLAVE sync: Finished with success
[1100] 24 Oct 20:55:37.960 # Connection with master lost.
[1100] 24 Oct 20:55:37.960 * Caching the disconnected master state.
[1100] 24 Oct 20:55:37.960 * Discarding previously cached master state.
[1100] 24 Oct 20:55:37.960 * SLAVE OF 10.86.4.140:6400 enabled (user request)
[1100] 24 Oct 20:55:37.974 # CONFIG REWRITE executed with success.
[1100] 24 Oct 20:55:38.509 * Connecting to MASTER 10.86.4.140:6400
[1100] 24 Oct 20:55:38.509 * MASTER <-> SLAVE sync started
[1100] 24 Oct 20:55:38.509 * Non blocking connect for SYNC fired the event.
[1100] 24 Oct 20:55:38.510 * Master replied to PING, replication can continue...
[1100] 24 Oct 20:55:38.510 * Partial resynchronization not possible (no cached master)
[1100] 24 Oct 20:55:38.511 * Full resync from master: 5169384220cbb44a145c047930d787e83e957477:202454910
[1100] 24 Oct 20:56:00.614 * MASTER <-> SLAVE sync: receiving 1551513934 bytes from master
[1100] 24 Oct 20:56:17.432 * MASTER <-> SLAVE sync: Flushing old data
[1100] 24 Oct 20:56:20.682 * MASTER <-> SLAVE sync: Loading DB in memory
[1100] 24 Oct 20:56:38.769 * MASTER <-> SLAVE sync: Finished with success

mattsta · 2014-10-24T19:21:49Z

You mention this:

during the sync there was no interaction with provisioning system

But I'm not sure that's true.

Your new log shows the disconnect is caused by a user initiated SLAVEOF command. A SLAVEOF command to an ip address. Something in your monitoring system is running "SLAVEOF 10.86.4.140 6400" even though you are already connected to detta1-redis-db-901.data.det:6400.

mattsta · 2014-10-24T20:14:10Z

Also, are you using Sentinel at all? There's no mention of Sentinel above, but Sentinel will force all configurations to use IP addresses.

If you are using Sentinel, you'll see an event in your sentinel log like "+fix-slave-config" — Sentinel sees your replica is connected to "detta1-redis-db-901.data.det:6400", but it doesn't know that "detta1-redis-db-901.data.det" is the correct IP address for your master.

mattsta · 2014-10-24T20:43:24Z

We should probably be forcing Redis to only act on IP addresses after name resolution anyway. Otherwise we have things like INFO output showing hostnames instead of IP addresses:

master_host:poo1333.vcap.me
master_port:6379

Using hostnames here makes Sentinel unable to verify the configuration, so Sentinel re-attaches the replica to the IP address of the same host.

The only change we need to make Redis use the IP of hostnames during SLAVEOF is:

diff --git a/src/replication.c b/src/replication.c
index 16014c8..af68ece 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,15 @@ void slaveofCommand(redisClient *c) {
     } else {
         long port;

+        char *ip = c->argv[1]->ptr;
+        char ipbuf[REDIS_IP_STR_LEN];
+        if (anetResolve(NULL, ip, ipbuf, sizeof(ipbuf)) == ANET_OK) {
+            ip = ipbuf;
+        } else {
+            addReplyError(c, "Requested hostname can't be resolved.");
+            return;
+        }
+
         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
             return;

@@ -1317,7 +1326,7 @@ void slaveofCommand(redisClient *c) {
         }
         /* There was no previous master or the user specified a different one,
          * we can continue. */
-        replicationSetMaster(c->argv[1]->ptr, port);
+        replicationSetMaster(ip, port);
         redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
             server.masterhost, server.masterport);
     }

Which gives us this INFO even after a SLAVEOF foo9.vcap.me 6379:

master_host:127.0.0.1
master_port:6379

The only problem with that approach: it could confuse people who think the hostname will be saved in the config for future connections. If the user later changes where their hostname points, expecting Redis to connect to the same hostname again, it won't work because Redis is already pinned (and config-saved/sentinel-saved) to the original IP address of the original hostname.

So, the other way to fix it is to just throw a big "NO HOSTNAMES IN SLAVEOF" error if somebody tries to do that.

The BIG ERROR approach would be:

diff --git a/src/replication.c b/src/replication.c
index 16014c8..a49c8c4 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,11 @@ void slaveofCommand(redisClient *c) {
     } else {
         long port;

+        if (anetResolveIP(NULL, c->argv[1]->ptr, NULL, 0) != ANET_OK) {
+            addReplyError(c, "Must use only IP address with SLAVEOF");
+            return;
+        }
+
         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
             return;

which results in:

127.0.0.1:9999> slaveof foo7.vcap.me 9999
(error) ERR Must use only IP address with SLAVEOF

But wait—there's one more option where we mix allow config hostnames, but use IP addresses! The third fix would be: store the original host name to persist in the config file, but don't use it for live connection tracking. The live "config.masterhost" will always be an IP address but maybe a "config.originalmasterhost" would be the hostname, and the hostname is what would be persisted to the config file (but Sentinel would just check against the live IP address given at config.masterhost/INFO output).

everesio · 2014-10-24T22:40:18Z

Yes, you right I use Sentinel (actually 4 of them). I haven't mention it as it seemed obvious for me - sorry about that. I am sure that your analysis is correct I will verify Sentinels' logs on Monday.

In the initial setup, all IPs i.e. in the “slaveof” and “sentinel monitor” were provided as FQDNs. All of them were rewritten at the end.

In my opinion, during the server/sentinel start-up, DNS-Lookup and some “initial” config rewrite (even if only in memory) would be the cleanest solution. This would provide consistent state as only IPs would be used and using FQDNs would be also possible. Manual commands should perform DNS-Lookup (probably on the server side) before application.

Thanks to everyone for the analysis !

everesio · 2014-10-27T09:54:56Z

As you mentioned, I see "+fix-slave-config" in the sentinel.log

[8049] 24 Oct 20:54:15.465 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 20:54:15.533 # -sdown slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 20:55:17.985 * +fix-slave-config slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 20:55:28.062 * +fix-slave-config slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 21:22:48.923 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400

everesio · 2014-10-27T11:16:56Z

Should I close the issue or do you need as a reference for any CR/fix ?

Sentniel verifies configurations by IP, but if a hostname gets set as master_host, Sentniel re-syncs to the already connected master because it can't match the hostname to the IP address. The downside to this approach is the DNS resolve is blocking (we don't have an async DNS resolver), so it can potentially stall the server while the lookup happens. Fixes redis#2075 Fixes redis#2032

If we don't resolve hosts to IP addresses, Sentniel compares a slave's masterhost (potentially a hostname) against the master's slave list (only IPs). This commit resolves hostnames provided by replicas so we can string compare IP->IP instead of Hostname->IP. During any failures, Sentinel will use IP addresses to re-parent replicas or rewrite the main config, so the hostnames are for convenience only and never get persisted in the Sentinel config. If the resolve fails, we emit a "-noresolve" event and set the local IP field equal to the non-existing hostname anyway. (Note: since the Redis server is _already_ connected to that hostname, it is *very* unlikely that Sentniel can't also resolve the same hostname because Redis already had to resolve it when it ran `SLAVEOF hostname port`.) Fixes redis#2075 Fixes redis#2032

yossigo · 2021-01-28T10:46:42Z

The issue of Sentinel and hostnames should be resolved by #8282.

mattsta mentioned this issue Oct 29, 2014

sentinel can't use a domain name for master #2032

Closed

mattsta mentioned this issue Oct 31, 2014

Sentinel: resolve master_host to IP address #2118

Closed

mattsta mentioned this issue Feb 19, 2015

CLUSTER MEET doesn't allow hostname to be used #2410

Closed

mabrarov mentioned this issue Jan 10, 2021

in a K8s deployment, Sentinel Can't update the IP of the old master when the old master get back online again with different IP #8300

Closed

yossigo closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075

Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075

everesio commented Oct 16, 2014

badboy commented Oct 16, 2014

everesio commented Oct 16, 2014

mattsta commented Oct 23, 2014

everesio commented Oct 23, 2014

mattsta commented Oct 23, 2014

everesio commented Oct 24, 2014

mattsta commented Oct 24, 2014

mattsta commented Oct 24, 2014

mattsta commented Oct 24, 2014

everesio commented Oct 24, 2014

everesio commented Oct 27, 2014

everesio commented Oct 27, 2014

yossigo commented Jan 28, 2021

Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075

Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075

Comments

everesio commented Oct 16, 2014

badboy commented Oct 16, 2014

everesio commented Oct 16, 2014

mattsta commented Oct 23, 2014

everesio commented Oct 23, 2014

mattsta commented Oct 23, 2014

everesio commented Oct 24, 2014

mattsta commented Oct 24, 2014

mattsta commented Oct 24, 2014

mattsta commented Oct 24, 2014

everesio commented Oct 24, 2014

everesio commented Oct 27, 2014

everesio commented Oct 27, 2014

yossigo commented Jan 28, 2021