Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full resync with slave is done twice when slaveof <masterip> is a FQDN #2075

Closed
everesio opened this issue Oct 16, 2014 · 13 comments
Closed

Comments

@everesio
Copy link

In case the host parameter is provided as a FQDN (SLAVEOF host port), slave full resync is performed twice.

slave log:

[13868] 16 Oct 11:28:30.119 # Server started, Redis version 2.8.17
[13868] 16 Oct 11:28:43.881 * DB loaded from disk: 13.762 seconds
[13868] 16 Oct 11:28:43.882 * The server is now ready to accept connections on port 6400
[13868] 16 Oct 11:28:43.888 * Connecting to MASTER detta1-redis-db-901.data.det:6400
[13868] 16 Oct 11:28:43.889 * MASTER <-> SLAVE sync started
[13868] 16 Oct 11:28:43.889 * Non blocking connect for SYNC fired the event.
[13868] 16 Oct 11:28:43.889 * Master replied to PING, replication can continue...
[13868] 16 Oct 11:28:43.890 * Partial resynchronization not possible (no cached master)
[13868] 16 Oct 11:28:43.890 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:1
[13868] 16 Oct 11:29:01.898 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master
[13868] 16 Oct 11:29:15.473 * MASTER <-> SLAVE sync: Flushing old data
[13868] 16 Oct 11:29:18.733 * MASTER <-> SLAVE sync: Loading DB in memory
[13868] 16 Oct 11:29:35.499 * MASTER <-> SLAVE sync: Finished with success
[13868] 16 Oct 11:29:40.326 # Connection with master lost.
[13868] 16 Oct 11:29:40.326 * Caching the disconnected master state.
[13868] 16 Oct 11:29:40.326 * Discarding previously cached master state.
[13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request)
[13868] 16 Oct 11:29:40.345 # CONFIG REWRITE executed with success.
[13868] 16 Oct 11:29:41.120 * Connecting to MASTER 10.86.3.175:6400
[13868] 16 Oct 11:29:41.120 * MASTER <-> SLAVE sync started
[13868] 16 Oct 11:29:41.120 * Non blocking connect for SYNC fired the event.
[13868] 16 Oct 11:29:41.121 * Master replied to PING, replication can continue...
[13868] 16 Oct 11:29:41.121 * Partial resynchronization not possible (no cached master)
[13868] 16 Oct 11:29:41.121 * Full resync from master: de143d0176fbae863acfb19241628ff395a150e7:17244
[13868] 16 Oct 11:29:59.130 * MASTER <-> SLAVE sync: receiving 1551510931 bytes from master
[13868] 16 Oct 11:30:09.797 * MASTER <-> SLAVE sync: Flushing old data
[13868] 16 Oct 11:30:12.871 * MASTER <-> SLAVE sync: Loading DB in memory
[13868] 16 Oct 11:30:27.986 * MASTER <-> SLAVE sync: Finished with success

master log:

[13319] 16 Oct 11:27:05.917 # Server started, Redis version 2.8.17
[13319] 16 Oct 11:27:18.744 * DB loaded from disk: 12.827 seconds
[13319] 16 Oct 11:27:18.745 * The server is now ready to accept connections on port 6400
[13319] 16 Oct 11:28:43.890 * Slave asks for synchronization
[13319] 16 Oct 11:28:43.890 * Full resync requested by slave.
[13319] 16 Oct 11:28:43.890 * Starting BGSAVE for SYNC
[13319] 16 Oct 11:28:43.892 * Background saving started by pid 13344
[13344] 16 Oct 11:29:01.838 * DB saved on disk
[13344] 16 Oct 11:29:01.849 * RDB: 24 MB of memory used by copy-on-write
[13319] 16 Oct 11:29:01.897 * Background saving terminated with success
[13319] 16 Oct 11:29:15.265 * Synchronization with slave succeeded
[13319] 16 Oct 11:29:40.326 # Connection with slave 10.86.3.176:6400 lost.
[13319] 16 Oct 11:29:41.121 * Slave asks for synchronization
[13319] 16 Oct 11:29:41.121 * Full resync requested by slave.
[13319] 16 Oct 11:29:41.121 * Starting BGSAVE for SYNC
[13319] 16 Oct 11:29:41.124 * Background saving started by pid 13346
[13346] 16 Oct 11:29:59.027 * DB saved on disk
[13346] 16 Oct 11:29:59.028 * RDB: 22 MB of memory used by copy-on-write
[13319] 16 Oct 11:29:59.129 * Background saving terminated with success
[13319] 16 Oct 11:30:09.577 * Synchronization with slave succeeded

@badboy
Copy link
Contributor

badboy commented Oct 16, 2014

[13868] 16 Oct 11:29:40.326 # Connection with master lost.

For some reason it loses the (already established) connection.

[13868] 16 Oct 11:29:40.326 * SLAVE OF 10.86.3.175:6400 enabled (user request)

and then a SLAVEOF is issued again, resulting in the full resync.

Are you sure this is only due to the use of a FQDN? Some client sends a slaveof again. Is this done automatically by your system? Or did you send that manually?

@everesio
Copy link
Author

It happens automatically and I can reproduce it every time. When the FQDN is changed to IP, full sync is performed once.

@mattsta
Copy link
Contributor

mattsta commented Oct 23, 2014

It happens automatically and I can reproduce it every time.

After a few attempts, I can't reproduce the name-based auto-disconnect on my systems.

Do you have more details about your steps or setup? Are you using redis-cli directly? A different redis client? A management application? What OS? Anything going through proxies or load balancers?

@everesio
Copy link
Author

The setup was done in the "2.6.32-431.29.2.el6.centos.plus.x86_64" running in VMware. There was
only one NIC on each host, the same network and no load balancer was used.
FQDN of master was configured in the redis.conf and redis slave was started. The instances were provision by Foreman/Puppet, but during the sync there was no interaction with provisioning system (maybe apart from DNS Lookup). After the slave was killed and the rewritten config replaced with the original one, after a slave start-up, the double sync happened again.

@mattsta
Copy link
Contributor

mattsta commented Oct 23, 2014

It looks like a strange networking issue since the connection just goes away without any user interaction. Are timeouts set too low? If you can still reproduce the problem try increasing the Redis log level to get more details about what's happening.

@everesio
Copy link
Author

As said it happens only with FQDN. It seems that it happens during / after config rewrite (FQDN => IP)

Params:
timeout 300
tcp-keepalive 60
repl-timeout 180
repl-disable-tcp-nodelay no

set loglevel debug (on both master and slave)

master:

[8098] 24 Oct 20:54:28.899 * Slave asks for synchronization
[8098] 24 Oct 20:54:28.900 * Full resync requested by slave.
[8098] 24 Oct 20:54:28.900 * Starting BGSAVE for SYNC
[8098] 24 Oct 20:54:28.927 * Background saving started by pid 24219
[24219] 24 Oct 20:54:49.248 * DB saved on disk
[24219] 24 Oct 20:54:49.274 * RDB: 6 MB of memory used by copy-on-write
[8098] 24 Oct 20:54:49.404 * Background saving terminated with success
[8098] 24 Oct 20:55:04.426 * Synchronization with slave succeeded
[8098] 24 Oct 20:55:37.963 # Connection with slave 10.86.4.141:6400 lost.
[8098] 24 Oct 20:55:38.513 * Slave asks for synchronization
[8098] 24 Oct 20:55:38.513 * Full resync requested by slave.
[8098] 24 Oct 20:55:38.513 * Starting BGSAVE for SYNC
[8098] 24 Oct 20:55:38.536 * Background saving started by pid 24221
[24221] 24 Oct 20:56:00.509 * DB saved on disk
[24221] 24 Oct 20:56:00.542 * RDB: 6 MB of memory used by copy-on-write
[8098] 24 Oct 20:56:00.616 * Background saving terminated with success
[8098] 24 Oct 20:56:17.239 * Synchronization with slave succeeded

slave:

[1100] 24 Oct 20:54:15.410 # Server started, Redis version 2.8.17
[1100] 24 Oct 20:54:28.701 * DB loaded from disk: 13.291 seconds
[1100] 24 Oct 20:54:28.701 * The server is now ready to accept connections on port 6400
[1100] 24 Oct 20:54:28.894 * Connecting to MASTER detta1-redis-db-901.data.det:6400
[1100] 24 Oct 20:54:28.895 * MASTER <-> SLAVE sync started
[1100] 24 Oct 20:54:28.896 * Non blocking connect for SYNC fired the event.
[1100] 24 Oct 20:54:28.896 * Master replied to PING, replication can continue...
[1100] 24 Oct 20:54:28.897 * Partial resynchronization not possible (no cached master)
[1100] 24 Oct 20:54:28.898 * Full resync from master: 5169384220cbb44a145c047930d787e83e957477:202434404
[1100] 24 Oct 20:54:49.402 * MASTER <-> SLAVE sync: receiving 1551513934 bytes from master
[1100] 24 Oct 20:55:04.768 * MASTER <-> SLAVE sync: Flushing old data
[1100] 24 Oct 20:55:07.740 * MASTER <-> SLAVE sync: Loading DB in memory
[1100] 24 Oct 20:55:29.928 * MASTER <-> SLAVE sync: Finished with success
[1100] 24 Oct 20:55:37.960 # Connection with master lost.
[1100] 24 Oct 20:55:37.960 * Caching the disconnected master state.
[1100] 24 Oct 20:55:37.960 * Discarding previously cached master state.
[1100] 24 Oct 20:55:37.960 * SLAVE OF 10.86.4.140:6400 enabled (user request)
[1100] 24 Oct 20:55:37.974 # CONFIG REWRITE executed with success.
[1100] 24 Oct 20:55:38.509 * Connecting to MASTER 10.86.4.140:6400
[1100] 24 Oct 20:55:38.509 * MASTER <-> SLAVE sync started
[1100] 24 Oct 20:55:38.509 * Non blocking connect for SYNC fired the event.
[1100] 24 Oct 20:55:38.510 * Master replied to PING, replication can continue...
[1100] 24 Oct 20:55:38.510 * Partial resynchronization not possible (no cached master)
[1100] 24 Oct 20:55:38.511 * Full resync from master: 5169384220cbb44a145c047930d787e83e957477:202454910
[1100] 24 Oct 20:56:00.614 * MASTER <-> SLAVE sync: receiving 1551513934 bytes from master
[1100] 24 Oct 20:56:17.432 * MASTER <-> SLAVE sync: Flushing old data
[1100] 24 Oct 20:56:20.682 * MASTER <-> SLAVE sync: Loading DB in memory
[1100] 24 Oct 20:56:38.769 * MASTER <-> SLAVE sync: Finished with success

@mattsta
Copy link
Contributor

mattsta commented Oct 24, 2014

You mention this:

during the sync there was no interaction with provisioning system

But I'm not sure that's true.

Your new log shows the disconnect is caused by a user initiated SLAVEOF command. A SLAVEOF command to an ip address. Something in your monitoring system is running "SLAVEOF 10.86.4.140 6400" even though you are already connected to detta1-redis-db-901.data.det:6400.

@mattsta
Copy link
Contributor

mattsta commented Oct 24, 2014

Also, are you using Sentinel at all? There's no mention of Sentinel above, but Sentinel will force all configurations to use IP addresses.

If you are using Sentinel, you'll see an event in your sentinel log like "+fix-slave-config" — Sentinel sees your replica is connected to "detta1-redis-db-901.data.det:6400", but it doesn't know that "detta1-redis-db-901.data.det" is the correct IP address for your master.

@mattsta
Copy link
Contributor

mattsta commented Oct 24, 2014

We should probably be forcing Redis to only act on IP addresses after name resolution anyway. Otherwise we have things like INFO output showing hostnames instead of IP addresses:

master_host:poo1333.vcap.me
master_port:6379

Using hostnames here makes Sentinel unable to verify the configuration, so Sentinel re-attaches the replica to the IP address of the same host.

The only change we need to make Redis use the IP of hostnames during SLAVEOF is:

diff --git a/src/replication.c b/src/replication.c
index 16014c8..af68ece 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,15 @@ void slaveofCommand(redisClient *c) {
     } else {
         long port;

+        char *ip = c->argv[1]->ptr;
+        char ipbuf[REDIS_IP_STR_LEN];
+        if (anetResolve(NULL, ip, ipbuf, sizeof(ipbuf)) == ANET_OK) {
+            ip = ipbuf;
+        } else {
+            addReplyError(c, "Requested hostname can't be resolved.");
+            return;
+        }
+
         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
             return;

@@ -1317,7 +1326,7 @@ void slaveofCommand(redisClient *c) {
         }
         /* There was no previous master or the user specified a different one,
          * we can continue. */
-        replicationSetMaster(c->argv[1]->ptr, port);
+        replicationSetMaster(ip, port);
         redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
             server.masterhost, server.masterport);
     }

Which gives us this INFO even after a SLAVEOF foo9.vcap.me 6379:

master_host:127.0.0.1
master_port:6379

The only problem with that approach: it could confuse people who think the hostname will be saved in the config for future connections. If the user later changes where their hostname points, expecting Redis to connect to the same hostname again, it won't work because Redis is already pinned (and config-saved/sentinel-saved) to the original IP address of the original hostname.

So, the other way to fix it is to just throw a big "NO HOSTNAMES IN SLAVEOF" error if somebody tries to do that.

The BIG ERROR approach would be:

diff --git a/src/replication.c b/src/replication.c
index 16014c8..a49c8c4 100644
--- a/src/replication.c
+++ b/src/replication.c
@@ -1305,6 +1305,11 @@ void slaveofCommand(redisClient *c) {
     } else {
         long port;

+        if (anetResolveIP(NULL, c->argv[1]->ptr, NULL, 0) != ANET_OK) {
+            addReplyError(c, "Must use only IP address with SLAVEOF");
+            return;
+        }
+
         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
             return;

which results in:

127.0.0.1:9999> slaveof foo7.vcap.me 9999
(error) ERR Must use only IP address with SLAVEOF

But wait—there's one more option where we mix allow config hostnames, but use IP addresses! The third fix would be: store the original host name to persist in the config file, but don't use it for live connection tracking. The live "config.masterhost" will always be an IP address but maybe a "config.originalmasterhost" would be the hostname, and the hostname is what would be persisted to the config file (but Sentinel would just check against the live IP address given at config.masterhost/INFO output).

@everesio
Copy link
Author

Yes, you right I use Sentinel (actually 4 of them). I haven't mention it as it seemed obvious for me - sorry about that. I am sure that your analysis is correct I will verify Sentinels' logs on Monday.

In the initial setup, all IPs i.e. in the “slaveof” and “sentinel monitor” were provided as FQDNs. All of them were rewritten at the end.

In my opinion, during the server/sentinel start-up, DNS-Lookup and some “initial” config rewrite (even if only in memory) would be the cleanest solution. This would provide consistent state as only IPs would be used and using FQDNs would be also possible. Manual commands should perform DNS-Lookup (probably on the server side) before application.

Thanks to everyone for the analysis !

@everesio
Copy link
Author

As you mentioned, I see "+fix-slave-config" in the sentinel.log

[8049] 24 Oct 20:54:15.465 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 20:54:15.533 # -sdown slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 20:55:17.985 * +fix-slave-config slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 20:55:28.062 * +fix-slave-config slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400
[8049] 24 Oct 21:22:48.923 * +reboot slave 10.86.4.141:6400 10.86.4.141 6400 @ contentexchange-cache 10.86.4.140 6400

@everesio
Copy link
Author

Should I close the issue or do you need as a reference for any CR/fix ?

mattsta added a commit to mattsta/redis that referenced this issue Oct 31, 2014
Sentniel verifies configurations by IP, but if a hostname
gets set as master_host, Sentniel re-syncs to the already
connected master because it can't match the hostname to the IP address.

The downside to this approach is the DNS resolve is blocking
(we don't have an async DNS resolver), so it can potentially
stall the server while the lookup happens.

Fixes redis#2075
Fixes redis#2032
mattsta added a commit to mattsta/redis that referenced this issue Oct 31, 2014
If we don't resolve hosts to IP addresses, Sentniel compares
a slave's masterhost (potentially a hostname) against the
master's slave list (only IPs).

This commit resolves hostnames provided by replicas so
we can string compare IP->IP instead of Hostname->IP.

During any failures, Sentinel will use IP addresses to
re-parent replicas or rewrite the main config, so the
hostnames are for convenience only and never get persisted
in the Sentinel config.

If the resolve fails, we emit a "-noresolve" event and
set the local IP field equal to the non-existing hostname
anyway.  (Note: since the Redis server is _already_ connected
to that hostname, it is *very* unlikely that Sentniel can't also
resolve the same hostname because Redis already had to resolve
it when it ran `SLAVEOF hostname port`.)

Fixes redis#2075
Fixes redis#2032
mattsta added a commit to mattsta/redis that referenced this issue Oct 31, 2014
If we don't resolve hosts to IP addresses, Sentniel compares
a slave's masterhost (potentially a hostname) against the
master's slave list (only IPs).

This commit resolves hostnames provided by replicas so
we can string compare IP->IP instead of Hostname->IP.

During any failures, Sentinel will use IP addresses to
re-parent replicas or rewrite the main config, so the
hostnames are for convenience only and never get persisted
in the Sentinel config.

If the resolve fails, we emit a "-noresolve" event and
set the local IP field equal to the non-existing hostname
anyway.  (Note: since the Redis server is _already_ connected
to that hostname, it is *very* unlikely that Sentniel can't also
resolve the same hostname because Redis already had to resolve
it when it ran `SLAVEOF hostname port`.)

Fixes redis#2075
Fixes redis#2032
@yossigo
Copy link
Member

yossigo commented Jan 28, 2021

The issue of Sentinel and hostnames should be resolved by #8282.

@yossigo yossigo closed this as completed Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants