Filter deleted Aurora replicas from auto-discovery #2336

wjordan · 2019-10-21T21:18:23Z

Aurora instances that have been deleted still exist in the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, but with a REPLICA_LAG_IN_MILLISECONDS set to 900000. This causes the current autodiscovery query to continue to add entries to runtime_mysql_servers for deleted instances, adding unnecessary connection-failure errors to the log as it continues to try to connect to an instance that no longer exists.

To allow auto discovery to remove these deleted instances from the server list, this PR adds AND REPLICA_LAG_IN_MILLISECONDS != 900000 to the monitor query, which removes it from the query results.

I haven't tested this code change through the existing Aurora automated testing, but I have manually run the new query on an existing Aurora cluster (version 5.7.mysql_aurora.2.04.6) with a deleted replica instance to verify the correct results:

Old query:

mysql> SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE REPLICA_LAG_IN_MILLISECONDS > 0 OR SESSION_ID = 'MASTER_SESSION_ID' ORDER BY SERVER_ID;
+----------------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| SERVER_ID      | SESSION_ID                           | LAST_UPDATE_TIMESTAMP      | REPLICA_LAG_IN_MILLISECONDS | CPU                |
+----------------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| staging-0      | MASTER_SESSION_ID                    | 2019-10-21 21:12:31.836369 |                           0 |                0.5 |
| staging-1      | [id]                                 | 2019-10-21 21:12:31.855021 |          19.878999710083008 | 1.8181818723678589 |
| staging-master | [id]                                 | 2019-10-21 19:39:56.000000 |                      900000 | 12.087912559509277 |
+----------------+--------------------------------------+----------------------------+-----------------------------+--------------------+
3 rows in set (0.00 sec)

New query:

mysql> SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE (REPLICA_LAG_IN_MILLISECONDS > 0 AND REPLICA_LAG_IN_MILLISECONDS != 900000) OR SESSION_ID = 'MASTER_SESSION_ID' ORDER BY SERVER_ID;
+-----------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| SERVER_ID | SESSION_ID                           | LAST_UPDATE_TIMESTAMP      | REPLICA_LAG_IN_MILLISECONDS | CPU                |
+-----------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| staging-0 | MASTER_SESSION_ID                    | 2019-10-21 21:14:00.018776 |                           0 | 2.6315789222717285 |
| staging-1 | [id]                                 | 2019-10-21 21:13:59.197106 |          19.409000396728516 |  4.142011642456055 |
+-----------+--------------------------------------+----------------------------+-----------------------------+--------------------+
2 rows in set (0.00 sec)

Aurora instances that have been deleted still exist in the `INFORMATION_SCHEMA.REPLICA_HOST_STATUS` table, but with a `REPLICA_LAG_IN_MILLISECONDS` set to 900000. To allow auto discovery to remove these deleted instances from the server list, add `AND REPLICA_LAG_IN_MILLISECONDS != 900000` to the monitor query.

pondix · 2019-10-21T21:22:03Z

Automated message: PR pending admin approval for build testing

renecannao · 2019-10-23T03:12:16Z

Hi @wjordan .
Thank you for the PR!
I couldn't find this detail in AWS documentation:

Aurora instances that have been deleted still exist in the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, but with a REPLICA_LAG_IN_MILLISECONDS set to 900000.

Instead of filtering the specific value 900000, does it sound reasonable to filter any value greater than a specific one?
Because mysql_aws_aurora_hostgroups has:

max_lag_ms INT NOT NULL CHECK (max_lag_ms>= 10 AND max_lag_ms <= 600000) DEFAULT 600000

instead of adding AND REPLICA_LAG_IN_MILLISECONDS != 900000 we could add AND REPLICA_LAG_IN_MILLISECONDS <= 600000.

Thoughts?

wjordan · 2019-10-23T20:34:00Z

I couldn't find this detail in AWS documentation:

Neither could I- it was only a guess that the 900000 is a special value to mean 'deleted instance'. Since the product is closed-source I've submitted a support request to clarify the expected behavior around this edge-case, and I'll update when I hear back. Probably safer to get some official, stable word on this rather than make guesses about an undocumented API based on current observation!

wjordan · 2019-10-23T20:37:37Z

That said- AND REPLICA_LAG_IN_MILLISECONDS <= 600000 sounds reasonable based on the existing CHECK constraint you mentioned, so it would be fine by me if you wanted to make that change right away 👍

renecannao · 2019-10-27T09:03:20Z

Hi @wjordan .
I merged your PR and made the minor change mentioned previously.
Thank you!

wjordan · 2019-10-31T17:21:25Z

@renecannao just following up with the response I received from AWS support on the current Aurora-MySQL behavior. They said that the current behavior with deleted replicas is confirmed as a bug and not intentional, which means the behavior described above is not stable/reliable and could change in the future when the bug is fixed in a future release. I think the PR you merged should be safe enough as a workaround until the bug is eventually fixed, however.

renecannao merged commit c00d24d into sysown:v2.0.8 Oct 27, 2019

renecannao added a commit that referenced this pull request Oct 27, 2019

Filter Aurora replica with lag > 10m #2336

bbb9ea3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter deleted Aurora replicas from auto-discovery #2336

Filter deleted Aurora replicas from auto-discovery #2336

wjordan commented Oct 21, 2019

pondix commented Oct 21, 2019

renecannao commented Oct 23, 2019

wjordan commented Oct 23, 2019

wjordan commented Oct 23, 2019

renecannao commented Oct 27, 2019

wjordan commented Oct 31, 2019

Filter deleted Aurora replicas from auto-discovery #2336

Filter deleted Aurora replicas from auto-discovery #2336

Conversation

wjordan commented Oct 21, 2019

pondix commented Oct 21, 2019

renecannao commented Oct 23, 2019

wjordan commented Oct 23, 2019

wjordan commented Oct 23, 2019

renecannao commented Oct 27, 2019

wjordan commented Oct 31, 2019