Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter deleted Aurora replicas from auto-discovery #2336

Merged

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Oct 21, 2019

Aurora instances that have been deleted still exist in the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, but with a REPLICA_LAG_IN_MILLISECONDS set to 900000. This causes the current autodiscovery query to continue to add entries to runtime_mysql_servers for deleted instances, adding unnecessary connection-failure errors to the log as it continues to try to connect to an instance that no longer exists.

To allow auto discovery to remove these deleted instances from the server list, this PR adds AND REPLICA_LAG_IN_MILLISECONDS != 900000 to the monitor query, which removes it from the query results.

I haven't tested this code change through the existing Aurora automated testing, but I have manually run the new query on an existing Aurora cluster (version 5.7.mysql_aurora.2.04.6) with a deleted replica instance to verify the correct results:

Old query:

mysql> SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE REPLICA_LAG_IN_MILLISECONDS > 0 OR SESSION_ID = 'MASTER_SESSION_ID' ORDER BY SERVER_ID;
+----------------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| SERVER_ID      | SESSION_ID                           | LAST_UPDATE_TIMESTAMP      | REPLICA_LAG_IN_MILLISECONDS | CPU                |
+----------------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| staging-0      | MASTER_SESSION_ID                    | 2019-10-21 21:12:31.836369 |                           0 |                0.5 |
| staging-1      | [id]                                 | 2019-10-21 21:12:31.855021 |          19.878999710083008 | 1.8181818723678589 |
| staging-master | [id]                                 | 2019-10-21 19:39:56.000000 |                      900000 | 12.087912559509277 |
+----------------+--------------------------------------+----------------------------+-----------------------------+--------------------+
3 rows in set (0.00 sec)

New query:

mysql> SELECT SERVER_ID, SESSION_ID, LAST_UPDATE_TIMESTAMP, REPLICA_LAG_IN_MILLISECONDS, CPU FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE (REPLICA_LAG_IN_MILLISECONDS > 0 AND REPLICA_LAG_IN_MILLISECONDS != 900000) OR SESSION_ID = 'MASTER_SESSION_ID' ORDER BY SERVER_ID;
+-----------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| SERVER_ID | SESSION_ID                           | LAST_UPDATE_TIMESTAMP      | REPLICA_LAG_IN_MILLISECONDS | CPU                |
+-----------+--------------------------------------+----------------------------+-----------------------------+--------------------+
| staging-0 | MASTER_SESSION_ID                    | 2019-10-21 21:14:00.018776 |                           0 | 2.6315789222717285 |
| staging-1 | [id]                                 | 2019-10-21 21:13:59.197106 |          19.409000396728516 |  4.142011642456055 |
+-----------+--------------------------------------+----------------------------+-----------------------------+--------------------+
2 rows in set (0.00 sec)

Aurora instances that have been deleted still exist in the
`INFORMATION_SCHEMA.REPLICA_HOST_STATUS` table, but with a
`REPLICA_LAG_IN_MILLISECONDS` set to 900000.
To allow auto discovery to remove these deleted instances from the
server list, add `AND REPLICA_LAG_IN_MILLISECONDS != 900000` to the
monitor query.
@pondix
Copy link
Contributor

pondix commented Oct 21, 2019

Automated message: PR pending admin approval for build testing

@renecannao
Copy link
Contributor

Hi @wjordan .
Thank you for the PR!
I couldn't find this detail in AWS documentation:

Aurora instances that have been deleted still exist in the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, but with a REPLICA_LAG_IN_MILLISECONDS set to 900000.

Instead of filtering the specific value 900000, does it sound reasonable to filter any value greater than a specific one?
Because mysql_aws_aurora_hostgroups has:

max_lag_ms INT NOT NULL CHECK (max_lag_ms>= 10 AND max_lag_ms <= 600000) DEFAULT 600000

instead of adding AND REPLICA_LAG_IN_MILLISECONDS != 900000 we could add AND REPLICA_LAG_IN_MILLISECONDS <= 600000.

Thoughts?

@wjordan
Copy link
Contributor Author

wjordan commented Oct 23, 2019

I couldn't find this detail in AWS documentation:

Neither could I- it was only a guess that the 900000 is a special value to mean 'deleted instance'. Since the product is closed-source I've submitted a support request to clarify the expected behavior around this edge-case, and I'll update when I hear back. Probably safer to get some official, stable word on this rather than make guesses about an undocumented API based on current observation!

@wjordan
Copy link
Contributor Author

wjordan commented Oct 23, 2019

That said- AND REPLICA_LAG_IN_MILLISECONDS <= 600000 sounds reasonable based on the existing CHECK constraint you mentioned, so it would be fine by me if you wanted to make that change right away 👍

@renecannao renecannao merged commit c00d24d into sysown:v2.0.8 Oct 27, 2019
renecannao added a commit that referenced this pull request Oct 27, 2019
@renecannao
Copy link
Contributor

Hi @wjordan .
I merged your PR and made the minor change mentioned previously.
Thank you!

@wjordan
Copy link
Contributor Author

wjordan commented Oct 31, 2019

@renecannao just following up with the response I received from AWS support on the current Aurora-MySQL behavior. They said that the current behavior with deleted replicas is confirmed as a bug and not intentional, which means the behavior described above is not stable/reliable and could change in the future when the bug is fixed in a future release. I think the PR you merged should be safe enough as a workaround until the bug is eventually fixed, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants