Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address flakiness of vtgate_vindex.prefixfanout tests #10216

Merged
merged 15 commits into from
May 5, 2022

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented May 4, 2022

Description

The vtgate_vindex->prefixfanout tests have been flaky, particularly when GitHub Actions is slower/has more resource contention than usual.

In this PR we make the following changes:

  1. Add code to wait for the vttablets to be seen as healthy and serving (in TestMain) by the vtgate before executing any tests
    • Correct the WaitForTabletsToHealthyInVtgate() function as it was always waiting for 1 replica tablet in each shard to be seen as healthy and serving in the vtgate but replica tablets are optional and we have none of them in the cluster used for the prefixfanout test
  2. Mark the vtgate_vindex test as heavy since even with a long wait we still seemed unable to have mysqld start at times
    • Although I later realized that this was caused by the WaitForTabletsToHealthyInVtgate() bug, this is fairly heavy so leaving this in place (can remove though if others prefer)
  3. Deal with Cluster_17 flakiness seen here too (hitting the 10 min time limit); renamed that to vtgate_general_heavy
  4. Since I was already renaming 2 CI workflows here, I went ahead and renamed 20 to xb_backup as I missed the opportunity to do that in Temp: Pin XtraBackup version used at 2.4.24 for 5.7 tests #10194 and was sad

ℹ️ NOTE: marking a CI workflow as heavy causes us to increase some key OS limits (e.g. local ephemeral port range, AIO slots, open files, etc) while also decreasing the resource usage of each mysqld

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests are not required
  • Documentation is not required

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as draft May 4, 2022 18:01
@mattlord mattlord changed the title Wait for vtgate and tablets to be healthy in prefixfanout tests Address flakiness of vtgate_vindex.prefixfanout tests May 4, 2022
Sometimes GitHub Actions is *super* slow and our tests should
still be able to pass.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@vitessio vitessio deleted a comment from github-actions bot May 4, 2022
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review May 4, 2022 20:14
@vitessio vitessio deleted a comment from github-actions bot May 4, 2022
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
And get related files aligned

Signed-off-by: Matt Lord <mattalord@gmail.com>
We were waiting for 1 replica tablet when the clsuter defined
for the test did not have any replica tablets.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, I left a comment and a question.

However, the new vtgate_vindex_heavy test has a timeout. It seems like starting the VTGate makes the test timeout:

I0504 23:20:14.082316   16539 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_1342812822/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_1342812822/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_1342812822/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3 --health_check_interval=2s

Is the last log before the test is interrupted due to the time out. I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.

go/test/endtoend/cluster/cluster_process.go Show resolved Hide resolved
go/test/endtoend/cluster/vtgate_process.go Show resolved Hide resolved
@frouioui
Copy link
Member

frouioui commented May 4, 2022

I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.

The timeout can possibly come from:

		if err := clusterInstance.WaitForTabletsToHealthyInVtgate(); err != nil {
 			return 1
 		}

as well.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Contributor Author

mattlord commented May 5, 2022

It looks good to me, I left a comment and a question.

However, the new vtgate_vindex_heavy test has a timeout. It seems like starting the VTGate makes the test timeout:

I0504 23:20:14.082316   16539 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_1342812822/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_1342812822/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_1342812822/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3 --health_check_interval=2s

Is the last log before the test is interrupted due to the time out. I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.

Just noting that this was discussed in Slack. I’m not sure why the vtgate wait didn’t timeout at 60s: https://github.com/vitessio/vitess/blob/main/go/test/endtoend/cluster/vtgate_process.go#L109-L135

We might be waiting in here:

    go func() {
        if vtgate.proc != nil {
            vtgate.exit <- vtgate.proc.Wait()
        }
    }()

Perhaps vtgate.proc.Wait() blocks and never returns because the OS process has exited (in this case vtgate exited immediately because it was being passed an invalid flag).

Perhaps another bug to fix in the future. 🙂

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord
Copy link
Contributor Author

mattlord commented May 5, 2022

The vtgate_vindex_heavy test has passed 6 times in a row now! I'm going to make a minor logging change now. This will give us one more test run and I can correct the log message. The log messages now look like this:

2022-05-05T02:11:32.9276338Z I0505 02:10:29.618259   16601 cluster_process.go:592] Vtgate started, connect to mysql using : mysql -h 127.0.0.1 -P 16033
2022-05-05T02:11:32.9278331Z I0505 02:10:29.618544   16601 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_717048359/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_717048359/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_717048359/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3
2022-05-05T02:11:32.9279784Z I0505 02:10:59.723618   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.-41.primary tablets in shard zone1
2022-05-05T02:11:32.9280383Z I0505 02:10:59.725485   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.41-4180.primary tablets in shard zone1
2022-05-05T02:11:32.9280975Z I0505 02:10:59.726655   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.4180-42.primary tablets in shard zone1
2022-05-05T02:11:32.9281532Z I0505 02:10:59.727817   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.42-.primary tablets in shard zone1
2022-05-05T02:11:32.9282112Z I0505 02:10:59.728965   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.-c2.primary tablets in shard zone1
2022-05-05T02:11:32.9282700Z I0505 02:10:59.729899   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.c2-c20a80.primary tablets in shard zone1
2022-05-05T02:11:32.9283309Z I0505 02:10:59.730821   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.c20a80-d0.primary tablets in shard zone1
2022-05-05T02:11:32.9283890Z I0505 02:10:59.731723   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.d0-.primary tablets in shard zone1
2022-05-05T02:11:32.9284221Z === RUN   TestCFCPrefixQueryNoHash
...

Once the test passes again for the 7th time in a row (hopefully) and I confirm the correct log messages (should be ... tablets in cell zone1) then I will merge the PR.

Signed-off-by: Matt Lord <mattalord@gmail.com>
@frouioui
Copy link
Member

frouioui commented May 5, 2022

Thanks for doing this @mattlord! It is amazing ❤️

@mattlord mattlord merged commit f7d6c0d into vitessio:main May 5, 2022
@mattlord mattlord deleted the prefixfanout_flakes branch May 5, 2022 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flakes: address vtgate_vindex test flakiness
2 participants