-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address flakiness of vtgate_vindex.prefixfanout tests #10216
Conversation
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Sometimes GitHub Actions is *super* slow and our tests should still be able to pass. Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
And get related files aligned Signed-off-by: Matt Lord <mattalord@gmail.com>
We were waiting for 1 replica tablet when the clsuter defined for the test did not have any replica tablets. Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, I left a comment and a question.
However, the new vtgate_vindex_heavy
test has a timeout. It seems like starting the VTGate makes the test timeout:
I0504 23:20:14.082316 16539 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_1342812822/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_1342812822/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_1342812822/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3 --health_check_interval=2s
Is the last log before the test is interrupted due to the time out. I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.
The timeout can possibly come from: if err := clusterInstance.WaitForTabletsToHealthyInVtgate(); err != nil {
return 1
} as well. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Just noting that this was discussed in Slack. I’m not sure why the We might be waiting in here:
Perhaps Perhaps another bug to fix in the future. 🙂 |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
The
Once the test passes again for the 7th time in a row (hopefully) and I confirm the correct log messages (should be |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Thanks for doing this @mattlord! It is amazing ❤️ |
Description
The
vtgate_vindex
->prefixfanout
tests have been flaky, particularly when GitHub Actions is slower/has more resource contention than usual.In this PR we make the following changes:
vttablet
s to be seen as healthy and serving (inTestMain
) by thevtgate
before executing any testsWaitForTabletsToHealthyInVtgate()
function as it was always waiting for 1replica
tablet in each shard to be seen as healthy and serving in thevtgate
butreplica
tablets are optional and we have none of them in the cluster used for theprefixfanout
testvtgate_vindex
test as heavy since even with a long wait we still seemed unable to have mysqld start at timesWaitForTabletsToHealthyInVtgate()
bug, this is fairly heavy so leaving this in place (can remove though if others prefer)Cluster_17
flakiness seen here too (hitting the 10 min time limit); renamed that tovtgate_general_heavy
20
toxb_backup
as I missed the opportunity to do that in Temp: Pin XtraBackup version used at 2.4.24 for 5.7 tests #10194 and was sadℹ️ NOTE: marking a CI workflow as heavy causes us to increase some key OS limits (e.g. local ephemeral port range, AIO slots, open files, etc) while also decreasing the resource usage of each mysqld
Related Issue(s)
Checklist