New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test fix for mysql start timeout [WIP] [DO NOT MERGE] #1214
Conversation
if this passes, please continue to recheck the upgrade job over and over again Don't care about the ceph or swift jobs for this particular PR |
0699087
to
79cdcb6
Compare
back to square one:
|
Seems like there's a sneaky start of mariadb 10 happening when the package is being installed, even though it's mean to be denied by the policy:
Specifically note this line:
what are you doing there buddy? |
using jenkins build 8491: specific bit of the galera log we are interested in showing the multiple starts/stops of mysqld/innodb/wsrep: http://paste.openstack.org/show/526090/ ansible console log: http://jenkins.propter.net/job/RPC-AIO/8491/consoleFull specific bit of ansible log showing the sneaky mysqld start at the package install time: |
0c4473f
to
71203fe
Compare
@mancdaz are you sure it's starting, the following two lines in the log suggest it was prevented
|
@git-harry I'm not sure of anything tbh, just continuing to pour through the logfiles and try and note anything odd. However, the timing of that line does suggest that something is trying to start up at the point of install, and indeed the mysql error log shows lots of activity around then. If it was truly being denied by policy-rc.d, I would not expect it to be assigned a PID. |
More notes: In a successful startup, the following is in the mysql log:
In a failed startup:
The existence of a May mean nothing, who knows |
So, adding a task to pgrep for mysqld and look at netstat right after the mariadb-server package is installed (bearing in mind that policy-rc.d should not allow it to be running), shows that it is indeed running (or at least the galera wsrep part of it is listening on 4567):
Pulling out the important bits: netstat:
pgrep:
|
In an upgrade job, nodes 1,2 of the cluster are restarted at the beginning of the liberty upgrade script, and then node 0 is restarted. In a job where the later mariadb upgrade fails, it's useful to note that earlier when those restarts happened, the cluster was left in a bad state:
In a 'good' job, the cluster gets itself back into a good state after those container restarts:
|
OK so it's not mariadb 10 that's starting right on install, but an old lingering mariadb 5.5 daemon that never got shut down properly from earlier. We can see this by looking at the PID we saw earlier: netstat:
And looking in the mariadb log for that container:
Shortly after this, the cluster fails to establish quorum and goes into error state:
For whatever reason, this means that the mariadb 5.5 daemon fails to shut down properly when we come to install/start mariadb 10. Need to understand what is causing the cluster to go into error (fail to reach quorum) around the time of the container restarts at the beginning of the upgrade. |
fd131ac
to
8f3e9f8
Compare
deda927
to
d2f4e8e
Compare
Current working theory is to gracefully stop mariadb on nodes prior to container restarts, to ensure data consistency and prevent nodes seeing themselves as 'crashed'. Interestingly, stopping (gracefully or otherwise) nodes 2/3 simultaneously causes the manifestation of this galera bug: https://bugs.launchpad.net/galera/+bug/1217225 Now trying stopping mariadb in serial to see if we get what we want |
f82945e
to
040e01b
Compare
040e01b
to
70910fe
Compare
722b31b
to
2316e67
Compare
turns out my graceful stop of mariadb on the master node was being skipped because of https://bugs.launchpad.net/openstack-ansible/+bug/1604775 trying again now with a workaround |
The story deepens. Turns out that the part of the lxc_container_create role that adds some autodev stuff for containers in liberty, thusly: causes a secret container restart, so even though my patch is shutting mysql down prior to the later explicit container restart, there is a secret container restart which causes mysql to start again. Then, when it comes to the later explicit container restart, we're basically just back to forcefully crashing the mysql daemon without a graceful stop. le sigh. |
2316e67
to
0ee4df7
Compare
w00t https://review.openstack.org/#/c/344834/ fixes the 'secret container restart' problem, and gives us more flexibility and control over when containers restart. Testing again including that fix |
0ee4df7
to
2984d9f
Compare
2984d9f
to
8bce6b5
Compare
8bce6b5
to
9eb1c2e
Compare
http://jenkins.propter.net/job/RPC-AIO/9842/consoleFull All failed in kilo when deploying maas:
|
all jenkins jobs currently failing due to maas rate limiting |
http://jenkins.propter.net/job/RPC-AIO/9924/consoleFull: install elasticsearch <2.1.0 fail (liberty) tests not running maas: I appear to have introduced some kind of problem with elasticsearch installs, maybe an artifact of the way I was rebuilding the same test, or some kind of problem with a lack of rebase for a while. Hoping that a rebase will sort that issue. The tests are at least passing the mysql upgrade part! Tests with reverted jimmy patch to try and fix the elsaticsearch constraint: straight liberty build to test jimmy patch revert |
7d226c5
to
9eb1c2e
Compare
pointing at my own fork of openstack-ansible since that is where the actual fix is.
9eb1c2e
to
61a7f6f
Compare
Ok so aside from issues with global requirements , we have 11 successful tests of the mysql changes. Going to do a few more for fun |
So, the story comes to an end. The patch was submitted in OSA https://review.openstack.org/#/c/347195/ and has merged. Yay for better control of mariadb/galera in upgrades. |
Here I'm testing a fix in OSA, but changing the submodule to point at
my fork of OSA where I have made a patch. I have to do it this way
because OSA doesn't do upgrade testing of mariadb between kilo and
liberty, which explains why we're only seeing this mostly in rpco.
The fix is essentially to wait longer between mysql start attempts after doing a mariadb upgrade.
Please, feel free to recheck_upgrade on this because that's what I'm testing here.
FYI this is the OSA fix I am testing https://github.com/mancdaz/openstack-ansible/commits/mysql-start
Connects #1201