New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225
Comments
Michael Bayer (@zzzeek) wrote: The 1047 state indicates the database can't serve queries. Reconnecting to the DB in that state will still produce a non-working connection. I also can't produce a situation where the connection cannot recover automatically already. This is because once the WSREP state comes back, the connection itself is bounced out on a 2013 which is handled normally. Here's a raw PyMySQL script:
Here is a run against a node that is in the 1047 state, then I bring it back manually. At the point the node is brought back, it throws 2013 which is recoverable:
Here's a script that has a SQLAlchemy connection held open long. It recovers:
output, while I bring the node offline, then back online again - the 2013 condition occurs which allows the connection to be reset: queries run normally:
in another window I disable the node from being in the cluster (you can see it takes 35 seconds):
in the script, the connection loses its connection with a 2013, then reconnects back into 1047 state where it remains stable:
bring the node back:
in the script, the connection gets another 2013 which makes it reconnect, then it works:
I also tried with returning a connection to the pool each time. In this case, the connection is invalidated even sooner, when it is returned to the pool, since the reset operation (e.g. rollback()) fails. So I cannot reproduce the condition where the 1047 error is not automatically recovered, once the database is available again. Reconnecting while the DB remains in 1047 doesn't accomplish anything, unless you are implementing your own client-side failover system, in which case you need to intercept 1047 yourself using the handle_error() event: http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error . However, since this is Mistral, you're using oslo.db which isn't doing anything like that and is expected that you're connected to HAProxy. Usually, Galera is used with HAProxy or a similar proxy server, which would prevent the 1047 error from ever being exposed to the client application; instead, it fails over the node to another Galera node and the client sees the usual 2013 from which it recovers. As for #3497, this is special condition that invovled the ".info" dictionary and I don't see anything here that indicates that specific condition is occurring, that's not related to the fact that the connection is invalidated or not, that had to do with the mechanics of invalidation. In short, the 1047 error means the database is not available and there is no need to reconnect while it's in this state. |
Kövi András (@akovi) wrote: Michael, thank you very much for the detailed analysis. Yes we use HAProxy., and it works well in most cases. First I also thought that 1047 should be recoverable. However, it seems like due to something really similar to #3497, it does not recover. The specimen for the error is the very last error in the attached file.
The same error occurs every time the CronTrigger scheduler is trying to run and never resolves. Unfortunately, if the erroneous connection is used in an API request, that bubbles up to the user as well. Forgive me if I'm mistaken somewhere. I'm trying hard to understand the logged errors. |
Changes by Michael Bayer (@zzzeek):
|
Michael Bayer (@zzzeek) wrote: also...you're using HAProxy with clustercheck I assume? you should never see 1047 in your client? |
Kövi András (@akovi) wrote: This is a really tortured DB and the HAProxy check happens every few seconds. There's probably a few seconds while the proxy has not detected the error yet but Mistral is trying to use it. I'll check the HAProxy settings. |
Michael Bayer (@zzzeek) wrote: oh right, there's that one second, sure. then it doesn't reset .info in that codepath, we think. OK |
Michael Bayer (@zzzeek) wrote: well OK oslo.db should possibly get involved adding app-specific conditions like this as disconnect situations. |
Michael Bayer (@zzzeek) wrote: need exact SQLAlchemy version in use |
Kövi András (@akovi) wrote: 1.2.0 |
Michael Bayer (@zzzeek) wrote: it's possible that the connect event which sets 'pid' in oslo.db is not occurring, because the connect event is failing earlier. in this case it seems MySQL set transaction isolation is failing first:
this is not invalidating the connection, which it probably should. the 'pid' event then doesn't occur. then it does a checkin, which proceeds normally because .connection is still there. but....as of yet I can't get this to reproduce exactly, because that 2013 always comes up and fixes everything. I have to leave for today so will try to look tomorrow. |
Michael Bayer (@zzzeek) wrote: I have a patch running through at https://gerrit.sqlalchemy.org/#/c/zzzeek/sqlalchemy/+/711/ that should fix this. however I haven't tried to get the full blown galera test envirnoment to reproduce, I would need to get it to do a cleaner 1047 by blocking the network, or something like that. Is your environment able to reliably reproduce the condition and perhaps you could see if this short patch resolves? |
Michael Bayer (@zzzeek) wrote: Blocking the galera node w/ firewall rules produces a better test for me now. There's no drop of the connection at all (raw PyMySQL):
so let me try this w/ the oslo.db events set up on an engine |
Kövi András (@akovi) wrote: Thanks, Michael! This is great news. I'll try the patch tomorrow. We test with blocking with firewall and restarts, as well. |
Michael Bayer (@zzzeek) wrote: OK, I was able to reproduce the 'pid' issue and the patch fixes so will go with this. |
Michael Bayer (@zzzeek) wrote: Invalidate on failed connect handler Fixed bug in connection pool where a connection could be present in the Change-Id: I61d6f4827a98ab8455f1c3e1c55d046eeccec09a → c543729 |
Michael Bayer (@zzzeek) wrote: Invalidate on failed connect handler Fixed bug in connection pool where a connection could be present in the Change-Id: I61d6f4827a98ab8455f1c3e1c55d046eeccec09a → 967998f |
Changes by Michael Bayer (@zzzeek):
|
hi old me, this is a connection pool issue, let's tag it as such! |
Migrated issue, originally created by Kövi András (@akovi)
It seems like WSREP errors can cause #3497 to reappear.
Example error: InternalError: (1047, u'WSREP has not yet prepared node for application use')
The error was produced in a HA testing environment where the Galera cluster was being being bombarded with errors. The connecting Openstack Mistral service correctly identified the disconnect errors 2013, but seems to have missed the 1047 which caused erroneous connections stay in the pool.
Attachments: sqlalchemy-issue.txt
The text was updated successfully, but these errors were encountered: