Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225

Closed
sqlalchemy-bot opened this issue Mar 27, 2018 · 19 comments
Closed
Labels
bug Something isn't working connection pool mysql
Milestone

Comments

@sqlalchemy-bot
Copy link
Collaborator

Migrated issue, originally created by Kövi András (@akovi)

It seems like WSREP errors can cause #3497 to reappear.

Example error: InternalError: (1047, u'WSREP has not yet prepared node for application use')

The error was produced in a HA testing environment where the Galera cluster was being being bombarded with errors. The connecting Openstack Mistral service correctly identified the disconnect errors 2013, but seems to have missed the 1047 which caused erroneous connections stay in the pool.


Attachments: sqlalchemy-issue.txt

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

The 1047 state indicates the database can't serve queries. Reconnecting to the DB in that state will still produce a non-working connection.

I also can't produce a situation where the connection cannot recover automatically already. This is because once the WSREP state comes back, the connection itself is bounced out on a 2013 which is handled normally. Here's a raw PyMySQL script:

import pymysql

conn = pymysql.connect(user='nova', passwd='Nova_password', host='10.0.20.14', db='nova')

def go():
    try:
        cursor = conn.cursor()
        cursor.execute("select 1")
        print "OK"
    except Exception as err:
        print err

while True:
    raw_input("press enter")
    go()

Here is a run against a node that is in the 1047 state, then I bring it back manually. At the point the node is brought back, it throws 2013 which is recoverable:

# python test2.py 
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(2013, 'Lost connection to MySQL server during query')
press enter
(0, '')
press enter
(0, '')

Here's a script that has a SQLAlchemy connection held open long. It recovers:

from sqlalchemy import create_engine
import time

e = create_engine("mysql+pymysql://nova:Nova_password@10.0.20.14/nova", echo='debug')

long_lived_connection = e.connect()

def go_long():
    try:
        print long_lived_connection.scalar("select 1")
        print "long conn success"
    except Exception as err:
        print "long conn exception: %s" % err

while True:
    raw_input("press enter to run queries")
    go_long()

output, while I bring the node offline, then back online again - the 2013 condition occurs which allows the connection to be reset:

queries run normally:

press enter to run queries
2018-03-27 17:39:33,370 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:39:33,370 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:39:33,371 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:39:33,371 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success
press enter to run queries
2018-03-27 17:39:33,744 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:39:33,744 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:39:33,745 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:39:33,745 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success

in another window I disable the node from being in the cluster (you can see it takes 35 seconds):

MariaDB [(none)]> SET GLOBAL wsrep_cluster_address="gcomm://192.168.28.1:4567";                                                                                                   
Query OK, 0 rows affected (35.02 sec)


in the script, the connection loses its connection with a 2013, then reconnects back into 1047 state where it remains stable:

press enter to run queries
2018-03-27 17:40:20,027 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:20,028 INFO sqlalchemy.engine.base.Engine {}
long conn exception: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'select 1'] (Background on this error at: http://sqlalche.me/e/e3
q8)
press enter to run queries
2018-03-27 17:40:21,601 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:21,601 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:21,602 INFO sqlalchemy.engine.base.Engine ROLLBACK
/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py:74: SAWarning: An exception has occurred during handling of a previous exception.  The previous exception is:
 <class 'pymysql.err.InternalError'> (1047, u'WSREP has not yet prepared node for application use')

  "is:\n %s %s\n" % (self._exc_info[0], self._exc_info[1]))
long conn exception: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') (Background on this error at: http://sqlalche.me/e/2j85)
press enter to run queries
2018-03-27 17:40:23,921 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:23,921 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:23,921 INFO sqlalchemy.engine.base.Engine ROLLBACK
long conn exception: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') (Background on this error at: http://sqlalche.me/e/2j85)
press enter to run queries
2018-03-27 17:40:24,831 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:24,832 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:24,833 INFO sqlalchemy.engine.base.Engine ROLLBACK
long conn exception: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') (Background on this error at: http://sqlalche.me/e/2j85)
press enter to run queries

bring the node back:

#!

MariaDB [(none)]> SET GLOBAL wsrep_cluster_address="gcomm://stack1-overcloud-controller-0.internalapi.localdomain,stack1-overcloud-controller-1.internalapi.localdomain,stack1-overcloud-controller-2.internalapi.localdomain";
Query OK, 0 rows affected (3.00 sec)

in the script, the connection gets another 2013 which makes it reconnect, then it works:

2018-03-27 17:40:35,056 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:35,056 INFO sqlalchemy.engine.base.Engine {}
long conn exception: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'select 1'] (Background on this error at: http://sqlalche.me/e/e3
q8)
press enter to run queries
2018-03-27 17:40:36,185 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:36,186 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:36,186 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:40:36,186 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success
press enter to run queries
2018-03-27 17:40:37,187 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:37,187 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:37,188 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:40:37,188 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success

I also tried with returning a connection to the pool each time. In this case, the connection is invalidated even sooner, when it is returned to the pool, since the reset operation (e.g. rollback()) fails.

So I cannot reproduce the condition where the 1047 error is not automatically recovered, once the database is available again. Reconnecting while the DB remains in 1047 doesn't accomplish anything, unless you are implementing your own client-side failover system, in which case you need to intercept 1047 yourself using the handle_error() event: http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error . However, since this is Mistral, you're using oslo.db which isn't doing anything like that and is expected that you're connected to HAProxy.

Usually, Galera is used with HAProxy or a similar proxy server, which would prevent the 1047 error from ever being exposed to the client application; instead, it fails over the node to another Galera node and the client sees the usual 2013 from which it recovers.

As for #3497, this is special condition that invovled the ".info" dictionary and I don't see anything here that indicates that specific condition is occurring, that's not related to the fact that the connection is invalidated or not, that had to do with the mechanics of invalidation.

In short, the 1047 error means the database is not available and there is no need to reconnect while it's in this state.

@sqlalchemy-bot
Copy link
Collaborator Author

Kövi András (@akovi) wrote:

Michael, thank you very much for the detailed analysis. Yes we use HAProxy., and it works well in most cases.

First I also thought that 1047 should be recoverable. However, it seems like due to something really similar to #3497, it does not recover. The specimen for the error is the very last error in the attached file.

#!
...
 File "/opt/venv/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 388, in checkout
    if connection_record.info['pid'] != pid:
KeyError: 'pid'

The same error occurs every time the CronTrigger scheduler is trying to run and never resolves. Unfortunately, if the erroneous connection is used in an API request, that bubbles up to the user as well.

Forgive me if I'm mistaken somewhere. I'm trying hard to understand the logged errors.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

ah OK the 'pid' thing is at the end there, then yes this looks more like a #3497 but doesn't necessarily mean the 1047 needs to be caught. I bet the issue is the invalidation occurring during the reset, I can take a look at that.

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed title from "MySQL error 1047 should be handled as disconnect" to "MySQL 1047 wsrep .info dictionary not handled in r"

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

also...you're using HAProxy with clustercheck I assume? you should never see 1047 in your client?

@sqlalchemy-bot
Copy link
Collaborator Author

Kövi András (@akovi) wrote:

This is a really tortured DB and the HAProxy check happens every few seconds. There's probably a few seconds while the proxy has not detected the error yet but Mistral is trying to use it. I'll check the HAProxy settings.
This whole thing is really good to know as, even though the error may not need to be caught, it should be transformed into a meaningful error in the API and retried a few times before giving up.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

oh right, there's that one second, sure. then it doesn't reset .info in that codepath, we think. OK

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

well OK oslo.db should possibly get involved adding app-specific conditions like this as disconnect situations.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

need exact SQLAlchemy version in use

@sqlalchemy-bot
Copy link
Collaborator Author

Kövi András (@akovi) wrote:

1.2.0

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

it's possible that the connect event which sets 'pid' in oslo.db is not occurring, because the connect event is failing earlier. in this case it seems MySQL set transaction isolation is failing first:

#!

 File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 679, in __connect
    pool.dispatch.connect(self.connection, self)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__
    fn(*args, **kw)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 173, in on_connect
    do_on_connect(conn)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1708, in connect
    self.set_isolation_level(conn, self.isolation_level)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1725, in set_isolation_level
    self._set_isolation_level(connection, level)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/mysqldb.py", line 217, in _set_isolation_level
    level)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1736, in _set_isolation_level
    cursor.execute("COMMIT")
    ...
  DBError: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') 

this is not invalidating the connection, which it probably should. the 'pid' event then doesn't occur. then it does a checkin, which proceeds normally because .connection is still there.

but....as of yet I can't get this to reproduce exactly, because that 2013 always comes up and fixes everything. I have to leave for today so will try to look tomorrow.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

I have a patch running through at https://gerrit.sqlalchemy.org/#/c/zzzeek/sqlalchemy/+/711/ that should fix this. however I haven't tried to get the full blown galera test envirnoment to reproduce, I would need to get it to do a cleaner 1047 by blocking the network, or something like that. Is your environment able to reliably reproduce the condition and perhaps you could see if this short patch resolves?

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

Blocking the galera node w/ firewall rules produces a better test for me now. There's no drop of the connection at all (raw PyMySQL):

press enter
OK
press enter
OK
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
OK
press enter
OK

so let me try this w/ the oslo.db events set up on an engine

@sqlalchemy-bot
Copy link
Collaborator Author

Kövi András (@akovi) wrote:

Thanks, Michael! This is great news. I'll try the patch tomorrow. We test with blocking with firewall and restarts, as well.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

OK, I was able to reproduce the 'pid' issue and the patch fixes so will go with this.

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

Invalidate on failed connect handler

Fixed bug in connection pool where a connection could be present in the
pool without all of its "connect" event handlers called, if a previous
"connect" handler threw an exception; note that the dialects themselves
have connect handlers that emit SQL, such as those which set transaction
isolation, which can fail if the database is in a non-available state, but
still allows a connection. The connection is now invalidated first if any
of the connect handlers fail.

Change-Id: I61d6f4827a98ab8455f1c3e1c55d046eeccec09a
Fixes: #4225

c543729

@sqlalchemy-bot
Copy link
Collaborator Author

Michael Bayer (@zzzeek) wrote:

Invalidate on failed connect handler

Fixed bug in connection pool where a connection could be present in the
pool without all of its "connect" event handlers called, if a previous
"connect" handler threw an exception; note that the dialects themselves
have connect handlers that emit SQL, such as those which set transaction
isolation, which can fail if the database is in a non-available state, but
still allows a connection. The connection is now invalidated first if any
of the connect handlers fail.

Change-Id: I61d6f4827a98ab8455f1c3e1c55d046eeccec09a
Fixes: #4225
(cherry picked from commit c543729)

967998f

@sqlalchemy-bot
Copy link
Collaborator Author

Changes by Michael Bayer (@zzzeek):

  • changed status to closed

@zzzeek
Copy link
Member

zzzeek commented Dec 12, 2022

hi old me, this is a connection pool issue, let's tag it as such!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connection pool mysql
Projects
None yet
Development

No branches or pull requests

2 participants