MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225

sqlalchemy-bot · 2018-03-27T12:54:21Z

Migrated issue, originally created by Kövi András (@akovi)

It seems like WSREP errors can cause #3497 to reappear.

Example error: InternalError: (1047, u'WSREP has not yet prepared node for application use')

The error was produced in a HA testing environment where the Galera cluster was being being bombarded with errors. The connecting Openstack Mistral service correctly identified the disconnect errors 2013, but seems to have missed the 1047 which caused erroneous connections stay in the pool.

Attachments: sqlalchemy-issue.txt

sqlalchemy-bot · 2018-03-27T18:15:19Z

Michael Bayer (@zzzeek) wrote:

The 1047 state indicates the database can't serve queries. Reconnecting to the DB in that state will still produce a non-working connection.

I also can't produce a situation where the connection cannot recover automatically already. This is because once the WSREP state comes back, the connection itself is bounced out on a 2013 which is handled normally. Here's a raw PyMySQL script:

import pymysql

conn = pymysql.connect(user='nova', passwd='Nova_password', host='10.0.20.14', db='nova')

def go():
    try:
        cursor = conn.cursor()
        cursor.execute("select 1")
        print "OK"
    except Exception as err:
        print err

while True:
    raw_input("press enter")
    go()

Here is a run against a node that is in the 1047 state, then I bring it back manually. At the point the node is brought back, it throws 2013 which is recoverable:

# python test2.py 
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(2013, 'Lost connection to MySQL server during query')
press enter
(0, '')
press enter
(0, '')

Here's a script that has a SQLAlchemy connection held open long. It recovers:

from sqlalchemy import create_engine
import time

e = create_engine("mysql+pymysql://nova:Nova_password@10.0.20.14/nova", echo='debug')

long_lived_connection = e.connect()

def go_long():
    try:
        print long_lived_connection.scalar("select 1")
        print "long conn success"
    except Exception as err:
        print "long conn exception: %s" % err

while True:
    raw_input("press enter to run queries")
    go_long()

output, while I bring the node offline, then back online again - the 2013 condition occurs which allows the connection to be reset:

queries run normally:

press enter to run queries
2018-03-27 17:39:33,370 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:39:33,370 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:39:33,371 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:39:33,371 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success
press enter to run queries
2018-03-27 17:39:33,744 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:39:33,744 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:39:33,745 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:39:33,745 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success

in another window I disable the node from being in the cluster (you can see it takes 35 seconds):

MariaDB [(none)]> SET GLOBAL wsrep_cluster_address="gcomm://192.168.28.1:4567";                                                                                                   
Query OK, 0 rows affected (35.02 sec)

in the script, the connection loses its connection with a 2013, then reconnects back into 1047 state where it remains stable:

press enter to run queries
2018-03-27 17:40:20,027 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:20,028 INFO sqlalchemy.engine.base.Engine {}
long conn exception: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'select 1'] (Background on this error at: http://sqlalche.me/e/e3
q8)
press enter to run queries
2018-03-27 17:40:21,601 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:21,601 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:21,602 INFO sqlalchemy.engine.base.Engine ROLLBACK
/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py:74: SAWarning: An exception has occurred during handling of a previous exception.  The previous exception is:
 <class 'pymysql.err.InternalError'> (1047, u'WSREP has not yet prepared node for application use')

  "is:\n %s %s\n" % (self._exc_info[0], self._exc_info[1]))
long conn exception: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') (Background on this error at: http://sqlalche.me/e/2j85)
press enter to run queries
2018-03-27 17:40:23,921 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:23,921 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:23,921 INFO sqlalchemy.engine.base.Engine ROLLBACK
long conn exception: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') (Background on this error at: http://sqlalche.me/e/2j85)
press enter to run queries
2018-03-27 17:40:24,831 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:24,832 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:24,833 INFO sqlalchemy.engine.base.Engine ROLLBACK
long conn exception: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') (Background on this error at: http://sqlalche.me/e/2j85)
press enter to run queries

bring the node back:

#!

MariaDB [(none)]> SET GLOBAL wsrep_cluster_address="gcomm://stack1-overcloud-controller-0.internalapi.localdomain,stack1-overcloud-controller-1.internalapi.localdomain,stack1-overcloud-controller-2.internalapi.localdomain";
Query OK, 0 rows affected (3.00 sec)

in the script, the connection gets another 2013 which makes it reconnect, then it works:

2018-03-27 17:40:35,056 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:35,056 INFO sqlalchemy.engine.base.Engine {}
long conn exception: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'select 1'] (Background on this error at: http://sqlalche.me/e/e3
q8)
press enter to run queries
2018-03-27 17:40:36,185 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:36,186 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:36,186 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:40:36,186 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success
press enter to run queries
2018-03-27 17:40:37,187 INFO sqlalchemy.engine.base.Engine select 1
2018-03-27 17:40:37,187 INFO sqlalchemy.engine.base.Engine {}
2018-03-27 17:40:37,188 DEBUG sqlalchemy.engine.base.Engine Col (u'1',)
2018-03-27 17:40:37,188 DEBUG sqlalchemy.engine.base.Engine Row (1,)
1
long conn success

I also tried with returning a connection to the pool each time. In this case, the connection is invalidated even sooner, when it is returned to the pool, since the reset operation (e.g. rollback()) fails.

So I cannot reproduce the condition where the 1047 error is not automatically recovered, once the database is available again. Reconnecting while the DB remains in 1047 doesn't accomplish anything, unless you are implementing your own client-side failover system, in which case you need to intercept 1047 yourself using the handle_error() event: http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error . However, since this is Mistral, you're using oslo.db which isn't doing anything like that and is expected that you're connected to HAProxy.

Usually, Galera is used with HAProxy or a similar proxy server, which would prevent the 1047 error from ever being exposed to the client application; instead, it fails over the node to another Galera node and the client sees the usual 2013 from which it recovers.

As for #3497, this is special condition that invovled the ".info" dictionary and I don't see anything here that indicates that specific condition is occurring, that's not related to the fact that the connection is invalidated or not, that had to do with the mechanics of invalidation.

In short, the 1047 error means the database is not available and there is no need to reconnect while it's in this state.

sqlalchemy-bot · 2018-03-27T18:42:40Z

Kövi András (@akovi) wrote:

Michael, thank you very much for the detailed analysis. Yes we use HAProxy., and it works well in most cases.

First I also thought that 1047 should be recoverable. However, it seems like due to something really similar to #3497, it does not recover. The specimen for the error is the very last error in the attached file.

#!
...
 File "/opt/venv/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 388, in checkout
    if connection_record.info['pid'] != pid:
KeyError: 'pid'

The same error occurs every time the CronTrigger scheduler is trying to run and never resolves. Unfortunately, if the erroneous connection is used in an API request, that bubbles up to the user as well.

Forgive me if I'm mistaken somewhere. I'm trying hard to understand the logged errors.

sqlalchemy-bot · 2018-03-27T18:53:47Z

Michael Bayer (@zzzeek) wrote:

ah OK the 'pid' thing is at the end there, then yes this looks more like a #3497 but doesn't necessarily mean the 1047 needs to be caught. I bet the issue is the invalidation occurring during the reset, I can take a look at that.

sqlalchemy-bot · 2018-03-27T18:54:19Z

Changes by Michael Bayer (@zzzeek):

changed title from "MySQL error 1047 should be handled as disconnect" to "MySQL 1047 wsrep .info dictionary not handled in r"

sqlalchemy-bot · 2018-03-27T18:54:58Z

Michael Bayer (@zzzeek) wrote:

also...you're using HAProxy with clustercheck I assume? you should never see 1047 in your client?

sqlalchemy-bot · 2018-03-27T19:13:33Z

Kövi András (@akovi) wrote:

This is a really tortured DB and the HAProxy check happens every few seconds. There's probably a few seconds while the proxy has not detected the error yet but Mistral is trying to use it. I'll check the HAProxy settings.
This whole thing is really good to know as, even though the error may not need to be caught, it should be transformed into a meaningful error in the API and retried a few times before giving up.

sqlalchemy-bot · 2018-03-27T19:20:48Z

Michael Bayer (@zzzeek) wrote:

oh right, there's that one second, sure. then it doesn't reset .info in that codepath, we think. OK

sqlalchemy-bot · 2018-03-27T19:22:09Z

Michael Bayer (@zzzeek) wrote:

well OK oslo.db should possibly get involved adding app-specific conditions like this as disconnect situations.

sqlalchemy-bot · 2018-03-27T19:38:19Z

Michael Bayer (@zzzeek) wrote:

need exact SQLAlchemy version in use

sqlalchemy-bot · 2018-03-27T19:39:43Z

Kövi András (@akovi) wrote:

1.2.0

sqlalchemy-bot · 2018-03-27T20:08:02Z

Michael Bayer (@zzzeek) wrote:

it's possible that the connect event which sets 'pid' in oslo.db is not occurring, because the connect event is failing earlier. in this case it seems MySQL set transaction isolation is failing first:

#!

 File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/pool.py", line 679, in __connect
    pool.dispatch.connect(self.connection, self)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/event/attr.py", line 256, in __call__
    fn(*args, **kw)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 173, in on_connect
    do_on_connect(conn)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1708, in connect
    self.set_isolation_level(conn, self.isolation_level)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1725, in set_isolation_level
    self._set_isolation_level(connection, level)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/mysqldb.py", line 217, in _set_isolation_level
    level)
  File "/opt/venv/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 1736, in _set_isolation_level
    cursor.execute("COMMIT")
    ...
  DBError: (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use')

this is not invalidating the connection, which it probably should. the 'pid' event then doesn't occur. then it does a checkin, which proceeds normally because .connection is still there.

but....as of yet I can't get this to reproduce exactly, because that 2013 always comes up and fixes everything. I have to leave for today so will try to look tomorrow.

sqlalchemy-bot · 2018-03-28T16:01:51Z

Michael Bayer (@zzzeek) wrote:

I have a patch running through at https://gerrit.sqlalchemy.org/#/c/zzzeek/sqlalchemy/+/711/ that should fix this. however I haven't tried to get the full blown galera test envirnoment to reproduce, I would need to get it to do a cleaner 1047 by blocking the network, or something like that. Is your environment able to reliably reproduce the condition and perhaps you could see if this short patch resolves?

sqlalchemy-bot · 2018-03-28T19:23:29Z

Michael Bayer (@zzzeek) wrote:

Blocking the galera node w/ firewall rules produces a better test for me now. There's no drop of the connection at all (raw PyMySQL):

press enter
OK
press enter
OK
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
(1047, u'WSREP has not yet prepared node for application use')
press enter
OK
press enter
OK

so let me try this w/ the oslo.db events set up on an engine

sqlalchemy-bot · 2018-03-28T19:31:00Z

Kövi András (@akovi) wrote:

Thanks, Michael! This is great news. I'll try the patch tomorrow. We test with blocking with firewall and restarts, as well.

sqlalchemy-bot · 2018-03-28T19:46:50Z

Michael Bayer (@zzzeek) wrote:

OK, I was able to reproduce the 'pid' issue and the patch fixes so will go with this.

sqlalchemy-bot · 2018-03-28T20:58:59Z

Michael Bayer (@zzzeek) wrote:

Invalidate on failed connect handler

Fixed bug in connection pool where a connection could be present in the
pool without all of its "connect" event handlers called, if a previous
"connect" handler threw an exception; note that the dialects themselves
have connect handlers that emit SQL, such as those which set transaction
isolation, which can fail if the database is in a non-available state, but
still allows a connection. The connection is now invalidated first if any
of the connect handlers fail.

Change-Id: I61d6f4827a98ab8455f1c3e1c55d046eeccec09a
Fixes: #4225

→ c543729

sqlalchemy-bot · 2018-03-28T20:58:59Z

Michael Bayer (@zzzeek) wrote:

Invalidate on failed connect handler

Fixed bug in connection pool where a connection could be present in the
pool without all of its "connect" event handlers called, if a previous
"connect" handler threw an exception; note that the dialects themselves
have connect handlers that emit SQL, such as those which set transaction
isolation, which can fail if the database is in a non-available state, but
still allows a connection. The connection is now invalidated first if any
of the connect handlers fail.

Change-Id: I61d6f4827a98ab8455f1c3e1c55d046eeccec09a
Fixes: #4225
(cherry picked from commit c543729)

→ 967998f

sqlalchemy-bot · 2018-03-28T20:58:59Z

Changes by Michael Bayer (@zzzeek):

changed status to closed

zzzeek · 2022-12-12T18:08:24Z

hi old me, this is a connection pool issue, let's tag it as such!

sqlalchemy-bot closed this as completed Mar 28, 2018

sqlalchemy-bot added bug Something isn't working mysql labels Nov 27, 2018

sqlalchemy-bot added this to the 1.2 milestone Nov 27, 2018

zzzeek mentioned this issue Dec 12, 2022

new eventlet-related race detected in pool #8974

Closed

zzzeek added the connection pool label Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225

MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

zzzeek commented Dec 12, 2022

MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225

MySQL 1047 wsrep, .info dictionary not handled in reset invalidation? #4225

Comments

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 27, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

sqlalchemy-bot commented Mar 28, 2018

zzzeek commented Dec 12, 2022