Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Connection fails re-opening connections in 1.0.0? #133

Closed
gorlins opened this issue Aug 23, 2016 · 25 comments
Closed

Regression: Connection fails re-opening connections in 1.0.0? #133

gorlins opened this issue Aug 23, 2016 · 25 comments

Comments

@gorlins
Copy link

gorlins commented Aug 23, 2016

I believe that ConnectionPool doesn't play nice with 1.0, and can yield broken connections even in the best case scenarios. Here's my reproducing code:

import happybase
from happybase.pool import ConnectionPool, TException, socket
import thriftpy

for mod in [happybase, thriftpy]:
    print mod.__name__, mod.__version__

kwargs = load_connection_kwargs()
print kwargs
pool = ConnectionPool(1, **kwargs)

for i in range(10):
    print 'Loop', i,
    try:
        with pool.connection() as conn:
            print 'opened',  # Ensure yield succeeds
            conn.tables()
            conn.close()  # Or any random failure/timeout etc
    except (TException, socket.error) as e:
        print 'failed', e
    else:
        print 'succeeded'

Works for 0.9:

happybase 0.9
thriftpy 0.3.8
{u'host': 'XXXX', 'protocol': 'binary', 'transport': 'framed'}
Loop 0 opened succeeded
Loop 1 opened succeeded
Loop 2 opened succeeded
Loop 3 opened succeeded
Loop 4 opened succeeded
Loop 5 opened succeeded
Loop 6 opened succeeded
Loop 7 opened succeeded
Loop 8 opened succeeded
Loop 9 opened succeeded

But not 1.0.0:

happybase 1.0.0
thriftpy 0.3.8
{u'host': 'XXXX', 'protocol': 'binary', 'transport': 'framed'}
Loop 0 opened succeeded
Loop 1 opened failed [Errno 9] Bad file descriptor
Loop 2 opened succeeded
Loop 3 opened failed [Errno 9] Bad file descriptor
Loop 4 opened succeeded
Loop 5 opened failed [Errno 9] Bad file descriptor
Loop 6 opened succeeded
Loop 7 opened failed [Errno 9] Bad file descriptor
Loop 8 opened succeeded
Loop 9 opened failed [Errno 9] Bad file descriptor

Given that ConnectionPool.connection() calls open() before yielding in both versions, I can't immediately see what causes this. I am using CDH 5.7 w/ hbase 1.2.0-cdh5.7.1.

@wbolster
Copy link
Member

could be different behaviour in other python lib: thriftpy vs thrift (different dependencies in those versions).

also, stack trace gives more info perhaps?

@gorlins
Copy link
Author

gorlins commented Aug 24, 2016

So I can actually simplify the error, happybase just cannot cleanly close and re-open a connection:

conn = Connection(**kwargs)
conn.open()
print 'First', conn.tables()  # Succeeds
conn.close()
conn.open()
print 'Second', conn.tables()  # Fails

Fails on the second call. Stack trace is the same for both

Traceback (most recent call last):
  File "junk.py", line 503, in <module>
    conn.tables()
  File "/Users/sgorlin/miniconda/envs/dna-f7cfb5b8/lib/python2.7/site-packages/happybase/connection.py", line 244, in tables
    names = self.client.getTableNames()
  File "/Users/sgorlin/miniconda/envs/dna-f7cfb5b8/lib/python2.7/site-packages/thriftpy/thrift.py", line 157, in _req
    self._send(_api, **kwargs)
  File "/Users/sgorlin/miniconda/envs/dna-f7cfb5b8/lib/python2.7/site-packages/thriftpy/thrift.py", line 169, in _send
    self._oprot.trans.flush()
  File "thriftpy/transport/framed/cyframed.pyx", line 107, in thriftpy.transport.framed.cyframed.TCyFramedTransport.flush (thriftpy/transport/framed/cyframed.c:2310)
  File "thriftpy/transport/framed/cyframed.pyx", line 95, in thriftpy.transport.framed.cyframed.TCyFramedTransport.c_flush (thriftpy/transport/framed/cyframed.c:2044)
  File "/Users/sgorlin/miniconda/envs/dna-f7cfb5b8/lib/python2.7/site-packages/thriftpy/transport/socket.py", line 129, in write
    self.sock.sendall(buff)
  File "/Users/sgorlin/miniconda/envs/dna-f7cfb5b8/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
  File "/Users/sgorlin/miniconda/envs/dna-f7cfb5b8/lib/python2.7/socket.py", line 174, in _dummy
    raise error(EBADF, 'Bad file descriptor')
socket.error: [Errno 9] Bad file descriptor

@gorlins gorlins changed the title Regression: ConnectionPool fails re-opening connections in 1.0.0? Regression: Connection fails re-opening connections in 1.0.0? Aug 24, 2016
@gorlins
Copy link
Author

gorlins commented Aug 25, 2016

For context I'll add my justifying use case. I get frequent timeouts in the form:

conn = Connection(**kwargs)
conn.tables()  # works
from time import sleep; sleep(75)
conn.tables()
TTransportException: TTransportException(message='TSocket read 0 bytes', type=4)

So I use a subclass of ConnectionPool which tries to shield the consuming code from bad connections:

class AutoRetryPool(ConnectionPool):
    """Connection pool which does its best to ensure open connections

    May impose some overhead (ie not good for low-latency stuff) but should
    increase reliability of getting a working, open connection when needed

    """
    def __init__(self, size, retries=5, **kwargs):
        self.retries = retries
        super(AutoRetryPool, self).__init__(size, **kwargs)

    @contextmanager
    def connection(self, timeout=None):
        attempts = 0
        opened = False
        while True:
            try:
                with super(AutoRetryPool, self).connection(timeout=timeout) as conn:
                    # Force refresh the client
                    conn.close()
                    conn.open()

                    # Make an actual call to ensure
                    _ = conn.tables()
                    opened = True
                    yield conn
                return

            except (TException, socket.error):
                # These trigger refreshed connections in parent
                if opened:
                    # aka the failure was inside the yield
                    raise
                attempts += 1
                # print 'Replacing connection, attempt', attempts
                if attempts >= self.retries:
                    raise

I can't simply let ConnectionPool refresh the connection because it could idle in the queue long enough before it is requested again (aside: perhaps open() should not be called in the except block in ConnectionPool.connection(), so it only opens lazily when the conn is up). Perhaps the right code to use entails Connection._refresh_thrift_client but as a private method, this is unsatisfactory - the public interfaces of Connection.open() and Connection.close() should be the only methods used by other classes to refresh the client and it's not clear to me how that method relates.

@wbolster
Copy link
Member

wbolster commented Aug 25, 2016

since Connection.open() only calls transport.open(),at first glance it seems the problem is in the thriftpy lib. can you try opening a bug report there (and refer to this one)?

https://github.com/wbolster/happybase/blob/master/happybase/connection.py#L169-L178

@gorlins
Copy link
Author

gorlins commented Aug 25, 2016

I'm not sure I understand the underlying bug/difference between thriftpy and thrift, but more importantly, based on my understanding of the code, I think this should be addressable by Connection and not upstream.

Expected behavior of conn.close(); conn.open(), to me, is that the thing return to a clean working state. What is the purpose of conn._refresh_thrift_client if it is not envoked in that sequence? Maybe:

def open(self):
    if self.transport is None:
        self._refresh_thrift_client()
    # rest of method omitted

def close(self):
    if self.transport is None or not self.transport.is_open():
        self.transport = None
        return

    # logging omitted
    self.transport.close()
    self.transport = None  # Release all socket-like resources for gc
    self.client = None  # not sure about this one

Unless you need persistent transport/client/etc objects for some reason? I know that conn.client use leaks out in Table but ideally these should be totally encapsulated inside Connection

@gorlins
Copy link
Author

gorlins commented Aug 25, 2016

cause identified in other ticket, relying on self.transport.is_open() is unsafe in this instance.

@gorlins
Copy link
Author

gorlins commented Aug 26, 2016

thriftpy is bumped to 0.3.9, do you want to up requirements and release 1.0.1? I'll consider a pull request for the other issues

@wbolster
Copy link
Member

does happybase need changes? or is just bumping the thriftpy dep enough?

@gorlins
Copy link
Author

gorlins commented Sep 15, 2016

i think the dep is enough here, if i have time i'll try a pull request to make more robust but the acute issue is resolved by thriftpy

@aswinQuiklo
Copy link

I still get the TTransportException(message='TSocket read 0 bytes', type=4) exception if I keep my connection alive for some ~2 minutes. I tried creating connection by specifying timeout=0, but that doesn't seem to help. Can you guys tell me what to do if I have to keep a connection alive forever and run queries on HBase? I am using HBase version 1.2.1

@gorlins
Copy link
Author

gorlins commented Mar 1, 2017

Here is what I currently have deployed, although I have not recently verified whether the extra work is still necessary

class AutoRetryPool(ConnectionPool):
    """Connection pool which does its best to ensure open connections

    May impose some overhead (ie not good for low-latency stuff) but should
    increase reliability of getting a working, open connection when needed

    """
    def __init__(self, size, retries=5, **kwargs):
        self.retries = retries
        super(AutoRetryPool, self).__init__(size, **kwargs)

    @contextmanager
    def connection(self, timeout=None):
        attempts = 0
        opened = False
        while True:
            try:
                with super(AutoRetryPool, self).connection(timeout=timeout) as conn:
                    # Make an actual call to ensure
                    _ = conn.tables()
                    opened = True
                    yield conn
                return

            except (TException, socket.error):
                # These trigger refreshed connections in parent
                if opened:
                    # aka the failure was inside the yield, can't fix
                    raise

                attempts += 1
                # print 'Replacing connection, attempt', attempts
                if attempts >= self.retries:
                    raise

    def _return_connection(self, connection):
        # Don't leave idling connections around - too easy to timeout before use
        connection.close()
        return super(AutoRetryPool, self)._return_connection(connection)

@aswinQuiklo
Copy link

Thanks! I will test this and get back to you!

@ecederstrand
Copy link
Contributor

I believe this may be caused by TSocket.is_open() returning True even after raising TTransportException. See Thriftpy/thriftpy#298

@pranny
Copy link

pranny commented Jul 3, 2017

@gorlins Thanks for the AutoRetryPool. I tried with the AutoRetryPool with _return_connection and I keep getting error like below. This happens when i scan a Client table which has 2 rows. If I remove the _return_connection function, it works fine. It's a great relief as I struggled for weeks to understand why I keep getting random inconsistent errors originating from thriftpy.

>>> [x for x in Client.scan()]
INFO:happybase.pool:Replacing tainted pool connection
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/pranav/workspace/venv/lib/python2.7/site-packages/happybase/table.py", line 402, in scan
    self.name, scan, {})
  File "/Users/pranav/workspace/venv/lib/python2.7/site-packages/thriftpy/thrift.py", line 195, in _req
    self._send(_api, **kwargs)
  File "/Users/pranav/workspace/venv/lib/python2.7/site-packages/thriftpy/thrift.py", line 206, in _send
    self._oprot.write_message_end()
  File "thriftpy/protocol/cybin/cybin.pyx", line 463, in cybin.TCyBinaryProtocol.write_message_end (thriftpy/protocol/cybin/cybin.c:6845)
  File "thriftpy/transport/buffered/cybuffered.pyx", line 80, in thriftpy.transport.buffered.cybuffered.TCyBufferedTransport.c_flush (thriftpy/transport/buffered/cybuffered.c:2147)
  File "/Users/pranav/workspace/venv/lib/python2.7/site-packages/thriftpy/transport/socket.py", line 129, in write
    self.sock.sendall(buff)

Also, sometimes I see the following error happening during scan. Not sure how to handle this

Traceback (most recent call last):
  File "/home/apps/etl_worker2/src/batchpostprocessor.py", line 39, in run
    res = ETLData.get_values_in_timerange(tag, *duration, values_only=True)
  File "/home/apps/etl_worker2/venv/local/lib/python2.7/site-packages/deming/models/etldata.py", line 45, in get_values_in_timerange
    res = [x for x in cls.scan(row_start=row_start, row_stop=row_stop)]
  File "/home/apps/etl_worker2/venv/local/lib/python2.7/site-packages/happybase/table.py", line 434, in scan
    self.connection.client.scannerClose(scan_id)
  File "/home/apps/etl_worker2/venv/local/lib/python2.7/site-packages/thriftpy/thrift.py", line 198, in _req
    return self._recv(_api)
  File "/home/apps/etl_worker2/venv/local/lib/python2.7/site-packages/thriftpy/thrift.py", line 210, in _recv
    fname, mtype, rseqid = self._iprot.read_message_begin()
  File "thriftpy/protocol/cybin/cybin.pyx", line 439, in cybin.TCyBinaryProtocol.read_message_begin (thriftpy/protocol/cybin/cybin.c:6470)
ProtocolError: No protocol version header

@gorlins
Copy link
Author

gorlins commented Jul 4, 2017

@pranny it Iooks like you didn't copy the error on the first issue, so I can't really comment... the latter issue looks like you have not configured the client correctly, so you need to check your server's transport and protocol and configure the client as such https://happybase.readthedocs.io/en/latest/api.html#happybase.Connection. Also, just to beg the question, why are you scanning/using HBase at all if your table only has 2 rows?

@nackjicholson
Copy link

@gorlins @wbolster I scanned this page for info, but it's quite a long time since the last reply. Is there any resolution to this problem? Is there any plan to address this problem? I am getting errno 32 broken pipe anytime I try to use pool.connection() twice during a process. To me, that's unusably broken.

Thanks for the AutoRetryPoolcode above, I'll give that a shot.

@ecederstrand
Copy link
Contributor

According to https://github.com/Thriftpy/thriftpy, thriftpy is deprecated and users should migrate to thriftpy2. I don't know what the implications are for happybase.

@ethe
Copy link

ethe commented May 24, 2019

I think it has already fixed in thriftpy2, see Thriftpy/thriftpy2@ce522cc, thanks.

@wbolster
Copy link
Member

happybase 1.2.0 uses thriftpy2: https://happybase.readthedocs.io/en/latest/news.html#happybase-1-2-0

@ankitgaur
Copy link

Hi I am using Happybase 1.2.0 and thriftpy2 0.4.4 and still facing this issue.

Steps to reproduce:-

  1. Create Connection pool.
    self.pool = happybase.ConnectionPool(size=self.POOL_SIZE, host=self.HBASE_HOST)

  2. Get connection and get row from hbase table

     with self.pool.connection() as connection:
         table = connection.table(self.HBASE_TABLE)
         row = table.row(id)
    
  3. sleep 3 minutes

  4. Do step 2 again

Result:- I get the following exception.

thriftpy2.transport.TTransportException: TTransportException(type=4, message='TSocket read 0 bytes')

@ethe
Copy link

ethe commented Aug 2, 2019

I think it is by design in thriftpy2, EOF would be returned if this connection closed.

@ankitgaur
Copy link

So as a conclusion the connection pooling in happybase does not work?

@ethe
Copy link

ethe commented Aug 3, 2019

I think HappyBase should catch this exception if it builds a connection pool in itself.

@nackjicholson
Copy link

The AutoRetryPool solution written by @gorlins above helps to repair broken connections which are released back to the pool. This is nice, but doesn't solve all connection problems. A random blog on the internet pretty clearly demonstrates a problem with hbase and happybase:

http://www.programmersought.com/article/41421562488/;jsessionid=660BCDAD2F7713EAE1A111B2255FD2FA

HBASE is hanging up on us while we have an active connection, and there is nothing gorlins code or happybase can really do to repair that. That is this bit:

if opened:
    # aka the failure was inside the yield, can't fix
    raise

So, if you're doing something like:

with pool.connection() as conn:
    t = conn.table("foo")
    t.put("1", {})

    # do something expensive for 60s
    sleep(60)

    # try to write some more data, BOOM BrokenPipeError!
    t.put("2", {})

You're going to get an explosion when that second t.put tries to write data, because the pipe to HBASE is gone. You'll get a BrokenPipeError.

The solution unfortunately is to open new connections for each set of operations after the expensive thing.

with pool.connection() as conn:
    t = conn.table("foo")
    t.put("1", {})

# close that one.
# do something expensive and time consuming, this time without an active conn.
sleep(60)

# open up another one
with pool.connection() as conn:
    t = conn.table("foo")
    t.put("2", {})

@ankitgaur So yes, connection pooling in happybase does not work. Also, even if it did work HBASE is going to hang up on you anyway.

@ethe
Copy link

ethe commented Oct 19, 2019

Hi @nackjicholson, it can be a walk-around. However, catching the losing connection exception in HappyBase can totally resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants