Scanner limit not always correct #72

Closed
daz-li opened this Issue Nov 17, 2014 · 4 comments

Projects

None yet

2 participants

@daz-li
daz-li commented Nov 17, 2014

There is a bug in the code (and a simple fix as well ) in dealing with limit in table.py:

# Avoid round-trip when exhausted
if len(items) < how_many: 
    break 

To provide some context, items stores the rows for current batch returned from the HBase server (according to a scan iterator) . how_many indicates the number of rows fetched for the batch. The idea of the above statements is to avoid keeping fetching if the number of returned rows in items is less than how_many.

However, note that the how_many is only a suggestion to the server, not a requirement: the size of the batch returned from the server may well be less than the suggested number (how_many) despite there could still be more to return. This cause the scan to return immaturely. The fix is easy:

# if len(items) < how_many: 
#    break 
 if len(items) == 0:
     break 
@wbolster
Owner

Are you sure the how_many flag (or whatever the name is in the Thrift API) is only a hint? Where did you get that information?

@daz-li
daz-li commented Nov 20, 2014

I have used the following very simple code to debug table.py. Basically, I just print the number of items returned by inserting the following in the table.py:

   print '===:', len(items)

Certainly, I need to apply the aforementioned fix to avoid premature exit. Then, I used the following to test against my table:

   import happybase
   conn=happybase.Connection(thrift)
   tb=conn.table(tbname)
   scanner=tb.scan(row_prefix='com.google', limit=1000, batch_size=100)
   for (k,v) in scanner:
       pass

The output is:

===: 100
===: 100
===: 100
===: 100
===: 100
===: 100 
===: 48
===: 100
===: 8
===: 100
===: 7
===: 100
===: 10
===: 27

Despite the batch_size=100, we see the items returned during the scan having the length less than 100 (i.e. 48, 8, 7, 10). I have not looked in to the HBase code yet, so I can not confirm whether it is intended or accidental from HBase side. But I can understand if cache_size in scan is only a suggestion number.

BTW, this is on MapR Table implementation of Big Table.

@wbolster
Owner

Okay, seems HBase behaves differently than I expected from its docs.

@wbolster wbolster added the bug label Nov 20, 2014
@wbolster wbolster self-assigned this Nov 20, 2014
@wbolster wbolster added this to the 0.9 milestone Nov 20, 2014
@wbolster wbolster changed the title from A bug in dealing with limit. to Scanner limit not always correct Nov 24, 2014
@wbolster wbolster closed this in 8a27521 Nov 24, 2014
@wbolster
Owner

FYI, I've released HappyBase 0.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment