doubleflush bug #58

scraperdragon · 2014-08-21T17:25:10Z

--> contains additional debugging statments
https://github.com/scraperwiki/scraperwiki-python/tree/doubleflush_bug

#!/usr/bin/env python
import logging
import scraperwiki
import json

logging.basicConfig(level=logging.DEBUG)

data = [{'rowx':x} for x in range(900042)]
#scraperwiki.sql.save(['rowx'], data)

print 'cat'
try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print repr(e)
else:
    print 'no exception'
print 'dog'
try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print repr(e)
else:
    print 'no exception'
print 'nope'

print 'la la la la\n'*6

cat
AttributeError("'NoneType' object has no attribute 'decode'",)
dog
RuntimeError('Double flush',)
nope
la la la la
la la la la
la la la la
la la la la
la la la la
la la la la

The text was updated successfully, but these errors were encountered:

pwaller · 2014-08-21T17:58:43Z

Nice sleuthing. Spotted a couple of bugs as a result.

Boiled the test case down to this:

#!/usr/bin/env python
import scraperwiki

data = [{'rowx': None} for x in range(10000)]

try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print "Masked exception:", repr(e)

scraperwiki.sql.save(['rowx'], data)

pwaller · 2014-08-21T18:02:59Z

I'm not yet sure what to do in this case. What's happening is that the first save fails due to an exception (because the user passed bad data). The user masks this, then goes on to try another save, which fails.

If I were writing the code I wouldn't a) pass bad data or b) mask the exception, i.e, I would want it to blow up.

The problem is that the user is doing something which causes save() to explode but they expect it to be non-fatal. By buffering the save and replaying it at a later date, we cause the explosion to happen elsewhere, which isn't optimal.

I think the solution is to sanity-check the input in append before buffering it.

pwaller · 2014-08-21T18:03:35Z

There is also another bug where state is incorrectly reset in the flush() which needs to be preserved during a long append(), which I'll submit a pull request for.

pwaller · 2014-08-21T18:16:06Z

I note that:

scraperwiki.sql.save(['rowx'], {})

is fine, but:

scraperwiki.sql.save(['rowx'], [{}])

raises ValueError: You passed no sample values, or all the values you passed were null..

It feels like it violates a generalisation principle that passing no data isn't a NOOP. But I don't know what the user might expect.

It feels pretty evil that dumptruck deletes keys with None as the value. What if you want to set a column to null? Then after having deleted them it explodes if nothing is left.

/cc @IanHopkinson @StevenMaude @scraperdragon

scraperdragon · 2014-08-21T19:33:11Z

I think the solution is to sanity-check the input in append before buffering it.

Agreed, but not sure entirely what the input's requirements are. There may be some odd edge cases (dates spring to mind).

It feels like it violates a generalisation principle that passing no data isn't a NOOP

Agreed.

It feels pretty evil that dumptruck deletes keys with None as the value.

Agreed. I think this is an artifact of the way the dataproxy used to work / was intended to work, where:

scraperwiki.sql.save({'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save({'my_id':5, 'pet_name': 'Rover'})

would lead to a row of `{'my_id':5, 'animal': 'dog', 'pet_name': 'Rover'}).
However, I don't remember this ever working.

pwaller · 2014-08-21T20:15:16Z

I'm happy with the behaviour you just stated, that's how it should work. However what it does it takes:

scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover', 'animal': None})

And where I would expect animal to be nulled, it just ignores the animal field and has the effect of the code from your example, it's as if you had run -

scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover'})

pwaller · 2014-08-21T20:19:56Z

We still have to somehow deal with the case that real_save legitimately explodes (someone just yanked the hard-disk, etc) and the user "legitimately" does a catch all and then proceeds to later try writing again. In that case I think we have to scream a big warning that there was an error whilst flushing the buffer, and that accept that that error might jump into the fray at any later point.

Ultimately this is all just hacking around the fact that dumptruck is (extremely) suboptimal for doing individual row insertions. If @sean-duffy or someone attacks the core of the problem by rewriting that logic in sqlalchemy then it should be possible to make it so that .save() is as cheap as an INSERT. I guess some time will have to be spent considering the transaction semantics of that, however.

pwaller · 2014-08-21T20:21:29Z

(From my last point I conclude that we may wish to indefinitely defer the buffering question entirely in favour of the rewrite to use sqlalchemy - getting this right could be very hard and might not give a well behaved solution in all circumstances, whereas fundamentally fixing the backend probably will result in obviously correct behaviour).

drj11 · 2014-08-27T17:06:39Z

FWIW I agree with @scraperdragon. The thing about passing (key,None) pairs is just wrong.

The thing about exceptions is that it's an exception. All bets are off. You should not reasonably expect any future .save()s to work (though @pwaller is right, we could make it so that this case was more clearly diagnosed).

A rewrite in sqlalchemy is clearly a good idea, but so is doing a million other things. (PRs accepted).

sean-duffy · 2014-08-27T18:31:04Z

@drj11 A sqlalchemy rewrite of dumptruck is what I'm currently doing:
https://github.com/scraperwiki/dumptruck/tree/sqlalchemy-rewrite

drj11 · 2014-09-23T14:27:21Z

Almost all irrelevant now we have the SQLAlchemy rewrite

pwaller added a commit that referenced this issue Aug 21, 2014

Bugfix where state accidentally reset (#58)

5e1b39e

StevenMaude mentioned this issue Sep 8, 2014

No new release since July #62

Closed

drj11 closed this as completed Sep 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doubleflush bug #58

doubleflush bug #58

scraperdragon commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

scraperdragon commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

drj11 commented Aug 27, 2014

sean-duffy commented Aug 27, 2014

drj11 commented Sep 23, 2014

doubleflush bug #58

doubleflush bug #58

Comments

scraperdragon commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

scraperdragon commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

pwaller commented Aug 21, 2014

drj11 commented Aug 27, 2014

sean-duffy commented Aug 27, 2014

drj11 commented Sep 23, 2014