Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doubleflush bug #58

Closed
scraperdragon opened this issue Aug 21, 2014 · 11 comments
Closed

doubleflush bug #58

scraperdragon opened this issue Aug 21, 2014 · 11 comments

Comments

@scraperdragon
Copy link
Contributor

http://pastebin.com/J4eGcWDC

--> contains additional debugging statments
https://github.com/scraperwiki/scraperwiki-python/tree/doubleflush_bug

#!/usr/bin/env python
import logging
import scraperwiki
import json

logging.basicConfig(level=logging.DEBUG)

data = [{'rowx':x} for x in range(900042)]
#scraperwiki.sql.save(['rowx'], data)

print 'cat'
try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print repr(e)
else:
    print 'no exception'
print 'dog'
try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print repr(e)
else:
    print 'no exception'
print 'nope'

print 'la la la la\n'*6
cat
AttributeError("'NoneType' object has no attribute 'decode'",)
dog
RuntimeError('Double flush',)
nope
la la la la
la la la la
la la la la
la la la la
la la la la
la la la la
@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

Nice sleuthing. Spotted a couple of bugs as a result.

Boiled the test case down to this:

#!/usr/bin/env python
import scraperwiki

data = [{'rowx': None} for x in range(10000)]

try:
    scraperwiki.sql.save(['rowx'], data)
except Exception as e:
    print "Masked exception:", repr(e)

scraperwiki.sql.save(['rowx'], data)

@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

I'm not yet sure what to do in this case. What's happening is that the first save fails due to an exception (because the user passed bad data). The user masks this, then goes on to try another save, which fails.

If I were writing the code I wouldn't a) pass bad data or b) mask the exception, i.e, I would want it to blow up.

The problem is that the user is doing something which causes save() to explode but they expect it to be non-fatal. By buffering the save and replaying it at a later date, we cause the explosion to happen elsewhere, which isn't optimal.

I think the solution is to sanity-check the input in append before buffering it.

@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

There is also another bug where state is incorrectly reset in the flush() which needs to be preserved during a long append(), which I'll submit a pull request for.

@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

I note that:

scraperwiki.sql.save(['rowx'], {})

is fine, but:

scraperwiki.sql.save(['rowx'], [{}])

raises ValueError: You passed no sample values, or all the values you passed were null..

It feels like it violates a generalisation principle that passing no data isn't a NOOP. But I don't know what the user might expect.

It feels pretty evil that dumptruck deletes keys with None as the value. What if you want to set a column to null? Then after having deleted them it explodes if nothing is left.

/cc @IanHopkinson @StevenMaude @scraperdragon

@scraperdragon
Copy link
Contributor Author

I think the solution is to sanity-check the input in append before buffering it.

Agreed, but not sure entirely what the input's requirements are. There may be some odd edge cases (dates spring to mind).

It feels like it violates a generalisation principle that passing no data isn't a NOOP

Agreed.

It feels pretty evil that dumptruck deletes keys with None as the value.

Agreed. I think this is an artifact of the way the dataproxy used to work / was intended to work, where:

scraperwiki.sql.save({'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save({'my_id':5, 'pet_name': 'Rover'})

would lead to a row of `{'my_id':5, 'animal': 'dog', 'pet_name': 'Rover'}).
However, I don't remember this ever working.

@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

I'm happy with the behaviour you just stated, that's how it should work. However what it does it takes:

scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover', 'animal': None})

And where I would expect animal to be nulled, it just ignores the animal field and has the effect of the code from your example, it's as if you had run -

scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover'})

@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

We still have to somehow deal with the case that real_save legitimately explodes (someone just yanked the hard-disk, etc) and the user "legitimately" does a catch all and then proceeds to later try writing again. In that case I think we have to scream a big warning that there was an error whilst flushing the buffer, and that accept that that error might jump into the fray at any later point.

Ultimately this is all just hacking around the fact that dumptruck is (extremely) suboptimal for doing individual row insertions. If @sean-duffy or someone attacks the core of the problem by rewriting that logic in sqlalchemy then it should be possible to make it so that .save() is as cheap as an INSERT. I guess some time will have to be spent considering the transaction semantics of that, however.

@pwaller
Copy link
Contributor

pwaller commented Aug 21, 2014

(From my last point I conclude that we may wish to indefinitely defer the buffering question entirely in favour of the rewrite to use sqlalchemy - getting this right could be very hard and might not give a well behaved solution in all circumstances, whereas fundamentally fixing the backend probably will result in obviously correct behaviour).

@drj11
Copy link
Contributor

drj11 commented Aug 27, 2014

FWIW I agree with @scraperdragon. The thing about passing (key,None) pairs is just wrong.

The thing about exceptions is that it's an exception. All bets are off. You should not reasonably expect any future .save()s to work (though @pwaller is right, we could make it so that this case was more clearly diagnosed).

A rewrite in sqlalchemy is clearly a good idea, but so is doing a million other things. (PRs accepted).

@sean-duffy
Copy link
Contributor

@drj11 A sqlalchemy rewrite of dumptruck is what I'm currently doing:
https://github.com/scraperwiki/dumptruck/tree/sqlalchemy-rewrite

@drj11
Copy link
Contributor

drj11 commented Sep 23, 2014

Almost all irrelevant now we have the SQLAlchemy rewrite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants