-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doubleflush bug #58
Comments
Nice sleuthing. Spotted a couple of bugs as a result. Boiled the test case down to this: #!/usr/bin/env python
import scraperwiki
data = [{'rowx': None} for x in range(10000)]
try:
scraperwiki.sql.save(['rowx'], data)
except Exception as e:
print "Masked exception:", repr(e)
scraperwiki.sql.save(['rowx'], data) |
I'm not yet sure what to do in this case. What's happening is that the first save fails due to an exception (because the user passed bad data). The user masks this, then goes on to try another save, which fails. If I were writing the code I wouldn't a) pass bad data or b) mask the exception, i.e, I would want it to blow up. The problem is that the user is doing something which causes I think the solution is to sanity-check the input in append before buffering it. |
There is also another bug where state is incorrectly reset in the |
I note that: scraperwiki.sql.save(['rowx'], {}) is fine, but: scraperwiki.sql.save(['rowx'], [{}]) raises It feels like it violates a generalisation principle that passing no data isn't a NOOP. But I don't know what the user might expect. It feels pretty evil that dumptruck deletes keys with |
Agreed, but not sure entirely what the input's requirements are. There may be some odd edge cases (dates spring to mind).
Agreed.
Agreed. I think this is an artifact of the way the dataproxy used to work / was intended to work, where:
would lead to a row of `{'my_id':5, 'animal': 'dog', 'pet_name': 'Rover'}). |
I'm happy with the behaviour you just stated, that's how it should work. However what it does it takes: scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover', 'animal': None}) And where I would expect animal to be nulled, it just ignores the animal field and has the effect of the code from your example, it's as if you had run - scraperwiki.sql.save([], {'my_id':5, 'animal': 'dog'})
scraperwiki.sql.save([], {'my_id':5, 'pet_name': 'Rover'}) |
We still have to somehow deal with the case that Ultimately this is all just hacking around the fact that dumptruck is (extremely) suboptimal for doing individual row insertions. If @sean-duffy or someone attacks the core of the problem by rewriting that logic in sqlalchemy then it should be possible to make it so that |
(From my last point I conclude that we may wish to indefinitely defer the buffering question entirely in favour of the rewrite to use sqlalchemy - getting this right could be very hard and might not give a well behaved solution in all circumstances, whereas fundamentally fixing the backend probably will result in obviously correct behaviour). |
FWIW I agree with @scraperdragon. The thing about passing The thing about exceptions is that it's an exception. All bets are off. You should not reasonably expect any future A rewrite in sqlalchemy is clearly a good idea, but so is doing a million other things. (PRs accepted). |
@drj11 A sqlalchemy rewrite of dumptruck is what I'm currently doing: |
Almost all irrelevant now we have the SQLAlchemy rewrite |
http://pastebin.com/J4eGcWDC
--> contains additional debugging statments
https://github.com/scraperwiki/scraperwiki-python/tree/doubleflush_bug
The text was updated successfully, but these errors were encountered: