-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix inconsistent C engine quoting behaviour #13411
Conversation
a5f270b
to
8122350
Compare
I think that we should expose these not sure if this is an issue in the real-world, but seems to be a small pot-hole. |
cols = ['a', 'b', 'c'] | ||
|
||
# QUOTE_MINIMAL and QUOTE_ALL apply only to | ||
# the CSV writer, so they should have no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be an error then? (or you are saying they don't do anything so might as well accept them)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not an error. Just leave it alone is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe we shouldn't mention them explicitly as options in documentation then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well they are technically valid to Python's csv.reader
class, so at this point I would leave them so that users know that they are options for completeness.
@jreback : Documentation on the meanings of the |
@jreback : I'm perfectly fine with name-spacing as I just commented above, though I'm hesitant about accepting strings. IMO enums are sufficient and will make it easier for users to debug (spelling errors). Not entirely sure how many people use them as you said (my guess is that many don't, and I certainly have never used them in my work), though it should probably remain as an option since it does play a role in |
383f2db
to
c879395
Compare
Current coverage is 84.32%@@ master #13411 diff @@
==========================================
Files 138 138
Lines 51069 51068 -1
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43066 43065 -1
Misses 8003 8003
Partials 0 0
|
Added the |
@@ -287,7 +287,7 @@ lineterminator : str (length 1), default ``None`` | |||
quotechar : str (length 1) | |||
The character used to denote the start and end of a quoted item. Quoted items | |||
can include the delimiter and it will be ignored. | |||
quoting : int or ``csv.QUOTE_*`` instance, default ``None`` | |||
quoting : int or ``csv.QUOTE_*`` instance, default ``0`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this still defaulted to None
?, meaning don't consider quoting at all? (and NOT QUOTE_NONE
?) or are these the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The signature says
quoting=0
- Passing in
None
raises an error (it always has been)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, you are essentially fixing the documentation, ok then
c879395
to
14d5ef2
Compare
expected = DataFrame([[3, '4 " 5"']], | ||
columns=['a', 'b']) | ||
result = self.read_csv(StringIO(data), quotechar='"', | ||
doublequote=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this one (I mean: reading the docs, I would expect '4 "" 5'
) Can you explain the logic of this result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure from the Python engine perspective but from the tokenizer.c
perspective, the two quotes that you see are because they are interpreted as in-field quotations so they are processed like normal characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why the quote after the 5? (so why '4 " 5"'
and not '4 " 5'
) Because isn't this quote the ending quote of the string (since quotechar='"'
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is not. The quote is considered in field. Follow along with the code and you'll see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see indeed the logic in the code. So after a closing quote, the rest of the field is regarded as a normal field, and cannot be a quoted field anymore, regardless of occurring quotes.
But you could also set the state to START_FIELD
instead of IN_FIELD
Anyway, this is a rather pathological case, so not that important. I just found it a bit strange (and as you are adding the behaviour explicitly as a test, wanted to check it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, okay. Thanks for clarifying!
I am not really sure that I find |
@jorisvandenbossche : I disagree about the strings. See my reasoning above. The inclusion of the enums in the namespace was to avoid hard-coding them as I did initially in |
14d5ef2
to
4013482
Compare
@jreback : Did some minor tweaking of the docs once more to mention that the |
4013482
to
633f78f
Compare
OK on not adding those, but still, I don't see the added value to users of having the enums in pandas.io.common (it's just more ways to import the exact same thing). So personally I would just not mention it in the docs (also, if we do put it in the docs, we should have a test for it) |
What test exactly? Not sure what you mean. @jreback thoughts? |
Sorry, I meant just a simple test that it is available in the pd.io.common namespace (if we want users to use it from there). But let's first decide what to do with it. |
@jorisvandenbossche : Not sure if we're encouraging users to use it from there. Just providing a |
e075f0b
to
6595618
Compare
if not isinstance(quoting, int): | ||
raise TypeError('"quoting" must be an integer') | ||
|
||
if not 0 <= quoting <= 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should test that instead it is a valid value (e.g. it has to match one of the constants)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. It's a little nicer than the two checks that I have currently (e.g. someone for whatever reason could pass in np.int32(2)
and that would raise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I take that back. The current checks will align nicely with what the Python engine currently does in its csv
library. I do check that it is a valid value by first checking whether or not it's an integer. Then if so, it can only take on four values (between 0 and 3 inclusive, hence my subsequent check with the inequalities).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no my point is instead of actually using 0 and 3, use QUOTE_NONE, and QUOTE_* (whatever the last one is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Fixed.
Adding it to the docstring seems 'encouraging users to use it' to me, why otherwise adding it there? For me it's OK to add it to the pd.io.common namespace if that makes the implementation easier, no problem, I personally would just not add it to the documentation. |
I could go either way. I thought it would be more consistent, IOW, a user can just use it directly, not having to think about So @gfyoung let's revert that (the location of the constants). |
397b665
to
bf2a59e
Compare
1) Add significant testing to quoting in read_csv 2) Fix bug in C engine in which a NULL quotechar would raise even though quoting=csv.QUOTE_NONE 3) Fix bug in C engine in which quoting=csv.QUOTE_ NONNUMERIC wouldn't cause non-quoted fields to be cast to float. 4) Fixed minor doc error for quoting parameter, as the default value is ZERO not None (this will raise an error in fact).
bf2a59e
to
0e791a5
Compare
thanks @gfyoung |
@gfyoung Thanks a lot! This more comprehensive testing is really good |
Add significant testing to quoting in
read_csv
Fix bug in C engine in which a NULL
quotechar
would raise even thoughquoting=csv.QUOTE_NONE
.Fix bug in C engine in which
quoting=csv.QUOTE_NONNUMERIC
wouldn't cause non-quoted fields to be cast tofloat
. Relevant definitions can be found in the Python docs here.