Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix inconsistent C engine quoting behaviour #13411

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 9, 2016

  1. Add significant testing to quoting in read_csv

  2. Fix bug in C engine in which a NULL quotechar would raise even though quoting=csv.QUOTE_NONE.

  3. Fix bug in C engine in which quoting=csv.QUOTE_NONNUMERIC wouldn't cause non-quoted fields to be cast to float. Relevant definitions can be found in the Python docs here.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

I think that we should expose these csv.QUOTE_* values somewhere in the pandas namespace as well. and/or, maybe accept a string? e.g. quoting='QUOTE_NONE' ? (and case_insensitve), or maybe just quoting='nonumeric','minimal','all' (not really even sure what QUOTE_NONE does)

not sure if this is an issue in the real-world, but seems to be a small pot-hole.

@jorisvandenbossche

@jreback jreback added API Design IO CSV read_csv, to_csv labels Jun 9, 2016
@jreback jreback added this to the 0.18.2 milestone Jun 9, 2016
cols = ['a', 'b', 'c']

# QUOTE_MINIMAL and QUOTE_ALL apply only to
# the CSV writer, so they should have no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be an error then? (or you are saying they don't do anything so might as well accept them)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not an error. Just leave it alone is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe we shouldn't mention them explicitly as options in documentation then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well they are technically valid to Python's csv.reader class, so at this point I would leave them so that users know that they are options for completeness.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 9, 2016

@jreback : Documentation on the meanings of the csv.* enums can be found in the link I provided in the initial PR description.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 9, 2016

@jreback : I'm perfectly fine with name-spacing as I just commented above, though I'm hesitant about accepting strings. IMO enums are sufficient and will make it easier for users to debug (spelling errors).

Not entirely sure how many people use them as you said (my guess is that many don't, and I certainly have never used them in my work), though it should probably remain as an option since it does play a role in csv.Dialect creation.

@gfyoung gfyoung force-pushed the quoting-read-csv-tests branch 2 times, most recently from 383f2db to c879395 Compare June 9, 2016 15:18
@codecov-io
Copy link

codecov-io commented Jun 9, 2016

Current coverage is 84.32%

Merging #13411 into master will decrease coverage by <.01%

@@             master     #13411   diff @@
==========================================
  Files           138        138          
  Lines         51069      51068     -1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43066      43065     -1   
  Misses         8003       8003          
  Partials          0          0          

Powered by Codecov. Last updated by 013c2ce...bf2a59e

@gfyoung
Copy link
Member Author

gfyoung commented Jun 9, 2016

Added the io.common enums, and Travis is still happy. Ready to merge if there are no other concerns.

@@ -287,7 +287,7 @@ lineterminator : str (length 1), default ``None``
quotechar : str (length 1)
The character used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
quoting : int or ``csv.QUOTE_*`` instance, default ``None``
quoting : int or ``csv.QUOTE_*`` instance, default ``0``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this still defaulted to None?, meaning don't consider quoting at all? (and NOT QUOTE_NONE?) or are these the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The signature says quoting=0
  2. Passing in None raises an error (it always has been)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, you are essentially fixing the documentation, ok then

expected = DataFrame([[3, '4 " 5"']],
columns=['a', 'b'])
result = self.read_csv(StringIO(data), quotechar='"',
doublequote=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this one (I mean: reading the docs, I would expect '4 "" 5') Can you explain the logic of this result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure from the Python engine perspective but from the tokenizer.c perspective, the two quotes that you see are because they are interpreted as in-field quotations so they are processed like normal characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why the quote after the 5? (so why '4 " 5"' and not '4 " 5') Because isn't this quote the ending quote of the string (since quotechar='"')?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is not. The quote is considered in field. Follow along with the code and you'll see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see indeed the logic in the code. So after a closing quote, the rest of the field is regarded as a normal field, and cannot be a quoted field anymore, regardless of occurring quotes.
But you could also set the state to START_FIELD instead of IN_FIELD

Anyway, this is a rather pathological case, so not that important. I just found it a bit strange (and as you are adding the behaviour explicitly as a test, wanted to check it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay. Thanks for clarifying!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 9, 2016

I am not really sure that I find pd.io.common.QUOTE_ better than csv.QUOTE_.. (but it is one import less, that is true).
I would maybe rather accept strings like 'minimal' or 'nonnumeric', this seems the more user friendly option? (but it makes it of course more verbose to explain all possibilities in the docstring)

@gfyoung
Copy link
Member Author

gfyoung commented Jun 9, 2016

@jorisvandenbossche : I disagree about the strings. See my reasoning above. The inclusion of the enums in the namespace was to avoid hard-coding them as I did initially in parsers.pyx.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 14, 2016

@jreback : Did some minor tweaking of the docs once more to mention that the csv enums can be found in pandas.io.common. Ready to merge if there are no other concerns.

@jorisvandenbossche
Copy link
Member

I disagree about the strings.

OK on not adding those, but still, I don't see the added value to users of having the enums in pandas.io.common (it's just more ways to import the exact same thing). So personally I would just not mention it in the docs (also, if we do put it in the docs, we should have a test for it)

@gfyoung
Copy link
Member Author

gfyoung commented Jun 14, 2016

What test exactly? Not sure what you mean. @jreback thoughts?

@jorisvandenbossche
Copy link
Member

What test exactly? Not sure what you mean

Sorry, I meant just a simple test that it is available in the pd.io.common namespace (if we want users to use it from there). But let's first decide what to do with it.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 15, 2016

@jorisvandenbossche : Not sure if we're encouraging users to use it from there. Just providing a pandas-namespace location where those enums are available from what it seems.

@gfyoung gfyoung force-pushed the quoting-read-csv-tests branch 3 times, most recently from e075f0b to 6595618 Compare June 16, 2016 18:45
if not isinstance(quoting, int):
raise TypeError('"quoting" must be an integer')

if not 0 <= quoting <= 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should test that instead it is a valid value (e.g. it has to match one of the constants)

Copy link
Member Author

@gfyoung gfyoung Jun 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. It's a little nicer than the two checks that I have currently (e.g. someone for whatever reason could pass in np.int32(2) and that would raise).

Copy link
Member Author

@gfyoung gfyoung Jun 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I take that back. The current checks will align nicely with what the Python engine currently does in its csv library. I do check that it is a valid value by first checking whether or not it's an integer. Then if so, it can only take on four values (between 0 and 3 inclusive, hence my subsequent check with the inequalities).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no my point is instead of actually using 0 and 3, use QUOTE_NONE, and QUOTE_* (whatever the last one is).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Fixed.

@jorisvandenbossche
Copy link
Member

Not sure if we're encouraging users to use it from there. Just providing a pandas-namespace location where those enums are available from what it seems.

Adding it to the docstring seems 'encouraging users to use it' to me, why otherwise adding it there? For me it's OK to add it to the pd.io.common namespace if that makes the implementation easier, no problem, I personally would just not add it to the documentation.
@jreback do you have any preference?

@jreback
Copy link
Contributor

jreback commented Jun 16, 2016

I could go either way. I thought it would be more consistent, IOW, a user can just use it directly, not having to think about csv module at all. Though since we don't really document these completely, maybe should just back it out and point for all of this to csv module (for the constants).

So @gfyoung let's revert that (the location of the constants).

@gfyoung gfyoung force-pushed the quoting-read-csv-tests branch 2 times, most recently from 397b665 to bf2a59e Compare June 17, 2016 07:02
1) Add significant testing to quoting in read_csv

2) Fix bug in C engine in which a NULL quotechar
would raise even though quoting=csv.QUOTE_NONE

3) Fix bug in C engine in which quoting=csv.QUOTE_
NONNUMERIC wouldn't cause non-quoted fields to be
cast to float.

4) Fixed minor doc error for quoting parameter, as
the default value is ZERO not None (this will raise
an error in fact).
@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

thanks @gfyoung

@gfyoung gfyoung deleted the quoting-read-csv-tests branch June 17, 2016 16:40
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 17, 2016

@gfyoung Thanks a lot! This more comprehensive testing is really good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants