BUG: Fix inconsistent C engine quoting behaviour #13411

gfyoung · 2016-06-09T13:18:35Z

Add significant testing to quoting in read_csv
Fix bug in C engine in which a NULL quotechar would raise even though quoting=csv.QUOTE_NONE.
Fix bug in C engine in which quoting=csv.QUOTE_NONNUMERIC wouldn't cause non-quoted fields to be cast to float. Relevant definitions can be found in the Python docs here.

jreback · 2016-06-09T13:24:08Z

I think that we should expose these csv.QUOTE_* values somewhere in the pandas namespace as well. and/or, maybe accept a string? e.g. quoting='QUOTE_NONE' ? (and case_insensitve), or maybe just quoting='nonumeric','minimal','all' (not really even sure what QUOTE_NONE does)

not sure if this is an issue in the real-world, but seems to be a small pot-hole.

@jorisvandenbossche

jreback · 2016-06-09T13:25:02Z

pandas/io/tests/parser/quoting.py

+        cols = ['a', 'b', 'c']
+
+        # QUOTE_MINIMAL and QUOTE_ALL apply only to
+        # the CSV writer, so they should have no


should these be an error then? (or you are saying they don't do anything so might as well accept them)?

Definitely not an error. Just leave it alone is best.

But maybe we shouldn't mention them explicitly as options in documentation then?

Well they are technically valid to Python's csv.reader class, so at this point I would leave them so that users know that they are options for completeness.

gfyoung · 2016-06-09T13:29:17Z

@jreback : Documentation on the meanings of the csv.* enums can be found in the link I provided in the initial PR description.

gfyoung · 2016-06-09T14:03:27Z

@jreback : I'm perfectly fine with name-spacing as I just commented above, though I'm hesitant about accepting strings. IMO enums are sufficient and will make it easier for users to debug (spelling errors).

Not entirely sure how many people use them as you said (my guess is that many don't, and I certainly have never used them in my work), though it should probably remain as an option since it does play a role in csv.Dialect creation.

codecov-io · 2016-06-09T15:18:42Z

Current coverage is 84.32%

Merging #13411 into master will decrease coverage by <.01%

@@             master     #13411   diff @@
==========================================
  Files           138        138          
  Lines         51069      51068     -1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43066      43065     -1   
  Misses         8003       8003          
  Partials          0          0

Powered by Codecov. Last updated by 013c2ce...bf2a59e

gfyoung · 2016-06-09T16:11:06Z

Added the io.common enums, and Travis is still happy. Ready to merge if there are no other concerns.

jreback · 2016-06-09T16:13:21Z

doc/source/io.rst

@@ -287,7 +287,7 @@ lineterminator : str (length 1), default ``None``
 quotechar : str (length 1)
  The character used to denote the start and end of a quoted item. Quoted items
  can include the delimiter and it will be ignored.
-quoting : int or ``csv.QUOTE_*`` instance, default ``None``
+quoting : int or ``csv.QUOTE_*`` instance, default ``0``


isn't this still defaulted to None?, meaning don't consider quoting at all? (and NOT QUOTE_NONE?) or are these the same

The signature says quoting=0

Passing in None raises an error (it always has been)

oh, you are essentially fixing the documentation, ok then

jorisvandenbossche · 2016-06-09T17:21:36Z

pandas/io/tests/parser/quoting.py

+        expected = DataFrame([[3, '4 " 5"']],
+                             columns=['a', 'b'])
+        result = self.read_csv(StringIO(data), quotechar='"',
+                               doublequote=False)


I don't really understand this one (I mean: reading the docs, I would expect '4 "" 5') Can you explain the logic of this result?

Not entirely sure from the Python engine perspective but from the tokenizer.c perspective, the two quotes that you see are because they are interpreted as in-field quotations so they are processed like normal characters.

But why the quote after the 5? (so why '4 " 5"' and not '4 " 5') Because isn't this quote the ending quote of the string (since quotechar='"')?

No it is not. The quote is considered in field. Follow along with the code and you'll see.

I can see indeed the logic in the code. So after a closing quote, the rest of the field is regarded as a normal field, and cannot be a quoted field anymore, regardless of occurring quotes.
But you could also set the state to START_FIELD instead of IN_FIELD

Anyway, this is a rather pathological case, so not that important. I just found it a bit strange (and as you are adding the behaviour explicitly as a test, wanted to check it)

Oh, okay. Thanks for clarifying!

jorisvandenbossche · 2016-06-09T17:25:59Z

I am not really sure that I find pd.io.common.QUOTE_ better than csv.QUOTE_.. (but it is one import less, that is true).
I would maybe rather accept strings like 'minimal' or 'nonnumeric', this seems the more user friendly option? (but it makes it of course more verbose to explain all possibilities in the docstring)

gfyoung · 2016-06-09T17:26:49Z

@jorisvandenbossche : I disagree about the strings. See my reasoning above. The inclusion of the enums in the namespace was to avoid hard-coding them as I did initially in parsers.pyx.

gfyoung · 2016-06-14T16:41:04Z

@jreback : Did some minor tweaking of the docs once more to mention that the csv enums can be found in pandas.io.common. Ready to merge if there are no other concerns.

jorisvandenbossche · 2016-06-14T23:20:58Z

I disagree about the strings.

OK on not adding those, but still, I don't see the added value to users of having the enums in pandas.io.common (it's just more ways to import the exact same thing). So personally I would just not mention it in the docs (also, if we do put it in the docs, we should have a test for it)

gfyoung · 2016-06-14T23:37:05Z

What test exactly? Not sure what you mean. @jreback thoughts?

jorisvandenbossche · 2016-06-14T23:54:25Z

What test exactly? Not sure what you mean

Sorry, I meant just a simple test that it is available in the pd.io.common namespace (if we want users to use it from there). But let's first decide what to do with it.

gfyoung · 2016-06-15T09:29:29Z

@jorisvandenbossche : Not sure if we're encouraging users to use it from there. Just providing a pandas-namespace location where those enums are available from what it seems.

jreback · 2016-06-16T21:06:08Z

pandas/parser.pyx

+        if not isinstance(quoting, int):
+            raise TypeError('"quoting" must be an integer')
+
+        if not 0 <= quoting <= 3:


you should test that instead it is a valid value (e.g. it has to match one of the constants)

Fair enough. It's a little nicer than the two checks that I have currently (e.g. someone for whatever reason could pass in np.int32(2) and that would raise).

Actually, I take that back. The current checks will align nicely with what the Python engine currently does in its csv library. I do check that it is a valid value by first checking whether or not it's an integer. Then if so, it can only take on four values (between 0 and 3 inclusive, hence my subsequent check with the inequalities).

no my point is instead of actually using 0 and 3, use QUOTE_NONE, and QUOTE_* (whatever the last one is).

Ah, got it. Fixed.

jorisvandenbossche · 2016-06-16T21:27:39Z

Not sure if we're encouraging users to use it from there. Just providing a pandas-namespace location where those enums are available from what it seems.

Adding it to the docstring seems 'encouraging users to use it' to me, why otherwise adding it there? For me it's OK to add it to the pd.io.common namespace if that makes the implementation easier, no problem, I personally would just not add it to the documentation.
@jreback do you have any preference?

jreback · 2016-06-16T21:37:06Z

I could go either way. I thought it would be more consistent, IOW, a user can just use it directly, not having to think about csv module at all. Though since we don't really document these completely, maybe should just back it out and point for all of this to csv module (for the constants).

So @gfyoung let's revert that (the location of the constants).

1) Add significant testing to quoting in read_csv 2) Fix bug in C engine in which a NULL quotechar would raise even though quoting=csv.QUOTE_NONE 3) Fix bug in C engine in which quoting=csv.QUOTE_ NONNUMERIC wouldn't cause non-quoted fields to be cast to float. 4) Fixed minor doc error for quoting parameter, as the default value is ZERO not None (this will raise an error in fact).

jreback · 2016-06-17T16:39:19Z

thanks @gfyoung

jorisvandenbossche · 2016-06-17T17:10:22Z

@gfyoung Thanks a lot! This more comprehensive testing is really good

gfyoung force-pushed the quoting-read-csv-tests branch from a5f270b to 8122350 Compare June 9, 2016 13:19

jreback added API Design IO CSV read_csv, to_csv labels Jun 9, 2016

jreback added this to the 0.18.2 milestone Jun 9, 2016

jreback reviewed Jun 9, 2016
View reviewed changes

gfyoung force-pushed the quoting-read-csv-tests branch 2 times, most recently from 383f2db to c879395 Compare June 9, 2016 15:18

jreback reviewed Jun 9, 2016
View reviewed changes

gfyoung force-pushed the quoting-read-csv-tests branch from c879395 to 14d5ef2 Compare June 9, 2016 17:04

jorisvandenbossche reviewed Jun 9, 2016
View reviewed changes

gfyoung force-pushed the quoting-read-csv-tests branch from 14d5ef2 to 4013482 Compare June 9, 2016 19:37

gfyoung force-pushed the quoting-read-csv-tests branch from 4013482 to 633f78f Compare June 14, 2016 16:49

gfyoung force-pushed the quoting-read-csv-tests branch 3 times, most recently from e075f0b to 6595618 Compare June 16, 2016 18:45

jreback reviewed Jun 16, 2016
View reviewed changes

gfyoung force-pushed the quoting-read-csv-tests branch 2 times, most recently from 397b665 to bf2a59e Compare June 17, 2016 07:02

gfyoung force-pushed the quoting-read-csv-tests branch from bf2a59e to 0e791a5 Compare June 17, 2016 15:06

jreback closed this in 883df65 Jun 17, 2016

gfyoung deleted the quoting-read-csv-tests branch June 17, 2016 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix inconsistent C engine quoting behaviour #13411

BUG: Fix inconsistent C engine quoting behaviour #13411

gfyoung commented Jun 9, 2016

jreback commented Jun 9, 2016

jreback Jun 9, 2016

gfyoung Jun 9, 2016

jorisvandenbossche Jun 9, 2016

gfyoung Jun 9, 2016

gfyoung commented Jun 9, 2016 •

edited

Loading

gfyoung commented Jun 9, 2016

codecov-io commented Jun 9, 2016 •

edited

Loading

gfyoung commented Jun 9, 2016

jreback Jun 9, 2016

gfyoung Jun 9, 2016

jreback Jun 9, 2016

jorisvandenbossche Jun 9, 2016

gfyoung Jun 9, 2016

jorisvandenbossche Jun 14, 2016

gfyoung Jun 14, 2016

jorisvandenbossche Jun 14, 2016

gfyoung Jun 15, 2016

jorisvandenbossche commented Jun 9, 2016 •

edited

Loading

gfyoung commented Jun 9, 2016 •

edited

Loading

gfyoung commented Jun 14, 2016

jorisvandenbossche commented Jun 14, 2016

gfyoung commented Jun 14, 2016

jorisvandenbossche commented Jun 14, 2016

gfyoung commented Jun 15, 2016

jreback Jun 16, 2016

gfyoung Jun 16, 2016 •

edited

Loading

gfyoung Jun 17, 2016 •

edited

Loading

jreback Jun 17, 2016

gfyoung Jun 17, 2016

jorisvandenbossche commented Jun 16, 2016

jreback commented Jun 16, 2016 •

edited

Loading

jreback commented Jun 17, 2016

jorisvandenbossche commented Jun 17, 2016 •

edited

Loading

BUG: Fix inconsistent C engine quoting behaviour #13411

BUG: Fix inconsistent C engine quoting behaviour #13411

Conversation

gfyoung commented Jun 9, 2016

jreback commented Jun 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Jun 9, 2016 • edited Loading

gfyoung commented Jun 9, 2016

codecov-io commented Jun 9, 2016 • edited Loading

Current coverage is 84.32%

gfyoung commented Jun 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 9, 2016 • edited Loading

gfyoung commented Jun 9, 2016 • edited Loading

gfyoung commented Jun 14, 2016

jorisvandenbossche commented Jun 14, 2016

gfyoung commented Jun 14, 2016

jorisvandenbossche commented Jun 14, 2016

gfyoung commented Jun 15, 2016

Choose a reason for hiding this comment

gfyoung Jun 16, 2016 • edited Loading

Choose a reason for hiding this comment

gfyoung Jun 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 16, 2016

jreback commented Jun 16, 2016 • edited Loading

jreback commented Jun 17, 2016

jorisvandenbossche commented Jun 17, 2016 • edited Loading

gfyoung commented Jun 9, 2016 •

edited

Loading

codecov-io commented Jun 9, 2016 •

edited

Loading

jorisvandenbossche commented Jun 9, 2016 •

edited

Loading

gfyoung commented Jun 9, 2016 •

edited

Loading

gfyoung Jun 16, 2016 •

edited

Loading

gfyoung Jun 17, 2016 •

edited

Loading

jreback commented Jun 16, 2016 •

edited

Loading

jorisvandenbossche commented Jun 17, 2016 •

edited

Loading