Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC, BUG: Clarify and Standardize Whitespace Delimiter Behaviour with Custom Line Terminator #12912

Closed
gfyoung opened this issue Apr 17, 2016 · 13 comments
Labels
Bug Docs IO CSV read_csv, to_csv
Milestone

Comments

@gfyoung
Copy link
Member

gfyoung commented Apr 17, 2016

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = """a b c~1 2 3~4 5 6~7 8 9"""
>>> df = read_csv(StringIO(data), lineterminator='~', delim_whitespace=True)
>>> df
Empty DataFrame
Columns: [a, b, c~1, 2, 3~4, 5, 6~7, 8, 9]
Index: []

expected:

>>> df
    a    b    c
0   1    2    3
1   4    5    6
2   7    8    9

Note that this bug is only for the C engine, as the Python engine does not yet support delim_whitespace.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

this should just raise an error I think. It doesn't make sense, these are 2 conflicting options.
as delimwhitespace -> \n termination implicity.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Difficulty Intermediate labels Apr 17, 2016
@jreback jreback modified the milestones: No action, Next Major Release Apr 17, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

  1. Where in the documentation does it say that? AFAICT, there is no documentation for delim_whitespace in read_csv.

  2. If delim_whitespace=True and custom lineterminator conflict with each other, why does this work and give me what I was expecting:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>> data = """a b c~1 2 3~4 5 6~7 8 9"""
>>> df = read_csv(StringIO(data), lineterminator='~', delimiter=' ')
>>> df
    a    b    c
0   1    2    3
1   4    5    6
2   7    8    9

@gfyoung gfyoung changed the title BUG: Lineterminator and delim_whitespace not respected together DOC, BUG: Clarify and standardize delim_whitespace behaviour with lineterminator Apr 17, 2016
@gfyoung gfyoung changed the title DOC, BUG: Clarify and standardize delim_whitespace behaviour with lineterminator DOC, BUG: Clarify and Standardize Whitespace Delimiter Behaviour with Custom Line Terminator Apr 17, 2016
@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

ok, I guess accepting lineterminator is prob ok, delim_whitespace is just a shortcut for setting sep

@jreback jreback added the Bug label Apr 17, 2016
@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

also docs could be updated as well.

@jreback jreback added Docs and removed Error Reporting Incorrect or improved errors from pandas labels Apr 17, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

Yep, that's what I was thinking of doing as well, as there aren't any for delim_whitespace ATM AFAICT.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

there is a tiny comment in io.rst, but need to add to doc-string and expand a little

@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

Ah, okay. I'll add use that documentation in io.rst as a starting point then.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 18, 2016

@gfyoung regarding your example of read_csv(StringIO(data), lineterminator='~', delimiter=' ') that works. But the equivalent of using delim_whitespace=True would rather be delimiter='\s+' I think, and this fails as well in combination with the custom lineterminator

@gfyoung
Copy link
Member Author

gfyoung commented Apr 18, 2016

@jorisvandenbossche : Whitespace in tokenizer.c is defined as ' ' or '\t' as evidenced here, so your regex expression would not apply I believe. In any case, I still think you can use a custom line terminator as that is separate from delimiter.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 18, 2016

I don't know about the c internals of the parser code, but in any case here: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L679 the '\s+' option is set to delim_whitespace=True, so both are equivalent (although I thought it was set the other way around)

@jorisvandenbossche
Copy link
Member

To be clear, I am not saying that this is not a bug, I was just pointing out that your example of delimiter=' ' is not exactly equivalent, or at least not taking the same code path.
But to the user, this of course seems equivalent, and ideally would behave the same with regard to lineterminator

@gfyoung
Copy link
Member Author

gfyoung commented Apr 18, 2016

@jorisvandenbossche : I did understand that they were not taking the same code path, though I did not see that override there that you pointed out. Nevertheless, I think we can agree then that the documentation definitely needs some explanation as what goes on with delim_whitespace.

@jorisvandenbossche
Copy link
Member

@gfyoung Certainly!

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 21, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 21, 2016
@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants