Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Option for reading files with a variable number of comment lines at start #2685

Closed
wesm opened this issue Jan 11, 2013 · 20 comments · Fixed by #7582
Closed

ENH: Option for reading files with a variable number of comment lines at start #2685

wesm opened this issue Jan 11, 2013 · 20 comments · Fixed by #7582
Labels
Enhancement IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@wesm
Copy link
Member

wesm commented Jan 11, 2013

http://stackoverflow.com/questions/14276661/python-pandas-read-file-skipping-commented

@montefra
Copy link

@wesm: I'm the author of the stackoverflow question.

Implementing this well would require dipping into the C tokenizer code. It's not as bad as it might sound.

If you can point me to where this is implemented, so I can give a look, see what I understand and if I can contribute?

@wesm
Copy link
Member Author

wesm commented Jan 15, 2013

pandas/src/parser/tokenizer.c

@montefra
Copy link

I have spent a number of hours in the last days trying to figure out which function calls which(, and is not an easy task).
The error that I get when reading a file (in my case with read_table, but my understanding is that it would be the same with read_csv) with commented line I get an error:

pandas/io/parsers.py", line 1475, in _rows_to_cols
   raise ValueError(msg) 
ValueError: Expected 0 fields in line 7, saw 7content

The error comes from within the PythonParser class. Before calling _rows_to_cols the input file is stored in a list of lists, content and fully commented lines appears as an empty list.
So a possibility would be to add an extra keyword skipcommentedlines (or similar) and if True count how many empty lines from the beginning of the lines there are and take them out from content. If self.orif_names is also empty, count the number of fields, and fill it with range(n_fields).

If this makes sense, I'll implement it asap.

It looks to me that the functions in pandas/src/parser/tokenizer.c are used only in class CParserWrapper when the 'c' engine is required. From a comment I understand that the 'c' engine is not supported yet. For this I think that I have found a couple of places where the skipping of commented lines can be implemented, although I have to understand if it makes sense and how to implement it.

At last. When reading a file of the kind

 # line1 (space before the comment)
# line2
# [...]
# line 6 
1 2 3 4 5 6 7
[...]

with pd.read_table(fname, header=None, comment='#', sep='\s') I have noticed that if I have 2 commented lines, content has two empty lists in the first two positions (as I would expecting), for 1 or >=3 commented lines I get n-1 empty lists in content. It does look quite strange to me. Any idea of what's going on?

Thanks for the time reading (and hopefully replying) to this long post

@holocronweaver
Copy link

I would very much like to see this feature implemented as I have to deal with workarounds for it often.

Is there a reason that the above implementation would not work? It seems simple enough. I have never contributed to pandas before and would like some feedback before proceeding to implement the feature.

@jreback
Copy link
Contributor

jreback commented Jul 30, 2013

the c-parser is the primary parser; this change would have to be made there (actually in src/parser.pyx). its not that hard, but a bit non-trivial

@montefra
Copy link

I'll give a look at src/parser.pyx asap. I hope to have more luck than when I looked into pandas/src/parser/tokenizer.c

@jreback
Copy link
Contributor

jreback commented Jul 30, 2013

this is an already existing feature to skip comments at the end of the line

http://pandas.pydata.org/pandas-docs/dev/io.html#comments

IIUC you want to skip a line if there is a comment at the beginning?

e.g.

# comment
data.....
# comment
data....
   # comment

if you specified comment='\s*#' (I am not sure it takes re ATM though)

@jreback
Copy link
Contributor

jreback commented Jul 30, 2013

the last one would be tough FYI, tokenzier only takes a single character ATM

@holocronweaver
Copy link

@montefra Whichever of us gets to it first then. The race is on. ;-)

@jreback That is correct, I want to skip lines with a comment character at the beginning. I would imagine the last case could not be handled in the event the file is space delimited. Basically, comments probably need to either be declared at the very beginning or end of the line.

@holocronweaver
Copy link

I have implemented comment skipping and also fixed a related problem with CSV format sniffing. I did so by modifying io/parsers.py and have not yet begun work on the c-parser which apparently my install of pandas is not using by default. Once I have finished making changes in parser.pyx I will push my changes.

A similar problem is whether pandas should ignore empty lines by default. It would be very easy to implement as a slight extension of ignoring comments. I will open up an issue related on this to get some feedback.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2013

gre8!

you can put your changes up as a PR, be sure to enable travis, lmk if you need help

@holocronweaver
Copy link

Only one build failed in Travis CI, but it does not appear to have anything to do with my changes. Is this expected, or did I break something related to pytables somehow?

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

@holocronweaver rebase on master again I just pushed some code to 'fix' that failure (although net net the changes didn't actually do antything)...but seemed to fix it

to force travis to rebuild

git commit -amend -C HEAD
git push origin yourbranch -f

this resets the last commit to a new hash forcing a rebuild

@holocronweaver
Copy link

I did as you suggested and the latest build (3) is all good. I will go ahead and submit a pull request.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@montefra doing a PR for this?

@montefra
Copy link

@jreback: ehm.
I did look into it but I kind of forgot after @holocronweaver did is PR.
So the answer is no.
When I did spend some time (about 9 months ago) I couldn't figure out much pandas/src/parser/tokenizer.c and proposed a solution, which is likely part of what @holocronweaver has done.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

ok...thanks

@holocronweaver
Copy link

@jreback @montefra Yep, I have taken care of this, just need to find time to get everything ready for PR.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

@holocronweaver @montefra can either of you update the PR for this?

@holocronweaver
Copy link

@jreback Will do so the moment I have a chance. Long time coming, I know. =) High on my priority list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
4 participants