ENH: Option for reading files with a variable number of comment lines at start #2685

wesm · 2013-01-11T14:55:38Z

http://stackoverflow.com/questions/14276661/python-pandas-read-file-skipping-commented

montefra · 2013-01-11T16:22:22Z

@wesm: I'm the author of the stackoverflow question.

Implementing this well would require dipping into the C tokenizer code. It's not as bad as it might sound.

If you can point me to where this is implemented, so I can give a look, see what I understand and if I can contribute?

wesm · 2013-01-15T03:23:11Z

pandas/src/parser/tokenizer.c

montefra · 2013-01-17T22:28:23Z

I have spent a number of hours in the last days trying to figure out which function calls which(, and is not an easy task).
The error that I get when reading a file (in my case with read_table, but my understanding is that it would be the same with read_csv) with commented line I get an error:

pandas/io/parsers.py", line 1475, in _rows_to_cols
   raise ValueError(msg) 
ValueError: Expected 0 fields in line 7, saw 7content

The error comes from within the PythonParser class. Before calling _rows_to_cols the input file is stored in a list of lists, content and fully commented lines appears as an empty list.
So a possibility would be to add an extra keyword skipcommentedlines (or similar) and if True count how many empty lines from the beginning of the lines there are and take them out from content. If self.orif_names is also empty, count the number of fields, and fill it with range(n_fields).

If this makes sense, I'll implement it asap.

It looks to me that the functions in pandas/src/parser/tokenizer.c are used only in class CParserWrapper when the 'c' engine is required. From a comment I understand that the 'c' engine is not supported yet. For this I think that I have found a couple of places where the skipping of commented lines can be implemented, although I have to understand if it makes sense and how to implement it.

At last. When reading a file of the kind

 # line1 (space before the comment)
# line2
# [...]
# line 6 
1 2 3 4 5 6 7
[...]

with pd.read_table(fname, header=None, comment='#', sep='\s') I have noticed that if I have 2 commented lines, content has two empty lists in the first two positions (as I would expecting), for 1 or >=3 commented lines I get n-1 empty lists in content. It does look quite strange to me. Any idea of what's going on?

Thanks for the time reading (and hopefully replying) to this long post

holocronweaver · 2013-07-30T14:22:07Z

I would very much like to see this feature implemented as I have to deal with workarounds for it often.

Is there a reason that the above implementation would not work? It seems simple enough. I have never contributed to pandas before and would like some feedback before proceeding to implement the feature.

jreback · 2013-07-30T14:30:53Z

the c-parser is the primary parser; this change would have to be made there (actually in src/parser.pyx). its not that hard, but a bit non-trivial

montefra · 2013-07-30T15:08:34Z

I'll give a look at src/parser.pyx asap. I hope to have more luck than when I looked into pandas/src/parser/tokenizer.c

jreback · 2013-07-30T15:17:32Z

this is an already existing feature to skip comments at the end of the line

http://pandas.pydata.org/pandas-docs/dev/io.html#comments

IIUC you want to skip a line if there is a comment at the beginning?

e.g.

# comment
data.....
# comment
data....
   # comment

if you specified comment='\s*#' (I am not sure it takes re ATM though)

jreback · 2013-07-30T15:36:56Z

the last one would be tough FYI, tokenzier only takes a single character ATM

holocronweaver · 2013-07-30T19:43:56Z

@montefra Whichever of us gets to it first then. The race is on. ;-)

@jreback That is correct, I want to skip lines with a comment character at the beginning. I would imagine the last case could not be handled in the event the file is space delimited. Basically, comments probably need to either be declared at the very beginning or end of the line.

holocronweaver · 2013-08-05T18:56:16Z

I have implemented comment skipping and also fixed a related problem with CSV format sniffing. I did so by modifying io/parsers.py and have not yet begun work on the c-parser which apparently my install of pandas is not using by default. Once I have finished making changes in parser.pyx I will push my changes.

A similar problem is whether pandas should ignore empty lines by default. It would be very easy to implement as a slight extension of ignoring comments. I will open up an issue related on this to get some feedback.

jreback · 2013-08-05T19:06:14Z

gre8!

you can put your changes up as a PR, be sure to enable travis, lmk if you need help

holocronweaver · 2013-08-07T01:50:09Z

Only one build failed in Travis CI, but it does not appear to have anything to do with my changes. Is this expected, or did I break something related to pytables somehow?

jreback · 2013-08-07T01:57:56Z

@holocronweaver rebase on master again I just pushed some code to 'fix' that failure (although net net the changes didn't actually do antything)...but seemed to fix it

to force travis to rebuild

git commit -amend -C HEAD
git push origin yourbranch -f

this resets the last commit to a new hash forcing a rebuild

holocronweaver · 2013-08-07T18:38:05Z

I did as you suggested and the latest build (3) is all good. I will go ahead and submit a pull request.

jreback · 2013-09-30T13:17:57Z

@montefra doing a PR for this?

montefra · 2013-09-30T13:43:20Z

@jreback: ehm.
I did look into it but I kind of forgot after @holocronweaver did is PR.
So the answer is no.
When I did spend some time (about 9 months ago) I couldn't figure out much pandas/src/parser/tokenizer.c and proposed a solution, which is likely part of what @holocronweaver has done.

jreback · 2013-09-30T13:50:42Z

ok...thanks

holocronweaver · 2013-09-30T17:26:03Z

@jreback @montefra Yep, I have taken care of this, just need to find time to get everything ready for PR.

jreback · 2014-02-15T21:06:29Z

@holocronweaver @montefra can either of you update the PR for this?

holocronweaver · 2014-02-24T03:57:56Z

@jreback Will do so the moment I have a chance. Long time coming, I know. =) High on my priority list.

jreback mentioned this issue Feb 12, 2013

DataFrame does not play well with classes extending it #2859

Closed

holocronweaver mentioned this issue Aug 5, 2013

ENH: Ignoring empty lines in files #4466

Closed

holocronweaver mentioned this issue Sep 24, 2013

ENH/BUG: ignore line comments in CSV files GH2685 #4505

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Feb 26, 2014

mdmueller mentioned this issue Jun 16, 2014

ENH: ignoring comment lines and empty lines in CSV files #7470

Closed

mdmueller mentioned this issue Jun 27, 2014

Ignore comment lines in read_csv parsing #7582

Merged

jreback modified the milestones: 0.14.1, 0.15.0 Jun 30, 2014

jreback closed this as completed in #7582 Jun 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Option for reading files with a variable number of comment lines at start #2685

ENH: Option for reading files with a variable number of comment lines at start #2685

wesm commented Jan 11, 2013

montefra commented Jan 11, 2013

wesm commented Jan 15, 2013

montefra commented Jan 17, 2013

holocronweaver commented Jul 30, 2013

jreback commented Jul 30, 2013

montefra commented Jul 30, 2013

jreback commented Jul 30, 2013

jreback commented Jul 30, 2013

holocronweaver commented Jul 30, 2013

holocronweaver commented Aug 5, 2013

jreback commented Aug 5, 2013

holocronweaver commented Aug 7, 2013

jreback commented Aug 7, 2013

holocronweaver commented Aug 7, 2013

jreback commented Sep 30, 2013

montefra commented Sep 30, 2013

jreback commented Sep 30, 2013

holocronweaver commented Sep 30, 2013

jreback commented Feb 15, 2014

holocronweaver commented Feb 24, 2014

ENH: Option for reading files with a variable number of comment lines at start #2685

ENH: Option for reading files with a variable number of comment lines at start #2685

Comments

wesm commented Jan 11, 2013

montefra commented Jan 11, 2013

wesm commented Jan 15, 2013

montefra commented Jan 17, 2013

holocronweaver commented Jul 30, 2013

jreback commented Jul 30, 2013

montefra commented Jul 30, 2013

jreback commented Jul 30, 2013

jreback commented Jul 30, 2013

holocronweaver commented Jul 30, 2013

holocronweaver commented Aug 5, 2013

jreback commented Aug 5, 2013

holocronweaver commented Aug 7, 2013

jreback commented Aug 7, 2013

holocronweaver commented Aug 7, 2013

jreback commented Sep 30, 2013

montefra commented Sep 30, 2013

jreback commented Sep 30, 2013

holocronweaver commented Sep 30, 2013

jreback commented Feb 15, 2014

holocronweaver commented Feb 24, 2014