Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of trailing delimiters in read_csv #2442

Closed
wesm opened this issue Dec 6, 2012 · 10 comments
Closed

Handling of trailing delimiters in read_csv #2442

wesm opened this issue Dec 6, 2012 · 10 comments
Labels
IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@wesm
Copy link
Member

wesm commented Dec 6, 2012

xref http://stackoverflow.com/questions/13719946/python-pandas-trailing-delimiter-confuses-read-csv

@edwardw
Copy link

edwardw commented Dec 8, 2012

To reproduce the bug, just create a two-line csv file. The first line is the header, without trailing delimiter. The second line is data, with trailing delimiter. Then read_csv will create a DataFrame in which headers and columns offset by one.

@wesm
Copy link
Member Author

wesm commented Dec 10, 2012

This is very annoying because the index/row name inference is very useful in most cases, but breaks down in the case where you have a malformed file. I'll think about it some

@changhiskhan
Copy link
Contributor

Hmmm...custom dialect option?

@wesm
Copy link
Member Author

wesm commented Dec 10, 2012

Probably should have an option like index_col=False and deal with an empty column. I have the latest FEC file (which has ballooned--!!-- to 900+MB) to try it out

@edwardw
Copy link

edwardw commented Dec 10, 2012

While we are on it, may I suggest a feature about read_csv? The FEC file I used (it was 700+MB a week ago) was too large for 4GB memory I have on my macbook. If I try to read the file in one run, it would take forever because of page fault. But since not all 20 or so columns were used, I read file in 4 chunks, ditched half unused columns, appended it to a accumulator and ended up with a big DataFrame, which held all rows but only columns I was interested in.

So having an option to tell pandas.read_csv to only read certain columns could be very useful.

@wesm
Copy link
Member Author

wesm commented Dec 10, 2012

Already done in the development version of pandas-- you should install it. usecols option

@wesm wesm closed this as completed in 648d581 Dec 10, 2012
@wesm
Copy link
Member Author

wesm commented Dec 10, 2012

I wrote a blog about this, enjoy: http://wesmckinney.com/blog/?p=635

@johannesschweig
Copy link

Blog link is dead.

@wesm
Copy link
Member Author

wesm commented Dec 4, 2018

see http://wesmckinney.com/blog/update-on-upcoming-pandas-v0-10-new-file-parser-other-performance-wins/

@smcinerney
Copy link

smcinerney commented Feb 11, 2020

Trailing delimiters on data rows confusing the parser is still an unresolved issue as of 1.0:

import pandas as pd
from io import StringIO

bad_dat = """A,B\n1,2,\n"""
pd.read_csv(StringIO(bad_dat), sep=',', header=0, index_col='A')

Traceback (most recent call last):
  File "read_csv_trailing_delimiter_bug2.py", line 6, in <module>
    df = pd.read_csv(StringIO(bad_dat), sep=',', header=0, index_col='A')
  File "/opt/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/opt/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/opt/anaconda/lib/python3.7/site-packages/pandas/io/parsers.py", line 2078, in read
    values = data.pop(self.index_col[i])
KeyError: 'A'

and also, if we try to explicitly specify the (single) header row:

>>> pd.read_csv(StringIO(bad_dat), sep=',', header=[0], index_col='A')
...
ValueError: index_col must only contain row numbers when specifying a multi-index header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants