New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_table corrupts last column names #166

Closed
coolbutuseless opened this Issue May 6, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@coolbutuseless

coolbutuseless commented May 6, 2015

When reading a simple whitespace separated file with read_table()

 MAID        TIME        AMT
  1.0000E+00  0.0000E+00  2.5000E+00
  1.0000E+00  0.0000E+00  0.0000E+00

The parsed colname for the final column is corrupt:

> read_table("test_nonmem2.txt")
Source: local data frame [2 x 3]

  MAID TIME AMT\n  1.000
1    1    0          2.5
2    1    0          0.0
@hadley

This comment has been minimized.

Member

hadley commented May 7, 2015

That isn't a valid format for read_table() - it excepts the whitespace columns to be in consistent positions

@coolbutuseless

This comment has been minimized.

coolbutuseless commented May 10, 2015

(Note: a cut/paste error means the headers in my above example weren't aligned)
So all values must take the same amount of space regardless of whether it's a header row or not?

MAID        TIME        AMT
1.0000E+00  0.0000E+00  2.5000E+00
1.0000E+00  0.0000E+00  0.0000E+00

The above (with aligned column starts) still causes a column misread:

> read_table("test.txt", col_names=TRUE)
  MAID TIME AMT\n1.0000
1    1    0         2.5
2    1    0         0.0

But if I pad the column names they are read correctly:

MAIDxxxxxx  TIMExxxxxx  AMTxxxxxxx
1.0000E+00  0.0000E+00  2.5000E+00
1.0000E+00  0.0000E+00  0.0000E+00
> read_table("test.txt", col_names=TRUE)
  MAIDxxxxxx TIMExxxxxx AMTxxxxxxx
1          1          0        2.5
2          1          0        0.0

So length of column names must match the data width for read_table()?

This is related to issue #121, because read_table() works about 20x faster than read.table() on NONMEM files, with the only problem being the parsing of the header for column names.

@hadley

This comment has been minimized.

Member

hadley commented May 11, 2015

read_fwf() (and hence read_table()) expects every row to have the same number of characters

@hadley

This comment has been minimized.

Member

hadley commented May 11, 2015

If the problem is just reading the header row, maybe you should just skip it and read it yourself?

@coolbutuseless

This comment has been minimized.

coolbutuseless commented May 11, 2015

That's what I've done for now, but I would hazard that the probability that the column names are ever the same width as the data in a FWF is very very low, and the times when the current read_table header parsing works as a user expects would be rare.

Would there be any benefit to parsing the header slightly differently by default for fwf? Or as an option e.g. relaxed_headers = TRUE.

Or am I an outlier? :)

@hadley hadley closed this in c1bd4bd Sep 21, 2015

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.