Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_table corrupts last column names #166

Closed
coolbutuseless opened this issue May 6, 2015 · 5 comments
Closed

read_table corrupts last column names #166

coolbutuseless opened this issue May 6, 2015 · 5 comments

Comments

@coolbutuseless
Copy link

@coolbutuseless coolbutuseless commented May 6, 2015

When reading a simple whitespace separated file with read_table()

 MAID        TIME        AMT
  1.0000E+00  0.0000E+00  2.5000E+00
  1.0000E+00  0.0000E+00  0.0000E+00

The parsed colname for the final column is corrupt:

> read_table("test_nonmem2.txt")
Source: local data frame [2 x 3]

  MAID TIME AMT\n  1.000
1    1    0          2.5
2    1    0          0.0
@hadley
Copy link
Member

@hadley hadley commented May 7, 2015

That isn't a valid format for read_table() - it excepts the whitespace columns to be in consistent positions

@coolbutuseless
Copy link
Author

@coolbutuseless coolbutuseless commented May 10, 2015

(Note: a cut/paste error means the headers in my above example weren't aligned)
So all values must take the same amount of space regardless of whether it's a header row or not?

MAID        TIME        AMT
1.0000E+00  0.0000E+00  2.5000E+00
1.0000E+00  0.0000E+00  0.0000E+00

The above (with aligned column starts) still causes a column misread:

> read_table("test.txt", col_names=TRUE)
  MAID TIME AMT\n1.0000
1    1    0         2.5
2    1    0         0.0

But if I pad the column names they are read correctly:

MAIDxxxxxx  TIMExxxxxx  AMTxxxxxxx
1.0000E+00  0.0000E+00  2.5000E+00
1.0000E+00  0.0000E+00  0.0000E+00
> read_table("test.txt", col_names=TRUE)
  MAIDxxxxxx TIMExxxxxx AMTxxxxxxx
1          1          0        2.5
2          1          0        0.0

So length of column names must match the data width for read_table()?

This is related to issue #121, because read_table() works about 20x faster than read.table() on NONMEM files, with the only problem being the parsing of the header for column names.

@hadley
Copy link
Member

@hadley hadley commented May 11, 2015

read_fwf() (and hence read_table()) expects every row to have the same number of characters

@hadley
Copy link
Member

@hadley hadley commented May 11, 2015

If the problem is just reading the header row, maybe you should just skip it and read it yourself?

@coolbutuseless
Copy link
Author

@coolbutuseless coolbutuseless commented May 11, 2015

That's what I've done for now, but I would hazard that the probability that the column names are ever the same width as the data in a FWF is very very low, and the times when the current read_table header parsing works as a user expects would be rare.

Would there be any benefit to parsing the header slightly differently by default for fwf? Or as an option e.g. relaxed_headers = TRUE.

Or am I an outlier? :)

@hadley hadley closed this in c1bd4bd Sep 21, 2015
@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants