read_table corrupts last column names #166

coolbutuseless · 2015-05-06T23:52:35Z

When reading a simple whitespace separated file with read_table()

 MAID        TIME        AMT
  1.0000E+00  0.0000E+00  2.5000E+00
  1.0000E+00  0.0000E+00  0.0000E+00

The parsed colname for the final column is corrupt:

> read_table("test_nonmem2.txt")
Source: local data frame [2 x 3]

  MAID TIME AMT\n  1.000
1    1    0          2.5
2    1    0          0.0

The text was updated successfully, but these errors were encountered:

hadley · 2015-05-07T11:39:10Z

That isn't a valid format for read_table() - it excepts the whitespace columns to be in consistent positions

coolbutuseless · 2015-05-10T12:24:28Z

(Note: a cut/paste error means the headers in my above example weren't aligned)
So all values must take the same amount of space regardless of whether it's a header row or not?

MAID        TIME        AMT
1.0000E+00  0.0000E+00  2.5000E+00
1.0000E+00  0.0000E+00  0.0000E+00

The above (with aligned column starts) still causes a column misread:

> read_table("test.txt", col_names=TRUE)
  MAID TIME AMT\n1.0000
1    1    0         2.5
2    1    0         0.0

But if I pad the column names they are read correctly:

MAIDxxxxxx  TIMExxxxxx  AMTxxxxxxx
1.0000E+00  0.0000E+00  2.5000E+00
1.0000E+00  0.0000E+00  0.0000E+00

> read_table("test.txt", col_names=TRUE)
  MAIDxxxxxx TIMExxxxxx AMTxxxxxxx
1          1          0        2.5
2          1          0        0.0

So length of column names must match the data width for read_table()?

This is related to issue #121, because read_table() works about 20x faster than read.table() on NONMEM files, with the only problem being the parsing of the header for column names.

hadley · 2015-05-11T12:25:03Z

read_fwf() (and hence read_table()) expects every row to have the same number of characters

hadley · 2015-05-11T12:25:38Z

If the problem is just reading the header row, maybe you should just skip it and read it yourself?

coolbutuseless · 2015-05-11T21:41:59Z

That's what I've done for now, but I would hazard that the probability that the column names are ever the same width as the data in a FWF is very very low, and the times when the current read_table header parsing works as a user expects would be rare.

Would there be any benefit to parsing the header slightly differently by default for fwf? Or as an option e.g. relaxed_headers = TRUE.

Or am I an outlier? :)

hadley closed this as completed in c1bd4bd Sep 21, 2015

ghaarsma mentioned this issue Oct 26, 2015

read_fwf cannot read subset of columns after 0.2.x upgrade #300

Closed

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_table corrupts last column names #166

read_table corrupts last column names #166

coolbutuseless commented May 6, 2015

hadley commented May 7, 2015

coolbutuseless commented May 10, 2015

hadley commented May 11, 2015

hadley commented May 11, 2015

coolbutuseless commented May 11, 2015

read_table corrupts last column names #166

read_table corrupts last column names #166

Comments

coolbutuseless commented May 6, 2015

hadley commented May 7, 2015

coolbutuseless commented May 10, 2015

hadley commented May 11, 2015

hadley commented May 11, 2015

coolbutuseless commented May 11, 2015