Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_table() can wrongly read numbers (silently) #518

Closed
aphalo opened this issue Sep 12, 2016 · 4 comments
Closed

read_table() can wrongly read numbers (silently) #518

aphalo opened this issue Sep 12, 2016 · 4 comments
Labels
feature a feature request or enhancement read 📖

Comments

@aphalo
Copy link

aphalo commented Sep 12, 2016

read_table() miss reads numbers in the first column when there is white space at the start of enough lines at the top of the file.

Of the attached files the shorter readr-read-table.txt is correctly read, but the second and longer file readr-read-table-2.txt is incorrectly read, with 1000 interpreted as 0, 1001 as 1, etc.

 readr::read_table("readr-read-table.txt", col_types = "dd")
 readr::read_table("readr-read-table-2.txt", col_types = "dd")

readr-read-table.txt
readr-read-table-2.txt

I noticed a problem with some of my files some months ago, but I did not find the cause of the problem until yesterday. utils::read.table() has no problems with either of these files.

readr installed from this repository minutes ago. 1.0.0.9000
R 3.3.1 Windows 10 x64.

@yeedle
Copy link
Contributor

yeedle commented Oct 28, 2016

Here's another similar complaint citing the same issue. I'm not sure it's a bug though. By design, the tokenizer reads only the first 100 lines to determine the structure of the file. Probably because of efficiency, though it wouldn't hurt to mention it in the documentation.

@aphalo
Copy link
Author

aphalo commented Oct 28, 2016

An option to remove white space at the start of lines before parsing could solve the problem, I think. In many data sets where the first column is an ordered index quantity like wavelength it is common that the numbers are right justified. This may happen even in files where the columns are not all aligned, which prevents the use of a fixed format.

@yeedle
Copy link
Contributor

yeedle commented Oct 28, 2016

The same issue occurs with both right-aligned and left-aligned columns. See the SO question I linked to above. There, the issue occurred to a left-justified column.

@hadley
Copy link
Member

hadley commented Dec 22, 2016

Yes, read_table() is potentially unreliable because it is magic. If you favour correctness over ease-of-use, you should use read_fwf().

That said, read_table() should print out the spec that it uses (a la col_types) so you can tweak afterwards.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement read 📖
Projects
None yet
Development

No branches or pull requests

3 participants