New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv coerces all column values to NA if 100 first observations are missing #128

Closed
vincentarelbundock opened this Issue Apr 13, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@vincentarelbundock

vincentarelbundock commented Apr 13, 2015

If the first 100 observations of a variable are missing, read_csv overwrites all column values with NAs (unless the column is boolean). Ideally, column type should be determined using the first 100 non-missing values of each column.

library(readr)
x = data.frame(matrix(rnorm(1000), ncol=5))
x$X1[1:100] = NA
write.csv(x, file='test.csv', row.names=FALSE)
y = read_csv('test.csv')
y$X1
problems(x)
@vincentarelbundock

This comment has been minimized.

vincentarelbundock commented Apr 13, 2015

FWIW, this actually happens a lot with the type of data I tend to work with. It is also not practical to manually specify column types in my particular application, since hundreds of columns raised a warning.

@hadley

This comment has been minimized.

Member

hadley commented Apr 13, 2015

It's not possible to look for the first 100 non-missing values, because that doesn't take a bounded amount of time to run - it might have to scan the whole file.

@vincentarelbundock

This comment has been minimized.

vincentarelbundock commented Apr 13, 2015

I had not thought of that. I guess I'll just keep using read.csv for now. Thanks for your work.

@hadley

This comment has been minimized.

Member

hadley commented Apr 13, 2015

Oh hmmmm, I think the reason that this is so painful is that I have a bug in my logic somewhere - if the first 100 values are all missing, it should guess that the column is character, since that ensures you don't lose info

@vincentarelbundock

This comment has been minimized.

vincentarelbundock commented Apr 14, 2015

Makes sense. Also, you probably don't want to see a proliferation of arguments, but since the 100 number is arbitrary, it might be useful to allow users to specify how many rows the function checks. For example, I'd be willing to waste a few cpu cycles to check 1000 lines and get good type inference.

@artemklevtsov

This comment has been minimized.

artemklevtsov commented Apr 15, 2015

Seems duplicated #124. Also readxl have the same bug.

@hadley hadley closed this in 1f352e5 Apr 16, 2015

@hadley

This comment has been minimized.

Member

hadley commented Apr 16, 2015

The major annoyingness of this behaviour should be fixed - now all contents will be loaded without errors into a character vector. I'll continue to explore better heuristics for guessing column type.

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.