Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv column type guess for 1000 rows of "NA" is logical - should be character? #839

Closed
jzadra opened this issue May 2, 2018 · 3 comments

Comments

@jzadra
Copy link

jzadra commented May 2, 2018

If col_types = NULL, and a column has the first 1000 rows with a value of NA, read_csv specifies the column as logical.

(As an aside, this seems to have changed in the last several months, as I discovered this with old code that used to yield these columns as character type).

This seems like a behavior that is prone to errors. If we only know that the first 1000 are NA, there is no reason to assume what comes later will be logical. The most robust handling of this situation would be to treat it as character so that there is no risk of parsing failures that coerce values to fit some other class.

Here's a test file - basically just CSV with 1200 rows of "NA" then a row with "TEST"
natest.csv.zip

require(readr)
#> Loading required package: readr
read_csv("~/temp/natest.csv")
#> Parsed with column specification:
#> cols(
#>   numbers = col_double(),
#>   testcol = col_logical()
#> )
#> Warning: 1 parsing failure.
#>  row     col           expected actual                file
#> 1290 testcol 1/0/T/F/TRUE/FALSE   TEST '~/temp/natest.csv'
#> # A tibble: 1,290 x 2
#>    numbers testcol
#>      <dbl> <lgl>  
#>  1      1. NA     
#>  2      2. NA     
#>  3      3. NA     
#>  4      4. NA     
#>  5      5. NA     
#>  6      6. NA     
#>  7      7. NA     
#>  8      8. NA     
#>  9      9. NA     
#> 10     10. NA     
#> # ... with 1,280 more rows

Created on 2018-05-02 by the reprex package (v0.2.0).

@jzadra jzadra changed the title read_csv column type guess for 1000 rows of "NA" is logical - should be character read_csv column type guess for 1000 rows of "NA" is logical - should be character? May 2, 2018
@jennybc
Copy link
Member

jennybc commented May 2, 2018

This is a considered decision and, I think, should have been this way from the start. If you have no information to go on, the most R-like thing to do is to guess the missing data is logical. This is also critical for later vector-binding or coercion or row-binding, because you can upcast logical NAs but can't downcast character.

Some places to read previous discussion re: why things are the way they are now:

#662

tidyverse/tidyverse#34 (comment)

If you know a column should be character, then it's best to express that outright. Or if you want to increase the number of rows used for guessing, you can increase guess_max. Also, cols() allows you to set your own default column type.

@jimhester
Copy link
Collaborator

As @jennybc mentioned, this is by design and we feel the current behavior is the best of the available imperfect options.

@lock
Copy link

lock bot commented Oct 31, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 31, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants