read_csv column type guess for 1000 rows of "NA" is logical - should be character? #839

jzadra · 2018-05-02T20:28:28Z

If col_types = NULL, and a column has the first 1000 rows with a value of NA, read_csv specifies the column as logical.

(As an aside, this seems to have changed in the last several months, as I discovered this with old code that used to yield these columns as character type).

This seems like a behavior that is prone to errors. If we only know that the first 1000 are NA, there is no reason to assume what comes later will be logical. The most robust handling of this situation would be to treat it as character so that there is no risk of parsing failures that coerce values to fit some other class.

Here's a test file - basically just CSV with 1200 rows of "NA" then a row with "TEST"
natest.csv.zip

require(readr)
#> Loading required package: readr
read_csv("~/temp/natest.csv")
#> Parsed with column specification:
#> cols(
#>   numbers = col_double(),
#>   testcol = col_logical()
#> )
#> Warning: 1 parsing failure.
#>  row     col           expected actual                file
#> 1290 testcol 1/0/T/F/TRUE/FALSE   TEST '~/temp/natest.csv'
#> # A tibble: 1,290 x 2
#>    numbers testcol
#>      <dbl> <lgl>  
#>  1      1. NA     
#>  2      2. NA     
#>  3      3. NA     
#>  4      4. NA     
#>  5      5. NA     
#>  6      6. NA     
#>  7      7. NA     
#>  8      8. NA     
#>  9      9. NA     
#> 10     10. NA     
#> # ... with 1,280 more rows

Created on 2018-05-02 by the reprex package (v0.2.0).

The text was updated successfully, but these errors were encountered:

jennybc · 2018-05-02T20:43:58Z

This is a considered decision and, I think, should have been this way from the start. If you have no information to go on, the most R-like thing to do is to guess the missing data is logical. This is also critical for later vector-binding or coercion or row-binding, because you can upcast logical NAs but can't downcast character.

Some places to read previous discussion re: why things are the way they are now:

#662

tidyverse/tidyverse#34 (comment)

If you know a column should be character, then it's best to express that outright. Or if you want to increase the number of rows used for guessing, you can increase guess_max. Also, cols() allows you to set your own default column type.

jimhester · 2018-05-04T13:14:46Z

As @jennybc mentioned, this is by design and we feel the current behavior is the best of the available imperfect options.

lock · 2018-10-31T14:11:30Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

jzadra changed the title ~~read_csv column type guess for 1000 rows of "NA" is logical - should be character~~ read_csv column type guess for 1000 rows of "NA" is logical - should be character? May 2, 2018

jimhester closed this as completed May 4, 2018

lock bot locked and limited conversation to collaborators Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv column type guess for 1000 rows of "NA" is logical - should be character? #839

read_csv column type guess for 1000 rows of "NA" is logical - should be character? #839

jzadra commented May 2, 2018

jennybc commented May 2, 2018 •

edited

Loading

jimhester commented May 4, 2018

lock bot commented Oct 31, 2018

read_csv column type guess for 1000 rows of "NA" is logical - should be character? #839

read_csv column type guess for 1000 rows of "NA" is logical - should be character? #839

Comments

jzadra commented May 2, 2018

jennybc commented May 2, 2018 • edited Loading

jimhester commented May 4, 2018

lock bot commented Oct 31, 2018

jennybc commented May 2, 2018 •

edited

Loading