New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make an all-NA variable logical #662

Closed
jennybc opened this Issue Apr 27, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@jennybc
Member

jennybc commented Apr 27, 2017

It seems like a variable that is filled with NA should be logical not character.

readr::read_csv("v1,v2\nNA,NA")
#> # A tibble: 1 × 2
#>      v1    v2
#>   <chr> <chr>
#> 1  <NA>  <NA>
read.csv(text = "v1,v2\nNA,NA")
#>   v1 v2
#> 1 NA NA
sapply(read.csv(text = "v1,v2\nNA,NA"), class)
#>        v1        v2 
#> "logical" "logical"

I feel like I've raised this issue before and learned that the choice of character was somewhat deliberate? But now I can't remember why. Recent tidyverse ingest discussion supports a default to logical.

Was thinking of this because googlesheets runs data through readr::read_csv() in some cases. Working to get consistent result from googlesheets, readxl, and readr for the same data.

@jimhester

This comment has been minimized.

Member

jimhester commented Apr 27, 2017

This behavior was defined in 1f352e5, the rational seems to be that if the first X values are missing readr will guess a charcter vector, so if there are non-missing values later they won't be lost, if we guess logical than non missing or T/F vales are lost.

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 27, 2017

Yes, this sounds familiar now. Is it prohibitive to do a check after the data has been read to detect all-NA columns and make them logical?

@jennybc

This comment has been minimized.

Member

jennybc commented Oct 29, 2017

I have a new example of why logical NA would really be better than character. At least in the "all NA" case. I actually think the default should be logical when the first guess_max observations are missing, even if there is data later. FWIW, that's what readxl does, although its guessing is based on cell types, not data.

I want to do this: map_dfr(files, read_csv, .id = "card") but cannot because there's a variable that is empty in some csv files (ergo, character NAs) and present in others (data is integer). So the row bind fails.

You can see the problem with read_csv() and dplyr::bind_rows() alone. It's also related to dplyr becoming very strict about coercions, which seems to be an intentional decision (see tidyverse/dplyr#2209 (comment)). readr's default to character and dplyr's refusal to coerce is an unfortunate combination. In this case, read.csv() (+ rbind() or bind_rows()) gives a more natural result.

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-10-28

library(tidyverse)

df1 <- read_csv("x,y\n1,\n")
df2 <- read_csv("x,y\n3,4\n")
bind_rows(df1, df2)
#> Error in bind_rows_(x, .id): Column `y` can't be converted from character to integer
rbind(df1, df2) ## "works" but is wrong because now y is character
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <chr>
#> 1     1  <NA>
#> 2     3     4

df1 <- read.csv(text = "x,y\n1,\n")
df2 <- read.csv(text = "x,y\n3,4\n")
rbind(df1, df2) %>% as_tibble()
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1    NA
#> 2     3     4
bind_rows(df1, df2) %>% as_tibble()
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1    NA
#> 2     3     4
@lock

This comment has been minimized.

lock bot commented Sep 25, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.