Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make an all-NA variable logical #662

Closed
jennybc opened this issue Apr 27, 2017 · 4 comments
Closed

Make an all-NA variable logical #662

jennybc opened this issue Apr 27, 2017 · 4 comments

Comments

@jennybc
Copy link
Member

jennybc commented Apr 27, 2017

It seems like a variable that is filled with NA should be logical not character.

readr::read_csv("v1,v2\nNA,NA")
#> # A tibble: 1 × 2
#>      v1    v2
#>   <chr> <chr>
#> 1  <NA>  <NA>
read.csv(text = "v1,v2\nNA,NA")
#>   v1 v2
#> 1 NA NA
sapply(read.csv(text = "v1,v2\nNA,NA"), class)
#>        v1        v2 
#> "logical" "logical"

I feel like I've raised this issue before and learned that the choice of character was somewhat deliberate? But now I can't remember why. Recent tidyverse ingest discussion supports a default to logical.

Was thinking of this because googlesheets runs data through readr::read_csv() in some cases. Working to get consistent result from googlesheets, readxl, and readr for the same data.

@jimhester
Copy link
Collaborator

jimhester commented Apr 27, 2017

This behavior was defined in 1f352e5, the rational seems to be that if the first X values are missing readr will guess a charcter vector, so if there are non-missing values later they won't be lost, if we guess logical than non missing or T/F vales are lost.

@jennybc
Copy link
Member Author

jennybc commented Apr 27, 2017

Yes, this sounds familiar now. Is it prohibitive to do a check after the data has been read to detect all-NA columns and make them logical?

@jennybc
Copy link
Member Author

jennybc commented Oct 29, 2017

I have a new example of why logical NA would really be better than character. At least in the "all NA" case. I actually think the default should be logical when the first guess_max observations are missing, even if there is data later. FWIW, that's what readxl does, although its guessing is based on cell types, not data.

I want to do this: map_dfr(files, read_csv, .id = "card") but cannot because there's a variable that is empty in some csv files (ergo, character NAs) and present in others (data is integer). So the row bind fails.

You can see the problem with read_csv() and dplyr::bind_rows() alone. It's also related to dplyr becoming very strict about coercions, which seems to be an intentional decision (see tidyverse/dplyr#2209 (comment)). readr's default to character and dplyr's refusal to coerce is an unfortunate combination. In this case, read.csv() (+ rbind() or bind_rows()) gives a more natural result.

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-10-28

library(tidyverse)

df1 <- read_csv("x,y\n1,\n")
df2 <- read_csv("x,y\n3,4\n")
bind_rows(df1, df2)
#> Error in bind_rows_(x, .id): Column `y` can't be converted from character to integer
rbind(df1, df2) ## "works" but is wrong because now y is character
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <chr>
#> 1     1  <NA>
#> 2     3     4

df1 <- read.csv(text = "x,y\n1,\n")
df2 <- read.csv(text = "x,y\n3,4\n")
rbind(df1, df2) %>% as_tibble()
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1    NA
#> 2     3     4
bind_rows(df1, df2) %>% as_tibble()
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <int>
#> 1     1    NA
#> 2     3     4

@lock
Copy link

lock bot commented Sep 25, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants