Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with bind_rows and types #5358

Closed
Fablepongiste opened this issue Jun 25, 2020 · 16 comments
Closed

Issue with bind_rows and types #5358

Fablepongiste opened this issue Jun 25, 2020 · 16 comments

Comments

@Fablepongiste
Copy link

Fablepongiste commented Jun 25, 2020

Should that really crash ?

bind_rows extremely strict with types, probably too strict ?

df1 <- structure(list(TEAM_ID = c("1", "2", "3", "4", "5", "6"), TEAM_ABBREVIATION = c("DEN", 
"DAL", "NYK", "ATL", "CHA", "MIA"), TEAM_NAME = c("Denver Nuggets", 
"Dallas Mavericks", "New York Knicks", "Atlanta Hawks", "Charlotte Hornets", 
"Miami Heat"), GAME_ID = c("1", "2", "3", "4", "5", "6"), GAME_DATE = c("2020-03-11", 
"2020-03-11", "2020-03-11", "2020-03-11", "2020-03-11", "2020-03-11"
)), row.names = c(NA, 6L), class = "data.frame")

df2 <- structure(list(TEAM_ID = logical(0), TEAM_ABBREVIATION = logical(0), 
    TEAM_NAME = logical(0), GAME_ID = logical(0), GAME_DATE = logical(0)), class = "data.frame", row.names = integer(0))

df <- bind_rows(df1, df2)

Error: Can't combine ..1$TEAM_ID and ..2$TEAM_ID .

When you get no data, types can sometimes be defaulted to logical, should take either type of first data or better the type of the df with data ?

This is only since change to vectrs in dplyr 1.0

Might be on purpose, in case fine i guess

@hadley
Copy link
Member

hadley commented Jun 25, 2020

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@Fablepongiste
Copy link
Author

Fablepongiste commented Jun 25, 2020

library(dplyr)

df1 <- structure(list(TEAM_ID = c("1", "2", "3", "4", "5", "6"),
                      TEAM_ABBREVIATION = c("DEN", 
                                            "DAL", "NYK", "ATL", "CHA", "MIA"), 
                      TEAM_NAME = c("Denver Nuggets", 
                                    "Dallas Mavericks", "New York Knicks", "Atlanta Hawks", "Charlotte Hornets", 
                                    "Miami Heat"), 
                      GAME_ID = c("1", "2", "3", "4", "5", "6"), 
                      GAME_DATE = c("2020-03-11", 
                                    "2020-03-11", "2020-03-11", "2020-03-11", "2020-03-11", "2020-03-11"
                      )), 
                 row.names = c(NA, 6L), class = "data.frame")

df2 <- structure(list(TEAM_ID = logical(0), TEAM_ABBREVIATION = logical(0), 
                      TEAM_NAME = logical(0), GAME_ID = logical(0), GAME_DATE = logical(0)), 
                 class = "data.frame", row.names = integer(0))

df <- bind_rows(df1, df2)
#> Error in bind_rows(df1, df2): could not find function "bind_rows"

Created on 2020-06-25 by the reprex package (v0.3.0)

@lionel-
Copy link
Member

lionel- commented Jun 26, 2020

@Fablepongiste Thank you. Can you please make sure the error message in the reprex corresponds to the one you're reporting? It looks like you're missing a library(dplyr).

@romainfrancois
Copy link
Member

@Fablepongiste:

library(dplyr, warn.conflicts = FALSE)

df1 <- structure(list(TEAM_ID = c("1", "2", "3", "4", "5", "6"),
                      TEAM_ABBREVIATION = c("DEN", 
                                            "DAL", "NYK", "ATL", "CHA", "MIA"), 
                      TEAM_NAME = c("Denver Nuggets", 
                                    "Dallas Mavericks", "New York Knicks", "Atlanta Hawks", "Charlotte Hornets", 
                                    "Miami Heat"), 
                      GAME_ID = c("1", "2", "3", "4", "5", "6"), 
                      GAME_DATE = c("2020-03-11", 
                                    "2020-03-11", "2020-03-11", "2020-03-11", "2020-03-11", "2020-03-11"
                      )), 
                 row.names = c(NA, 6L), class = "data.frame")

df2 <- structure(list(TEAM_ID = logical(0), TEAM_ABBREVIATION = logical(0), 
                      TEAM_NAME = logical(0), GAME_ID = logical(0), GAME_DATE = logical(0)), 
                 class = "data.frame", row.names = integer(0))

df <- bind_rows(df1, df2)
#> Error: Can't combine `..1$TEAM_ID` <character> and `..2$TEAM_ID` <logical>.
#> Backtrace:
#>     █
#>  1. ├─dplyr::bind_rows(df1, df2)
#>  2. │ └─vctrs::vec_rbind(!!!dots, .names_to = .id) /Users/romainfrancois/git/tidyverse/dplyr/R/bind.r:122:2
#>  3. └─vctrs::vec_default_ptype2(...)
#>  4.   └─vctrs::stop_incompatible_type(...)
#>  5.     └─vctrs:::stop_incompatible(...)
#>  6.       └─vctrs:::stop_vctrs(...)

Created on 2020-06-29 by the reprex package (v0.3.0.9001)

@romainfrancois
Copy link
Member

One thing along the lines of #5366 could be to ignore data frames with 0 rows when bind_rows(), that would at least solve the issue about "When you get no data..."

@romainfrancois
Copy link
Member

btw @Fablepongiste part of making a reprex is simplifying the example so that it's easier for us to help you, e.g.

library(dplyr, warn.conflicts = FALSE)

df1 <- data.frame(x = c("a", "b"))
df2 <- data.frame(x = logical())

df <- bind_rows(df1, df2)
#> Error: Can't combine `..1$x` <character> and `..2$x` <logical>.
#> Backtrace:
#>     █
#>  1. ├─dplyr::bind_rows(df1, df2)
#>  2. │ └─vctrs::vec_rbind(!!!dots, .names_to = .id) /Users/romainfrancois/git/tidyverse/dplyr/R/bind.r:122:2
#>  3. └─vctrs::vec_default_ptype2(...)
#>  4.   └─vctrs::stop_incompatible_type(...)
#>  5.     └─vctrs:::stop_incompatible(...)
#>  6.       └─vctrs:::stop_vctrs(...)

Created on 2020-06-29 by the reprex package (v0.3.0.9001)

@romainfrancois
Copy link
Member

ignoring 0 rows data frames when bind_rows() caused many problems so I don't think this is viable option.

@hadley
Copy link
Member

hadley commented Jun 29, 2020

I wonder if we should treat a logical() as unspecified? But that would be a big change, and I have a vague feeling that we tried that and it had some other major negative consequence.

In this case, if the root cause is reading a 0-row csv file, I think the solution is to fix the problem upstream by (e.g.) using col_types in readr::read_csv() to ensure that even empty data frames get the correct column types.

@Fablepongiste
Copy link
Author

Sure for example @romainfrancois , sorry for this.

Seems to me there are other cases, not just reading a csv, where you can get empty data frames, and it is hard to always know them before it happens. Scrapping is good example.

Similar cases do not crash in R base neither in data.table, that's why I find it weird.

@hadley
Copy link
Member

hadley commented Jun 29, 2020

We are stricter in dplyr/vctrs because we believe it's safer. Sure, it's a bit annoying here, but it protects you from accidents like this:

rbind(
  data.frame(x = 1),
  data.frame(x = "b")
)
#>   x
#> 1 1
#> 2 b

Created on 2020-06-29 by the reprex package (v0.3.0)

This is a deliberate design decision so I'm going to close this issue.

@hadley hadley closed this as completed Jun 29, 2020
@rdatasculptor
Copy link

I am not aiming at reopening this issue since I completely understand the delibarate design. The thing is, this new behaviour of bind_rows() causes a lot of "Can't combine"-errors in my code. Because it affects a huge part of my automated reports scripts, I was wondering if there is an easy workaround (other than using rbind() or switching back to an earlier version of dplyr) until I have updated my scripts? Any ideas? thanks in advance!

@lionel-
Copy link
Member

lionel- commented Sep 4, 2020

Sorry there is no easy workaround for allowing character coercions. So pinning dplyr to an older version seems the best way.

@rdatasculptor
Copy link

Or maybe something like this is possible (and yes I know it's quite ugly and I haven't tried it yet)?

bind_rows_workaround <- function(df1, df2){
df1 <- df1 %>% mutate(across(where(is.logical), as.character))
df2 <- df2 %>% mutate(across(where(is.logical), as.character))
bind_rows(df1, df2)
}

@lionel-
Copy link
Member

lionel- commented Sep 4, 2020

oh yes that could be a good starting point to update your scripts.

@Ljupch0
Copy link

Ljupch0 commented Jan 26, 2022

Or maybe something like this is possible (and yes I know it's quite ugly and I haven't tried it yet)?

bind_rows_workaround <- function(df1, df2){
df1 <- df1 %>% mutate(across(where(is.logical), as.character))
df2 <- df2 %>% mutate(across(where(is.logical), as.character))
bind_rows(df1, df2)
}

This works as a workaround but it would also convert columns that deserve to be logical. The issue is getting slapped on the wrist when a column becomes logical when it's all NA. Converting an all NA logical column to any other type by default should not count as a type conversion. I think a new column type is needed for these cases, something like the unspecified() @hadley mentioned.

@klin333
Copy link

klin333 commented Aug 2, 2022

perhaps a workaround for now

# dplyr 1.0+ prevents bind_rows between character columns and empty logical columns,
# work around is remove 0-row tibbles from the bind_rows
# only works for data frames and lists of data frames (can't do list that could be a data frame)
bind_rows_legacy <- function(..., .id = NULL) {
  fallback <- tibble() # best efforts fall back column spec
  args <- list(...)
  processed <- list()
  for(item in args) { # can't use purrr::map because of side effects on fallback
    if (is.data.frame(item)) {
      if (nrow(item) == 0) {
        fallback <- bind_cols(
          fallback, 
          item %>% select(-any_of(colnames(fallback)))
        )
        item <- tibble()
      } 
    } else if (is.list(item)) {
      item <- do.call(bind_rows_legacy, c(item, list(.id = .id)))
      if (nrow(item) == 0) {
        fallback <- bind_cols(
          fallback, 
          item %>% select(-any_of(colnames(fallback)))
        )
        item <- tibble()
      } 
    } else {
      stop("unsupported")
    }
    processed <- c(processed, list(item))
  }

  binded <- do.call(dplyr::bind_rows, c(processed, list(.id = .id)))
  
  binded <- bind_rows(
    binded,
    fallback %>% select(-any_of(colnames(binded)))
  )
  
  binded
}
> bind_rows_legacy(tibble(x = logical(0)), tibble(x = 'b'))
# A tibble: 1 x 1
  x    
  <chr>
1 b    

bind_rows_legacy(list(tibble(x = 'a'), tibble(x = 'b')), tibble(x = logical(0)), tibble(x = 'c', y = 1))
# A tibble: 3 x 2
  x         y
  <chr> <dbl>
1 a        NA
2 b        NA
3 c         1

bind_rows_legacy(list(tibble(x = 'a'), tibble(x = 'b')), tibble(x = logical(0)), tibble(x = 'c', y = TRUE))
# A tibble: 3 x 2
  x     y    
  <chr> <lgl>
1 a     NA   
2 b     NA   
3 c     TRUE 

> bind_rows_legacy(tibble(x = logical(0)))
# A tibble: 0 x 1
# ... with 1 variable: x <lgl>
# i Use `colnames()` to see all variable names

> bind_rows_legacy(tibble(x = logical(0), y = character(0)), tibble(x = 'b'))
# A tibble: 1 x 2
  x     y    
  <chr> <chr>
1 b     NA   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants