Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with dplyr bind_row() and tidyr pivot_longer() in combination with vctrs after upgrade. #999

Closed
BenPVD opened this issue Apr 8, 2020 · 6 comments · Fixed by #1035
Labels
bug an unexpected problem or unintended behavior type:dataframe

Comments

@BenPVD
Copy link

@BenPVD BenPVD commented Apr 8, 2020

Hi,

I was tempted and did the upgrade to the development version of dplyr after reading a bit in the blog post from @hadley released earlier this week. Unfortunately, this broke my existing script at two locations, once (so far I have to say) at a dplyr::bind_rows position and once at a tidyr::pivot_longer position.

During the upgrade of dplyr to the development version, vctrs was updated as well.

I will not post my whole script here, since this would be too much, so I stored my environment prior to the error and just run the function creating the error in the code section below. I did not change anything on the script itself! The only things I changed between the executions are the versions of either dplyr or vctrs and a downgrade (installation from CRAN) of dplyr and vctrs restored my script.

load(file = "~/troubleshoot.RData")
DT_count_tab = dplyr::bind_rows(DT_count_tab, .id = "sample_id")
#> Error: Can't convert <data.table<
#>   percentage_reads      : double
#>   total_number_reads    : integer
#>   Vsegm                 : character
#>   Jsegm                 : character
#>   frame_info            : character
#>   frame_shift           : integer
#>   ORF_info              : character
#>   aa1_peptide_seq_insert: character
#>   aa2_peptide_seq_total : character
#>   aa2_dna_seq           : character
#> >> to <data.table<
#>   sample_id             : character
#>   percentage_reads      : double
#>   total_number_reads    : integer
#>   Vsegm                 : character
#>   Jsegm                 : character
#>   frame_info            : character
#>   frame_shift           : integer
#>   ORF_info              : character
#>   aa1_peptide_seq_insert: character
#>   aa2_peptide_seq_total : character
#>   aa2_dna_seq           : character
#> >>.
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4        knitr_1.28        magrittr_1.5      tidyselect_1.0.0 
#>  [5] R6_2.4.1          rlang_0.4.5.9000  fansi_0.4.1       stringr_1.4.0    
#>  [9] highr_0.8         dplyr_0.8.99.9002 tools_3.6.0       xfun_0.12        
#> [13] cli_2.0.2         htmltools_0.4.0   ellipsis_0.3.0    yaml_2.2.1       
#> [17] digest_0.6.25     assertthat_0.2.1  tibble_3.0.0      lifecycle_0.2.0  
#> [21] crayon_1.3.4      purrr_0.3.3       vctrs_0.2.99.9011 glue_1.4.0       
#> [25] evaluate_0.14     rmarkdown_2.1     stringi_1.4.6     compiler_3.6.0   
#> [29] pillar_1.4.3      generics_0.0.2    pkgconfig_2.0.3

Created on 2020-04-08 by the reprex package (v0.3.0)

A downgrade of dplyr alone (not vctrs) resolved the issue with the dplyr::bind_rows() function, as seen below.

load(file = "~/troubleshoot.RData")
DT_count_tab = dplyr::bind_rows(DT_count_tab, .id = "sample_id")
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4        knitr_1.28        magrittr_1.5      tidyselect_1.0.0 
#>  [5] R6_2.4.1          rlang_0.4.5.9000  fansi_0.4.1       stringr_1.4.0    
#>  [9] highr_0.8         dplyr_0.8.5       tools_3.6.0       xfun_0.12        
#> [13] cli_2.0.2         htmltools_0.4.0   ellipsis_0.3.0    yaml_2.2.1       
#> [17] assertthat_0.2.1  digest_0.6.25     tibble_3.0.0      lifecycle_0.2.0  
#> [21] crayon_1.3.4      purrr_0.3.3       vctrs_0.2.99.9011 glue_1.4.0       
#> [25] evaluate_0.14     rmarkdown_2.1     stringi_1.4.6     compiler_3.6.0   
#> [29] pillar_1.4.3      pkgconfig_2.0.3

Created on 2020-04-08 by the reprex package (v0.3.0)

The tidyr::pivot_longer() was of course not resolved by downgrading dplyr. Here, a downgrade of vctrs resolved the issue, as shown using both reprex outputs below. The first output uses vctrs version 0.2.99.9011 and results in an error, the other one was executed after a downgrade of vctrs and did not produce any errors.

load(file = "~/troubleshoot.RData")
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
new_obj = tmp_DT_count_tab_summary %>%
  dplyr::select(patient_id, sample_type, study_week, min_clone_count, max_clone_count) %>%
  tidyr::pivot_longer(data = ., cols = c("min_clone_count", "max_clone_count"), names_to = "range", values_to = "count")
#> Error: Can't combine `..1` <data.table<>> and `..2` <tbl_df<>>.
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyr_1.0.2.9000 dplyr_0.8.5     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4        knitr_1.28        magrittr_1.5      tidyselect_1.0.0 
#>  [5] R6_2.4.1          rlang_0.4.5.9000  fansi_0.4.1       stringr_1.4.0    
#>  [9] highr_0.8         tools_3.6.0       xfun_0.12         cli_2.0.2        
#> [13] htmltools_0.4.0   ellipsis_0.3.0    yaml_2.2.1        assertthat_0.2.1 
#> [17] digest_0.6.25     tibble_3.0.0      lifecycle_0.2.0   crayon_1.3.4     
#> [21] purrr_0.3.3       vctrs_0.2.99.9011 glue_1.4.0        evaluate_0.14    
#> [25] rmarkdown_2.1     stringi_1.4.6     compiler_3.6.0    pillar_1.4.3     
#> [29] pkgconfig_2.0.3

Created on 2020-04-08 by the reprex package (v0.3.0)

Traceback for this error is:

> traceback()
25: stop(fallback)
24: signal_abort(cnd)
23: abort(message, class = c(class, "vctrs_error"), ...)
22: stop_vctrs(message, class = c(class, "vctrs_error_incompatible"), 
        x = x, y = y, details = details, ...)
21: stop_incompatible(x, y, x_arg = x_arg, y_arg = y_arg, details = details, 
        ..., message = message, class = c(class, "vctrs_error_incompatible_type"))
20: stop_incompatible_type_impl(x = x, y = y, x_arg = x_arg, y_arg = y_arg, 
        details = details, action = "combine", ..., message = message, 
        class = class)
19: stop_incompatible_type_combine(x = x, y = y, x_arg = x_arg, y_arg = y_arg, 
        details = details, ..., message = message, class = class)
18: stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
17: vec_default_ptype2(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
16: vec_cbind(vec_slice(df_out, rows$df_id), vec_slice(keys, rows$key_id), 
        vec_slice(vals, rows$val_id), .name_repair = names_repair)
15: doTryCatch(return(expr), name, parentenv, handler)
14: tryCatchOne(expr, names, parentenv, handlers[[1L]])
13: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch(code, vctrs_error_names = function(cnd) {
        abort(c("Failed to create output due to bad names.", "Choose another strategy with `names_repair`"), 
            parent = cnd)
    })
11: wrap_error_names(vec_cbind(vec_slice(df_out, rows$df_id), vec_slice(keys, 
        rows$key_id), vec_slice(vals, rows$val_id), .name_repair = names_repair))
10: pivot_longer_spec(data, spec, names_repair = names_repair, values_drop_na = values_drop_na, 
        values_ptypes = values_ptypes)
9: tidyr::pivot_longer(data = ., cols = c("min_clone_count", "max_clone_count"), 
       names_to = "range", values_to = "count")
8: function_list[[k]](value)
7: withVisible(function_list[[k]](value))
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(quote(`_fseq`(`_lhs`)), env, env)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: tmp_DT_count_tab_summary %>% dplyr::select(patient_id, sample_type, 
       study_week, min_clone_count, max_clone_count) %>% tidyr::pivot_longer(data = ., 
       cols = c("min_clone_count", "max_clone_count"), names_to = "range", 
       values_to = "count")

Code which executes just fine with vctrs version 0.2.4:

load(file = "~/troubleshoot.RData")
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
new_obj = tmp_DT_count_tab_summary %>%
  dplyr::select(patient_id, sample_type, study_week, min_clone_count, max_clone_count) %>%
  tidyr::pivot_longer(data = ., cols = c("min_clone_count", "max_clone_count"), names_to = "range", values_to = "count")
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyr_1.0.2.9000 dplyr_0.8.5     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4       knitr_1.28       magrittr_1.5     tidyselect_1.0.0
#>  [5] R6_2.4.1         rlang_0.4.5.9000 fansi_0.4.1      stringr_1.4.0   
#>  [9] highr_0.8        tools_3.6.0      xfun_0.12        cli_2.0.2       
#> [13] htmltools_0.4.0  ellipsis_0.3.0   yaml_2.2.1       assertthat_0.2.1
#> [17] digest_0.6.25    tibble_3.0.0     lifecycle_0.2.0  crayon_1.3.4    
#> [21] purrr_0.3.3      vctrs_0.2.4      glue_1.4.0       evaluate_0.14   
#> [25] rmarkdown_2.1    stringi_1.4.6    compiler_3.6.0   pillar_1.4.3    
#> [29] pkgconfig_2.0.3

Created on 2020-04-08 by the reprex package (v0.3.0)

I hope all this is clearly explained. I know this might not be the best way of a submission of an issue today, but I wanted to put this out in case it is not reported yet (I saw some discussion about vctrs and row binding...)

Thanks much for the support!

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Apr 8, 2020

Looks like this is related to #694

@lionel-
Copy link
Member

@lionel- lionel- commented Apr 8, 2020

And #981.

@BenPVD
Copy link
Author

@BenPVD BenPVD commented Apr 8, 2020

And #981.

Just as a clarification and since I did look briefly at the comments of #694 and #981 and saw the comments about the different data types (e.g. data frame and data table). While this is true for the tidyr::pivot_longer() function shown above, it is not true for the dplyr::bind_rows() function. The list supplied to dplyr::bind_rows() (the DT_count_tab object in my first code block) does not have different data types. It is a list of data tables imported using fread().

Again... just a quick clarification.
Thanks again for the support!

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Apr 8, 2020

They are considered different types by vctrs because the column types are not identical. There is a sample_id column in one but not the other

library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)

packageVersion("dplyr")
#> [1] '0.8.99.9002'

df1 <- data.table(x = 1, y = 2)
df2 <- data.table(x = 3, y = 4)

bind_rows(df1, df2)
#>    x y
#> 1: 1 2
#> 2: 3 4

df3 <- data.table(x = 3, y = 4, z = 5)

bind_rows(df1, df3)
#> Error: Can't combine `..1` <data.table<
#>   x: double
#>   y: double
#> >> and `..2` <data.table<
#>   x: double
#>   y: double
#>   z: double
#> >>.

Created on 2020-04-08 by the reprex package (v0.3.0)

@BenPVD
Copy link
Author

@BenPVD BenPVD commented Apr 8, 2020

Understood!
The sample_id column is created as part of the dplyr::bind_rows() function using the .id = "sample_id" argument (not part of the data tables within the list). I did not check whether removing the argument .id = "sample_id" would solve the problem, which I now assume it would.
Thanks for the clarification.

@geotheory
Copy link

@geotheory geotheory commented Apr 18, 2020

I'm also getting Error: Can't convert <spec_tbl_df< for a map_df(d, read_csv, .id = 'set') where d is vector of two file path strings for files that do read in individually with read_csv and have same column data types. Read object's class reports as "spec_tbl_df" "tbl_df" "tbl" "data.frame". Feels like maybe the same root issue?

@lionel- lionel- modified the milestones: 0.3.0, 0.3.0-revdep-mails Apr 20, 2020
@lionel- lionel- added bug an unexpected problem or unintended behavior type:dataframe labels Apr 24, 2020
lionel- added a commit to lionel-/vctrs that referenced this issue Apr 24, 2020
lionel- added a commit to lionel-/vctrs that referenced this issue Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior type:dataframe
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants