Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by .drop=TRUE on arrow table is leading to an error #5763

Closed
jangorecki opened this issue Feb 14, 2021 · 1 comment
Closed

group_by .drop=TRUE on arrow table is leading to an error #5763

jangorecki opened this issue Feb 14, 2021 · 1 comment

Comments

@jangorecki
Copy link

group_by(..., .drop=TRUE) on arrow table is leading to an error

library(arrow)
library(dplyr)
d = Table$create(name = rownames(mtcars), mtcars)
collect(group_by(d, gear, .drop=TRUE))
#Error in Table__SelectColumns(self, indices) : 
#  Invalid: Invalid column index -2147483648 to select columns.
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.4 arrow_3.0.0

loaded via a namespace (and not attached):
 [1] fansi_0.4.2      utf8_1.1.4       crayon_1.4.1     assertthat_0.2.1
 [5] R6_2.5.0         lifecycle_0.2.0  magrittr_2.0.1   pillar_1.4.7    
 [9] cli_2.3.0        rlang_0.4.10     vctrs_0.3.6      generics_0.1.0  
[13] ellipsis_0.3.1   tools_4.0.3      bit64_4.0.5      glue_1.4.2      
[17] purrr_0.3.4      bit_4.0.4        compiler_4.0.3   pkgconfig_2.0.3 
[21] tidyselect_1.1.0 tibble_3.0.6    
@romainfrancois
Copy link
Member

This feels like an arrow issue. Can you file an issue on the jira tracker here please ? https://issues.apache.org/jira/projects/ARROW/issues

cc @nealrichardson.

nealrichardson added a commit to apache/arrow that referenced this issue Feb 17, 2021
tidyverse/dplyr#5763 concisely exposes multiple issues:

* We don't expect the .drop argument to group_by (fixed here)
* You can apparently provide expressions to `...` in group_by, which effectively do `mutate()` to add the columns and then group by them (detected here with a useful error; implementation of the feature deferred to ARROW-11658)
* Our input validation in `[.ArrowTabular` and the Table/RecordBatch `SelectColumns` methods was incomplete, and where it was present was not very helpful (fixed here)

Other issues observed here and deferred:

* Table has a proper SelectColumns method in the C++ library but the RecordBatch one is in the R library and should be pushed down to C++ (ARROW-11660)
* The .drop argument is not actually implemented here, it is only caught if specified, and if the value given is other than the default, it errors. We should keep it around like we do the group_vars themselves (ARROW-11658), and we'll need to implement it too in the C++ query engine eventually.

Closes #9510 from nealrichardson/group-by-drop

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this issue Jun 7, 2021
tidyverse/dplyr#5763 concisely exposes multiple issues:

* We don't expect the .drop argument to group_by (fixed here)
* You can apparently provide expressions to `...` in group_by, which effectively do `mutate()` to add the columns and then group by them (detected here with a useful error; implementation of the feature deferred to ARROW-11658)
* Our input validation in `[.ArrowTabular` and the Table/RecordBatch `SelectColumns` methods was incomplete, and where it was present was not very helpful (fixed here)

Other issues observed here and deferred:

* Table has a proper SelectColumns method in the C++ library but the RecordBatch one is in the R library and should be pushed down to C++ (ARROW-11660)
* The .drop argument is not actually implemented here, it is only caught if specified, and if the value given is other than the default, it errors. We should keep it around like we do the group_vars themselves (ARROW-11658), and we'll need to implement it too in the C++ query engine eventually.

Closes apache#9510 from nealrichardson/group-by-drop

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this issue Jun 13, 2021
tidyverse/dplyr#5763 concisely exposes multiple issues:

* We don't expect the .drop argument to group_by (fixed here)
* You can apparently provide expressions to `...` in group_by, which effectively do `mutate()` to add the columns and then group by them (detected here with a useful error; implementation of the feature deferred to ARROW-11658)
* Our input validation in `[.ArrowTabular` and the Table/RecordBatch `SelectColumns` methods was incomplete, and where it was present was not very helpful (fixed here)

Other issues observed here and deferred:

* Table has a proper SelectColumns method in the C++ library but the RecordBatch one is in the R library and should be pushed down to C++ (ARROW-11660)
* The .drop argument is not actually implemented here, it is only caught if specified, and if the value given is other than the default, it errors. We should keep it around like we do the group_vars themselves (ARROW-11658), and we'll need to implement it too in the C++ query engine eventually.

Closes apache#9510 from nealrichardson/group-by-drop

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants