Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot combine expressions resulting in positive & negative indices within c() #130

Closed
smingerson opened this issue Oct 23, 2019 · 8 comments
Closed

Comments

@smingerson
Copy link

@smingerson smingerson commented Oct 23, 2019

Sorry if I should have filed this in tidyr repo, it was unclear to me which was more appropriate.

I'll be using the direct tidyselect call and splicing in the result to avoid the error, but this didn't seem right to me. ?tidyr::pivot_longer says cols is a "tidyselect specification". I took that to mean anything which works in vars_select(...) would work in c(...).

library(tidyselect)
library(tidyr)
tb <-
  tibble(
    Number_A = -2:2,
    Number_B = rnorm(5),
    Number_C = runif(5),
    Number_Participants = 1:5
  )
# Works as expected.
tidyselect::vars_select(names(tb), matches("^Number"),-ends_with("A"))
#>              Number_B              Number_C   Number_Participants 
#>            "Number_B"            "Number_C" "Number_Participants"

pivot_longer(tb, c(matches("^Number"),-ends_with("A")))
#> Error in inds_combine(.vars, ind_list): Each argument must yield either positive or negative integers
# `c()` works with two positive selections
pivot_longer(tb, c(ends_with("B"), ends_with("C")))
#> # A tibble: 10 x 4
#>    Number_A Number_Participants name      value
#>       <int>               <int> <chr>     <dbl>
#>  1       -2                   1 Number_B  1.17 
#>  2       -2                   1 Number_C  0.999
#>  3       -1                   2 Number_B -0.145
#>  4       -1                   2 Number_C  0.854
#>  5        0                   3 Number_B -1.24 
#>  6        0                   3 Number_C  0.627
#>  7        1                   4 Number_B -1.49 
#>  8        1                   4 Number_C  0.245
#>  9        2                   5 Number_B  0.203
#> 10        2                   5 Number_C  0.491

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyr_1.0.0      tidyselect_0.2.5
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1       knitr_1.24       magrittr_1.5     R6_2.4.0        
#>  [5] rlang_0.4.0      fansi_0.4.0      stringr_1.4.0    highr_0.8       
#>  [9] dplyr_0.8.3      tools_3.6.1      xfun_0.8         utf8_1.1.4      
#> [13] cli_1.1.0        htmltools_0.3.6  assertthat_0.2.1 yaml_2.2.0      
#> [17] digest_0.6.20    lifecycle_0.1.0  tibble_2.1.3     crayon_1.3.4    
#> [21] purrr_0.3.2      vctrs_0.2.0      zeallot_0.1.0    glue_1.3.1      
#> [25] evaluate_0.14    rmarkdown_1.13   stringi_1.4.3    compiler_3.6.1  
#> [29] pillar_1.4.2     backports_1.1.4  pkgconfig_2.0.2
@lionel-
Copy link
Member

@lionel- lionel- commented Oct 23, 2019

We were just discussing this with @hadley :)

We'll fix this. Also the next version will feature new syntax. It's still being discussed, but we're thinking about enabling binary - for set difference, this way you don't need c():

matches("^Number") - ends_with("A")

@smingerson
Copy link
Author

@smingerson smingerson commented Oct 23, 2019

I like that solution. I've always struggled with combinations of selectors, the new options of |, &, and ! are welcome.

@lionel-
Copy link
Member

@lionel- lionel- commented Oct 24, 2019

I wonder which of binary - or & ! is more obvious:

data %>% select(matches("^Number") - ends_with("A"))

data %>% select(matches("^Number") & !ends_with("A"))

It is a bit strange to add support for a new set operator in addition to the boolean operators, but set difference might still be more intuitive as it's a single operation.

The two boolean operations translate easily to English though:

Matches "^number" and doesn't end with "A".

@jeffreypullin
Copy link

@jeffreypullin jeffreypullin commented Oct 24, 2019

FWIW, personally I would strongly prefer the second option for two reasons:

  1. - is already used for single column removals and I believe single columns will not treated as a single element set.
  2. - isn't a universally accepted symbol for set difference, some (most?) use \

@lionel-
Copy link
Member

@lionel- lionel- commented Oct 24, 2019

Unary - is actually syntax for set difference already, so binary - would only catch up.

@lionel-
Copy link
Member

@lionel- lionel- commented Oct 24, 2019

e.g. these would be completely equivalent:

select(foo, -bar)
select(foo - bar)

@lionel-
Copy link
Member

@lionel- lionel- commented Oct 24, 2019

But I think you're right, I'm leaning towards keeping things simple. If we solve c(foo, -bar) we have full syntax for setdiff, and we probably want to encourage foo & !bar instead for consistency of the overall DSL so no need to add more syntax sugar.

@smingerson
Copy link
Author

@smingerson smingerson commented Oct 24, 2019

A couple of thoughts, I have no strong preference either way.

  1. I've seen - used for set difference in mathematics and statistics textbooks. No, they do not come immediately to mind. Wolfram mentions it http://mathworld.wolfram.com/SetDifference.html
  2. If you accidentally drop a comma in between a positive and negative selection sets, you will still get the expected result. If binary - isn't implemented, this is what happen. In a large select statement, I think the missing comma would be easy to overlook. This would also be solved by a better error message.
tib <- tibble(a = 1, b = 2, c = 3, f = 4)

select(tib, a - b)
# Error in .f(.x[[i]], ...) : object 'a' not found
  1. I think from a non-programmer perspective, data %>% select(matches("^Number") - ends_with("A")) is easier to understand than data %>% select(matches("^Number") & !ends_with("A")). They both have straightforward English translations, if you're comfortable with & and ! as operators. Whether my perception is right is questionable, and as you mentioned there's still c(foo, -bar)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants