Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality to slice_* and order_by multiple columns? #6176

Closed
admivsn opened this issue Feb 4, 2022 · 6 comments · Fixed by #6384
Closed

Functionality to slice_* and order_by multiple columns? #6176

admivsn opened this issue Feb 4, 2022 · 6 comments · Fixed by #6384
Labels
feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange()

Comments

@admivsn
Copy link

admivsn commented Feb 4, 2022

I would like to be able to order_by multiple columns when using either slice_min() or slice_max().

This would hopefully feed into dbplyr and this feature request tidyverse/dbplyr#765

library(dplyr, warn.conflicts = FALSE)

# Create dummy table
(df <- tibble(
 col_1 = c(1, 2, 1, 2, 1, 2, 1, 2),
 col_2 = c(2, 1, 0, 3, 2, 1, 0, 1)
))
#> # A tibble: 8 × 2
#>   col_1 col_2
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     1
#> 3     1     0
#> 4     2     3
#> 5     1     2
#> 6     2     1
#> 7     1     0
#> 8     2     1

df %>%
  slice_max(order_by = c(col_1, -col_2))
#> # A tibble: 4 × 2
#>   col_1 col_2
#>   <dbl> <dbl>
#> 1     2     1
#> 2     2     3
#> 3     2     1
#> 4     2     1

df %>%
  slice_max(order_by = c(col_1, desc(col_2)))
#> # A tibble: 4 × 2
#>   col_1 col_2
#>   <dbl> <dbl>
#> 1     2     1
#> 2     2     3
#> 3     2     1
#> 4     2     1
@DavisVaughan
Copy link
Member

We get a different error with the dev version.

library(dplyr, warn.conflicts = FALSE)

# Create dummy table
(df <- tibble(
  col_1 = c(1, 2, 1, 2, 1, 2, 1, 2),
  col_2 = c(2, 1, 0, 3, 2, 1, 0, 1)
))
#> # A tibble: 8 × 2
#>   col_1 col_2
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     1
#> 3     1     0
#> 4     2     3
#> 5     1     2
#> 6     2     1
#> 7     1     0
#> 8     2     1

df %>%
  slice_max(order_by = c(col_1, col_2))
#> Error in `slice_max()`:
#> ! Problem while computing indices.
#> Caused by error:
#> ! `order_by` must have size 8, not size 16.

df %>%
  slice_max(order_by = c(col_1, -col_2))
#> Error in `slice_max()`:
#> ! Problem while computing indices.
#> Caused by error:
#> ! `order_by` must have size 8, not size 16.

This happens because c(col_1, col_2) isn't interpreted like tidyselect, it literally combines col_1 and col_2 together to get a vector of size 16. @romainfrancois I'm not really sure how order_by is supposed to work here.

@romainfrancois
Copy link
Member

it is supposed to be one vector, it gets expanded in:

 slice(.data, local({
    order_by <- {{ order_by }}
    n <- dplyr::n()
    order_by <- fix_call(vec_assert(order_by, size = n, arg = "order_by"), NULL)
    idx(order_by, n)
  }))

I tried df %>% slice_max(order_by = tibble(col_1, col_2)) but it does not play well with how idx is implemented, in particular desc :

  if (with_ties) {
    idx <- function(x, n) head(
        order(x, decreasing = TRUE), smaller_ranks(desc(x), size(n))
    )
  } else {
    idx <- function(x, n) head(order(x, decreasing = TRUE), size(n))
  }

which gives:

library(dplyr, warn.conflicts = FALSE)

# Create dummy table
(df <- tibble(
  col_1 = c(1, 2, 1, 2, 1, 2, 1, 2),
  col_2 = c(2, 1, 0, 3, 2, 1, 0, 1)
))
#> # A tibble: 8 × 2
#>   col_1 col_2
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     1
#> 3     1     0
#> 4     2     3
#> 5     1     2
#> 6     2     1
#> 7     1     0
#> 8     2     1

df %>%
  slice_max(order_by = tibble(col_1, col_2))
#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames

#> Warning in xtfrm.data.frame(x): cannot xtfrm data frames
#> # A tibble: 0 × 2
#> # … with 2 variables: col_1 <dbl>, col_2 <dbl>

Created on 2022-02-07 by the reprex package (v2.0.1)

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 7, 2022

You might could swap that out order() for vec_order(), if it really is supposed to be working this way

@hadley
Copy link
Member

hadley commented Apr 15, 2022

The current constraint is the order_by must be a single variable. We could make it a tidyselect param here, but that become inconsistent with lead() etc.

@hadley hadley added feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange() labels Apr 15, 2022
@hadley
Copy link
Member

hadley commented Jul 21, 2022

Yeah, I think we could use vec_order() here so that, if desired, you could supply a data frame/tibble of multiple variables.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 21, 2022

We'd also have to update min_rank() and friends to use vec_rank() (not exported yet) so smaller_ranks() (used by slice_min/max()) would work

hadley added a commit that referenced this issue Aug 3, 2022
hadley added a commit that referenced this issue Aug 9, 2022
* Refactor slice tests
  * Remove tests of invariants now handled by vctrs
  * Reduce duplication by testing central code paths once
  * Update error testing style

* Use `vec_rank()`. Fixes #6176. 

* Implement na.rm for slice_min()/slice_max(). Fixes #6177.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange()
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants