Skip to content

Implement vec_identify_runs() and vec_rle() #1081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DavisVaughan opened this issue May 5, 2020 · 2 comments · Fixed by #1262
Closed

Implement vec_identify_runs() and vec_rle() #1081

DavisVaughan opened this issue May 5, 2020 · 2 comments · Fixed by #1262
Labels
feature a feature request or enhancement
Milestone

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented May 5, 2020

Inspired by the adjacency grouping idea in tidyverse/dplyr#5184

We could do much better than this in terms of performance with a C implementation. vec_group_rle() does a lot more work than is necessary here because it uses a dictionary to keep track of what it has already seen.

I've added this to the vec-prefixes google sheet

library(vctrs)
library(dplyr, warn.conflicts = FALSE)

vec_runs <- function(x) {
  rle <- vec_group_rle(x)
  lengths <- field(rle, "length")
  rep(seq_along(lengths), times = lengths)
}

mtcars <- as_tibble(mtcars)

mtcars %>%
  select(vs, am) %>%
  mutate(runs = vec_runs(across()))
#> # A tibble: 32 x 3
#>       vs    am  runs
#>    <dbl> <dbl> <int>
#>  1     0     1     1
#>  2     0     1     1
#>  3     1     1     2
#>  4     1     0     3
#>  5     0     0     4
#>  6     1     0     5
#>  7     0     0     6
#>  8     1     0     7
#>  9     1     0     7
#> 10     1     0     7
#> # … with 22 more rows

Created on 2020-05-05 by the reprex package (v0.3.0)

@DavisVaughan DavisVaughan added this to the 0.4.0 milestone May 5, 2020
@DavisVaughan DavisVaughan changed the title Implement vec_runs() Implement vec_runs() and vec_rle() May 5, 2020
@DavisVaughan
Copy link
Member Author

DavisVaughan commented May 5, 2020

Also consider vec_rle(), which would be different from vec_group_rle() since that uses a dictionary to track the first time we saw an individual value. vec_rle() would be much simpler, and would just track changes in x, like vec_runs(). It would return a two column data frame with val and len, very much like rle().

Also, why doesn't vec_group_rle() return a data frame rather than a rcrd? It seems like that would've been simpler for such a low level function.

library(vctrs)

vec_group_rle2 <- function(x) {
  rle <- vec_group_rle(x)
  data.frame(
    grp = field(rle, "group"),
    len = field(rle, "length") 
  )
}

head(vec_group_rle2(mtcars$cyl))
#>   grp len
#> 1   1   2
#> 2   2   1
#> 3   1   1
#> 4   3   1
#> 5   1   1
#> 6   3   1

For vec_runs() and vec_rle() to work efficiently, we need to extract out the equality comparison caching utilities from the dictionary code (i.e. d->equal and d->vec_p). That will allow us to extremely efficiently compare values of x for equality.

@lionel- lionel- added the feature a feature request or enhancement label May 7, 2020
@DavisVaughan
Copy link
Member Author

DavisVaughan commented Sep 2, 2020

vec_identify_runs() might be a better name, since it returns an id. This is inline with other vec-prefixes changes

@DavisVaughan DavisVaughan changed the title Implement vec_runs() and vec_rle() Implement vec_identity_runs() and vec_rle() Sep 2, 2020
@DavisVaughan DavisVaughan changed the title Implement vec_identity_runs() and vec_rle() Implement vec_identify_runs() and vec_rle() Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants