Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export vec_duplicate_split() #514

Closed
romainfrancois opened this issue Jul 30, 2019 · 4 comments · Fixed by #515
Closed

export vec_duplicate_split() #514

romainfrancois opened this issue Jul 30, 2019 · 4 comments · Fixed by #515

Comments

@romainfrancois
Copy link
Contributor

romainfrancois commented Jul 30, 2019

For e.g. tidyverse/dplyr#4504

vec_duplicate_split() gives exactly what is needed for a vctrs based implementation of group_by() , e.g.

library(dplyr)

d <- tibble(x = c(1, 1, 1, 1, 2, 2), y = c(1, 2, 1, 2, 1, 2), z = 1:6)

vctrs:::vec_duplicate_split(select(d, x, y))
#> $key
#> [1] 1 2 5 6
#> 
#> $idx
#> $idx[[1]]
#> [1] 1 3
#> 
#> $idx[[2]]
#> [1] 2 4
#> 
#> $idx[[3]]
#> [1] 5
#> 
#> $idx[[4]]
#> [1] 6

group_by(d, x, y) %>% 
  group_data()
#> # A tibble: 4 x 3
#>       x     y .rows    
#> * <dbl> <dbl> <list>   
#> 1     1     1 <int [2]>
#> 2     1     2 <int [2]>
#> 3     2     1 <int [1]>
#> 4     2     2 <int [1]>
  • key gives the indices I can use to vec_slice() the data to get the first columns
  • idx is exactly the .rows column

tidyverse/dplyr#4504 then goes a bit further to reveal empty groups when .drop = FALSE but that does not need to be vctrs' concern.

@lionel-
Copy link
Member

lionel- commented Jul 30, 2019

Would vec_split_data() make sense as a name? Or perhaps vec_split_info()?

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 30, 2019

Would it make sense to go ahead and vec_slice() by using ki$key and use that as the return value in vec_duplicate_split()? In both vec_split() and in @romainfrancois's PR that seems to be the first thing you do. Plus the $key information is technically redundant because it is all there in $idx already. I probably would have done this already, but I don't think vec_slice() had a C implementation at the time.

I think it would be nice to even return a data frame. Then vec_split() would just switch out the idx column with the val column. It would make it more familiar to anyone used to the vec_split() output.

With this in mind it could also be called vec_split_id(), because we would have an id column (I'd change it from idx -> id). But I'm also fine with vec_split_info().

library(vctrs)
library(tibble)

by <- mtcars[c("vs", "am")]

ki <- vctrs:::vec_duplicate_split(by)

# current return value
ki
#> $key
#> [1] 1 3 4 5
#> 
#> $idx
#> $idx[[1]]
#> [1]  1  2 27 29 30 31
#> 
#> $idx[[2]]
#> [1]  3 18 19 20 26 28 32
#> 
#> $idx[[3]]
#> [1]  4  6  8  9 10 11 21
#> 
#> $idx[[4]]
#>  [1]  5  7 12 13 14 15 16 17 22 23 24 25

# proposed return value
ki$key <- vec_slice(by, ki$key)
as_tibble(new_data_frame(ki, n = vec_size(ki$key)))
#> # A tibble: 4 x 2
#>   key$vs   $am idx       
#>    <dbl> <dbl> <list>    
#> 1      0     1 <int [6]> 
#> 2      1     1 <int [7]> 
#> 3      1     0 <int [7]> 
#> 4      0     0 <int [12]>

# similar to
as_tibble(vec_split(mtcars, by))
#> # A tibble: 4 x 2
#>   key$vs   $am             val
#>    <dbl> <dbl> <list<df[,11]>>
#> 1      0     1        [6 × 11]
#> 2      1     1        [7 × 11]
#> 3      1     0        [7 × 11]
#> 4      0     0       [12 × 11]

Created on 2019-07-30 by the reprex package (v0.2.1)

@romainfrancois
Copy link
Contributor Author

I like this, having it a data frame with a data frame column for the key adds some structural information that is nice to have.

I suppose a vec_split_id() that would return current $idx would indeed be enough for dplyr given this has no notion of empty groups. It's "nice" not to have to do map_int(1L) to get $key.

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 30, 2019

Would it make sense to document vec_split_id() on the same page as vec_split(), and give it an argument vec_split_id(by) rather than vec_split_id(x)? If I put it on a different page then I would probably use x as the argument, but on the same page it kind of makes sense to be by, since if we called it x it would complicate the documentation.

(This might go against the general vctrs api since we use x as the first argument for pretty much everything)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants