Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vec_split() #196

Closed
hadley opened this issue Feb 21, 2019 · 3 comments
Closed

vec_split() #196

hadley opened this issue Feb 21, 2019 · 3 comments

Comments

@hadley
Copy link
Member

hadley commented Feb 21, 2019

vec_split <- function(x, f) {
  unname(split(x, vec_duplicate_id(f)))
}

But implemented more efficiently internally, and using vec_slice()

@hadley
Copy link
Member Author

hadley commented Feb 23, 2019

Or would it return a tibble? One row would be the keys (vec_unique()) and the other would be the values (as above). Is this the answer to #197?

@DavisVaughan
Copy link
Member

This can do some pretty neat things, especially if using data frames as f

suppressPackageStartupMessages({
  library(vctrs)
  library(gapminder)
  library(tidyr)
  library(dplyr)
})
#> Warning: package 'dplyr' was built under R version 3.5.2

vec_split <- function(x, f) {
  keys <- vec_unique(f)
  
  prxy <- vec_duplicate_id(f)
  prxy_keys <- vec_unique(prxy)
  
  values <- vctrs:::map(prxy_keys, function(prxy_key) {
    vec_slice(x, vec_equal(prxy, prxy_key))
  })
  
  tibble::tibble(.keys = keys, .values = values)
}

vec_split(iris, iris$Species)
#> # A tibble: 3 x 2
#>   .keys      .values              
#>   <fct>      <list>               
#> 1 setosa     <data.frame [50 × 5]>
#> 2 versicolor <data.frame [50 × 5]>
#> 3 virginica  <data.frame [50 × 5]>

vec_split(iris, 1)
#> # A tibble: 1 x 2
#>   .keys .values               
#>   <dbl> <list>                
#> 1     1 <data.frame [150 × 5]>

vec_split(iris, NA)
#> # A tibble: 1 x 2
#>   .keys .values               
#>   <lgl> <list>                
#> 1 NA    <data.frame [150 × 5]>

vec_split(iris, NULL)
#> Error: All columns in a tibble must be 1d or 2d objects:
#> * Column `.keys` is NULL
#> Backtrace:
#>     █
#>  1. └─global::vec_split(iris, NULL)
#>  2.   └─tibble::tibble(.keys = keys, .values = values)
#>  3.     └─tibble:::lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
#>  4.       └─tibble:::check_valid_cols(x)

# split by unique combinations of 2 columns
gap_nest <- nest(gapminder, continent, country)
vec_split(gap_nest, gap_nest$data)
#> # A tibble: 142 x 2
#>    .keys            .values          
#>    <list>           <list>           
#>  1 <tibble [1 × 2]> <tibble [12 × 5]>
#>  2 <tibble [1 × 2]> <tibble [12 × 5]>
#>  3 <tibble [1 × 2]> <tibble [12 × 5]>
#>  4 <tibble [1 × 2]> <tibble [12 × 5]>
#>  5 <tibble [1 × 2]> <tibble [12 × 5]>
#>  6 <tibble [1 × 2]> <tibble [12 × 5]>
#>  7 <tibble [1 × 2]> <tibble [12 × 5]>
#>  8 <tibble [1 × 2]> <tibble [12 × 5]>
#>  9 <tibble [1 × 2]> <tibble [12 × 5]>
#> 10 <tibble [1 × 2]> <tibble [12 × 5]>
#> # … with 132 more rows

# or this way
vec_split(gapminder, select(gapminder, continent, country))
#> # A tibble: 142 x 2
#>    .keys$continent $country    .values          
#>    <fct>           <fct>       <list>           
#>  1 Asia            Afghanistan <tibble [12 × 6]>
#>  2 Europe          Albania     <tibble [12 × 6]>
#>  3 Africa          Algeria     <tibble [12 × 6]>
#>  4 Africa          Angola      <tibble [12 × 6]>
#>  5 Americas        Argentina   <tibble [12 × 6]>
#>  6 Oceania         Australia   <tibble [12 × 6]>
#>  7 Europe          Austria     <tibble [12 × 6]>
#>  8 Asia            Bahrain     <tibble [12 × 6]>
#>  9 Asia            Bangladesh  <tibble [12 × 6]>
#> 10 Europe          Belgium     <tibble [12 × 6]>
#> # … with 132 more rows

mat <- matrix(1:50, 10, 5)
mat_f <- rep(1:2, times = 5)
vec_split(mat, mat_f)
#> # A tibble: 2 x 2
#>   .keys .values      
#>   <int> <list>       
#> 1     1 <int [5 × 5]>
#> 2     2 <int [5 × 5]>

Created on 2019-02-27 by the reprex package (v0.2.1.9000)

@hadley hadley closed this as completed in c8063f9 Feb 27, 2019
@hadley
Copy link
Member Author

hadley commented Feb 27, 2019

And the fact that the implementation is so simple is a good signal that the underlying primitives are correct. At some point, we will need to rewrite in C to avoid creating the internal dictionary twice (once for the unique values and once for the duplicates).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants