Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidy grouped data attributes #3489

Closed
romainfrancois opened this issue Apr 9, 2018 · 13 comments
Closed

tidy grouped data attributes #3489

romainfrancois opened this issue Apr 9, 2018 · 13 comments

Comments

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 9, 2018

Would it make sense to tidy attributes that we use internally for grouped data frame, we could e.g. have labels, indices and indices sizes in the same data frame, which would be convenient for manipulation because then it's tidy. (inspired by working on #341).

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
gdf <- group_by(df, x, y)

attributes(gdf)[ c("labels", "indices", "group_sizes")]
#> $labels
#>   x y
#> 1 1 1
#> 2 2 1
#> 
#> $indices
#> $indices[[1]]
#> [1] 0
#> 
#> $indices[[2]]
#> [1] 1
#> 
#> 
#> $group_sizes
#> [1] 1 1
as_tibble(
  mutate( attr(gdf, "labels"), 
    ..index.. = attr(gdf, "indices"), 
    ..size.. = attr(gdf, "group_sizes")
  )
)
#> # A tibble: 2 x 4
#>       x y     ..index.. ..size..
#>   <int> <fct> <list>       <int>
#> 1     1 1     <int [1]>        1
#> 2     2 1     <int [1]>        1

Created on 2018-04-09 by the reprex package (v0.2.0).

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Apr 10, 2018

I think so. Added to the list of breaking changes which we can start working on right after the upcoming release.

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented Apr 10, 2018

Maybe this is part of a bigger change on how we do groupings. Having something tidy could perhaps open the door to other kinds of groupings. Way back we discussed bootstrap groupings for example.

The way things are done now with grouped and rowwise is less than ideal

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Apr 10, 2018

Seems related to #2311 then?

Loading

@hadley
Copy link
Member

@hadley hadley commented May 7, 2018

In the long run, I think the right way to handle this sort of namespacing for variables (i.e. we need to use ..index to avoid clashing with a grouping variable called index) is to use a df-column:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
gdf <- group_by(df, x, y)

attr <- attributes(gdf)[ c("labels", "indices", "group_sizes")]

df <- data.frame(size = attr$group_sizes)
df$label <- attr$labels
df$index <- attr$indices
df <- df[c(2, 3, 1)]
str(df)
#> 'data.frame':    2 obs. of  3 variables:
#>  $ label:'data.frame':   2 obs. of  2 variables:
#>   ..$ x: int  1 2
#>   ..$ y: Factor w/ 2 levels "1","2": 1 1
#>   ..- attr(*, "vars")= chr  "x" "y"
#>   ..- attr(*, "drop")= logi TRUE
#>  $ index:List of 2
#>   ..$ : int 0
#>   ..$ : int 1
#>  $ size : int  1 1

This will require substantial work throughout dplyr so this is a note for the future, rather than something we should try and do now.

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented May 9, 2018

I have something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# all information is in the labels attribute
group_by(mtcars, am, cyl) %>% attr("labels")
#> # A tibble: 6 x 3
#>      am   cyl ..indices..
#>   <dbl> <dbl> <list>     
#> 1     0     4 <int [3]>  
#> 2     0     6 <int [4]>  
#> 3     0     8 <int [12]> 
#> 4     1     4 <int [8]>  
#> 5     1     6 <int [3]>  
#> 6     1     8 <int [2]>

# some operations make lazy grouped_df
# in that case, "labels" is a character vector
df1 <- data_frame(a = 1:3) %>% group_by(a)
df2 <- data_frame(a = rep(1:4, 2)) %>% group_by(a)
res <- left_join(df1, df2, by = "a")
attr(res, "labels")
#> [1] "a"

# which is materialised into a tibble when needed
# this happens by reference :scream: 
group_size(res)
#> [1] 2 2 2
attr(res, "labels")
#> # A tibble: 3 x 2
#>       a ..indices..
#>   <int> <list>     
#> 1     1 <int [2]>  
#> 2     2 <int [2]>  
#> 3     3 <int [2]>

not sure we want to keep the laziness feature

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented May 9, 2018

questions:

  • I guess we need some way to access "labels"
  • The name ..indices.. is until someone has a better idea. It mostly internally only cares that this is the last column.
  • should the ..indices.. column be a list of 1-based indices ? it is currently 0-based because previously this was meant to be only used internally

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented May 9, 2018

joins are the only producers of lazy grouped data frame now since #3492, through the reconstruct_join function:

reconstruct_join <- function(out, x, vars) {
  if (is_grouped_df(x)) {
    groups_in_old <- match(group_vars(x), tbl_vars(x))
    groups_in_alias <- match(groups_in_old, vars$x)
    out <- grouped_df_impl(out, vars$alias[groups_in_alias], FALSE)
  }
  out
}

I'd argue we should get rid of it and altogether and only have materialized grouping structures.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented May 9, 2018

How do we update the grouping structure in a join?

How about group_labels() ?

Loading

@hadley
Copy link
Member

@hadley hadley commented May 9, 2018

  • Yes, we probably should use 1 based indices, but it might be better to do that in a separate PR
  • I think we should call the attribute groups
  • I think we should call the column .rows
  • I think it's ok to get rid of the lazy grouping, although we should file an issue to consider preserving as part of the join process

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented May 9, 2018

preserving in the context of joins might be tricky because:

  • mixed grouping, e.g when the lhs is a factor and the rhs a chr
  • for right joins and any joins where the lhs is not in control of the group population.

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented May 9, 2018

ˋrows()` then would be a good name to extract .rows

Loading

@romainfrancois
Copy link
Member Author

@romainfrancois romainfrancois commented May 9, 2018

with @hadley naming suggestions 👌 +rows() and group_data(). Too bad groups() is taken.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(g=c(1,1,2,2), x = 1:4)

# all information is in the `groups` attribute
group_by(df, g) %>% attr("groups")
#> # A tibble: 2 x 2
#>       g .rows    
#>   <dbl> <list>   
#> 1     1 <int [2]>
#> 2     2 <int [2]>

# can also extract it with group_data() and rows()
group_by(df, g) %>% group_data()
#> # A tibble: 2 x 2
#>       g .rows    
#>   <dbl> <list>   
#> 1     1 <int [2]>
#> 2     2 <int [2]>
group_by(df, g) %>% rows()
#> [[1]]
#> [1] 0 1
#> 
#> [[2]]
#> [1] 2 3

# works also on ungrouped data frames
group_data(df)
#> # A tibble: 1 x 1
#>   .rows    
#>   <list>   
#> 1 <int [4]>
rows(df)
#> [[1]]
#> [1] 0 1 2 3

# ... and rowwise
group_data(rowwise(df))
#> # A tibble: 4 x 1
#>   .rows    
#>   <list>   
#> 1 <int [1]>
#> 2 <int [1]>
#> 3 <int [1]>
#> 4 <int [1]>
rows(rowwise(df))
#> [[1]]
#> [1] 0
#> 
#> [[2]]
#> [1] 1
#> 
#> [[3]]
#> [1] 2
#> 
#> [[4]]
#> [1] 3

Created on 2018-05-10 by the reprex package (v0.2.0).

Loading

@lock
Copy link

@lock lock bot commented Nov 24, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Loading

@lock lock bot locked and limited conversation to collaborators Nov 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants