Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a group_indices as a new variable #1185

Closed
matthieugomez opened this issue May 30, 2015 · 21 comments
Closed

Create a group_indices as a new variable #1185

matthieugomez opened this issue May 30, 2015 · 21 comments
Assignees
Labels
feature a feature request or enhancement

Comments

@matthieugomez
Copy link

matthieugomez commented May 30, 2015

Some packages like ggplot2 act on groups defined by one variable only (as opposed to groups defined by several variables). It would be nice to have a function, say group(), that creates a new integer variable from groups defined by multiple variables:

Batting %>% mutate(group = group(teamID, yearID))
Batting %>% group_by(teamID, yearID) %>% mutate(group = group())

This function could also have a na.rm argument. The default should return a missing value for the observation if some grouping variable for this observation is missing.

group_indices is not suited for that since (i) it requires df as an argument (ii) group_indices does not work inside mutate

df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
df %>% mutate(g = group_indices(df, v1))
# Error: cannot handle

A work around for now

group <- function(..., na.rm = FALSE){
  df <- data.frame(list(...))
  if (na.rm){
    out <- rep(NA, nrow(df))
    complete <- complete.cases(df)
    indices <- df %>% filter(complete) %>% group_indices_(.dots = names(df))
    out[complete] <- indices
  } else{
    out <- group_indices_(df, .dots = names(df))
  }
  out
}
@matthieugomez matthieugomez changed the title Using group_indices within mutate Create a group_indices as a new variable Jun 6, 2015
@romainfrancois
Copy link
Member

You use group_indices like this:

> df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
> df %>% group_by(v1) %>% group_indices
[1] 3 3 1 1 2

@hadley perhaps we could support something like:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

@matthieugomez
Copy link
Author

matthieugomez commented Jul 8, 2015

In all cases I used this function (for instance before using ggplot2 or plm), I wanted the group index to be missing in observations where some of the column was missing. This is different from the behavior of group_indices. Maybe dplyr is not the best place to implement such a function?

@romainfrancois
Copy link
Member

In that case, perhaps you can close the issue ?

@voxnonecho
Copy link

I think the initial idea is great. Using group_indices() as we use rleid() in data.table

@jarkub
Copy link

jarkub commented May 19, 2016

@romainfrancois

Why was the issue closed? While @matthieugomez found another solution for his particular use case, the original request would be an extremely useful feature. It would be great to be able to do what you suggested above:

df %>% group_by(v1) %>% mutate( g = group_indices() )
# or 
df %>% group_by(v1) %>% mutate( g = group() )

@stephlocke
Copy link

I would expect to be able to use this inside a mutate. I use a lot of data.table like @voxnonecho and that there is no easy way to do this is in dplyr is a bit of a pain

@krlmlr
Copy link
Member

krlmlr commented Mar 28, 2017

I can offer:

library(dplyr, warn.conflicts = FALSE)
df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4))
df %>% group_by(v1) %>% { mutate(ungroup(.), g = group_indices(.)) }
#> # A tibble: 5 × 3
#>      v1    v2     g
#>   <dbl> <dbl> <int>
#> 1    NA    NA     3
#> 2    NA    NA     3
#> 3     2     3     1
#> 4     2     3     1
#> 5     3     4     2

@hadley: This would be much simpler if we had a hybrid handler for group_indices() that just does the right thing for grouped data frames.

@hadley
Copy link
Member

hadley commented Mar 28, 2017

I want to shelve this for now, and come back to when we broadly reconsider what other pronouns would useful inside dplyr verbs.

@ericgtaylor
Copy link

howdy, thread. any updates on using group_indices inside of mutate? :)

df %>% group_by(v1) %>% mutate( g = group_indices() )

@dfrail24
Copy link

dfrail24 commented Feb 5, 2018

I see that another option is doing
df$g <- df %>% group_by(v1) %>% group_indices
but it would be great if we could do this in a seamless pipe like others have voiced above.

@Zedseayou
Copy link

I'll add that if you want to use group_indices inside a pipe, you can always do this:
df %>% bind_cols(g = group_indices(., group_var1, group_var2))
which does work seamlessly in a pipe but does not feel very neat.

@krlmlr krlmlr added feature a feature request or enhancement data frame labels Feb 28, 2018
@krlmlr
Copy link
Member

krlmlr commented Feb 28, 2018

Happy to reconsider.

@krlmlr krlmlr reopened this Feb 28, 2018
@krlmlr
Copy link
Member

krlmlr commented Feb 28, 2018

@hadley: Do we have a better idea what other pronouns are useful inside dplyr verbs?

@romainfrancois romainfrancois self-assigned this Mar 27, 2018
romainfrancois added a commit that referenced this issue Apr 9, 2018
```r
> df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4)) %>% group_by(v1)
> df %>% mutate( g = group_indices() )
# A tibble: 5 x 3
# Groups:   v1 [3]
     v1    v2     g
  <dbl> <dbl> <int>
1   NA    NA      3
2   NA    NA      3
3    2.    3.     1
4    2.    3.     1
5    3.    4.     2
```
romainfrancois added a commit that referenced this issue Apr 9, 2018
romainfrancois added a commit that referenced this issue Apr 9, 2018
@romainfrancois
Copy link
Member

romainfrancois commented Apr 9, 2018

I pushed some code in the feature-1185-group branch so that group_indices() is hybrid-interpreted.

df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1), v2 = 1:6) %>% group_by(v1)
mutate(df, g = group_indices())
#> # A tibble: 6 x 3
#> # Groups:   v1 [3]
#>      v1    v2     g
#>   <dbl> <int> <int>
#> 1    3.     1     3
#> 2    3.     2     3
#> 3    2.     3     2
#> 4    2.     4     2
#> 5    3.     5     3
#> 6    1.     6     1

Because of internal implementation, this gives 0 as the group for everybody when this is not a grouped data frame. Is that ok ?

df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1), v2 = 1:6)
mutate(df, g = group_indices())
#> # A tibble: 6 x 3
#>      v1    v2     g
#>   <dbl> <int> <int>
#> 1    3.     1     0
#> 2    3.     2     0
#> 3    2.     3     0
#> 4    2.     4     0
#> 5    3.     5     0
#> 6    1.     6     0

@krlmlr
Copy link
Member

krlmlr commented Apr 9, 2018

This sounds good, but I think group_indices() should return an all-1 vector for an ungrouped data frame for consistency:

dplyr::group_indices(mtcars)
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Created on 2018-04-10 by the reprex package (v0.2.0).

romainfrancois added a commit that referenced this issue Apr 10, 2018
```r
> df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4)) %>% group_by(v1)
> df %>% mutate( g = group_indices() )
# A tibble: 5 x 3
# Groups:   v1 [3]
     v1    v2     g
  <dbl> <dbl> <int>
1   NA    NA      3
2   NA    NA      3
3    2.    3.     1
4    2.    3.     1
5    3.    4.     2
```
romainfrancois added a commit that referenced this issue Apr 10, 2018
romainfrancois added a commit that referenced this issue Apr 10, 2018
@romainfrancois
Copy link
Member

I made the change in the branch, why are group() returning -1 in the first place ?

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

Thanks. What is group()?

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

Oh, I see -- it's in the C++ code.

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

I have no idea. Do the tests still pass if we change that -1 to 1 ?

@romainfrancois
Copy link
Member

This would have to be 0, I'll check

romainfrancois added a commit that referenced this issue May 3, 2018
```r
> df <- data_frame(v1 = c(NA, NA, 2, 2, 3), v2 = c(NA, NA, 3,3, 4)) %>% group_by(v1)
> df %>% mutate( g = group_indices() )
# A tibble: 5 x 3
# Groups:   v1 [3]
     v1    v2     g
  <dbl> <dbl> <int>
1   NA    NA      3
2   NA    NA      3
3    2.    3.     1
4    2.    3.     1
5    3.    4.     2
```
romainfrancois added a commit that referenced this issue May 3, 2018
romainfrancois added a commit that referenced this issue May 3, 2018
krlmlr added a commit that referenced this issue May 5, 2018
- `group_indices()` can be used without argument in expressions in verbs (#1185).
@lock
Copy link

lock bot commented Nov 1, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

10 participants