Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Add argument keep_grouping_vars to cur_data() #5342

Closed
mayer79 opened this issue Jun 19, 2020 · 6 comments
Closed

Suggestion: Add argument keep_grouping_vars to cur_data() #5342

mayer79 opened this issue Jun 19, 2020 · 6 comments
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦
Milestone

Comments

@mayer79
Copy link

mayer79 commented Jun 19, 2020

The new cur_data() in dplyr is fantastic. However, it does not yet serve as a full replacement of do. Why? According to its docu, it excludes the grouping variables. I suppose this is to ensure that the same variable name does not appear twice in the resulting data.frame.

Would it be possible to add an argument keep_grouping_vars = FALSE to cur_data where this could be overwritten?

Here is an example:

library(dplyr)

iris %>% 
  group_by(Species) %>% 
  summarize(should_by_five = ncol(cur_data()))

# Output
  Species    should_by_five
  <fct>               <int>
1 setosa                  4
2 versicolor              4
3 virginica               4

This is my suggestion:

iris %>% 
  group_by(Species) %>% 
  summarize(should_by_five = ncol(cur_data(keep_grouping_vars = TRUE)))

The current workaround:

iris %>% 
  group_by(Species) %>% 
  summarize(should_by_five = ncol(bind_cols(cur_group(), cur_data())))
@hadley
Copy link
Member

hadley commented Jun 19, 2020

Could you please provide an example of how you’re using do that requires this proposed feature?

@mayer79
Copy link
Author

mayer79 commented Jun 19, 2020

Here, I want to calculate global surrogate trees (within Species) to explain random forest predictions:

library(ranger)
library(dplyr)
library(rpart)

rmse <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

# Function that calculates surrogate tree to the predictions of the random forest
surrogate <- function(X) {
  simple_fit <- rpart(pred_ ~ Species + Petal.Width + Petal.Length,
                      data = X)
  simple_pred <- predict(simple_fit, X)
  data.frame(rmse = rmse(X[["pred_"]], simple_pred),
             tree = I(list(simple_fit)))
}

# Fit random forest on iris and keep OOB predictions
ir <- iris
complex_fit <- ranger(Sepal.Length ~ Species + Petal.Width + Petal.Length,
                      data = ir)
ir[["pred_"]] <- complex_fit$predictions

# Fit surrogate tree for each "by" group together with performance
group_by(ir, Species) %>%
  do(surrogate(.data)) %>%
  ungroup()

@hadley
Copy link
Member

hadley commented Jun 19, 2020

BTW one work around in the meantime is just to cbind(cur_group(), cur_data())

@mayer79
Copy link
Author

mayer79 commented Jun 19, 2020

Thanks for the hint. It actually is my workaround, but now with a better feeling :-).

@hadley hadley added feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 labels Jun 22, 2020
@hadley hadley added this to the 1.0.1 milestone Jun 22, 2020
@hadley
Copy link
Member

hadley commented Jun 22, 2020

Question is whether this should be an argument (e.g. cur_data(group_vars = TRUE)) or another function (cur_data_all())

@mayer79
Copy link
Author

mayer79 commented Jun 23, 2020

Maybe even better as own function, yes.

The behaviour of .SD in data.table is as your current implementation:

library(data.table)
ir <- as.data.table(iris)
ir[, ncol(.SD), Species] # All 4

f <- function(X) "Species" %in% colnames(X)
ir[, f(.SD), Species] # All FALSE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦
Projects
None yet
Development

No branches or pull requests

3 participants