New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

returned column class wrong #599

Closed
davidcarslaw opened this Issue Sep 15, 2014 · 6 comments

Comments

Projects
None yet
3 participants
@davidcarslaw

davidcarslaw commented Sep 15, 2014

Given some simple data:

dat <- data.frame(grp = rep(1:4, each = 2),
                  val = c(NA, 2, 3:8),
                  val2 = c(3, 4, NA, 1, 5:8))

and a simple function:

Mean <- function(x, thresh = 2) {
  res <- mean(x, na.rm = TRUE)
  if (res > thresh) res else NA
}

Now apply to data:

dat %>%
  group_by(grp) %>%
  summarise_each(funs(Mean(., thresh = 2)))

# Source: local data frame [4 x 3]

#   grp  val val2
#1   1   NA  3.5
#2   2 TRUE   NA
#3   3 TRUE  5.5
#4   4 TRUE  7.5

The problem is that the first column becomes boolean instead of numeric. I think this is because the first returned result is NA as it works OK for the second column that also has an NA but it appears further down...

All the best

David

@hadley

This comment has been minimized.

Member

hadley commented Sep 15, 2014

For now, work around this by using the right NA: NA_real_

@hadley hadley added this to the 0.3.1 milestone Sep 22, 2014

@hadley hadley added the feature label Sep 22, 2014

@hadley hadley self-assigned this Sep 22, 2014

@hadley hadley assigned romainfrancois and unassigned hadley Nov 18, 2014

@hadley

This comment has been minimized.

Member

hadley commented Nov 18, 2014

@romainfrancois any thoughts on this problem?

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Nov 19, 2014

Hum. That logical NA problem again. summarize makes the assumption that all the group will yield compatible results, and this is indeed completely determined by what's returned from the first group. I guess I can put in place some sort of promotion/coercion feature as I've done in rbind ...

This is particularly a problem because of NA being resolved to boolean. Do we also have to cater for other cases, let's say a function that would sometimes return an integer and sometimes a numeric.

I think I can handle the NA case separately but error on other cases.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Nov 19, 2014

Alright, I put this in place. Now instead of deciding based on the first evaluation, we decide based on the first evaluation that is not NA.

> dat <- data.frame(grp = rep(1:4, each = 2), val = c(NA, 2, 3:8))
>   Mean <- function(x, thresh = 2) {
+     res <- mean(x, na.rm = TRUE)
+     if (res > thresh) res else NA
+   }
>   res <- dat %>% group_by(grp) %>% summarise( val = Mean(val, thresh = 2))
>   str(res)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4 obs. of  2 variables:
 $ grp: int  1 2 3 4
 $ val: num  NA 3.5 5.5 7.5
 - attr(*, "drop")= logi TRUE
@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Nov 19, 2014

Now is this a problem :

> dat <- data.frame(grp = rep(1:4, each = 2), val = c(NA, 2, 3:8))
> Mean <- function(x, thresh = 2) {
+    res <- mean(x, na.rm = TRUE)
+    if (res > thresh) res else as.integer(res)
+ }
> res <- dat %>% group_by(grp) %>% summarise( val = Mean(val, thresh = 2))
> res
Source: local data frame [4 x 2]

  grp val
1   1   2
2   2   3
3   3   5
4   4   7

i.e. the first eval governs the return type, so we get integer here and further results are coerced to integer through Rcpp::as.

It's fine in the other direction, but we lose something in this direction. options:

  • I could put some sort of promotion scheme in place. This is potentially expensive as I'd have to allocate the integer vector first, then the numeric vector, coerce all the previous results to the point where we need promotion, ...
    • Just do as is with a warning
    • Hard error with enough of a message to help the user change the code. I think that's my preferred option. Easy enough to throw in some as.integer, as.numeric etc ... it is pretty esoteric anyway to make functions that don't always return the same type. I can understand the exception for NA because nobody wants to be typing NA_integer_ etc ... but otherwise I think I'd like to go for the error.

@hadley ?

@hadley

This comment has been minimized.

Member

hadley commented Nov 19, 2014

@romainfrancois error message sounds good to me. We should probably expand on this in a vignette somewhere.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.