Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do() gives error if returned values are not data frames #397

Closed
wch opened this issue Apr 16, 2014 · 14 comments
Closed

do() gives error if returned values are not data frames #397

wch opened this issue Apr 16, 2014 · 14 comments

Comments

@wch
Copy link
Member

wch commented Apr 16, 2014

The docs say:

You can use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list.

But it doesn't work this way. For example:

mtcars %>% group_by(cyl) %>% do(nrow(.))
# Error: Results are not data frames at positions: 1, 2, 3

The result I would expect is something like this (which presently works):

x <- mtcars %>% group_by(cyl) %>% do(data.frame(nrow = nrow(.)))
as.list(x$nrow)
# [[1]]
# [1] 11
#
# [[2]]
# [1] 7
#
# [[3]]
# [1] 14

Note that I'm just using nrow to illustrate; I know you can use group_size to get this result. The thing I'm actually doing returns a vector for each group, so there's not a simple drop-in replacement for it.

@wch
Copy link
Member Author

wch commented Apr 16, 2014

Oh, maybe it's that you have to provide a name if the result isn't a data frame? That seems kind of counterintuitive, or at least it could be documented better, and could be mentioned in the error message.

x <- mtcars %>% group_by(cyl) %>% do(nrow = nrow(.))
# Source: local data frame [3 x 2]
# Groups: <by row>
# 
#   cyl     nrow
# 1   4 <int[1]>
# 2   6 <int[1]>
# 3   8 <int[1]>

x$nrow
# [[1]]
# [1] 11
# 
# [[2]]
# [1] 7
# 
# [[3]]
# [1] 14

It would also be useful to have a function for extracting columns. You can use [[, but the syntax isn't very elegant:

mtcars %>% group_by(cyl) %>% do(nrow = nrow(.)) %>% `[[`("nrow")

@romainfrancois
Copy link
Member

Maybe this perhaps. Not sure I is more elegant though:

> `%[[%` <- function(x, .) x[[.]]
> mtcars %>% group_by(cyl) %>% do(nrow = nrow(.)) %[[% "nrow"
[[1]]
[1] 11

[[2]]
[1] 7

[[3]]
[1] 14

@hadley
Copy link
Member

hadley commented Apr 17, 2014

I'd prefer to leave functions that operate on vectors/columns in magrittr.

I'll try and make the docs more clear.

@statwonk
Copy link

I'm with @wch, my expectation today was that I could run run do(gam(y ~ s(x), data = .) and it would return a list of fits.

Edit: actually, I get what I expect if I just assign to a variable name (as @wch mentions above).

One improvement might be to return a list with the group_by variables concatenated as list names instead of a separate vector with the same type of concatenation.

@hadley hadley closed this as completed Jul 28, 2014
@nzcoops
Copy link

nzcoops commented Mar 5, 2015

Hi @hadley , I've been battling with this one today. I'm happy to paste a full working example but I'm guessing since this is a year old that you're possibly done and dusted with the topic. An example of something I do regularly, is I will have hospitalisation records for people (multiple records per person), and I might be interested in the first record for each person for a particular diagnosis. I write a small function to find the first record and pull that row out (of that persons larger set of records) and return 'all the first records'.

In the past
ddply(cs_all, .(id), function(x) {findfirst(x, outcome = "BLAH")})

has worked perfectly, rbinding (I presume) the records together.

In dplyr I have to do

cs_all %>%
group_by(id) %>%
do(as.data.frame(findfirst(., outcome = "BLAH")))

I tried

cs_all %>%
group_by(id) %>%
do(findfirst(., outcome = "BLAH")) %>%
rbind()

but I guess it fails because do wants a data frame as the output? Even editing my findfirst function to output a data frame instead of a vector didn't work.

I'm obviously failing to make the connection between some of the nuances of these functions and how they handle different types of data. That said, there's probably a much better way to find the first record than the cobbled together functions I'm using, but it does feel somewhat clean and followable from my end :/

@andrewla
Copy link

andrewla commented Mar 5, 2015

@nzcoops -- you should consider posting this to stackoverflow; it's a good place for questions like this and @hadley is very active there.

I think what you want is probably filter combined with row_number:

cs_all %>% group_by(id) %>% filter(outcome="BLAH") %>% filter(row_number() == 1)

or, for a parallel with find_first, a combination of filter and head will work

cs_all %>% group_by(id) %>% filter(outcome="BLAH") %>% do(head(., 1))

Without seeing the definition of findfirst, it's hard to offer more.

@nzcoops
Copy link

nzcoops commented Mar 6, 2015

Thanks you (lots!) for the reply.

findfirst <- function(x, outcome, var = "diag_lab"){
if(any(x[, var] == outcome, na.rm=T)){x <- x[match(1,x[, var] == outcome), ];return(x)}
}

I could have hardcoded x$diag_lab instead of the way I did it but I wanted it to be generic.

I am on stackoverflow, googling and a few other SO questions (including one of mine related to do() lead me here. Both your suggestions work, the second I like, the first is just a step I would have never dreamt of taking :) My findfirst is redundant! But did work well in ddply (though there was probably a similar method to your examples within the plyr framework that I could have used).

I guess while these solutions work, it doesn't change the underlying want for this framework to rbind? vectors without the need to wrap as.data.frame? Like in the scenario where one might be doing manipulation within the findfirst function as well (as in, more complex than just head/rownumber as a proxy).

Clearly I need to go further down the rabbit hole!

@andrewla
Copy link

andrewla commented Mar 6, 2015

dplyr::do does rbind the dataframes together.

Looking at your code and trying some things, I see the problem here: the problem is not when your function returns results, it's when it doesn't. The head solution above has the advantage that given an empty input data.frame, it returns an empty data frame. the findfirst function will return NULL instead, and that's where the problem is.

So this might be something that is actually fixable on the dplyr side, but it's probably worth bringing it up as a separate issue, namely, that functions returning NULL in a do for a group should drop the value, rather than erroring.

@nzcoops
Copy link

nzcoops commented Mar 25, 2015

@andrewla me again...

So I'm trying to something similar but different, and running into another error.
Fundamentally this is the same problem, but rather than selecting the first record (pulling it out), I'm creating a new variable and tagging it with a 1. It reads pretty ugly but I want to be able to use this over and over hence appending 'f' at the start in this way. But I don't think that is the problem here.

tagFirst <- function(x, y){
  x[, paste0("f",y)] = 0  # create the new column, all 0's
  if(any(x[, y] == 1, na.rm = T)) x[match(1, x[ ,y]), paste0("f",y)] = 1 # if any of the values = 1, find (match) the first place within the new column and mark it as 1
  return(x)
}
check1 <- tagFirst(test[100:200,], "sch_psy") # this works fine, ignore the lack of rootnum here, it just proves the function works
check2 <- ddply(test[100:200,], .(rootnum), function(x) tagFirst(x, "sch_psy")) # this (my old method) works fine, expected output for the full 12000 records for >400 rootnums
check3 <- test[100:200,] %>%
  group_by(rootnum) %>%
  do(tagFirst(., "sch_psy")) # this fails, error (below) doesn't make sense to me as there will never be a missing value in that subscript (no missing data in the sch_psy column)

check3 <- test[100:200,] %>%

  • group_by(rootnum) %>%
  • do(as.data.frame(tagFirst(., "sch_psy")))
    Error in [<-.data.frame(*tmp*, match(1, x[, y]), paste0("f", y), value = 1) :
    missing values are not allowed in subscripted assignments of data frames
    Called from: [<-.data.frame(*tmp*, match(1, x[, y]), paste0("f", y), value = 1)
    Browse[1]> Q

Because x will always be returned a as a data frame in this instance, it's not the (exact) same issue as last time.

Any insights or thoughts appreciated. I can export those 100 records and share them somewhere if you a) can't spot it from the code and b) care enough to pursue it :)
Thanks for your time!

@statwonk
Copy link

statwonk commented Mar 1, 2016

Yet again I arrive at this Issue after Googling, Results are not data frames. Assigning to a variable helps, but it's easy to forget.

@kendonB
Copy link

kendonB commented Jun 13, 2016

You can still use do to get arbitrary output by using side effects:

lm_models <- list()
iris %>% group_by(Species) %>% 
  do({
    out <- lm(Petal.Length ~ Petal.Width, data = .)
    lm_models[[length(lm_models) + 1]] <<- out
    data.frame()
  })

@hadley
Copy link
Member

hadley commented Jun 13, 2016

@kendonB it's much simpler to do:

iris %>% group_by(Species) %>% 
  do(model = lm(Petal.Length ~ Petal.Width, data = .))

You should never use <<- in routine practice.

But these days I'd recommend the combination of dplyr, tidyr, and purrr described in https://www.youtube.com/watch?v=rz3_FDVt9eg

@kendonB
Copy link

kendonB commented Jun 13, 2016

Oh I see. I was trying this:

iris %>% group_by(Species) %>% 
  do({
    this_op_has_more_steps <- 2
    model = lm(Petal.Length ~ Petal.Width, data = .)
    })

when it should be this:

iris %>% group_by(Species) %>% 
  do(model = {
    this_op_has_more_steps <- 2
    lm(Petal.Length ~ Petal.Width, data = .)
    })

@kendonB
Copy link

kendonB commented Jun 13, 2016

And for completeness, I believe this was what hadley was talking about:

library(purrr)
library(tidyr)

complicated_operation <- function(x){
  this_op_has_more_steps <- 2
  lm(Petal.Length ~ Petal.Width, data = x)
}

iris %>% 
  group_by(Species) %>% 
  nest() %>% 
  # Now a df with "data" as a column of data.frames
  mutate(model = data %>% map(complicated_operation))

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants