do() gives error if returned values are not data frames #397

wch · 2014-04-16T22:05:30Z

The docs say:

You can use do to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list.

But it doesn't work this way. For example:

mtcars %>% group_by(cyl) %>% do(nrow(.))
# Error: Results are not data frames at positions: 1, 2, 3

The result I would expect is something like this (which presently works):

x <- mtcars %>% group_by(cyl) %>% do(data.frame(nrow = nrow(.)))
as.list(x$nrow)
# [[1]]
# [1] 11
#
# [[2]]
# [1] 7
#
# [[3]]
# [1] 14

Note that I'm just using nrow to illustrate; I know you can use group_size to get this result. The thing I'm actually doing returns a vector for each group, so there's not a simple drop-in replacement for it.

The text was updated successfully, but these errors were encountered:

wch · 2014-04-16T22:11:54Z

Oh, maybe it's that you have to provide a name if the result isn't a data frame? That seems kind of counterintuitive, or at least it could be documented better, and could be mentioned in the error message.

x <- mtcars %>% group_by(cyl) %>% do(nrow = nrow(.))
# Source: local data frame [3 x 2]
# Groups: <by row>
# 
#   cyl     nrow
# 1   4 <int[1]>
# 2   6 <int[1]>
# 3   8 <int[1]>

x$nrow
# [[1]]
# [1] 11
# 
# [[2]]
# [1] 7
# 
# [[3]]
# [1] 14

It would also be useful to have a function for extracting columns. You can use [[, but the syntax isn't very elegant:

mtcars %>% group_by(cyl) %>% do(nrow = nrow(.)) %>% `[[`("nrow")

romainfrancois · 2014-04-17T08:12:21Z

Maybe this perhaps. Not sure I is more elegant though:

> `%[[%` <- function(x, .) x[[.]]
> mtcars %>% group_by(cyl) %>% do(nrow = nrow(.)) %[[% "nrow"
[[1]]
[1] 11

[[2]]
[1] 7

[[3]]
[1] 14

hadley · 2014-04-17T13:05:02Z

I'd prefer to leave functions that operate on vectors/columns in magrittr.

I'll try and make the docs more clear.

statwonk · 2014-07-26T18:26:42Z

I'm with @wch, my expectation today was that I could run run do(gam(y ~ s(x), data = .) and it would return a list of fits.

Edit: actually, I get what I expect if I just assign to a variable name (as @wch mentions above).

One improvement might be to return a list with the group_by variables concatenated as list names instead of a separate vector with the same type of concatenation.

nzcoops · 2015-03-05T01:09:33Z

Hi @hadley , I've been battling with this one today. I'm happy to paste a full working example but I'm guessing since this is a year old that you're possibly done and dusted with the topic. An example of something I do regularly, is I will have hospitalisation records for people (multiple records per person), and I might be interested in the first record for each person for a particular diagnosis. I write a small function to find the first record and pull that row out (of that persons larger set of records) and return 'all the first records'.

In the past
ddply(cs_all, .(id), function(x) {findfirst(x, outcome = "BLAH")})

has worked perfectly, rbinding (I presume) the records together.

In dplyr I have to do

cs_all %>%
group_by(id) %>%
do(as.data.frame(findfirst(., outcome = "BLAH")))

I tried

cs_all %>%
group_by(id) %>%
do(findfirst(., outcome = "BLAH")) %>%
rbind()

but I guess it fails because do wants a data frame as the output? Even editing my findfirst function to output a data frame instead of a vector didn't work.

I'm obviously failing to make the connection between some of the nuances of these functions and how they handle different types of data. That said, there's probably a much better way to find the first record than the cobbled together functions I'm using, but it does feel somewhat clean and followable from my end :/

andrewla · 2015-03-05T15:11:49Z

@nzcoops -- you should consider posting this to stackoverflow; it's a good place for questions like this and @hadley is very active there.

I think what you want is probably filter combined with row_number:

cs_all %>% group_by(id) %>% filter(outcome="BLAH") %>% filter(row_number() == 1)

or, for a parallel with find_first, a combination of filter and head will work

cs_all %>% group_by(id) %>% filter(outcome="BLAH") %>% do(head(., 1))

Without seeing the definition of findfirst, it's hard to offer more.

nzcoops · 2015-03-06T01:14:21Z

Thanks you (lots!) for the reply.

findfirst <- function(x, outcome, var = "diag_lab"){
if(any(x[, var] == outcome, na.rm=T)){x <- x[match(1,x[, var] == outcome), ];return(x)}
}

I could have hardcoded x$diag_lab instead of the way I did it but I wanted it to be generic.

I am on stackoverflow, googling and a few other SO questions (including one of mine related to do() lead me here. Both your suggestions work, the second I like, the first is just a step I would have never dreamt of taking :) My findfirst is redundant! But did work well in ddply (though there was probably a similar method to your examples within the plyr framework that I could have used).

I guess while these solutions work, it doesn't change the underlying want for this framework to rbind? vectors without the need to wrap as.data.frame? Like in the scenario where one might be doing manipulation within the findfirst function as well (as in, more complex than just head/rownumber as a proxy).

Clearly I need to go further down the rabbit hole!

andrewla · 2015-03-06T16:11:46Z

dplyr::do does rbind the dataframes together.

Looking at your code and trying some things, I see the problem here: the problem is not when your function returns results, it's when it doesn't. The head solution above has the advantage that given an empty input data.frame, it returns an empty data frame. the findfirst function will return NULL instead, and that's where the problem is.

So this might be something that is actually fixable on the dplyr side, but it's probably worth bringing it up as a separate issue, namely, that functions returning NULL in a do for a group should drop the value, rather than erroring.

nzcoops · 2015-03-25T13:13:43Z

@andrewla me again...

So I'm trying to something similar but different, and running into another error.
Fundamentally this is the same problem, but rather than selecting the first record (pulling it out), I'm creating a new variable and tagging it with a 1. It reads pretty ugly but I want to be able to use this over and over hence appending 'f' at the start in this way. But I don't think that is the problem here.

tagFirst <- function(x, y){
  x[, paste0("f",y)] = 0  # create the new column, all 0's
  if(any(x[, y] == 1, na.rm = T)) x[match(1, x[ ,y]), paste0("f",y)] = 1 # if any of the values = 1, find (match) the first place within the new column and mark it as 1
  return(x)
}

check1 <- tagFirst(test[100:200,], "sch_psy") # this works fine, ignore the lack of rootnum here, it just proves the function works
check2 <- ddply(test[100:200,], .(rootnum), function(x) tagFirst(x, "sch_psy")) # this (my old method) works fine, expected output for the full 12000 records for >400 rootnums
check3 <- test[100:200,] %>%
  group_by(rootnum) %>%
  do(tagFirst(., "sch_psy")) # this fails, error (below) doesn't make sense to me as there will never be a missing value in that subscript (no missing data in the sch_psy column)

check3 <- test[100:200,] %>%

group_by(rootnum) %>%

do(as.data.frame(tagFirst(., "sch_psy")))
Error in [<-.data.frame(*tmp*, match(1, x[, y]), paste0("f", y), value = 1) :
missing values are not allowed in subscripted assignments of data frames
Called from: [<-.data.frame(*tmp*, match(1, x[, y]), paste0("f", y), value = 1)
Browse[1]> Q

Because x will always be returned a as a data frame in this instance, it's not the (exact) same issue as last time.

Any insights or thoughts appreciated. I can export those 100 records and share them somewhere if you a) can't spot it from the code and b) care enough to pursue it :)
Thanks for your time!

statwonk · 2016-03-01T22:46:06Z

Yet again I arrive at this Issue after Googling, Results are not data frames. Assigning to a variable helps, but it's easy to forget.

kendonB · 2016-06-13T04:06:03Z

You can still use do to get arbitrary output by using side effects:

lm_models <- list()
iris %>% group_by(Species) %>% 
  do({
    out <- lm(Petal.Length ~ Petal.Width, data = .)
    lm_models[[length(lm_models) + 1]] <<- out
    data.frame()
  })

hadley · 2016-06-13T12:12:09Z

@kendonB it's much simpler to do:

iris %>% group_by(Species) %>% 
  do(model = lm(Petal.Length ~ Petal.Width, data = .))

You should never use <<- in routine practice.

But these days I'd recommend the combination of dplyr, tidyr, and purrr described in https://www.youtube.com/watch?v=rz3_FDVt9eg

kendonB · 2016-06-13T17:29:05Z

Oh I see. I was trying this:

iris %>% group_by(Species) %>% 
  do({
    this_op_has_more_steps <- 2
    model = lm(Petal.Length ~ Petal.Width, data = .)
    })

when it should be this:

iris %>% group_by(Species) %>% 
  do(model = {
    this_op_has_more_steps <- 2
    lm(Petal.Length ~ Petal.Width, data = .)
    })

kendonB · 2016-06-13T18:34:33Z

And for completeness, I believe this was what hadley was talking about:

library(purrr)
library(tidyr)

complicated_operation <- function(x){
  this_op_has_more_steps <- 2
  lm(Petal.Length ~ Petal.Width, data = x)
}

iris %>% 
  group_by(Species) %>% 
  nest() %>% 
  # Now a df with "data" as a column of data.frames
  mutate(model = data %>% map(complicated_operation))

hadley closed this as completed Jul 28, 2014

lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do() gives error if returned values are not data frames #397

do() gives error if returned values are not data frames #397

wch commented Apr 16, 2014

wch commented Apr 16, 2014

romainfrancois commented Apr 17, 2014

hadley commented Apr 17, 2014

statwonk commented Jul 26, 2014

nzcoops commented Mar 5, 2015

andrewla commented Mar 5, 2015

nzcoops commented Mar 6, 2015

andrewla commented Mar 6, 2015

nzcoops commented Mar 25, 2015

statwonk commented Mar 1, 2016

kendonB commented Jun 13, 2016

hadley commented Jun 13, 2016

kendonB commented Jun 13, 2016

kendonB commented Jun 13, 2016

do() gives error if returned values are not data frames #397

do() gives error if returned values are not data frames #397

Comments

wch commented Apr 16, 2014

wch commented Apr 16, 2014

romainfrancois commented Apr 17, 2014

hadley commented Apr 17, 2014

statwonk commented Jul 26, 2014

nzcoops commented Mar 5, 2015

andrewla commented Mar 5, 2015

nzcoops commented Mar 6, 2015

andrewla commented Mar 6, 2015

nzcoops commented Mar 25, 2015

statwonk commented Mar 1, 2016

kendonB commented Jun 13, 2016

hadley commented Jun 13, 2016

kendonB commented Jun 13, 2016

kendonB commented Jun 13, 2016