equivalent of pig's ILLUSTRATE #549

jhofman · 2014-08-18T18:02:18Z

pig has a great command called ILLUSTRATE that demonstrates how a sample of data will be transformed through a series of commands. it's very useful for debugging, and a great educational tool to generate example data transformations.

needless to say, it'd be awesome if dplyr had a similar function.

more info here:

http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE
http://wiki.apache.org/pig/PigIllustrate

jimhester · 2014-08-22T17:08:18Z

The way that pig's ILLUSTRATE finds a sample of data is actually very sophisticated. See http://i.stanford.edu/~olston/publications/sigmod09.pdf

Implementing this in R might be a worthwhile goal, I am pretty impressed with the paper. However I would expect it to be in a standalone package, it seems far to big an ask to simply be a dplyr function in my opinion.

There are also a number of technical hurdles in implementing this, such as breaking down each step in a dplyr pipeline. While they are not impossible to do they are definitely tricky to get right.

Definitely a cool idea though!

hadley · 2014-08-26T20:19:55Z

I love this idea, and I think implementing something similar to the pig approach will be a lot of work, but extremely worthwhile.

gshotwell · 2014-08-28T00:14:08Z

I'm finding myself doing the following when using dplyr:

Create subset of data which represents the largest group
Running each line of dplyr commands in sequence to try to see what's happening.

I wrote a little function to accomplish this, and I thought I'd share it because it seems similar to the illustrate functionality, and so I thought it might be helpful here. The part I can't figure out is how to get the function to evaluate a piped command as a character vector. So mtcars %>% mutate(mpg = mpg) should return "mtcars %>% mutate(mpg = mpg)". Seems like it should be simple, but it's beyond me. So I just included the code with the dplyr expression as a character string.

library(dplyr)
library(stringr)

expr<- "mtcars%>%
  mutate(mpg2 = mpg*2)%>%
  group_by(cyl)%>%
  select(mpg)%>%
  mutate(mpg2 = mpg)%>%
  summarise(n = n())"

expr <-  as.vector(str_split( expr, "%>%\n"))
expr<- expr[[1]]
group_expr<- which( grepl("group_by", expr))

f<- paste( expr[1: group_expr], collapse = "%>%")

#runs everything up to the point the data frame is grouped. 
data <- eval(parse(text = f))
print(f)
data

size_table<- data%>%
  summarise(n = n())%>%
  arrange(desc(n))

#select largest sub group, which tends to be a more representative sample
data<- data[ data[ , as.character(groups(data))] == size_table[1,1], ]

for(i in (group_expr +1): length(expr)){
  f <- paste("data", expr[i], sep = "%>%")
  data<- eval(parse(text = f))
  print(f)
  print(data)
}

jhofman · 2014-08-28T13:44:41Z

@jimhester: agreed that pig's method of find a good group is relatively complex.

that said, @gshotwell's suggestion above is a nice one: perhaps we can settle for some simple heuristic (e.g., pick a few large groups as a reasonable sample)?

hadley · 2014-08-28T13:56:46Z

@jhofman I think the problem is that a representative sample means different things for different problems. e.g. a grouped summary probably needs to pick a handful of complete groups, a filter could randomly sample rows, a grouped filter could pick one big group, a join needs to make sure there are some matches, ...

I think the way to achieve this is to have a new type of tbl that wraps (i.e. points to) an existing table. Then methods for tbl_illustrate could do different things for different generics.

jhofman · 2014-08-28T14:26:51Z

@hadley yes---sounds like a good approach. each verb should essentially get a corresponding "sampling" function, which might do a random sample, stratify by some columns, or simply take an entire group by certain columns.

datalove · 2015-03-28T06:34:34Z

I could see this working very well in conjunction with the RStudio's improved data viewer. Perhaps the Viewer would allow us to step through each call one at a time?

Perhaps another option is to implement an RStudio shortcut to iteratively print each piece of a piped command. For example:

new_data <- mtcars %>% filter(cyl == 6) %>% mutate(d_cyl = disp/cyl)

With the cursor over a statement like that, typing cmd+opt+p would first print mtcars, then typing cmd+opt+p again would then print the result of mtcars %>% filter(cyl == 6) and then finally it would print the result of mtcars %>% filter(cyl == 6) %>% mutate(d_cyl = disp/cyl)

gshotwell · 2015-05-07T16:19:27Z

Interesting packages related to this issue called loopr. http://cran.r-project.org/web/packages/loopr/vignettes/Looping.html

It seems like this would run into some problems with large datasets but still very interesting syntax IMO.

jhofman · 2015-05-08T18:12:23Z

@datalove: that sounds like a great idea.

even an uninformed cmd+opt+p that simply did this with the first n rows of a data frame would be a nice start. then things can get smarter over time to deal with things like joins and filters that would require that consistent subsets are taken to ensure non-trivial output, following the techniques for Pig's ILLUSTRATE.

hadley · 2017-02-02T21:01:07Z

I still think this is a cool idea, but realistically it's unlikely we'll ever have the time to implement in dplyr.

hadley added bug and removed bug labels Aug 26, 2014

hadley added this to the 0.4 milestone Aug 26, 2014

hadley self-assigned this Aug 26, 2014

hadley modified the milestones: bluesky, 0.5 Oct 22, 2015

hadley closed this as completed Feb 2, 2017

lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

equivalent of pig's ILLUSTRATE #549

equivalent of pig's ILLUSTRATE #549

jhofman commented Aug 18, 2014

jimhester commented Aug 22, 2014

hadley commented Aug 26, 2014

gshotwell commented Aug 28, 2014

jhofman commented Aug 28, 2014

hadley commented Aug 28, 2014

jhofman commented Aug 28, 2014

datalove commented Mar 28, 2015

gshotwell commented May 7, 2015

jhofman commented May 8, 2015

hadley commented Feb 2, 2017

equivalent of pig's ILLUSTRATE #549

equivalent of pig's ILLUSTRATE #549

Comments

jhofman commented Aug 18, 2014

jimhester commented Aug 22, 2014

hadley commented Aug 26, 2014

gshotwell commented Aug 28, 2014

jhofman commented Aug 28, 2014

hadley commented Aug 28, 2014

jhofman commented Aug 28, 2014

datalove commented Mar 28, 2015

gshotwell commented May 7, 2015

jhofman commented May 8, 2015

hadley commented Feb 2, 2017