Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

equivalent of pig's ILLUSTRATE #549

Closed
jhofman opened this issue Aug 18, 2014 · 10 comments
Closed

equivalent of pig's ILLUSTRATE #549

jhofman opened this issue Aug 18, 2014 · 10 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@jhofman
Copy link

jhofman commented Aug 18, 2014

pig has a great command called ILLUSTRATE that demonstrates how a sample of data will be transformed through a series of commands. it's very useful for debugging, and a great educational tool to generate example data transformations.

needless to say, it'd be awesome if dplyr had a similar function.

more info here:

http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE
http://wiki.apache.org/pig/PigIllustrate

@jimhester
Copy link
Contributor

The way that pig's ILLUSTRATE finds a sample of data is actually very sophisticated. See http://i.stanford.edu/~olston/publications/sigmod09.pdf

Implementing this in R might be a worthwhile goal, I am pretty impressed with the paper. However I would expect it to be in a standalone package, it seems far to big an ask to simply be a dplyr function in my opinion.

There are also a number of technical hurdles in implementing this, such as breaking down each step in a dplyr pipeline. While they are not impossible to do they are definitely tricky to get right.

Definitely a cool idea though!

@hadley hadley added bug and removed bug labels Aug 26, 2014
@hadley hadley added this to the 0.4 milestone Aug 26, 2014
@hadley hadley self-assigned this Aug 26, 2014
@hadley
Copy link
Member

hadley commented Aug 26, 2014

I love this idea, and I think implementing something similar to the pig approach will be a lot of work, but extremely worthwhile.

@gshotwell
Copy link

I'm finding myself doing the following when using dplyr:

  1. Create subset of data which represents the largest group
  2. Running each line of dplyr commands in sequence to try to see what's happening.

I wrote a little function to accomplish this, and I thought I'd share it because it seems similar to the illustrate functionality, and so I thought it might be helpful here. The part I can't figure out is how to get the function to evaluate a piped command as a character vector. So mtcars %>% mutate(mpg = mpg) should return "mtcars %>% mutate(mpg = mpg)". Seems like it should be simple, but it's beyond me. So I just included the code with the dplyr expression as a character string.

library(dplyr)
library(stringr)

expr<- "mtcars%>%
  mutate(mpg2 = mpg*2)%>%
  group_by(cyl)%>%
  select(mpg)%>%
  mutate(mpg2 = mpg)%>%
  summarise(n = n())"

expr <-  as.vector(str_split( expr, "%>%\n"))
expr<- expr[[1]]
group_expr<- which( grepl("group_by", expr))

f<- paste( expr[1: group_expr], collapse = "%>%")

#runs everything up to the point the data frame is grouped. 
data <- eval(parse(text = f))
print(f)
data

size_table<- data%>%
  summarise(n = n())%>%
  arrange(desc(n))

#select largest sub group, which tends to be a more representative sample
data<- data[ data[ , as.character(groups(data))] == size_table[1,1], ]

for(i in (group_expr +1): length(expr)){
  f <- paste("data", expr[i], sep = "%>%")
  data<- eval(parse(text = f))
  print(f)
  print(data)
}

@jhofman
Copy link
Author

jhofman commented Aug 28, 2014

@jimhester: agreed that pig's method of find a good group is relatively complex.

that said, @gshotwell's suggestion above is a nice one: perhaps we can settle for some simple heuristic (e.g., pick a few large groups as a reasonable sample)?

@hadley
Copy link
Member

hadley commented Aug 28, 2014

@jhofman I think the problem is that a representative sample means different things for different problems. e.g. a grouped summary probably needs to pick a handful of complete groups, a filter could randomly sample rows, a grouped filter could pick one big group, a join needs to make sure there are some matches, ...

I think the way to achieve this is to have a new type of tbl that wraps (i.e. points to) an existing table. Then methods for tbl_illustrate could do different things for different generics.

@jhofman
Copy link
Author

jhofman commented Aug 28, 2014

@hadley yes---sounds like a good approach. each verb should essentially get a corresponding "sampling" function, which might do a random sample, stratify by some columns, or simply take an entire group by certain columns.

@datalove
Copy link

I could see this working very well in conjunction with the RStudio's improved data viewer. Perhaps the Viewer would allow us to step through each call one at a time?

Perhaps another option is to implement an RStudio shortcut to iteratively print each piece of a piped command. For example:

new_data <- mtcars %>% filter(cyl == 6) %>% mutate(d_cyl = disp/cyl) 

With the cursor over a statement like that, typing cmd+opt+p would first print mtcars, then typing cmd+opt+p again would then print the result of mtcars %>% filter(cyl == 6) and then finally it would print the result of mtcars %>% filter(cyl == 6) %>% mutate(d_cyl = disp/cyl)

@gshotwell
Copy link

Interesting packages related to this issue called loopr. http://cran.r-project.org/web/packages/loopr/vignettes/Looping.html

It seems like this would run into some problems with large datasets but still very interesting syntax IMO.

@jhofman
Copy link
Author

jhofman commented May 8, 2015

@datalove: that sounds like a great idea.

even an uninformed cmd+opt+p that simply did this with the first n rows of a data frame would be a nice start. then things can get smarter over time to deal with things like joins and filters that would require that consistent subsets are taken to ensure non-trivial output, following the techniques for Pig's ILLUSTRATE.

@hadley hadley modified the milestones: bluesky, 0.5 Oct 22, 2015
@hadley
Copy link
Member

hadley commented Feb 2, 2017

I still think this is a cool idea, but realistically it's unlikely we'll ever have the time to implement in dplyr.

@hadley hadley closed this as completed Feb 2, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants