-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
equivalent of pig's ILLUSTRATE #549
Comments
The way that pig's ILLUSTRATE finds a sample of data is actually very sophisticated. See http://i.stanford.edu/~olston/publications/sigmod09.pdf Implementing this in R might be a worthwhile goal, I am pretty impressed with the paper. However I would expect it to be in a standalone package, it seems far to big an ask to simply be a dplyr function in my opinion. There are also a number of technical hurdles in implementing this, such as breaking down each step in a dplyr pipeline. While they are not impossible to do they are definitely tricky to get right. Definitely a cool idea though! |
I love this idea, and I think implementing something similar to the pig approach will be a lot of work, but extremely worthwhile. |
I'm finding myself doing the following when using dplyr:
I wrote a little function to accomplish this, and I thought I'd share it because it seems similar to the illustrate functionality, and so I thought it might be helpful here. The part I can't figure out is how to get the function to evaluate a piped command as a character vector. So mtcars %>% mutate(mpg = mpg) should return "mtcars %>% mutate(mpg = mpg)". Seems like it should be simple, but it's beyond me. So I just included the code with the dplyr expression as a character string. library(dplyr)
library(stringr)
expr<- "mtcars%>%
mutate(mpg2 = mpg*2)%>%
group_by(cyl)%>%
select(mpg)%>%
mutate(mpg2 = mpg)%>%
summarise(n = n())"
expr <- as.vector(str_split( expr, "%>%\n"))
expr<- expr[[1]]
group_expr<- which( grepl("group_by", expr))
f<- paste( expr[1: group_expr], collapse = "%>%")
#runs everything up to the point the data frame is grouped.
data <- eval(parse(text = f))
print(f)
data
size_table<- data%>%
summarise(n = n())%>%
arrange(desc(n))
#select largest sub group, which tends to be a more representative sample
data<- data[ data[ , as.character(groups(data))] == size_table[1,1], ]
for(i in (group_expr +1): length(expr)){
f <- paste("data", expr[i], sep = "%>%")
data<- eval(parse(text = f))
print(f)
print(data)
}
|
@jimhester: agreed that pig's method of find a good group is relatively complex. that said, @gshotwell's suggestion above is a nice one: perhaps we can settle for some simple heuristic (e.g., pick a few large groups as a reasonable sample)? |
@jhofman I think the problem is that a representative sample means different things for different problems. e.g. a grouped summary probably needs to pick a handful of complete groups, a filter could randomly sample rows, a grouped filter could pick one big group, a join needs to make sure there are some matches, ... I think the way to achieve this is to have a new type of tbl that wraps (i.e. points to) an existing table. Then methods for |
@hadley yes---sounds like a good approach. each verb should essentially get a corresponding "sampling" function, which might do a random sample, stratify by some columns, or simply take an entire group by certain columns. |
I could see this working very well in conjunction with the RStudio's improved data viewer. Perhaps the Viewer would allow us to step through each call one at a time? Perhaps another option is to implement an RStudio shortcut to iteratively print each piece of a piped command. For example:
With the cursor over a statement like that, typing |
Interesting packages related to this issue called loopr. http://cran.r-project.org/web/packages/loopr/vignettes/Looping.html It seems like this would run into some problems with large datasets but still very interesting syntax IMO. |
@datalove: that sounds like a great idea. even an uninformed cmd+opt+p that simply did this with the first n rows of a data frame would be a nice start. then things can get smarter over time to deal with things like joins and filters that would require that consistent subsets are taken to ensure non-trivial output, following the techniques for Pig's ILLUSTRATE. |
I still think this is a cool idea, but realistically it's unlikely we'll ever have the time to implement in dplyr. |
pig has a great command called ILLUSTRATE that demonstrates how a sample of data will be transformed through a series of commands. it's very useful for debugging, and a great educational tool to generate example data transformations.
needless to say, it'd be awesome if dplyr had a similar function.
more info here:
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE
http://wiki.apache.org/pig/PigIllustrate
The text was updated successfully, but these errors were encountered: