New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of hybrid and standard evaluation #3799
Conversation
I haven't looked in details yet but just a quick comment:
I don't think the data mask should be dynamically scoped. The mask should be found in the immediate caller frame. Did you encounter a case where this would be an issue? |
🤔 this is one of the first thing i looked. Should it just be parent.frame() then ? |
Yes this way |
That’d be nice. I’ll quickly test this in the morning. Is there some other information that would be useful in the data mask, e.g the column names (maybe this can help tidyselect) ? |
The column names should be available through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small things I noticed. I'll try and find time for a deeper dive tomorrow.
R/group-indices.R
Outdated
@@ -15,7 +15,12 @@ group_indices <- function(.data, ...) { | |||
} | |||
#' @export | |||
group_indices.default <- function(.data, ...) { | |||
group_indices_(.data, .dots = compat_as_lazy_dots(...)) | |||
if (missing(.data)) { | |||
context <- get_data_context(sys.frames(), "group_indices()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we wrap this up a bit so we could just have
rep.int(from_context("..group_number"), from_context("..group_size"))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This was saving a trip to look for the right environment, although @lionel- hints that this might not be needed and I can just do get("..group_size", parent.frame())
.
also the second argument of get_data_context()
drives the error message.
// [[Rcpp::export]]
SEXP get_data_context(SEXP frames, const char* expr) {
static SEXP symb_group_size = Rf_install("..group_size");
for (; !Rf_isNull(frames) ; frames = CDR(frames)) {
SEXP group_size = Rf_findVarInFrame3(CAR(frames), symb_group_size, FALSE) ;
if (group_size != R_UnboundValue) return CAR(frames);
}
throw Rcpp::exception(tfm::format("%s should only be called in a data context", expr).c_str(), false);
return R_NilValue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to store all the information in a single object? Like .__dplyr_data_mask__.
.
@lionel- should we allow for things like:
because of course in that case from_context <- function(x) get(x, parent.frame(2)) and perhaps with a friendlier message than |
No but we should allow: summarise(mtcars, {
half <- function() n() / 2
half()
}) With dynamic scoping you could easily use We can also add an half <- function(env = parent.frame()) {
n(env = env) / 2
}
summarise(mtcars, half()) I agree it seems useful to wrap around these functions. |
I don't think it's necessary to support functions created inside |
It occurs naturally from lexical scoping. Equivalently, because the pipe evaluates in a child: mutate(mtcars, cyl %>% divide_by(n()) |
I am now reconsidering this because we'll hit the same issue as for lexical dispatch, i.e. the following will produce different results: by_n <- function(x, env = caller_env()) {
x / n(env = env)
}
summarise(data, map(list_col, by_n)) # Can't find data mask
summarise(data, map(list_col, ~by_n(.x))) # Works There's just no way in current R to find the right lexical context when functionals are involved. So I think you're right that dynamic scoping of data mask information gives the least surprising semantics. However it should probably be registered on the stack like in tidyselect because collecting and searching all the frames is expensive. |
@lionel- do you have some pointers of what you mean by "it should probably be registered on the stack like in tidyselect" ? |
The gist is to poke an enviroment with the current data mask information and restore the old information on exit. See https://github.com/tidyverse/tidyselect/blob/master/R/vars.R |
Right I see so instead of going to the data mask, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only looked up to the ColumnBinding
class, will continue reviewing later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed up to the source files and tests. Need to open new issues for "postpone" comments, if we agree that these are worth fixing.
SEXP column_subset_vector_impl(const Rcpp::Vector<RTYPE>& x, const Index& index, Rcpp::traits::true_type) { | ||
typedef typename Rcpp::Vector<RTYPE>::stored_type STORAGE; | ||
int n = index.size(); | ||
Rcpp::Vector<RTYPE> res = no_init(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postpone: Need to check if this still triggers rchk
errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used the other form just in case. e66ac91
This does not work for Matrix
for some reason, but iirc you have a fix for Rcpp right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RcppCore/Rcpp#893 has been merged. Why doesn't it work for Matrix
?
…rguments supplied, and otherwiwe calls dplyr::group_indices
267cd02
to
9654a5b
Compare
…up_number`. Now using the `context_env` environment that lives inside the dplyr namespace instead of searching for the data mask overscope based on @lionel- suggestion Also simplify the use of the context so that we can define `n()` as : ``` n <- function() { from_context("..group_size") } ```
I reworked the handling of the context information so that n <- function() {
from_context("..group_size")
} on the R side the handling of the context is just: context_env <- new_environment()
context_env[[".group_size"]] <- NULL
context_env[[".group_index"]] <- NULL
from_context <- function(what){
context_env[[what]] %||% abort(glue("{expr} should only be called in a data context", expr = deparse(sys.call(-1))))
} and the environment is managed by the
previous_group_size = get_context_env()["..group_size"];
previous_group_number = get_context_env()["..group_number"];
get_context_env()["..group_size"] = previous_group_size;
get_context_env()["..group_number"] = previous_group_number;
// update the data context variables, these are used by n(), ...
get_context_env()["..group_size"] = indices.size();
get_context_env()["..group_number"] = indices.group() + 1; where Rcpp::Environment& get_context_env() const {
static Rcpp::Environment context_env(Rcpp::Environment::namespace_env("dplyr")["context_env"]);
return context_env;
} ping @lionel- |
…ehaviour when external pointers to data mask call back to the data mask after its lifetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
inst/include/dplyr/data/DataMask.h
Outdated
// TODO: when .data knows how to look ancestry, this should use mask_resolved instead | ||
// | ||
// it does not really install an active binding because there is no need for that | ||
inline void install(SEXP mask_active, SEXP mask_resolved, int /* pos */, boost::shared_ptr< DataMaskProxy<NaturalDataFrame> >& /* data_mask_proxy */) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: Can you add some newlines in argument lists please? Long lines are hard to read. Same for comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
inst/include/dplyr/data/DataMask.h
Outdated
// top : the environment containing active bindings. | ||
// | ||
// overscope : where .data etc ... are installed | ||
overscope = internal::rlang_api().new_data_mask(mask_resolved, mask_active, parent_env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There'll be an API update on this. The C callable will continue to work but the parent_env is now ignored. See r-lib/rlang#603
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need to export eval_tidy()
as C callable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to ask you actually. That's good this way I only have to call new_data_mask
once.
In that case I suppose we do indeed need eval_tidy
for the handling of .env
?
inst/include/dplyr/data/DataMask.h
Outdated
// The 3 environments of the data mask | ||
Environment mask_active; // where the active bindings live | ||
Environment mask_resolved; // where the resolved active bindings live | ||
Environment overscope; // actual data mask, contains the .data pronoun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overscope => data_mask
…un(...) instead of dplyr::internal::rlang_api().fun(...)
…sigbn of hybrid evaluation in #3799.
I'll merge this shortly, probably tomorrow, unless there are strong concerns. |
Thanks! I did a coarse review and summarized the results in a new issue. I like the new implementation and the fact that 2000 lines of code are gone now. |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
Please don't review this through diffs, but following these pointers (also in the
dplyr_0.8.0_new_hybrid.Rmd
document) :@krlmlr this goes a bit further than what we discussed in Amsterdam, as writing these notes led to some edits.
@lionel- great if you find the time to have a look at this, esp. I guess the
DataMask
class. As we discussed, I might need to submit a pr so that.data
goes through the ancestry instead of just the bottom environment.Happy to expand on these notes as needed.
overview
This is a complete redesign of how we evaluate expression in dplyr. We no longer attempt to evaluate part of an expression. We now either:
n()
ormean(x)
and use C++ code to evaluate it (this is what we call hybrid evaluation now, but I guess another term would be better.data mask
When used internally in the c++ code, a tibble become one of the 3 classes
GroupedDataFrame
,RowwiseDataFrame
orNaturalDataFrame
. Most internal code is templated by these classes, e.g.summarise
is:The
DataMask<SlicedTibble>
template class is used by both hybrid and standard evaluation to extract the relevant information from thecolumns (original columns or columns that have just been made by
mutate()
orsummarise()
)standard evaluation
meta information about the groups
The functions
n()
,row_number()
andgroup_indices()
when called without argumentslack contextual information, i.e. the current group size and index, so they look for that information in
the "data context" (which in fact is the data mask).
The DataMask class is responsible for updating the variables
..group_size
and..group_number
all other functions can just be called with standard evaluation in the data mask
active and resolved bindings
When doing standard evaluation, we need to install a data mask that evaluates the symbols from the data
to the relevant subset. The simple solution would be to update the data mask at each iteration with
subsets for all the variables but that would be potentially expensive and a waste, as we might not need
all of the variables at a given time, e.g. in this case:
We only need to materialize
Sepal.Length
, we don't need the other variables.DataMask
installs an active binding for each variable in one of (the top)the environment in the data mask ancestry, the active binding function is generated by this function
so that it holds an index and a pointer to the data mask in its enclosure.
When hit, the active binding calls the materialize_binding function :
The
DataMask<>::materialize(idx)
method returns the materialized subset, but also:active binding. The point is to call the active binding only once.
idx
has been materialized, so that beforeevaluating the same expression in the next group, it is proactively
materialized, because it is very likely that we need the same variables for all groups
When we move to the next expression to evaluate,
DataMask
forgets about the materializedbindings so that the active binding can be triggered again as needed.
use case of the DataMask class
DataMask<SlicedTibble> mask(tbl);
reset(parent_env)
to prepare the data mask toevaluate expression with a given parent environment. This "forgets" about the materialized
bindings.
rematerializing the already materialized bindings
hybrid evaluation
Use of DataMask
Hybrid evaluation also uses the
DataMask<>
class, but it only needs to quickly retrievethe data for an entire column. This is what the
maybe_get_subset_binding
method does.when the symbol map contains the binding, we get a
ColumnBinding<SlicedTibble>*
. These objects hold these fields:hybrid evaluation only needs
summary
anddata
.Expression
When attempting to evaluate an expression with the hybrid evaluator, we first construct an
Expression
object.This class has methods to quickly check if the expression can be managed, e.g.
This checks that the call matches
sum(<column>)
orbase::sum(<column>)
where<column>
is a column from the data mask.In that example, the
Expression
class checks that:Otherwise it means it is an expression that we can't handle, so we return
R_UnboundValue
which is the hybrid evaluationway to signal it gives up on handling the expression, and that it should be evaluated with standard evaluation.
Expression has the following methods:
inline bool is_fun(SEXP symbol, SEXP pkg, SEXP ns)
: are we callingfun
? If so doesfun
curently resolve to thefunction we intend to (it might not if the function is masked, which allows to do trghings like this:)
bool is_valid() const
: is the expression valid. the Expressio, constructor rules out a few expressions that hjave no chance of beinghandled, such as pkg::fun() when
pkg
is none ofdplyr
,stats
orbase
SEXP value(int i) const
: the expression at position ibool is_named(int i, SEXP symbol) const
: is the i'th argument namedsymbol
bool is_scalar_logical(int i, bool& test) const
: is the i'th argument a scalar logical, we need this for handling e.g.na.rm = TRUE
bool is_scalar_int(int i, int& out) const
is the i'th argument a scalar int, we need this forn = <int>
bool is_column(int i, Column& column) const
is the i'th argument a column.hybrid_do
The
hybrid_do
function uses methods fromExpression
to quickly assess if it can handle the expression and then calls the relevantfunction from
dplyr::hybrid::
to create the result at once:The functions in the C++
dplyr::hybrid::
namespace create objects whose classes hold:These classes all have these methods:
summarise()
to return a result of the same size as the number of groups. This is used when op is aSummary
. This returnsR_UnboundValue
to give upwhen the class can't do that, e.g. the classes behind
lag
window()
to return a result of the same size as the number of rows in the original data set.The classes typically don't provide these methods directly, but rather inherit, via CRTP one of:
HybridVectorScalarResult
, so that the class only has to provide aprocess
method, for example theCount
class:HybridVectorScalarResult
uses the result ofprocess
in bothsummarise()
andwindow()
HybridVectorVectorResult
expects afill
method, e.g. implementation ofntile(n=<int>)
uses this classthat derive from HybridVectorVectorResult.
The result of
fill
is only used inwindow()
. Thesummarise()
method simpliy returnsR_UnboundValue
to give up.