Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of hybrid and standard evaluation #3799

Merged
merged 216 commits into from Sep 14, 2018
Merged

Conversation

romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 5, 2018

Please don't review this through diffs, but following these pointers (also in the dplyr_0.8.0_new_hybrid.Rmd document) :

@krlmlr this goes a bit further than what we discussed in Amsterdam, as writing these notes led to some edits.

@lionel- great if you find the time to have a look at this, esp. I guess the DataMask class. As we discussed, I might need to submit a pr so that .data goes through the ancestry instead of just the bottom environment.

Happy to expand on these notes as needed.

overview

This is a complete redesign of how we evaluate expression in dplyr. We no longer attempt to evaluate part of an expression. We now either:

  • recognize the entire expression, e.g. n() or mean(x) and use C++ code to evaluate it (this is what we call hybrid evaluation now, but I guess another term would be better.
  • if not, we use standard evaluation in a suitable environment

data mask

When used internally in the c++ code, a tibble become one of the 3 classes GroupedDataFrame, RowwiseDataFrame or NaturalDataFrame. Most internal code is templated by these classes, e.g. summarise is:

// [[Rcpp::export]]
SEXP summarise_impl(DataFrame df, QuosureList dots) {
  check_valid_colnames(df);
  if (is<RowwiseDataFrame>(df)) {
    return summarise_grouped<RowwiseDataFrame>(df, dots);
  } else if (is<GroupedDataFrame>(df)) {
    return summarise_grouped<GroupedDataFrame>(df, dots);
  } else {
    return summarise_grouped<NaturalDataFrame>(df, dots);
  }
}

The DataMask<SlicedTibble> template class is used by both hybrid and standard evaluation to extract the relevant information from the
columns (original columns or columns that have just been made by mutate() or summarise())

standard evaluation

meta information about the groups

The functions n(), row_number() and group_indices() when called without arguments
lack contextual information, i.e. the current group size and index, so they look for that information in
the "data context" (which in fact is the data mask).

n <- function() {
  get_data_context(sys.frames(), "n()")[["..group_size"]]
}

The DataMask class is responsible for updating the variables ..group_size and ..group_number

    // update the data context variables, these are used by n(), ...
    overscope["..group_size"] = indices.size();
    overscope["..group_number"] = indices.group() + 1;

all other functions can just be called with standard evaluation in the data mask

active and resolved bindings

When doing standard evaluation, we need to install a data mask that evaluates the symbols from the data
to the relevant subset. The simple solution would be to update the data mask at each iteration with
subsets for all the variables but that would be potentially expensive and a waste, as we might not need
all of the variables at a given time, e.g. in this case:

iris %>% group_by(Species) %>% summarise(Sepal.Length = +mean(Sepal.Length))

We only need to materialize Sepal.Length, we don't need the other variables.

DataMask installs an active binding for each variable in one of (the top)
the environment in the data mask ancestry, the active binding function is generated by this function
so that it holds an index and a pointer to the data mask in its enclosure.

.active_binding_fun <- function(index, subsets){
  function() {
    materialize_binding(index, subsets)
  }
}

When hit, the active binding calls the materialize_binding function :

// [[Rcpp::export]]
SEXP materialize_binding(int idx, XPtr<DataMaskBase> mask) {
  return mask->materialize(idx);
}

The DataMask<>::materialize(idx) method returns the materialized subset, but also:

  • install the result in the bottom environment of the data mask, so that it mask the
    active binding. The point is to call the active binding only once.
  • remembers that the binding at position idx has been materialized, so that before
    evaluating the same expression in the next group, it is proactively
    materialized, because it is very likely that we need the same variables for all groups

When we move to the next expression to evaluate, DataMask forgets about the materialized
bindings so that the active binding can be triggered again as needed.

use case of the DataMask class

  • before evaluating expressions, construct a DataMask from a tibble
DataMask<SlicedTibble> mask(tbl);
  • before evaluating a new expression, we need to reset(parent_env) to prepare the data mask to
    evaluate expression with a given parent environment. This "forgets" about the materialized
    bindings.
mask.reset(quosure.env());
  • before evaluating the expression ona new group, the indices are updated, this includes
    rematerializing the already materialized bindings

hybrid evaluation

Use of DataMask

Hybrid evaluation also uses the DataMask<> class, but it only needs to quickly retrieve
the data for an entire column. This is what the maybe_get_subset_binding method does.

  // returns a pointer to the ColumnBinding if it exists
  // this is mostly used by the hybrid evaluation
  const ColumnBinding<SlicedTibble>* maybe_get_subset_binding(const SymbolString& symbol) const {
    int pos = symbol_map.find(symbol);
    if (pos >= 0) {
      return &column_bindings[pos];
    } else {
      return 0;
    }
  }

when the symbol map contains the binding, we get a ColumnBinding<SlicedTibble>*. These objects hold these fields:

  // is this a summary binding, i.e. does it come from summarise
  bool summary;

  // symbol of the binding
  SEXP symbol;

  // data. it is own either by the original data frame or by the
  // accumulator, so no need for additional protection here
  SEXP data;

hybrid evaluation only needs summary and data.

Expression

When attempting to evaluate an expression with the hybrid evaluator, we first construct an Expression object.
This class has methods to quickly check if the expression can be managed, e.g.

    // sum( <column> ) and base::sum( <column> )
    if (expression.is_fun(s_sum, s_base, ns_base)) {
      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return sum_(data, x, /* na.rm = */ false, op);
      } else {
        return R_UnboundValue;
      }
    }

This checks that the call matches sum(<column>) or base::sum(<column>) where <column> is a column from the data mask.

In that example, the Expression class checks that:

  • the first argument is not named
  • the first argument is a column from the data

Otherwise it means it is an expression that we can't handle, so we return R_UnboundValue which is the hybrid evaluation
way to signal it gives up on handling the expression, and that it should be evaluated with standard evaluation.

Expression has the following methods:

  • inline bool is_fun(SEXP symbol, SEXP pkg, SEXP ns) : are we calling fun ? If so does fun curently resolve to the
    function we intend to (it might not if the function is masked, which allows to do trghings like this:)
> n <- function() 42
> summarise(iris, nn = n())
  nn
1 42
  • bool is_valid() const : is the expression valid. the Expressio, constructor rules out a few expressions that hjave no chance of being
    handled, such as pkg::fun() when pkg is none of dplyr, stats or base

  • SEXP value(int i) const : the expression at position i

  • bool is_named(int i, SEXP symbol) const : is the i'th argument named symbol

  • bool is_scalar_logical(int i, bool& test) const : is the i'th argument a scalar logical, we need this for handling e.g. na.rm = TRUE

  • bool is_scalar_int(int i, int& out) const is the i'th argument a scalar int, we need this for n = <int>

  • bool is_column(int i, Column& column) const is the i'th argument a column.

hybrid_do

The hybrid_do function uses methods from Expression to quickly assess if it can handle the expression and then calls the relevant
function from dplyr::hybrid:: to create the result at once:

    if (expression.is_fun(s_sum, s_base, ns_base)) {
      // sum( <column> ) and base::sum( <column> )
      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return sum_(data, x, /* na.rm = */ false, op);
      }
    } else if (expression.is_fun(s_mean, s_base, ns_base)) {
      // mean( <column> ) and base::mean( <column> )

      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return mean_(data, x, false, op);
      }
    } else if ...

The functions in the C++ dplyr::hybrid:: namespace create objects whose classes hold:

  • the type of output they create
  • the information they need (e.g. the column, the value of na.rm, ...)

These classes all have these methods:

  • summarise() to return a result of the same size as the number of groups. This is used when op is a Summary. This returns R_UnboundValue to give up
    when the class can't do that, e.g. the classes behind lag
  • window() to return a result of the same size as the number of rows in the original data set.

The classes typically don't provide these methods directly, but rather inherit, via CRTP one of:

  • HybridVectorScalarResult, so that the class only has to provide a process method, for example the Count class:
template <typename SlicedTibble>
class Count : public HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > {
public:
  typedef HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > Parent ;

  Count(const SlicedTibble& data) : Parent(data) {}

  int process(const typename SlicedTibble::slicing_index& indices) const {
    return indices.size();
  }
} ;

HybridVectorScalarResult uses the result of process in both summarise() and window()

  • HybridVectorVectorResult expects a fill method, e.g. implementation of ntile(n=<int>) uses this class
    that derive from HybridVectorVectorResult.
template <typename SlicedTibble>
class Ntile1 : public HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1<SlicedTibble> > {
public:
  typedef HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1> Parent;

  Ntile1(const SlicedTibble& data, int ntiles_): Parent(data), ntiles(ntiles_) {}

  void fill(const typename SlicedTibble::slicing_index& indices, Rcpp::IntegerVector& out) const {
    int m = indices.size();
    for (int j = m - 1; j >= 0; j--) {
      out[ indices[j] ] = (int)floor((ntiles * j) / m) + 1;
    }
  }

private:
  int ntiles;
};

The result of fill is only used in window(). The summarise() method simpliy returns R_UnboundValue to give up.

@romainfrancois romainfrancois added the wip work in progress label Sep 5, 2018
@lionel-
Copy link
Member

lionel- commented Sep 5, 2018

I haven't looked in details yet but just a quick comment:

get_data_context(sys.frames()

I don't think the data mask should be dynamically scoped. The mask should be found in the immediate caller frame. Did you encounter a case where this would be an issue?

@romainfrancois
Copy link
Member Author

🤔 this is one of the first thing i looked. Should it just be parent.frame() then ?

@lionel-
Copy link
Member

lionel- commented Sep 5, 2018

Yes this way n() etc are lexically scoped.

@romainfrancois
Copy link
Member Author

That’d be nice. I’ll quickly test this in the morning.

Is there some other information that would be useful in the data mask, e.g the column names (maybe this can help tidyselect) ?

@lionel-
Copy link
Member

lionel- commented Sep 5, 2018

The column names should be available through names(.data), albeit unordered.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small things I noticed. I'll try and find time for a deeper dive tomorrow.

R/dataframe.R Show resolved Hide resolved
@@ -15,7 +15,12 @@ group_indices <- function(.data, ...) {
}
#' @export
group_indices.default <- function(.data, ...) {
group_indices_(.data, .dots = compat_as_lazy_dots(...))
if (missing(.data)) {
context <- get_data_context(sys.frames(), "group_indices()")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we wrap this up a bit so we could just have

rep.int(from_context("..group_number"), from_context("..group_size"))

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This was saving a trip to look for the right environment, although @lionel- hints that this might not be needed and I can just do get("..group_size", parent.frame()).

also the second argument of get_data_context() drives the error message.

// [[Rcpp::export]]
SEXP get_data_context(SEXP frames, const char* expr) {
  static SEXP symb_group_size = Rf_install("..group_size");

  for (; !Rf_isNull(frames) ; frames = CDR(frames)) {
    SEXP group_size = Rf_findVarInFrame3(CAR(frames), symb_group_size, FALSE) ;
    if (group_size != R_UnboundValue) return CAR(frames);
  }

  throw Rcpp::exception(tfm::format("%s should only be called in a data context", expr).c_str(), false);

  return R_NilValue;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to store all the information in a single object? Like .__dplyr_data_mask__..

R/rank.R Outdated Show resolved Hide resolved
R/rank.R Outdated Show resolved Hide resolved
romainfrancois added a commit that referenced this pull request Sep 6, 2018
@romainfrancois
Copy link
Member Author

romainfrancois commented Sep 6, 2018

@lionel- should we allow for things like:

> half <- function() n() / 2
> summarise(mtcars, n = half())
 Error: Evaluation error: object '..group_size' not found. 

because of course in that case parent.frame() is empty. It's not something we test for so maybe it's fine, and we can have :

from_context <- function(x) get(x, parent.frame(2))

and perhaps with a friendlier message than object '..group_size' not found

@lionel-
Copy link
Member

lionel- commented Sep 6, 2018

No but we should allow:

summarise(mtcars, {
  half <- function() n() / 2
  half()
})

With dynamic scoping you could easily use half() in the wrong place and get unexpected results.

We can also add an env argument to hybrid functions, and then you can write:

half <- function(env = parent.frame()) {
  n(env = env) / 2
}
summarise(mtcars, half())

I agree it seems useful to wrap around these functions.

@hadley
Copy link
Member

hadley commented Sep 6, 2018

I don't think it's necessary to support functions created inside summary(); unless it occurs naturally by doing the right thing elsewhere.

@lionel-
Copy link
Member

lionel- commented Sep 6, 2018

It occurs naturally from lexical scoping. Equivalently, because the pipe evaluates in a child:

mutate(mtcars, cyl %>% divide_by(n())

@lionel-
Copy link
Member

lionel- commented Sep 7, 2018

I am now reconsidering this because we'll hit the same issue as for lexical dispatch, i.e. the following will produce different results:

by_n <- function(x, env = caller_env()) {
  x / n(env = env)
}

summarise(data, map(list_col, by_n))       # Can't find data mask
summarise(data, map(list_col, ~by_n(.x)))  # Works

There's just no way in current R to find the right lexical context when functionals are involved. So I think you're right that dynamic scoping of data mask information gives the least surprising semantics.

However it should probably be registered on the stack like in tidyselect because collecting and searching all the frames is expensive.

@romainfrancois
Copy link
Member Author

@lionel- do you have some pointers of what you mean by "it should probably be registered on the stack like in tidyselect" ?

@lionel-
Copy link
Member

lionel- commented Sep 7, 2018

The gist is to poke an enviroment with the current data mask information and restore the old information on exit. See https://github.com/tidyverse/tidyselect/blob/master/R/vars.R

@romainfrancois
Copy link
Member Author

Right I see so instead of going to the data mask, the ..group_* variables would go to a context_env environment that is easier to reach.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked up to the ColumnBinding class, will continue reviewing later.

R/dataframe.R Show resolved Hide resolved
inst/include/dplyr/data/DataMask.h Outdated Show resolved Hide resolved
Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed up to the source files and tests. Need to open new issues for "postpone" comments, if we agree that these are worth fixing.

inst/include/dplyr/data/tbl_classes.h Outdated Show resolved Hide resolved
inst/include/dplyr/data/DataMask.h Outdated Show resolved Hide resolved
inst/include/dplyr/data/DataMask.h Show resolved Hide resolved
inst/include/dplyr/data/DataMask.h Show resolved Hide resolved
inst/include/dplyr/hybrid/Expression.h Outdated Show resolved Hide resolved
SEXP column_subset_vector_impl(const Rcpp::Vector<RTYPE>& x, const Index& index, Rcpp::traits::true_type) {
typedef typename Rcpp::Vector<RTYPE>::stored_type STORAGE;
int n = index.size();
Rcpp::Vector<RTYPE> res = no_init(n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postpone: Need to check if this still triggers rchk errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the other form just in case. e66ac91

This does not work for Matrix for some reason, but iirc you have a fix for Rcpp right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RcppCore/Rcpp#893 has been merged. Why doesn't it work for Matrix ?

inst/include/tools/SlicingIndex.h Outdated Show resolved Hide resolved
inst/include/tools/SymbolMap.h Show resolved Hide resolved
inst/include/tools/set_rownames.h Show resolved Hide resolved
inst/include/tools/utils.h Show resolved Hide resolved
…up_number`.

Now using the `context_env` environment that lives inside the dplyr namespace instead of searching for the data mask overscope based on @lionel- suggestion

Also simplify the use of the context so that we can define `n()` as :

```
n <- function() {
  from_context("..group_size")
}
```
@romainfrancois
Copy link
Member Author

I reworked the handling of the context information so that n() can be written:

n <- function() {
  from_context("..group_size")
}

on the R side the handling of the context is just:

context_env <- new_environment()
context_env[[".group_size"]] <- NULL
context_env[[".group_index"]] <- NULL

from_context <- function(what){
  context_env[[what]] %||% abort(glue("{expr} should only be called in a data context", expr = deparse(sys.call(-1))))
}

and the environment is managed by the DataMask class internally:

  • on construction:
    previous_group_size = get_context_env()["..group_size"];
    previous_group_number = get_context_env()["..group_number"];
  • destruction:
    get_context_env()["..group_size"] = previous_group_size;
    get_context_env()["..group_number"] = previous_group_number;
  • when we move on to another group:
    // update the data context variables, these are used by n(), ...
    get_context_env()["..group_size"] = indices.size();
    get_context_env()["..group_number"] = indices.group() + 1;

where get_context_env() :

Rcpp::Environment& get_context_env() const {
    static Rcpp::Environment context_env(Rcpp::Environment::namespace_env("dplyr")["context_env"]);
    return context_env;
  }

ping @lionel-

…ehaviour when external pointers to data mask call back to the data mask after its lifetime
Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

// TODO: when .data knows how to look ancestry, this should use mask_resolved instead
//
// it does not really install an active binding because there is no need for that
inline void install(SEXP mask_active, SEXP mask_resolved, int /* pos */, boost::shared_ptr< DataMaskProxy<NaturalDataFrame> >& /* data_mask_proxy */) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: Can you add some newlines in argument lists please? Long lines are hard to read. Same for comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

inst/include/dplyr/data/DataMask.h Show resolved Hide resolved
inst/include/dplyr/data/DataMask.h Show resolved Hide resolved
// top : the environment containing active bindings.
//
// overscope : where .data etc ... are installed
overscope = internal::rlang_api().new_data_mask(mask_resolved, mask_active, parent_env);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There'll be an API update on this. The C callable will continue to work but the parent_env is now ignored. See r-lib/rlang#603

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to export eval_tidy() as C callable as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to ask you actually. That's good this way I only have to call new_data_mask once.

In that case I suppose we do indeed need eval_tidy for the handling of .env ?

// The 3 environments of the data mask
Environment mask_active; // where the active bindings live
Environment mask_resolved; // where the resolved active bindings live
Environment overscope; // actual data mask, contains the .data pronoun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overscope => data_mask

@romainfrancois
Copy link
Member Author

I'll merge this shortly, probably tomorrow, unless there are strong concerns.
We can handle small problems with small pull requests.

@krlmlr
Copy link
Member

krlmlr commented Sep 14, 2018

Thanks! I did a coarse review and summarized the results in a new issue.

I like the new implementation and the fact that 2000 lines of code are gone now.

@lock
Copy link

lock bot commented Mar 25, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Mar 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants