Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make mutate use collecter h #2487

Merged
merged 6 commits into from
Mar 5, 2017

Conversation

zeehio
Copy link
Contributor

@zeehio zeehio commented Mar 2, 2017

This PR is a continuation to #2486 and closes #1892.

mutate(col2 = fun(col1)) on a grouped data frame calls fun once per group.

It used to require that fun returns the exact same type on each call. That is not desirable in functions that may return different (but compatible) types, such as integer and numeric.

This PR changes that behavior, so the returned vectors from each of the fun calls are combined using the same coercion rules than combine and bind_rows, defined in Collecter.h.

Comments are welcome. Feel free to be picky, so I can improve a bit my C++ and Rcpp skills. 😃

We want to support difftime in bind_rows and combine.

We are already supporting mutate and I'm preparing a PR
to make mutate use Collecter.h as well.
Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into these issues! I have a few comments and questions.

@@ -457,3 +457,10 @@ test_that("bind_rows rejects data frame columns (#2015)", {
fixed = TRUE
)
})

test_that("bind_rows accepts difftime objects", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case for "hms":

  df1 <- data.frame(x = hms::hms(hours = 1))
  df2 <- data.frame(x = as.difftime(1, units = "mins"))
  res <- bind_rows(df1, df2)
  expect_equal(res$x, hms::hms(hours = 1, minutes = 1))

}

inline SEXP get() {
set_class(Parent::data, "difftime");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the "hms" class that inherits from "difftime", I guess we'd like to return "hms" objects if the first object is of that class.

double factor_data = time_conversion_factor(units);
if (factor_data != 1.0) {
for (int i=0; i<Parent::data.size(); i++) {
Parent::data[i] = factor_data*Parent::data[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy the data before in-place modification?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if a copy has been made, I'm missing it ;-)

units = wrap("secs");
double factor_v = time_conversion_factor(v_units);
NumericVector v_sec(v);
double* v_sec_ptr = v_sec.begin();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unsafe, better use an iterator and advance that (in addition to i).

// then collect the data:
Parent::collect(index, v);
} else {
// We already units, is the new vector with the same units?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check grammar.

}
if (units.isNULL()) {
// if current unit is NULL, grab the new one
units = v_units;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply reject "difftime" objects that don't have a "units" attribute?

@@ -25,12 +25,13 @@ namespace dplyr {
static bool is_class_known(SEXP x) {
/* C++11 (need initializer lists)
static std::set<std::string> known_classes {
"POSIXct", "factor", "Date", "AsIs", "integer64", "table"
"difftime", "POSIXct", "factor", "Date", "AsIs", "integer64", "table"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to carry around code in comments. The old-style code also works in C++11, we'll rewrite that when we are on C++11 and when we touch the code again.

if (first_non_na < gdf.ngroups())
grab(first, indices);
copy_most_attributes(data, first);
class Gatherer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please indent like the original, to make review simpler?

@krlmlr
Copy link
Member

krlmlr commented Mar 2, 2017

@zeehio: If you address the feedback here and backport to #2486, we can continue the discussion there.

`mutate(col2 = fun(col1))` on a grouped data frame calls `fun` once per group.

It used to require that `fun` returns the exact same type and that was not desirable in functions that may return different (but compatible) types, such as integer and numeric.

This PR changes that behaviour, so the returned vectors from each of the `fun` calls are combined using the same coercion rules than `combine` and `bind_rows`, defined in `Collecter.h`.
@zeehio zeehio force-pushed the make_mutate_use_collecter_h branch from 1f1b494 to 36319d8 Compare March 2, 2017 18:03
@zeehio zeehio changed the title Make mutate use collecter h [WIP] Make mutate use collecter h Mar 2, 2017
@zeehio zeehio force-pushed the make_mutate_use_collecter_h branch from 9b0fca5 to dad85f8 Compare March 2, 2017 19:49
@zeehio zeehio force-pushed the make_mutate_use_collecter_h branch from dad85f8 to 1231ced Compare March 2, 2017 20:57
@zeehio zeehio changed the title [WIP] Make mutate use collecter h Make mutate use collecter h Mar 2, 2017
@zeehio
Copy link
Contributor Author

zeehio commented Mar 2, 2017

@krlmlr I have taken care of all the issues you mentioned. Apologies for the grammar error and for all the indentation issues, thanks for all the comments and suggestions 👍

double factor_data = time_conversion_factor(units);
if (factor_data != 1.0) {
for (int i=0; i<Parent::data.size(); i++) {
Parent::data[i] = factor_data*Parent::data[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if a copy has been made, I'm missing it ;-)

for (int i=0; i<index.size(); i++) {
Parent::data[index[i]] = factor_v * REAL(v)[i];
}
} else if (TYPEOF(v) == INTSXP) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you create a difftime of mode integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only think of structure(4L, units="secs", class = "difftime") and I agree it is forcing it. I can drop the else if if you prefer.

double time_conversion_factor(RObject v_units) {
// Acceptable units based on r-source/src/library/base/R/datetime.R
double factor = 1;
std::string v_units_c = Rcpp::as<std::string>(v_units);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wounder if we could use a map (as a static variable in a function) that allows lookup by SEXP, both here and in has_valid_time_unit().

Copy link
Contributor Author

@zeehio zeehio Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the last commit I have used an std::string as key. If that is not good enough I will try to fix it tomorrow.

}

private:
RObject units;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move data members to the bottom?

}
}

double time_conversion_factor(SEXP v_units) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels overworked to me now - I preferred the previous version with explicit if statements. Might be better to make the argument std::string()

}

void collect_difftime(const SlicingIndex& index, SEXP v) {
RObject v_units(Rf_getAttrib(v, Rf_install("units")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you do v.attr("units")? I'm pretty sure there's a C++ api here.

void collect_difftime(const SlicingIndex& index, SEXP v) {
RObject v_units(Rf_getAttrib(v, Rf_install("units")));
if (v_units.isNULL()) {
stop("Can't collect difftime without units");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here, and for the non REALSXP case below, you can simply do stop("Invalid difftime object").

}
}

RObject units;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would all this code be simpler if units was a std::string?

@zeehio
Copy link
Contributor Author

zeehio commented Mar 4, 2017

I took care of the comments, feel free to do another review if you want

@hadley hadley merged commit 996318b into tidyverse:master Mar 5, 2017
@hadley
Copy link
Member

hadley commented Mar 5, 2017

Looks good, thanks!

@zeehio zeehio deleted the make_mutate_use_collecter_h branch March 5, 2017 11:56
@lock
Copy link

lock bot commented Jan 18, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integer not automatically coerced to double
3 participants