New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vctrs methods #38
Conversation
So it works with `devtools::test()`
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
+ Coverage 93.99% 94.30% +0.31%
==========================================
Files 8 10 +2
Lines 333 369 +36
==========================================
+ Hits 313 348 +35
- Misses 20 21 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! I added a couple of minor comments.
But more importantly, we have to be careful with the "id" attribute. The package has a global hash table that keeps track of the covariances between pairs of errors objects that are alive in the R session. Any time you perform an operation, covariance vectors are saved:
library(errors)
library(dplyr, warn.conflicts=FALSE)
iris.q <- head(iris)
for (i in 1:4)
errors(iris.q[,i]) <- iris.q[,i] * 0.05
iris.q$another <- iris.q$Sepal.Length + iris.q$Sepal.Width
correl(iris.q$another, iris.q$Sepal.Length)
#> [1] 0.8245134 0.8528513 0.8265992 0.8292664 0.8115343 0.8106792
correl(iris.q$another, iris.q$Sepal.Width)
#> [1] 0.5658425 0.5221539 0.5627910 0.5588535 0.5843047 0.5854906
but
res <- iris.q %>%
mutate(another = Sepal.Length + Sepal.Width)
correl(res$another, res$Sepal.Length)
#> NULL
correl(res$another, res$Sepal.Width)
#> NULL
In essence, every time we make an operation that doesn't modify the length of the vector, the "id" is used to track pairs of covariances. That's why I use an environment for this "id", because methods return a copy of the object, but the reference to that environment doesn't change. However, if the operation changes the length of the vector (summaries, subsets...), then a new object is created from scratch, and thus a new "id" is assigned.
So somewhere we need to detect whether {dplyr} is mutating or summarising in order to keep the "id" or create a new object.
Things get trickier when you are mutating grouped data. But mutating subsets is not currently supported, e.g.:
iris.q$another[1:3] <- iris.q$Sepal.Length[1:3] * iris.q$Sepal.Width[1:3]
correl(iris.q$another, iris.q$Sepal.Length)
#> NULL
correl(iris.q$another, iris.q$Sepal.Width)
#> NULL
so I don't expect it to work with {dplyr} now. Definitely I need to work on this though.
Oh, and add yourself as a contributor, please! |
hmm I don't think we can use the length to figure this out. For instance if you reverse a vector it will have the same length but the correlations will be wrong if they are not recomputed properly. Maybe one thing that could help is to include some kind of identity vector in the proxy that assigns a unique identifier to the values. This seems like a hard problem though. |
Do you mean in DESCRIPTION? or in NEWS? |
DESCRIPTION.
Yeap, it was a simplification to explain the idea behind the identifier, but probably looking at the length is not a good indicator. In fact, reversing a vector currently doesn't hold the correlations either, because I think I need a trace of calls to really understand how {dplyr} and {vctrs} interact. Do you know a better way of doing this than calling |
dplyr uses vctrs operations like |
Here you go @Enchufa2 Would you like to do the {units} package? I'd be happy to send a PR as well for units (and quantities)! |
One more question. I see: library(errors)
library(dplyr, warn.conflicts=FALSE)
iris.q <- head(iris)
for (i in 1:4)
errors(iris.q[,i]) <- iris.q[,i] * 0.05
tracer <- quote({
call <- try(match.call(), silent=TRUE)
.trace.ind <<- .trace.ind + 1
if (!inherits(call, "try-error")) {
indent <- paste0(rep(" ", .trace.ind), collapse="")
line <- paste0(indent, as.character(call))
cat(line, "\n")
}
})
exit <- quote({ .trace.ind <<- .trace.ind - 1 })
pkgs <- c("dplyr", "vctrs", "errors")
for (pkg in pkgs) {
where <- asNamespace(pkg)
funs <- ls(where)
funs <- funs[sapply(funs, function(x) is.function(get(x, where)))]
trace(funs, tracer=tracer, exit=exit, print=FALSE, where=where)
#untrace(funs, where=where)
}
#> [snip]
.trace.ind <- 0
res <- iris.q %>%
mutate(another = Sepal.Length + Sepal.Width)
#> mutate.data.frame . Sepal.Length + Sepal.Width
#> group_rows data
#> group_data .data
#> group_data.data.frame .data
#> new_list_of list(seq_len(nrow(.data))) integer()
#> vec_is_list x
#> vec_size ptype
#> validate_names .data
#> names_all_or_nothing nms
#> vec_set_attributes .data attrib
#> new_data_frame list(.rows = rows) 1
#> local_mask self frame
#> context_local mask x frame
#> context_poke name value
#> group_keys data
#> group_keys.data.frame data
#> group_data .tbl
#> group_data.data.frame .tbl
#> new_list_of list(seq_len(nrow(.data))) integer()
#> vec_is_list x
#> vec_size ptype
#> validate_names .data
#> names_all_or_nothing nms
#> vec_set_attributes .data attrib
#> new_data_frame list(.rows = rows) 1
#> group_vars data
#> group_vars.data.frame data
#> group_data x
#> group_data.data.frame x
#> new_list_of list(seq_len(nrow(.data))) integer()
#> vec_is_list x
#> vec_size ptype
#> validate_names .data
#> names_all_or_nothing nms
#> vec_set_attributes .data attrib
#> new_data_frame list(.rows = rows) 1
#> is_grouped_df data
#> map seq_len(ncol(data)) function(.x) expr(promise_fn(!!.x))
#> map_lgl private$resolved is.null
#> map_lgl private$resolved is.null
#> cond2int !inherits(e1, "errors") !inherits(e2, "errors")
#> propagate unclass(NextMethod()) e1 e2 deriv[[1]] deriv[[2]]
#> errors x
#> errors.errors x
#> errors y
#> errors.errors y
#> covar x y
#> covar.errors x y
#> set_errors xx sqrt(var)
#> set_errors.numeric xx sqrt(var)
#> errors<- `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> errors<-.numeric `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> new_id
#> errors<-.errors x value
#> ids idx
#> ids idy
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> vec_unchop chunks rows
#> vec_proxy x
#> vec_proxy x
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> set_errors x$data x$errors
#> set_errors.numeric x$data x$errors
#> errors<- `*tmp*` numeric(0)
#> errors<-.numeric `*tmp*` numeric(0)
#> new_id
#> errors<-.errors x value
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> vec_ptype_finalise.default x
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> set_errors x$data x$errors
#> set_errors.numeric x$data x$errors
#> errors<- `*tmp*` numeric(0)
#> errors<-.numeric `*tmp*` numeric(0)
#> new_id
#> errors<-.errors x value
#> drop_errors.errors x
#> drop_errors.errors to
#> set_errors out errors(x)
#> set_errors.numeric out errors(x)
#> errors x
#> errors.errors x
#> errors<- `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> errors<-.numeric `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> new_id
#> errors<-.errors x value
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> vec_proxy x
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> set_errors x$data x$errors
#> set_errors.numeric x$data x$errors
#> errors<- `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> errors<-.numeric `*tmp*` c(0.309273341883842, 0.287271648444465, 0.28429737951659, 0.277353564967173, 0.308058436014987, 0.333054049667618)
#> new_id
#> errors<-.errors x value
#> map_lgl new_columns inherits rlang_zap
#> context_poke mask NULL
#> dplyr_col_modify .data cols
#> dplyr_col_modify.data.frame .data cols
#> vec_recycle_common !!!cols nrow(data)
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> dplyr_vec_data data
#> vec_data x
#> vec_assert x
#> vec_is_vector x
#> vec_proxy x
#> has_dim x
#> vec_set_attributes x list(names = names(x))
#> unset_s4 x
#> new_data_frame out nrow(x)
#> new_data_frame out nrow(data)
#> dplyr_reconstruct out data
#> dplyr_new_data_frame data
#> drop_errors.errors x
#> vctrs::new_data_frame list(data = data, errors = errors) length(data)
#> dplyr_reconstruct_dispatch data template
#> dplyr_reconstruct.data.frame data template Is there any way to carry the "id" attribute around when |
IIUC I think you'd want to use the |
Yes! I'll merge this PR and I'll play a little bit before introducing that change, but it just works, and it doesn't break the summarise case. What potential problems do you see? In theory, the object returned by
I you have cycles for this, yes, I would appreciate that help a lot. Because it's clear that I still don't have a full grasp of this mechanism. |
If the attribute doesn't depend on the data, then in theory it's part of the type, and should be contained in the
The values being restored might come from a different vector (e.g. with This is why I suggested to include id columns in the proxy, the first column would be a
Sounds good. And no worries, we're still learning as well, this is research in progress :-) |
What if vec_proxy.errors <- function(x, ...) {
data <- drop_errors.errors(x)
errors <- attr(x, "errors")
# Simplifies coercion methods
errors <- as.double(errors)
# The `errors` are a vectorised attribute, which requires a data
# frame proxy
- vctrs::new_data_frame(
+ df <- vctrs::new_data_frame(
list(data = data, errors = errors),
n = length(data)
)
+ attr(df, "id") <- attr(x, "id")
+ df
} Is there any guarantee that this attribute would survive to see |
This is UB from the point of view of vctrs. But even if not, it would also incorrectly propagate the |
A good use case to test: dplyr::slice_sample(iris_errors, n = nrow(iris_errors)) The data will restore with the same size, but it will be different. Propagating the |
Yeap, you're right. And here nor {dplyr} nor {vctrs} are calling any {errors} method. So basically there's missing information at In the first case, |
The situation is similar to x <- set_errors(1:3, 1:3)
y <- set_errors(3:1, 3:1)
z <- x + y
correl(x, z)
#> [1] 0.3162278 0.7071068 0.9486833
correl(x[], z)
#> NULL
correl(x[seq_along(x)], z)
#> NULL |
Yes, as I said, this particular use case is not currently covered in |
With the combination of a vector identifier and an element identifier in the proxy you'd have all the information as well in |
By the way your tracing routine is fantastic, thanks for that! It helped my workflow twice already. |
:D It was a quick test and for sure it can be done better. But if it's useful, I could make a package out of it, or send a PR to... {pryr} may be a good place? |
Maybe lobstr? What do you think @hadley and @gaborcsardi? Just did a few improvements: # Same logic as rlang backtrace formatting
call_add_namespace <- function(call, fn) {
if (!is.call(call) || !is.symbol(call[[1]])) {
return(call)
}
sym <- call[[1]]
nm <- as.character(sym)
if (nm %in% c("::", ":::")) {
return(call)
}
env <- environment(fn)
top <- topenv(env)
if (identical(env, globalenv())) {
prefix <- "global"
op <- "::"
} else if (isNamespace(top)) {
prefix <- rlang::ns_env_name(top)
if (rlang:::ns_exports_has(top, nm)) {
op <- "::"
} else {
op <- ":::"
}
} else {
return(call)
}
namespaced_sym <- call(op, as.symbol(prefix), sym)
call[[1]] <- namespaced_sym
call
}
trace_pkgs <- function(pkgs, max_level = Inf, ..., regexp = NULL) {
# Avoids namespace loading issues
lapply(pkgs, requireNamespace, quietly = TRUE)
trace_level <- 0
# Create a thunk because `trace()` sloppily transforms functions into calls
tracer <- rlang::call2(function() {
trace_level <<- trace_level + 1
if (trace_level > max_level) {
return()
}
# Work around sys.foo() weirdness
get_fn <- rlang::call2(function(fn = sys.function(sys.parent())) fn)
fn <- eval(get_fn, env = parent.frame())
try(silent = TRUE, {
call <- evalq(base::match.call(), envir = parent.frame())
call <- call_add_namespace(call, fn)
indent <- paste0(rep(" ", (trace_level - 1) * 2), collapse = "")
line <- paste0(indent, rlang::as_label(call))
cat(line, "\n")
})
})
exit <- rlang::call2(function() {
trace_level <<- trace_level - 1
})
if (length(regexp) == 1) {
regexp <- rep(regexp, length(pkgs))
}
for (i in seq_along(pkgs)) {
pkg <- pkgs[[i]]
ns <- asNamespace(pkg)
ns_fns <- names(Filter(is.function, as.list(ns)))
if (!is.null(regexp)) {
ns_fns <- ns_fns[grepl(regexp[[i]], ns_fns)]
}
suppressMessages(trace(
ns_fns,
tracer = tracer,
exit = exit,
print = FALSE,
where = ns
))
message(sprintf(
"Tracing %d functions in %s.",
length(ns_fns),
pkg
))
}
} This ensures the calls are namespaced, allows setting a maximum nesting level, and supplying a regexp to match functions: trace_pkgs(c("vctrs", "tibble"), max_level = 2, regexp = "^(vec|new)_")
#> Tracing 459 functions in vctrs.
#> Tracing 2 functions in tibble.
library(dplyr)
invisible(mtcars %>% transmute(ts(cyl)))
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::vec_unchop(x = chunks, indices = rows)
#> vctrs::vec_proxy(x = x)
#> vctrs::vec_proxy(x = x)
#> vctrs::vec_default_cast(...)
#> vctrs::vec_proxy(x = x)
#> vctrs::vec_slice(x = i, i = existing)
#> vctrs::vec_slice(x = value, i = existing)
#> vctrs:::vec_dim_n(x = x)
#> vctrs::vec_recycle_common(!!!cols, .size = nrow(data))
#> vctrs::vec_data(x = x)
#> vctrs::vec_assert(x = x)
#> vctrs::vec_proxy(x = x)
#> vctrs:::vec_set_attributes(x = x, attrib = list(names = names(x)))
#> vctrs::new_data_frame(x = out, n = nrow(x))
#> vctrs::new_data_frame(x = out, n = nrow(data))
#> vctrs::new_list_of(x = list(seq_len(nrow(.data))), ptype = integer())
#> vctrs::vec_is_list(x = x)
#> vctrs::vec_size(x = ptype)
#> vctrs::new_data_frame(x = list(.rows = rows), n = 1L)
#> vctrs::vec_as_location(i = loc, n = ncol(.data), names = names(.data)) |
Great work! I'm gonna use this a lot. |
Another potential improvement: optionally, the |
To me, it feels like rlang is the right place for it, because it feels like a bit of a peer to |
Part of r-lib/vctrs#1130
I have moved the test helpers from
testthat.R
to a helper file, this way it works with both R CMD check anddevtools::test()
. Let me know if you prefer the original location for these helpers and I'll revert.This is an interesting class to work with and hopefully we'll be able to simplify the coercion methods in the future, once we know how to handle higher order types in a more generic way.