Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unconditionally clone in df_proxy() #1838

Merged
merged 2 commits into from
May 9, 2023

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented May 8, 2023

Closes #1837

I think it was a mistake to not clone in df_proxy() if the original input was not referenced. There are cases where we:

  • Create a fresh object at the C level (like with x = vec_cast(x, ptype) to cast to a common type)
  • Proxy that object to get x_proxy = vec_proxy_equal(x) and perform some manipulation on the proxy
  • Slice the original fresh x using information gained from the proxy

This is what is done in vec_set_*().

When a data frame is supplied as x, the current implementation of df_proxy() means the vec_proxy_equal() step from above modifies x directly (because it wasn't referenced), so the last step of slicing x ends up slicing and returning a proxy rather than the original object.


It doesn't look like the more sensitive functions are affected by this.

I checked vec_rbind() in particular, and it isn't affected because we do xs <- vec_cast_common(xs), which does create fresh x elements, but it immediately stores them in a list so MAYBE_REFERENCED() was already returning true for them.

library(bench)
library(tidyr)

run(against = c("vctrs", "r-lib/vctrs#1838"), local = FALSE, {
  library(vctrs)
  
  x <- vec_rep(list(data_frame(x = 1)), 1000)
  
  bench::mark(vec_rbind(!!!x))
}) %>% 
  unnest(result)
#> # A tibble: 2 × 14
#>   reference       expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
#>   <chr>           <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
#> 1 vctrs           vec_rbind… 2.34ms 2.55ms      387.    33.2KB     53.5   152    21      393ms <df>   <Rprofmem>
#> 2 r-lib/vctrs#18… vec_rbind… 2.26ms 2.45ms      398.    33.3KB     56.2   156    22      392ms <df>   <Rprofmem>
#> # ℹ 2 more variables: time <list>, gc <list>

run(against = c("vctrs", "r-lib/vctrs#1838"), local = FALSE, {
  library(vctrs)
  
  x <- vec_rep(list(data_frame(x = 1)), 1000)
  
  bench::mark(vec_ptype_common(!!!x))
}) %>% 
  unnest(result)
#> # A tibble: 2 × 14
#>   reference        expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
#>   <chr>            <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
#> 1 vctrs            vec_ptype… 644µs  718µs     1372.    10.7KB     50.2   601    22      438ms <df>   <Rprofmem>
#> 2 r-lib/vctrs#1838 vec_ptype… 646µs  721µs     1369.    10.7KB     50.2   600    22      438ms <df>   <Rprofmem>
#> # ℹ 2 more variables: time <list>, gc <list>

@DavisVaughan DavisVaughan requested a review from lionel- May 8, 2023 14:41
Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Also on hindsight it was probably not worth the risk to avoid cloning fresh dfs given that data frames with many columns are rare so the shallow copy is cheap.

@DavisVaughan DavisVaughan merged commit ae010ea into r-lib:main May 9, 2023
12 checks passed
@DavisVaughan DavisVaughan deleted the fix/df-proxy-always-clone branch May 9, 2023 13:35
@etiennebacher
Copy link
Contributor

etiennebacher commented May 9, 2023

@DavisVaughan I'm sorry for asking here because it's completely unrelated to vctrs but where does this run() function come from? It looks very useful to benchmark PRs against the main branch but I can't find it in bench, even in the devel version. It's the first time I see this function

@DavisVaughan
Copy link
Member Author

i have a script that i source that pulls it in. i am tinkering with it but i will probably put it in a small package with helpers like run() to run anything and run_mark() specifically for benchmarks

run <- function(expr,
                against,
                ...,
                local = TRUE) {
  rlang::check_dots_empty0(...)

  without_callr_messages <- function(expr) {
    suppressMessages(expr, classes = "callr_message")
  }

  libdirs <- vector("character", length = length(against))

  for (i in seq_along(libdirs)) {
    libdirs[[i]] <- withr::local_tempdir()
  }
  for (i in seq_along(against)) {
    without_callr_messages({
      pak::pkg_install(pkg = against[[i]], lib = libdirs[[i]], ask = FALSE)
    })
  }

  if (local) {
    libdir <- withr::local_tempdir()
    against <- c("local", against)
    libdirs <- c(libdir, libdirs)

    without_callr_messages({
      pak::pkg_install(pkg = "local::.", lib = libdir, ask = FALSE)
    })
  }

  body <- substitute(expr)

  fn <- rlang::new_function(
    args = NULL,
    body = body,
    env = globalenv()
  )

  libpaths <- .libPaths()

  out <- vector("list", length = length(libdirs))

  for (i in seq_along(libdirs)) {
    libpath <- c(libdirs[[i]], libpaths)
    out[[i]] <- callr::r(func = fn, libpath = libpath)
  }

  tibble::tibble(reference = against, result = out)
}

@etiennebacher
Copy link
Contributor

Nice, thanks! Having this in a small package would be cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df_proxy() should unconditionally clone
3 participants