Skip to content

faster internal version of update_tibble_attrs()#717

Merged
krlmlr merged 8 commits into
tidyverse:masterfrom
romainfrancois:update_tibble_attrs
Feb 14, 2020
Merged

faster internal version of update_tibble_attrs()#717
krlmlr merged 8 commits into
tidyverse:masterfrom
romainfrancois:update_tibble_attrs

Conversation

@romainfrancois
Copy link
Copy Markdown
Member

@romainfrancois romainfrancois commented Feb 13, 2020

This has come up as part of performance benchmarks for dplyr 1.0.0:

library(tibble)
library(rlang)

update_tibble_attrs_old <- function(x, ...) {
  # Compat for https://github.com/r-spatial/sf/pull/1232
  attribs <- list(...)
  if (has_length(attribs) && is_named(attribs)) {
    attributes(x)[names(attribs)] <- attribs
  }
  
  x
}
update_tibble_attrs <- tibble:::update_tibble_attrs


df <- tibble(x = rnorm(1e6), g = sample(rep(1:1e4, 100)))
groups <- tibble(g = unique(df$g))

bench::mark(
  new = update_tibble_attrs(df, groups = groups), 
  old = update_tibble_attrs_old(df, groups = groups) 
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new          2.09µs   2.72µs   344885.        0B      0  
#> 2 old          3.98ms   4.96ms      201.    3.81MB     18.7

Created on 2020-02-13 by the reprex package (v0.3.0.9000)

I believe this is because the form attributes(x)[names(attribs)] <- pays for setting all the attributes, including row.names which is expensive to get/set.

closes tidyverse/dplyr#4855

Comment thread R/new.R
}

x
.Call(`tibble_update_attrs`, x, pairlist2(...))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably safer to check if any of the special attributes is supplied, and fall back to the R code in that case?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how safer that would be, setAttrib() does some internal dispatch to deal with special attributes, i.e. names, row.names, class ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, I was confused!

There's still some special handling at R level in structure() though, but that doesn't seem important:

specials <- c(".Dim", ".Dimnames", ".Names", ".Tsp", ".Label")

Comment thread src/attributes.c Outdated
romainfrancois added a commit to tidyverse/dplyr that referenced this pull request Feb 13, 2020
@krlmlr
Copy link
Copy Markdown
Member

krlmlr commented Feb 14, 2020

Thanks! I have removed compat code that is no longer necessary with the sf update. It's still very slow, though.

I have also added a test regarding overwriting of attributes. I think our implementation will need to account for this. Do we need to adapt?

@romainfrancois
Copy link
Copy Markdown
Member Author

structure() seems to be as slow:

library(tibble)
library(rlang)

update_tibble_attrs_old <- function(x, ...) {
  # Compat for https://github.com/r-spatial/sf/pull/1232
  attribs <- list(...)
  if (has_length(attribs) && is_named(attribs)) {
    attributes(x)[names(attribs)] <- attribs
  }
  x
}
update_tibble_attrs <- tibble:::update_tibble_attrs

df <- tibble(x = rnorm(1e6), g = sample(rep(1:1e4, 100)))
groups <- tibble(g = unique(df$g))

bench::mark(
  new = update_tibble_attrs(df, groups = groups),
  old = update_tibble_attrs_old(df, groups = groups), 
  str = structure(df, groups = groups)
)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new          2.02µs   2.71µs   335090.        0B      0  
#> 2 old          3.97ms   4.77ms      211.    3.82MB     22.4
#> 3 str          3.96ms    4.3ms      226.    3.81MB     15.8

Created on 2020-02-14 by the reprex package (v0.3.0.9000)

@romainfrancois
Copy link
Copy Markdown
Member Author

which test @krlmlr ?

@romainfrancois
Copy link
Copy Markdown
Member Author

I believe vectbl_restore() has the same issue, i.e. using attributes()[<-

vectbl_restore <- function(xo, x) {
  n <- fast_nrow(xo)

  # FIXME: Use new_tibble()?
  forbidden <- attrs_names_only
  attrs <- attributes(x)
  attrs <- attrs[!(names(attrs) %in% forbidden)]

  attributes(xo)[names(attrs)] <- attrs

  # Need to patch manually
  attr(xo, "row.names") <- .set_row_names(n)

  xo
}

@krlmlr
Copy link
Copy Markdown
Member

krlmlr commented Feb 14, 2020

Good catch, pushed the tests.

I still wonder why vectbl_restore() seems to be much faster:

library(tibble)

x <- tibble(a = 1:3)
y <- x

vectbl_restore <- tibble:::vectbl_restore

bench::mark(vectbl_restore(x, y), y)
#> # A tibble: 2 x 6
#>   expression                min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vectbl_restore(x, y)    5.8µs   6.93µs   123302.    3.97KB     24.7
#> 2 y                      23.1ns  28.17ns 23492926.        0B      0

Created on 2020-02-14 by the reprex package (v0.3.0)

@romainfrancois
Copy link
Copy Markdown
Member Author

For similar reasons, I've updated vec_tbl_restore() :

library(tibble)
library(rlang)

df1 <- tibble(x = rnorm(1e6), g = sample(rep(1:1e4, 100)))
df2 <- tibble(x = rnorm(1e6))

vectbl_restore <- tibble:::vectbl_restore
fast_nrow <- function(x) {
  .row_names_info(x, 2L)
}

attrs_names_only <- c("names", "row.names")
vectbl_restore_old <- function(xo, x) {
  n <- fast_nrow(xo)
  
  # FIXME: Use new_tibble()?
  forbidden <- attrs_names_only
  attrs <- attributes(x)
  attrs <- attrs[!(names(attrs) %in% forbidden)]
  
  attributes(xo)[names(attrs)] <- attrs
  
  # Need to patch manually
  attr(xo, "row.names") <- .set_row_names(n)
  
  xo
}

bench::mark(
  vectbl_restore(df1, df2), 
  vectbl_restore_old(df1, df2)
)
#> # A tibble: 2 x 6
#>   expression                        min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                   <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 vectbl_restore(df1, df2)        929ns 1.27µs   753454.        0B      0  
#> 2 vectbl_restore_old(df1, df2)   3.97ms 4.81ms      209.    3.81MB     22.7

Created on 2020-02-14 by the reprex package (v0.3.0.9000)

@romainfrancois
Copy link
Copy Markdown
Member Author

closes tidyverse/dplyr#4855

Copy link
Copy Markdown
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it. If you revert 752bfa2 as part of this PR, the CI tests should pass.

Comment thread src/attributes.c Outdated
Comment thread src/attributes.c Outdated
@krlmlr
Copy link
Copy Markdown
Member

krlmlr commented Feb 14, 2020

Oh, it's only a problem with long data frames:

library(tibble)
library(rlang)

df1 <- tibble(x = rnorm(1e6), g = sample(rep(1:1e4, 100)))
df1a <- df1[1:10000, ]
df1b <- df1[1:1000, ]
df1c <- df1[1:100, ]

vectbl_restore <- tibble:::vectbl_restore

bench::mark(
  vectbl_restore(df1, df1),
  vectbl_restore(df1a, df1a),
  vectbl_restore(df1b, df1b),
  vectbl_restore(df1c, df1c),
  check = FALSE
)
#> # A tibble: 4 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vectbl_restore(df1, df1)     3.57ms   3.97ms      241.    3.81MB     23.4
#> 2 vectbl_restore(df1a, df1a)  40.44µs  43.49µs    20181.   39.11KB     16.4
#> 3 vectbl_restore(df1b, df1b)   9.54µs  11.04µs    75970.    3.95KB     15.2
#> 4 vectbl_restore(df1c, df1c)   6.44µs   7.39µs   118001.      448B     23.6

Created on 2020-02-14 by the reprex package (v0.3.0)

Interesting threshold: going from 10k to 100k slows down by a factor of 100.

@romainfrancois
Copy link
Copy Markdown
Member Author

yes @krlmlr I suppose this is row.names being inconveninentnly materialized when using attributes[]<-

@romainfrancois

This comment has been minimized.

@krlmlr
Copy link
Copy Markdown
Member

krlmlr commented Feb 14, 2020

If you revert 752bfa2 the CI errors should go away.

Applying [ to a tibble speeds up subsequent execution of vectbl_restore():

library(tibble)
library(rlang)

df1 <- tibble(x = rnorm(1e6), g = sample(rep(1:1e4, 100)))
df1a <- df1[1:100000, ]
df1b <- df1[1:10000, ]
df1c <- df1[1:1000, ]

vectbl_restore <- tibble:::vectbl_restore

bench::mark(
  vectbl_restore(df1, df1),
  vectbl_restore(df1a, df1a),
  vectbl_restore(df1b, df1b),
  vectbl_restore(df1c, df1c),
  check = FALSE
)
#> # A tibble: 4 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vectbl_restore(df1, df1)     3.54ms   4.05ms      232.    3.81MB     23.4
#> 2 vectbl_restore(df1a, df1a) 345.57µs 401.14µs     2235.  390.67KB     18.3
#> 3 vectbl_restore(df1b, df1b)  40.69µs  46.54µs    19021.   39.11KB     13.2
#> 4 vectbl_restore(df1c, df1c)   9.79µs  11.77µs    71793.    3.95KB     14.4

Created on 2020-02-14 by the reprex package (v0.3.0)

I guess this no longer holds with the C implementation?

@romainfrancois
Copy link
Copy Markdown
Member Author

romainfrancois commented Feb 14, 2020

library(tibble)
library(rlang)

df1 <- tibble(x = rnorm(1e6), g = sample(rep(1:1e4, 100)))
df1a <- df1[1:100000, ]
df1b <- df1[1:10000, ]
df1c <- df1[1:1000, ]

vectbl_restore <- tibble:::vectbl_restore

bench::mark(
  vectbl_restore(df1, df1),
  vectbl_restore(df1a, df1a),
  vectbl_restore(df1b, df1b),
  vectbl_restore(df1c, df1c),
  check = FALSE
)
#> # A tibble: 4 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vectbl_restore(df1, df1)      947ns   1.09µs   629424.        0B      0  
#> 2 vectbl_restore(df1a, df1a)    932ns   1.06µs   635387.        0B      0  
#> 3 vectbl_restore(df1b, df1b)      1µs   1.23µs   516068.        0B      0  
#> 4 vectbl_restore(df1c, df1c)    949ns    1.1µs   782567.        0B     78.3

Created on 2020-02-14 by the reprex package (v0.3.0.9000)

romainfrancois and others added 2 commits February 14, 2020 13:26
Co-Authored-By: Kirill Müller <krlmlr@users.noreply.github.com>
@krlmlr krlmlr merged commit cba28e9 into tidyverse:master Feb 14, 2020
@krlmlr
Copy link
Copy Markdown
Member

krlmlr commented Feb 14, 2020

Thanks!

@romainfrancois romainfrancois deleted the update_tibble_attrs branch February 14, 2020 13:37
romainfrancois added a commit to tidyverse/dplyr that referenced this pull request Feb 21, 2020
romainfrancois added a commit to tidyverse/dplyr that referenced this pull request Feb 24, 2020
* 🧞tidyverse/tibble#717

* prototype vec_sprinkle()

* require fallback for more types

* handle names when sprinkling

* need to vec_cast() all the chunks to common type

* make all sprinkle steps internal with short_vec_init() and vec_cast() callables

* internal recycling by repeatedly calling vec_assign_impl() with the sprinkling

* rows_i_j needs PROTECT()

* swing back to master tibble

* globalVariables(o_rows) because it is defined lazily

* rename to vec_unchop() and move arguments around

* Apply suggestions from code review

Co-Authored-By: Lionel Henry <lionel.hry@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Lionel Henry <lionel.hry@gmail.com>

* PROTECT(Rf_getAttrib())

* skip again for a different reason

* intercept mutate() recycling problems earlier

* using vctrs::vec_unchop()

* using the callable vctrs::short_vec_recycle() so that chunks are recycled to fit

* No need to hang on to rows_lengths

* - simplify DataMask constructor
- provide DataMask$get_rows()

* Move size handling until after all the expressions. use vec_unchop() then

* recycling and then call vec_c() instead of using vec_unchop()

* use master vctrs

* vec_assign_impl() no longer used

* shirt_vec_init() no longer used

* vec_cast() not used intrenally

Co-authored-by: Lionel Henry <lionel.hry@gmail.com>
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Feb 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

performance of dplyr_col_modify()

3 participants