New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_tibble() is slow for lots of repeated use #471

Closed
earowang opened this Issue Aug 25, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@earowang
Copy link
Contributor

earowang commented Aug 25, 2018

new_tibble() provides convenience to easily create a new subclass, but it may be inhibited by downstream packages due to its speed performance, when calling it many times.

  • CRAN version:
library(tibble)
x <- tibble(a = 1:3, b = 4:6)

bench::mark(
  tibble = new_tibble(x, nrow = 3, subclass = "my_class"),
  base = structure(x, class = c("my_class", "tbl_df", "tbl", "data.frame"))
)
#> # A tibble: 2 x 10
#>   expression     min   mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#> 1 tibble     57.14µs 68.3µs 66.61µs 330.7µs    14646.     3.8KB    34  6006
#> 2 base        4.92µs  6.2µs  5.98µs  59.9µs   161368.        0B     7  9993
#> # ... with 1 more variable: total_time <bch:tm> 
  • Dev version:
#> # A tibble: 2 x 10
#>   expression      min     mean   median   max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch>     <dbl> <bch:byt> <dbl>
#> 1 tibble     146.81µs 174.21µs 170.01µs 347µs     5740.     364KB    28
#> 2 base         4.93µs   6.31µs   6.11µs 111µs   158500.        0B     7
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>
#> The tibble must inherit from tbl_df and tbl, use tibble(), as_tibble() or new_tibble() for construction.

Created on 2018-08-25 by the reprex
package
(v0.2.0).

@krlmlr krlmlr added the performance label Oct 12, 2018

@krlmlr krlmlr added this to the 1.5.0 milestone Oct 12, 2018

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Oct 13, 2018

Thanks. One obvious way to shave off more than half of the slowdown is to unclass the argument. The rest of the time is spent on checks which are currently mandatory:

# from #501
pkgload::load_all("~/git/R/tibble")
#> Loading tibble

x <- as_tibble(setNames(map(as.list(letters), "[", c(1, 1, 1)), LETTERS))

bench::mark(
  tibble = new_tibble(x, nrow = 3, subclass = "my_class"),
  tibble_list = new_tibble(unclass(x), nrow = 3, subclass = "my_class"),
  new_valid_tibble = new_valid_tibble(unclass(x), .nrow = 3, .subclass = "my_class"),
  set_tibble_class = set_tibble_class(unclass(x), .nrow = 3, .subclass = "my_class"),
  base = structure(x, row.names = .set_row_names(3), class = c("my_class", "tbl_df", "tbl", "data.frame"))
)
#> # A tibble: 5 x 10
#>   expression      min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 tibble      551.8µs 757.31µs    640µs   2.14ms     1320.     2.3KB    12
#> 2 tibble_li… 123.66µs 155.37µs 139.65µs 482.87µs     6436.    1.25KB    14
#> 3 new_valid…  85.92µs 106.42µs  96.58µs 367.57µs     9397.    1.25KB    14
#> 4 set_tibbl…   6.36µs   8.71µs   7.54µs  83.84µs   114829.      768B     3
#> 5 base         5.03µs    5.9µs   5.64µs  31.84µs   169607.      512B     3
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>

Created on 2018-10-13 by the reprex package (v0.2.1.9000)

@hadley: Do we want to support tibble constructors with fewer checks (perhaps as arguments to new_tibble()) so that users can construct tibbles from lists ~12x-30x faster (e.g. 8 µs vs. 130 µs)?

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Oct 15, 2018

I think we should refactor new_tibble() into new_tibble() and validate_tibble(), using the updated templates from adv-r/vctrs.

What is tibble:::update_tibble_attrs() supposed to do? I can't tell from reading the code.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Oct 19, 2018

To avoid foot-bullet intersections...

Made my day.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Oct 19, 2018

Simplified update_tibble_attrs(): f6f1c6c.

@hadley: Does it matter if we use class or subclass as argument name for the vector of subclass names? tibble currently uses subclass, is it worth changing to class?

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Oct 19, 2018

So the constructor doesn't even check correctness of names? Just copy attributes and set row names and class?

Currently, new_tibble() also strips the "dim" attribute from all vectors. This is expensive, so the constructor won't do it anymore. Another breaking change, let's hope the impact is small.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Oct 20, 2018

I think the general rule is that constructor should check types/lengths and nothing else, in order to be as performant as possible. If we can check names very quickly, I think it's ok to include, but generally the constructor is a developer facing function that should only be used when you know that the values are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment