Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[ subsetting much slower than $ #780

Closed
ebein opened this issue Jun 3, 2020 · 9 comments
Closed

[[ subsetting much slower than $ #780

ebein opened this issue Jun 3, 2020 · 9 comments
Labels
bug an unexpected problem or unintended behavior performance 🏎️

Comments

@ebein
Copy link

ebein commented Jun 3, 2020

Starting with tibble 3.0.0, column subsetting using [[ is much slower than $. This causes slowdowns in functions that call [[ many times, for example data.matrix on a wide tibble.

df <- tibble::tibble(x = 1)

bench::mark(
  dollar = df$x,
  bracket = df[["x"]],
  iterations = 1000
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dollar        6.8us    8.1us   100956.    7.96KB      0  
#> 2 bracket     190.3us  211.2us     3998.  165.09KB     12.0

Created on 2020-06-03 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Jun 5, 2020

The call to vectbl_as_col_location2() is responsible for the speed difference, probably due to its use of tryCatch(), which is slow.

@hadley hadley added the bug an unexpected problem or unintended behavior label Jun 5, 2020
@md0u80c9
Copy link

md0u80c9 commented Jun 10, 2020

Out of interest, do you know if $ was significantly quicker prior to tibble 3 - or was performance more equal?

@ebein
Copy link
Author

ebein commented Jun 10, 2020

From tibble 2.1.1 on a different machine. So it seems like $ was ~2x faster on 2.1.1 and is 25-30x faster on 3.0.1.

df <- tibble::tibble(x = 1)

bench::mark(
  dollar = df$x,
  bracket = df[["x"]],
  iterations = 1000
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dollar     682.65ns   1.36us   514262.    6.28KB        0
#> 2 bracket      1.02us   1.36us   472464.     5.3KB        0

Created on 2020-06-10 by the reprex package (v0.2.1)

@md0u80c9
Copy link

md0u80c9 commented Jun 10, 2020

Thanks @ebein (all my machines were on tibble 3 and the hassle of doing a full reinstall to check it meant I was cheeky and just asked the question!): very interesting that from a practical perspective it may be better to train my muscle memory to use $ where possible (obviously [[ has benefits where the column name isn't a constant!)

@krlmlr
Copy link
Member

krlmlr commented Jun 13, 2020

Once we remove the vectbl_as_col_location2() call and the associated overhead, run time drops to 9 µs. Still way too much, compared to 130 ns for base lists.

@krlmlr
Copy link
Member

krlmlr commented Jun 13, 2020

Pure S3 dispatch without doing actual work is already 1.3 µs. Oh well...

@krlmlr
Copy link
Member

krlmlr commented Jun 14, 2020

Now:

df <- tibble::tibble(x = 1)

bench::mark(
  dollar = df$x,
  bracket = df[["x"]],
  iterations = 1000
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dollar       4.29µs   4.67µs   199184.    17.6KB        0
#> 2 bracket      3.91µs   4.25µs   219202.    90.6KB        0

Created on 2020-06-14 by the reprex package (v0.3.0)

@krlmlr
Copy link
Member

krlmlr commented Jun 14, 2020

We can strive for even faster processing (closer to 2 µs), I suspect this needs a full rewrite in C. This should be fast enough for most use cases.

krlmlr added a commit that referenced this issue Feb 25, 2021
tibble 3.0.2

- `[[` works with classed indexes again, e.g. created with `glue::glue()` (#778).
- `add_column()` works without warning for 0-column data frames (#786).
- `tribble()` now better handles named inputs (#775) and objects of non-vtrs classes like `lubridate::Period` (#784) and `formattable::formattable` (#785).

- Subsetting and subassignment are faster (#780, #790, #794).
- `is.null()` is preferred over `is_null()` for speed.
- Implement continuous benchmarking (#793).

- `is_vector_s3()` is no longer reexported from pillar (#789).
@github-actions
Copy link
Contributor

github-actions bot commented Jun 15, 2021

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior performance 🏎️
Projects
None yet
Development

No branches or pull requests

4 participants