Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance fixes for wide data frames #3543

Merged
merged 2 commits into from May 3, 2018

Conversation

romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 25, 2018

This was going through the SymbolVector::push_back method.

void push_back(const SymbolString& s) {
    v.push_back(s.get_string());
  }

and v is a CharacterVector 😱

This is still expensive e.g. compared to [, I'll look for other bottlenecks.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

n_samples <- 500
n_features <- 100000

df <- bind_cols(
  tibble(sample_id = paste0("sample_", 1:n_samples)),
  1:n_features %>%
    map(rnorm, n = n_samples) %>%
    map(as_tibble) %>%
    bind_cols()
)

system.time(fetched_sample <- filter(df, sample_id == "sample_1"))
#>    user  system elapsed 
#>   1.720   0.029   1.755
system.time(fetched_sample <- df[which(df$sample_id == "sample_1"),])
#>    user  system elapsed 
#>   0.048   0.001   0.048

For reference in the original issue, these were the timings:

> system.time(fetched_sample <- filter(df, sample_id == "sample_1"))
#>.   user  system elapsed
#> 165.060   0.110 165.446

@krlmlr
Copy link
Member

krlmlr commented May 2, 2018

Is there a corresponding issue? Please merge with a corresponding NEWS bullet point, I've created the r-0.7.5 branch just in case.

@romainfrancois
Copy link
Member Author

I can merge but there might be some more work in there. The issue is #3335

@krlmlr
Copy link
Member

krlmlr commented May 2, 2018

Do you mean you can make the code even faster? How? Do you want to leave the PR open, or merge, leave the issue open then, and revisit later?

@romainfrancois
Copy link
Member Author

I’m pretty sure this can be faster. Might br a good example to benchmark the SlicingIndex bounds checks for example.

But i guess it does not hurt to merge and leave the issue open

@krlmlr
Copy link
Member

krlmlr commented May 2, 2018

Can you please add a bullet point to NEWS.md?

…t` method in a loop.

more efficient `LazySubsets` constructor.

#3335
@romainfrancois romainfrancois force-pushed the feature-3335-wide-performance branch from d21159a to 32dba91 Compare May 3, 2018 07:39
@romainfrancois romainfrancois merged commit 5da3d0b into master May 3, 2018
@romainfrancois romainfrancois deleted the feature-3335-wide-performance branch May 3, 2018 07:42
@lock
Copy link

lock bot commented Oct 30, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 30, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance 🚀 wip work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants