New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and separate are too slow #72

Closed
hadley opened this Issue Apr 7, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@hadley
Member

hadley commented Apr 7, 2015

No description provided.

@aaronwolen

This comment has been minimized.

Contributor

aaronwolen commented Apr 7, 2015

Yeah, same for unnest_string(). I wanted to use the stringi::str_split_* functions, which provide a nice performance boost, but doing so would require additional arguments for the user to specify whether the pattern should be treated as fixed or regex. And I thought unnest_string() should pass arguments to strsplit() to be consistent with separate().

I personally like the way stringr handles that specification with the fixed and regex functions. What do you think about importing those and using them to determine whether stringi::str_split_fixed() or stringi::str_split_regex() should be called?

@hadley

This comment has been minimized.

Member

hadley commented Apr 7, 2015

@aaronwolen the dev version of stringr uses stringi, so once that rolls out we'll get stringi performance for free. That might be enough for this issue

@hadley

This comment has been minimized.

Member

hadley commented May 19, 2015

A bit of benchmarking suggests that separate() is ok (only 2x slower than stringi), but extract() needs some work:

library(tidyr)
library(stringi)
options(digits = 3)

x <- replicate(1e5, paste(sample(letters, 3), collapse = "-"))
df <- data_frame(x)

microbenchmark::microbenchmark(
  separate = separate(df, x, c("x", "y", "z"), "-"),
  regex = stri_split_regex(x, "-"),
  regex_n = stri_split_regex(x, "-", nmax = 3),
  fixed = stri_split_fixed(x, "-"),
  times = 10
)
#> Unit: milliseconds
#>      expr   min    lq  mean median    uq   max neval cld
#>  separate 106.6 109.6 122.6  112.5 120.0 208.5    10   c
#>     regex  64.9  66.0  68.1   67.6  70.4  72.4    10  b 
#>   regex_n  65.1  65.3  67.6   66.7  68.9  73.3    10  b 
#>     fixed  37.5  37.8  39.8   38.1  38.6  53.9    10 a  

microbenchmark::microbenchmark(
  extract = extract(df, x, c("(x", "y", "z"), "(.)-(.)-(.)"),
  regex = stri_match_first_regex(x, "(.)-(.)-(.)"),
  times = 10
)
#> Unit: milliseconds
#>     expr    min     lq mean median   uq  max neval cld
#>  extract 1005.0 1073.7 1104   1092 1119 1209    10   b
#>    regex   82.1   89.5  115     91  115  206    10  a 

@hadley hadley closed this in f69cd17 May 19, 2015

@hadley

This comment has been minimized.

Member

hadley commented May 19, 2015

Benchmark is now

#> Unit: milliseconds
#>     expr  min   lq mean median   uq  max neval cld
#>  extract 86.7 90.6 93.5   95.1 96.2 98.3    10   b
#>    regex 79.1 80.3 82.0   80.8 82.1 88.1    10  a 
@hadley

This comment has been minimized.

Member

hadley commented May 21, 2015

And post be2eb95:

#> Unit: milliseconds
#>      expr  min    lq  mean median    uq   max neval
#>  separate 97.7 100.7 111.6  101.1 111.8 173.9    10
#>     regex 68.5  69.7  72.4   71.4  74.1  80.1    10
#>   regex_n 65.9  66.8  75.8   68.3  70.8 143.5    10
#>     fixed 33.2  34.1  37.7   35.5  40.3  51.5    10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment