Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and separate are too slow #72

Closed
hadley opened this issue Apr 7, 2015 · 6 comments
Closed

Extract and separate are too slow #72

hadley opened this issue Apr 7, 2015 · 6 comments

Comments

@hadley
Copy link
Member

hadley commented Apr 7, 2015

No description provided.

@aaronwolen
Copy link
Contributor

Yeah, same for unnest_string(). I wanted to use the stringi::str_split_* functions, which provide a nice performance boost, but doing so would require additional arguments for the user to specify whether the pattern should be treated as fixed or regex. And I thought unnest_string() should pass arguments to strsplit() to be consistent with separate().

I personally like the way stringr handles that specification with the fixed and regex functions. What do you think about importing those and using them to determine whether stringi::str_split_fixed() or stringi::str_split_regex() should be called?

@hadley
Copy link
Member Author

hadley commented Apr 7, 2015

@aaronwolen the dev version of stringr uses stringi, so once that rolls out we'll get stringi performance for free. That might be enough for this issue

@hadley
Copy link
Member Author

hadley commented May 19, 2015

A bit of benchmarking suggests that separate() is ok (only 2x slower than stringi), but extract() needs some work:

library(tidyr)
library(stringi)
options(digits = 3)

x <- replicate(1e5, paste(sample(letters, 3), collapse = "-"))
df <- data_frame(x)

microbenchmark::microbenchmark(
  separate = separate(df, x, c("x", "y", "z"), "-"),
  regex = stri_split_regex(x, "-"),
  regex_n = stri_split_regex(x, "-", nmax = 3),
  fixed = stri_split_fixed(x, "-"),
  times = 10
)
#> Unit: milliseconds
#>      expr   min    lq  mean median    uq   max neval cld
#>  separate 106.6 109.6 122.6  112.5 120.0 208.5    10   c
#>     regex  64.9  66.0  68.1   67.6  70.4  72.4    10  b 
#>   regex_n  65.1  65.3  67.6   66.7  68.9  73.3    10  b 
#>     fixed  37.5  37.8  39.8   38.1  38.6  53.9    10 a  

microbenchmark::microbenchmark(
  extract = extract(df, x, c("(x", "y", "z"), "(.)-(.)-(.)"),
  regex = stri_match_first_regex(x, "(.)-(.)-(.)"),
  times = 10
)
#> Unit: milliseconds
#>     expr    min     lq mean median   uq  max neval cld
#>  extract 1005.0 1073.7 1104   1092 1119 1209    10   b
#>    regex   82.1   89.5  115     91  115  206    10  a 

@hadley hadley closed this as completed in f69cd17 May 19, 2015
@hadley
Copy link
Member Author

hadley commented May 19, 2015

Benchmark is now

#> Unit: milliseconds
#>     expr  min   lq mean median   uq  max neval cld
#>  extract 86.7 90.6 93.5   95.1 96.2 98.3    10   b
#>    regex 79.1 80.3 82.0   80.8 82.1 88.1    10  a 

@hadley
Copy link
Member Author

hadley commented May 21, 2015

And post be2eb95:

#> Unit: milliseconds
#>      expr  min    lq  mean median    uq   max neval
#>  separate 97.7 100.7 111.6  101.1 111.8 173.9    10
#>     regex 68.5  69.7  72.4   71.4  74.1  80.1    10
#>   regex_n 65.9  66.8  75.8   68.3  70.8 143.5    10
#>     fixed 33.2  34.1  37.7   35.5  40.3  51.5    10

@natxosorolla
Copy link

My computer solved in minutes this code:
library(stringi)
t5_bisep <- stri_split_fixed(t5_bigrams$bigram, " ")

My other code was computing for hours without results:
library(dplyr)
library(tidyr)
t5_bisep <- t5_bigrams %>% separate(bigram, c("word1", "word2"), sep = " ")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants