-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Description
*_join() functions are slow when the key is character. Do you have any plan to improve this?
library("dplyr")
set.seed(71)
size1 <- 4*10^5
size2 <- size1 * 0.1
df1 <- data.frame(id=paste0("SERVICE_", 1:size1), value=rnorm(size1), stringsAsFactors=FALSE)
df2 <- data.frame(id=paste0("SERVICE_", sample(1:size1, size2)), value=rnorm(size2), stringsAsFactors=FALSE)
print(system.time(ljd <- dplyr::left_join(df1, df2, "id")))
#> user system elapsed
#> 15.50 0.07 15.56 I think *_join() can be faster by factorizing the key beforehand in most cases. Futhermore, I believe the key should be treated as factor, since no one will try to join by some column where every row has a different value.
print(system.time({
lvl <- unique(c(df1$id, df2$id))
ljd <- dplyr::left_join(mutate(df1, id=factor(id, levels = lvl)),
mutate(df2, id=factor(id, levels = lvl)),
"id")
})
)
#> user system elapsed
#> 0.33 0.10 0.42 Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels