-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance decrease in dfm_group #1295
Comments
@HolgersID I know what is going on exactly: we change the |
I think you are hit hard because you have a lot of small groups. I am checking uniformity of values using Lines 148 to 151 in e8a5c0e
|
Can anyone write something faster than set.seed (4711)
n <- 100000L
v <- seq(n)
g <- sample(seq.int(1.5 * n), n, replace = TRUE)
# this is current version
is_grouped <- function(x, group) {
all(sapply(split(x, group), function(x) all(x[1] == x)), na.rm = TRUE)
}
require(Matrix)
is_grouped2 <- function(x, group) {
all(rowSums(as(sparseMatrix(i = as.integer(group),
j = match(x, unique(x)),
x = rep(1L, length(x))), 'lgCMatrix')) == 1)
}
microbenchmark::microbenchmark(
is_grouped(v, g),
is_grouped2(v, g),
unit = 'relative'
)
# Unit: relative
# expr min lq mean median uq max neval
# is_grouped(v, g) 1.73397 1.766153 1.682654 1.861066 1.646036 1.090419 100
# is_grouped2(v, g) 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 100 |
Good candidate for SO, but you should clearly explain what are the inputs and expected outputs, with examples. |
data.table !! require("data.table")
is_grouped3 <- function(x, group) {
dt <- data.table(x, group)
dt2 <- dt[, var(x), by = group]
all(dt2[, V1] %in% c(NA, 0))
}
microbenchmark::microbenchmark(
is_grouped(v, g),
is_grouped2(v, g),
is_grouped3(v, g),
unit = "relative", times = 50
)
# Unit: relative
# expr min lq mean median uq max neval
# is_grouped(v, g) 6.078963 5.915561 5.338450 5.856405 5.130891 1.5265294 100
# is_grouped2(v, g) 3.720617 3.596134 3.082539 3.552472 3.227739 0.3975824 100
# is_grouped3(v, g) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 |
Great. I will use the code. |
It actually should be this is_grouped3 <- function(x, group) {
x <- match(x, unique(x))
dt <- data.table(x, group)
dt <- dt[, var(x), by = group]
all(dt[, V1] %in% c(NA, 0))
} because x is not always numeric. Then |
I am writing |
is_grouped4 <- function(x, group) {
if (is.character(x)) {
quanteda:::qatd_cpp_is_grouped_character(x, as.integer(group))
} else {
quanteda:::qatd_cpp_is_grouped_numeric(as.numeric(x), as.integer(group))
}
} I am happy with this.
|
This was really a quick reply and analysis of the problem. Thank you very much! |
@HolgersID, please check if the problem is solved. |
I can confirm that with commit 4ba983f (package version 1.1.6) the issue is fixed. Using the test data set above, the new version of system.time (dfm_group (tdfm, "Group"))
## user system elapsed
## 2.28 0.25 2.54 Also the performance with the actual data set is up again. |
Hi,
thank you for this great package! I used it successfully to demonstrate what can be achieved by textmining and what not.
Between versions v1.0.0 and v1.1.0 I observed a strong decrease of performance of function
dfm_group
. For the example below and version v1.0.0 a grouping took on my systemWith v1.1.0 this changed to
Astonishingly, this seems to depend on the fact that the DFM contains an additional document level variable
FGroup
of type factor. If this is removed,dfm_group
will be fast again:The DFM of the test data set consists of 100000 documents, 15600 features, and has a sparsity of 99.7%. The document level variable
Group
has approx. 73000 different values.Example data:
The text was updated successfully, but these errors were encountered: