Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance decrease in dfm_group #1295

Closed
HolgersID opened this issue Apr 5, 2018 · 12 comments
Closed

Performance decrease in dfm_group #1295

HolgersID opened this issue Apr 5, 2018 · 12 comments
Assignees

Comments

@HolgersID
Copy link

Hi,
thank you for this great package! I used it successfully to demonstrate what can be achieved by textmining and what not.

Between versions v1.0.0 and v1.1.0 I observed a strong decrease of performance of function dfm_group. For the example below and version v1.0.0 a grouping took on my system

system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed
##    1.59    0.31    1.92

With v1.1.0 this changed to

system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed
##   95.66    0.37   96.24

Astonishingly, this seems to depend on the fact that the DFM contains an additional document level variable FGroup of type factor. If this is removed, dfm_group will be fast again:

docvars (tdfm, "FGroup") <- NULL
system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed
##    1.71    0.37    2.09

The DFM of the test data set consists of 100000 documents, 15600 features, and has a sparsity of 99.7%. The document level variable Group has approx. 73000 different values.

Example data:

set.seed (4711)
no <- 100000L
tdata <-
    data.frame (ID=     as.character (seq.int (no))
              , Group=  sample (seq.int (1.5*no)
                              , no
                              , replace=TRUE
                                )
              , FGroup= factor (sample (seq.int (20L)
                                      , no
                                      , replace=TRUE
                                        )
                                )
              , Text=   replicate (no
                                 , paste (replicate (50L
                                                   , paste (sample (LETTERS
                                                                  , 3
                                                                    )
                                                          , collapse=""
                                                            )
                                                     )
                                        , collapse=" "
                                          )
                                   )
              , stringsAsFactors= FALSE
                )
tdfm <-
    dfm (corpus (tdata
               , docid_field= "ID"
               , text_field=  "Text"
                 )
         )
@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

@HolgersID I know what is going on exactly: we change the dfm_group to keep the document level variables when they are invariant within group, and this check is taking a lot time. I will see if I can make it faster, but you can just drop all the document level variables by docvars(x) <- NULL before grouping if you want to make it faster (this is the pre-v1.1.0 behavior).

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

I think you are hit hard because you have a lot of small groups. I am checking uniformity of values using split and sapply but the latter is really slow.

quanteda/R/dfm_group.R

Lines 148 to 151 in e8a5c0e

# check if there is not within group variance
is_grouped <- function(x, group) {
all(sapply(split(x, group), function(x) all(x[1] == x)), na.rm = TRUE)
}

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

Can anyone write something faster than is_grouped2()? This might be a good SO challenge.

set.seed (4711)
n <- 100000L
v <- seq(n)
g <- sample(seq.int(1.5 * n), n, replace = TRUE)

# this is current version
is_grouped <- function(x, group) {
    all(sapply(split(x, group), function(x) all(x[1] == x)), na.rm = TRUE)
}

require(Matrix)
is_grouped2 <- function(x, group) {
    all(rowSums(as(sparseMatrix(i = as.integer(group), 
                                j = match(x, unique(x)), 
                                x = rep(1L, length(x))), 'lgCMatrix')) == 1)
}

microbenchmark::microbenchmark(
    is_grouped(v, g),
    is_grouped2(v, g),
    unit = 'relative'
)
# Unit: relative
#              expr     min       lq     mean   median       uq      max neval
#   is_grouped(v, g) 1.73397 1.766153 1.682654 1.861066 1.646036 1.090419   100
#  is_grouped2(v, g) 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000   100

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 5, 2018

Good candidate for SO, but you should clearly explain what are the inputs and expected outputs, with examples.

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 5, 2018

data.table !!

require("data.table")
is_grouped3 <- function(x, group) {
    dt <- data.table(x, group)
    dt2 <- dt[, var(x), by = group]
    all(dt2[, V1] %in% c(NA, 0))
}

microbenchmark::microbenchmark(
    is_grouped(v, g),
    is_grouped2(v, g),
    is_grouped3(v, g),
    unit = "relative", times = 50
)
# Unit: relative
#               expr      min       lq     mean   median       uq       max neval
#   is_grouped(v, g) 6.078963 5.915561 5.338450 5.856405 5.130891 1.5265294   100
#  is_grouped2(v, g) 3.720617 3.596134 3.082539 3.552472 3.227739 0.3975824   100
#  is_grouped3(v, g) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

Great. I will use the code.

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

It actually should be this

is_grouped3 <- function(x, group) {
    x <- match(x, unique(x))
    dt <- data.table(x, group)
    dt <- dt[, var(x), by = group]
    all(dt[, V1] %in% c(NA, 0))
}

because x is not always numeric. Then is_grouped3() becomes not too different from is_grouped2(). Any idea?

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

I am writing is_grouped4() using Rcpp now.

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

is_grouped4 <- function(x, group) {
    if (is.character(x)) {
        quanteda:::qatd_cpp_is_grouped_character(x, as.integer(group))
    } else {
        quanteda:::qatd_cpp_is_grouped_numeric(as.numeric(x), as.integer(group))
    }
}

I am happy with this.

Unit: relative
              expr       min        lq      mean    median        uq       max neval
  is_grouped(v, g) 486.59550 446.14608 100.57739 387.57089 347.79485 25.965659    10
 is_grouped2(v, g)  92.98925  87.11357  41.62165  88.16636 163.20247 22.435546    10
 is_grouped3(v, g)  82.62168  89.77822  20.15067  80.49150  73.94539  4.969287    10
 is_grouped4(v, g)   1.00000   1.00000   1.00000   1.00000   1.00000  1.000000    10

@koheiw koheiw mentioned this issue Apr 5, 2018
@HolgersID
Copy link
Author

This was really a quick reply and analysis of the problem. Thank you very much!
Holger

@koheiw
Copy link
Collaborator

koheiw commented Apr 5, 2018

@HolgersID, please check if the problem is solved.

@HolgersID
Copy link
Author

I can confirm that with commit 4ba983f (package version 1.1.6) the issue is fixed. Using the test data set above, the new version of dfm_group takes on my system

system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed 
##    2.28    0.25    2.54 

Also the performance with the actual data set is up again.
Thank you again. :-)

kbenoit added a commit that referenced this issue Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants