Performance decrease in dfm_group #1295

HolgersID · 2018-04-05T11:27:48Z

Hi,
thank you for this great package! I used it successfully to demonstrate what can be achieved by textmining and what not.

Between versions v1.0.0 and v1.1.0 I observed a strong decrease of performance of function dfm_group. For the example below and version v1.0.0 a grouping took on my system

system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed
##    1.59    0.31    1.92

With v1.1.0 this changed to

system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed
##   95.66    0.37   96.24

Astonishingly, this seems to depend on the fact that the DFM contains an additional document level variable FGroup of type factor. If this is removed, dfm_group will be fast again:

docvars (tdfm, "FGroup") <- NULL
system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed
##    1.71    0.37    2.09

The DFM of the test data set consists of 100000 documents, 15600 features, and has a sparsity of 99.7%. The document level variable Group has approx. 73000 different values.

Example data:

set.seed (4711)
no <- 100000L
tdata <-
    data.frame (ID=     as.character (seq.int (no))
              , Group=  sample (seq.int (1.5*no)
                              , no
                              , replace=TRUE
                                )
              , FGroup= factor (sample (seq.int (20L)
                                      , no
                                      , replace=TRUE
                                        )
                                )
              , Text=   replicate (no
                                 , paste (replicate (50L
                                                   , paste (sample (LETTERS
                                                                  , 3
                                                                    )
                                                          , collapse=""
                                                            )
                                                     )
                                        , collapse=" "
                                          )
                                   )
              , stringsAsFactors= FALSE
                )
tdfm <-
    dfm (corpus (tdata
               , docid_field= "ID"
               , text_field=  "Text"
                 )
         )

The text was updated successfully, but these errors were encountered:

koheiw · 2018-04-05T12:05:19Z

@HolgersID I know what is going on exactly: we change the dfm_group to keep the document level variables when they are invariant within group, and this check is taking a lot time. I will see if I can make it faster, but you can just drop all the document level variables by docvars(x) <- NULL before grouping if you want to make it faster (this is the pre-v1.1.0 behavior).

koheiw · 2018-04-05T12:17:35Z

I think you are hit hard because you have a lot of small groups. I am checking uniformity of values using split and sapply but the latter is really slow.

quanteda/R/dfm_group.R

Lines 148 to 151 in e8a5c0e

    
           # check if there is not within group variance 
        
           is_grouped <- function(x, group) { 
        
               all(sapply(split(x, group), function(x) all(x[1] == x)), na.rm = TRUE) 
        
           }

koheiw · 2018-04-05T13:05:39Z

Can anyone write something faster than is_grouped2()? This might be a good SO challenge.

set.seed (4711)
n <- 100000L
v <- seq(n)
g <- sample(seq.int(1.5 * n), n, replace = TRUE)

# this is current version
is_grouped <- function(x, group) {
    all(sapply(split(x, group), function(x) all(x[1] == x)), na.rm = TRUE)
}

require(Matrix)
is_grouped2 <- function(x, group) {
    all(rowSums(as(sparseMatrix(i = as.integer(group), 
                                j = match(x, unique(x)), 
                                x = rep(1L, length(x))), 'lgCMatrix')) == 1)
}

microbenchmark::microbenchmark(
    is_grouped(v, g),
    is_grouped2(v, g),
    unit = 'relative'
)
# Unit: relative
#              expr     min       lq     mean   median       uq      max neval
#   is_grouped(v, g) 1.73397 1.766153 1.682654 1.861066 1.646036 1.090419   100
#  is_grouped2(v, g) 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000   100

kbenoit · 2018-04-05T13:15:46Z

Good candidate for SO, but you should clearly explain what are the inputs and expected outputs, with examples.

kbenoit · 2018-04-05T13:46:20Z

data.table !!

require("data.table")
is_grouped3 <- function(x, group) {
    dt <- data.table(x, group)
    dt2 <- dt[, var(x), by = group]
    all(dt2[, V1] %in% c(NA, 0))
}

microbenchmark::microbenchmark(
    is_grouped(v, g),
    is_grouped2(v, g),
    is_grouped3(v, g),
    unit = "relative", times = 50
)
# Unit: relative
#               expr      min       lq     mean   median       uq       max neval
#   is_grouped(v, g) 6.078963 5.915561 5.338450 5.856405 5.130891 1.5265294   100
#  is_grouped2(v, g) 3.720617 3.596134 3.082539 3.552472 3.227739 0.3975824   100
#  is_grouped3(v, g) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100

koheiw · 2018-04-05T14:00:44Z

Great. I will use the code.

koheiw · 2018-04-05T14:23:06Z

It actually should be this

is_grouped3 <- function(x, group) {
    x <- match(x, unique(x))
    dt <- data.table(x, group)
    dt <- dt[, var(x), by = group]
    all(dt[, V1] %in% c(NA, 0))
}

because x is not always numeric. Then is_grouped3() becomes not too different from is_grouped2(). Any idea?

koheiw · 2018-04-05T14:27:59Z

I am writing is_grouped4() using Rcpp now.

koheiw · 2018-04-05T14:43:42Z

is_grouped4 <- function(x, group) {
    if (is.character(x)) {
        quanteda:::qatd_cpp_is_grouped_character(x, as.integer(group))
    } else {
        quanteda:::qatd_cpp_is_grouped_numeric(as.numeric(x), as.integer(group))
    }
}

I am happy with this.

Unit: relative
              expr       min        lq      mean    median        uq       max neval
  is_grouped(v, g) 486.59550 446.14608 100.57739 387.57089 347.79485 25.965659    10
 is_grouped2(v, g)  92.98925  87.11357  41.62165  88.16636 163.20247 22.435546    10
 is_grouped3(v, g)  82.62168  89.77822  20.15067  80.49150  73.94539  4.969287    10
 is_grouped4(v, g)   1.00000   1.00000   1.00000   1.00000   1.00000  1.000000    10

HolgersID · 2018-04-05T16:45:01Z

This was really a quick reply and analysis of the problem. Thank you very much!
Holger

koheiw · 2018-04-05T19:49:38Z

@HolgersID, please check if the problem is solved.

HolgersID · 2018-04-06T06:07:34Z

I can confirm that with commit 4ba983f (package version 1.1.6) the issue is fixed. Using the test data set above, the new version of dfm_group takes on my system

system.time (dfm_group (tdfm, "Group"))
##    user  system elapsed 
##    2.28    0.25    2.54

Also the performance with the actual data set is up again.
Thank you again. :-)

kbenoit assigned koheiw Apr 5, 2018

koheiw mentioned this issue Apr 5, 2018

Issue 1295 #1297

Merged

HolgersID closed this as completed Apr 6, 2018

kbenoit added a commit that referenced this issue Apr 6, 2018

Add note about #1295

8a15a33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance decrease in dfm_group #1295

Performance decrease in dfm_group #1295

HolgersID commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

kbenoit commented Apr 5, 2018

kbenoit commented Apr 5, 2018 •

edited

Loading

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

HolgersID commented Apr 5, 2018

koheiw commented Apr 5, 2018 •

edited

Loading

HolgersID commented Apr 6, 2018

Performance decrease in dfm_group #1295

Performance decrease in dfm_group #1295

Comments

HolgersID commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

kbenoit commented Apr 5, 2018

kbenoit commented Apr 5, 2018 • edited Loading

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

koheiw commented Apr 5, 2018

HolgersID commented Apr 5, 2018

koheiw commented Apr 5, 2018 • edited Loading

HolgersID commented Apr 6, 2018

kbenoit commented Apr 5, 2018 •

edited

Loading

koheiw commented Apr 5, 2018 •

edited

Loading