Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by memory efficiency regression #4334

Closed
jangorecki opened this issue Apr 21, 2019 · 2 comments
Closed

group_by memory efficiency regression #4334

jangorecki opened this issue Apr 21, 2019 · 2 comments

Comments

@jangorecki
Copy link

jangorecki commented Apr 21, 2019

There seems to be memory inefficiency introduced somewhere between 0.7.8 and development version during grouping operation.
Using a 40 MB data frame:

  • dplyr 0.7.8 max RSS measure reaches 228 MB
  • recent devel uses 5.7 GB max RSS

To reproduce use following script:

cat dplyr-debug.R
args = commandArgs(TRUE)
rows = as.integer(args[1L])
cols = as.integer(args[2L])
DF = as.data.frame(lapply(1:cols, function(i) 1:rows))
names(DF) = paste0("V",1:cols)
format(object.size(DF), units="KB")
suppressMessages(library(dplyr))
ans = DF %>% group_by(.dots = names(DF)) %>% summarize(count = n())
q("no")

Collect data size and max RSS memory

/usr/bin/time -v Rscript dplyr-debug.R 1e3 1e1
/usr/bin/time -v Rscript dplyr-debug.R 1e3 1e2
/usr/bin/time -v Rscript dplyr-debug.R 1e3 1e3
/usr/bin/time -v Rscript dplyr-debug.R 1e3 1e4

Run script using devel and 0.7.8.
On my machine I am getting following values

dplyr, rows, cols, df_kb, max_rss_kb
devel, 1e3, 1e1, 40.9, 69632
devel, 1e3, 1e2, 402.9, 174344
devel, 1e3, 1e3, 4024, 6019788
devel, 1e3, 1e4, 40235, 6021072
0.7.8, 1e3, 1e1, 40.9, 68508
0.7.8, 1e3, 1e2, 402.9, 69576
0.7.8, 1e3, 1e3, 4024, 80612
0.7.8, 1e3, 1e4, 40235, 233504

Issue has been initially spotted in a more complex query in which grouping was made only by 6 columns, thus I believe this issue is not related to number of columns to group but to cardinality.


read.dcf(system.file("DESCRIPTION", package="dplyr"), fields="RemoteSha")[[1L]]
#[1] "792ca4909c1c2f7d5a61c4c7369d5731ef092477"
sessionInfo()
#R version 3.5.3 (2019-03-11)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Ubuntu 18.04.2 LTS
#
#Matrix products: default
#BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#
#locale:
# [1] LC_CTYPE=en_IN       LC_NUMERIC=C         LC_TIME=en_IN       
# [4] LC_COLLATE=en_IN     LC_MONETARY=en_IN    LC_MESSAGES=en_IN   
# [7] LC_PAPER=en_IN       LC_NAME=C            LC_ADDRESS=C        
#[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C 
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#loaded via a namespace (and not attached):
#[1] compiler_3.5.3
@romainfrancois
Copy link
Member

I guess we'll revise this when vctrs will be in charge of hashing via vec_compare() etc ...

@hadley
Copy link
Member

hadley commented Jan 7, 2020

I see:

plyr (master): /usr/bin/time -l Rscript dplyr-debug.R 1e3 1e1
[1] "40.9 Kb"
        0.51 real         0.46 user         0.05 sys
  82210816  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     24482  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         1  signals received
        10  voluntary context switches
        33  involuntary context switches
dplyr (master): /usr/bin/time -l Rscript dplyr-debug.R 1e3 1e4
[1] "40235 Kb"
        7.20 real         7.07 user         0.11 sys
 312426496  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     80684  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         1  signals received
        10  voluntary context switches
        44  involuntary context switches
c(82210816, 312426496) / 1024
#> [1]  80284 305104

So we're back (approximately) to 0.7.8 sizes

@hadley hadley closed this as completed Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants