Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to save big gs #32

Closed
mikejiang opened this issue Mar 11, 2020 · 5 comments
Closed

fail to save big gs #32

mikejiang opened this issue Mar 11, 2020 · 5 comments

Comments

@mikejiang
Copy link
Member

Here is the reproducible example

library(flowCore)
library(flowWorkspace)
fr <- flowFrame(matrix(rnorm(1e6), ncol = 2, dimnames = list(NULL, letters[1:2])))
fs <- as(list(s1 = fr), "flowSet")
fs
gs <- GatingSet(fs)
#create big tree with dummy gates
for(i in seq_len(1e3))
{
  g <- rectangleGate(a = c(-Inf, Inf))
  gs_pop_add(gs, g, name = as.character(i))
}
recompute(gs)
length(gs_get_pop_paths(gs))

#replace the data with smaller version to avoid writing too big h5
fs <- as(list(s1 = flowFrame(matrix(rnorm(2), ncol = 2, dimnames = list(NULL, letters[1:2])))), "flowSet")
gs_cyto_data(gs) <- flowSet_to_cytoset(fs)

#copy gs to mimic large data set
ptrlist <- sampleList <- list()
for(i in seq_len(100))
{
  sn <- paste0("s",i)
  gs1 <- gs_copy_tree_only(gs)
  sampleNames(gs1) <- sn
  ptrlist <- c(ptrlist, gs1@pointer)
  sampleList <- c(sampleList, sn)
}

#combine to a big gs (a hack to bypass validity check)
gs_big <- new("GatingSet", pointer = flowWorkspace:::.cpp_combineGatingSet(ptrlist, sampleList))
tmp <- tempfile()
save_gs(gs_big, tmp)
[libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc:155] Cannot allocate buffer larger than kint32max for StringOutputStream.
Error in save_gs(gs_big, tmp) : 
@mikejiang
Copy link
Member Author

This buffer size limitation was introduced by switching to protobuf-lite (RGLab/RProtoBufLib#6 (comment)), which doesn't support iostream and thus imposes the size restriction from using StringOutputStream wrapped over single string buffer

@jacobpwagner
Copy link
Member

I see. Is it worth switching back then? I kept pretty detailed notes on minimization of the protobuf bundle, so if we want to do that again after moving back to the full library, it should be reasonably quick.

@mikejiang
Copy link
Member Author

mikejiang commented Mar 11, 2020

Yeah, switching back to full version of protobuf will be one quick solution. There are two other alternatives, which require the change of the existing message format

  1. still save to the single pb file , but with multiple string buffer writes to the same file preceded by a small int byte that records each buffer size (so that they can be reloaded by multiple buffer reads)
  2. write each gh(i.e. sample) to its own pb file

The second approach will be potentially good for concurrent loading as well as efficient sub-loading through select argument (i.e. load_gs(path, select = c(1:3))) since it no longer has to the load and parse the entire message for all samples.

Either of the two could still fail theoretically if the single sample reaches the same buffer limit (when the total number of gates are huge and events number is large enough). This probably would not happen practically. (Or I could be wrong on this, given the nature of faust application)

Anyway, in the short run, I will do the switching. The discussion above is for the record in future.

@mikejiang
Copy link
Member Author

I ended up implemented the second distributed approach, i.e. one pb file per sample. Now it saves ok for big gs

> save_gs(gs_big, tmp)
Done
To reload it, use 'load_gs' function

> list.files(tmp)
  [1] "90b6757a-26ab-4158-bfd2-fb4272fd1054.pb" "s1.h5"                                  
  [3] "s1.pb"                                   "s10.h5"                                 
  [5] "s10.pb"                                  "s100.h5"                                
  [7] "s100.pb"                                 "s11.h5"                                 
...
[195] "s96.pb"                                  "s97.h5"                                 
[197] "s97.pb"                                  "s98.h5"                                 
[199] "s98.pb"                                  "s99.h5"                                 
[201] "s99.pb"                                 

And sub-loading is more efficient than before

> system.time(gs1 <- load_gs(tmp, select = c("s1", "s100")))
   user  system elapsed 
  2.290   0.068   2.382 
> sampleNames(gs1)
[1] "s1"   "s100"

mikejiang pushed a commit that referenced this issue Mar 12, 2020
mikejiang pushed a commit to RGLab/flowWorkspace that referenced this issue Mar 12, 2020
@DillonHammill
Copy link

This is great @mikejiang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants