fail to save big gs #32

mikejiang · 2020-03-11T19:18:34Z

Here is the reproducible example

library(flowCore)
library(flowWorkspace)
fr <- flowFrame(matrix(rnorm(1e6), ncol = 2, dimnames = list(NULL, letters[1:2])))
fs <- as(list(s1 = fr), "flowSet")
fs
gs <- GatingSet(fs)
#create big tree with dummy gates
for(i in seq_len(1e3))
{
  g <- rectangleGate(a = c(-Inf, Inf))
  gs_pop_add(gs, g, name = as.character(i))
}
recompute(gs)
length(gs_get_pop_paths(gs))

#replace the data with smaller version to avoid writing too big h5
fs <- as(list(s1 = flowFrame(matrix(rnorm(2), ncol = 2, dimnames = list(NULL, letters[1:2])))), "flowSet")
gs_cyto_data(gs) <- flowSet_to_cytoset(fs)

#copy gs to mimic large data set
ptrlist <- sampleList <- list()
for(i in seq_len(100))
{
  sn <- paste0("s",i)
  gs1 <- gs_copy_tree_only(gs)
  sampleNames(gs1) <- sn
  ptrlist <- c(ptrlist, gs1@pointer)
  sampleList <- c(sampleList, sn)
}

#combine to a big gs (a hack to bypass validity check)
gs_big <- new("GatingSet", pointer = flowWorkspace:::.cpp_combineGatingSet(ptrlist, sampleList))
tmp <- tempfile()
save_gs(gs_big, tmp)
[libprotobuf ERROR google/protobuf/io/zero_copy_stream_impl_lite.cc:155] Cannot allocate buffer larger than kint32max for StringOutputStream.
Error in save_gs(gs_big, tmp) :

mikejiang · 2020-03-11T19:27:48Z

This buffer size limitation was introduced by switching to protobuf-lite (RGLab/RProtoBufLib#6 (comment)), which doesn't support iostream and thus imposes the size restriction from using StringOutputStream wrapped over single string buffer

jacobpwagner · 2020-03-11T20:36:23Z

I see. Is it worth switching back then? I kept pretty detailed notes on minimization of the protobuf bundle, so if we want to do that again after moving back to the full library, it should be reasonably quick.

mikejiang · 2020-03-11T21:02:05Z

Yeah, switching back to full version of protobuf will be one quick solution. There are two other alternatives, which require the change of the existing message format

still save to the single pb file , but with multiple string buffer writes to the same file preceded by a small int byte that records each buffer size (so that they can be reloaded by multiple buffer reads)
write each gh(i.e. sample) to its own pb file

The second approach will be potentially good for concurrent loading as well as efficient sub-loading through select argument (i.e. load_gs(path, select = c(1:3))) since it no longer has to the load and parse the entire message for all samples.

Either of the two could still fail theoretically if the single sample reaches the same buffer limit (when the total number of gates are huge and events number is large enough). This probably would not happen practically. (Or I could be wrong on this, given the nature of faust application)

Anyway, in the short run, I will do the switching. The discussion above is for the record in future.

mikejiang · 2020-03-12T00:28:54Z

I ended up implemented the second distributed approach, i.e. one pb file per sample. Now it saves ok for big gs

> save_gs(gs_big, tmp)
Done
To reload it, use 'load_gs' function

> list.files(tmp)
  [1] "90b6757a-26ab-4158-bfd2-fb4272fd1054.pb" "s1.h5"                                  
  [3] "s1.pb"                                   "s10.h5"                                 
  [5] "s10.pb"                                  "s100.h5"                                
  [7] "s100.pb"                                 "s11.h5"                                 
...
[195] "s96.pb"                                  "s97.h5"                                 
[197] "s97.pb"                                  "s98.h5"                                 
[199] "s98.pb"                                  "s99.h5"                                 
[201] "s99.pb"

And sub-loading is more efficient than before

> system.time(gs1 <- load_gs(tmp, select = c("s1", "s100")))
   user  system elapsed 
  2.290   0.068   2.382 
> sampleNames(gs1)
[1] "s1"   "s100"

DillonHammill · 2020-03-13T00:34:11Z

This is great @mikejiang!

mikejiang closed this as completed Mar 12, 2020

mikejiang pushed a commit that referenced this issue Mar 12, 2020

split pb into multiple files #32

3d4045b

mikejiang pushed a commit to RGLab/flowWorkspace that referenced this issue Mar 12, 2020

fix test cases according to RGLab/cytolib#32

4c93a29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fail to save big gs #32

fail to save big gs #32

mikejiang commented Mar 11, 2020

mikejiang commented Mar 11, 2020

jacobpwagner commented Mar 11, 2020

mikejiang commented Mar 11, 2020 •

edited

Loading

mikejiang commented Mar 12, 2020

DillonHammill commented Mar 13, 2020

fail to save big gs #32

fail to save big gs #32

Comments

mikejiang commented Mar 11, 2020

mikejiang commented Mar 11, 2020

jacobpwagner commented Mar 11, 2020

mikejiang commented Mar 11, 2020 • edited Loading

mikejiang commented Mar 12, 2020

DillonHammill commented Mar 13, 2020

mikejiang commented Mar 11, 2020 •

edited

Loading