-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Hi folks. Over in SLiM-land I've pulled the current tskit head (well, current as of a couple days ago now) into SLiM, in PR MesserLab/SLiM#248, and am making adjustments to SLiM in anticipation of tskit's C API 1.0. As part of that work I'm moving SLiM over from reading/writing the reference sequence in kastore directly, to having it use the new support for the reference sequence in tskit (as mentioned in issue MesserLab/SLiM#180). There are a few bumps along the way, however.
One bump is probably just my own lack of knowledge of how tskit is intended to be used. Am I allowed to munge around with the reference_sequence ivar in tsk_table_collection_t directly? I don't see any getter/setter functions for it, so I guess so? Similarly, if I have a tsk_table_collection_t and get the tsk_reference_sequence_t *reference_sequence from its ivar, I guess I'm allowed to just go into the reference sequence's struct and get the data from it directly, from its data and data_length ivars? Feels a bit weird, as a C++ programmer used to objects having private state and using accessor functions. But I guess this is how it's designed, since it is C not C++?
A more important bump has to do with copying of the reference sequence – or more precisely, avoiding such copying. A 1e10 reference sequence is almost 10 GB, so it would be good not to make unnecessary copies of it – a little bit for the speed penalty, but mostly to avoid increasing the memory allocation high-water mark associated with handling it. This comes up in two spots.
One spot is on load, in tsk_table_collection_load_reference_sequence(). That calls tsk_reference_sequence_set_data(), which makes a copy of the whole buffer. It would be great if that copy could be eliminated somehow – presumably by taking over ownership of the buffer from kastore, I guess? Right now in SLiM we do the following on load (removing the error checking for readability):
kastore_t store;
ret = kastore_open(&store, p_file, "r", 0);
ret = kastore_gets_int8(&store, "reference_sequence/data", (int8_t **)&buffer, &buffer_length);
chromosome_->AncestralSequence()->ReadNucleotidesFromBuffer(buffer);
// buffer is owned by kastore and is freed by closing the store
kastore_close(&store);
The ReadNucleotidesFromBuffer() method copies the data directly from the kastore buffer into a private buffer in SLiM that stores the sequence in a 2-bit-per-base format. Now, with the way tsk_table_collection_load_reference_sequence() works at present, we will end up (1) loading the data from kastore, (2) copying that data into tskit at 8 bits per base, (3) freeing the kastore buffer, presumably, and then (4) copying tskit's data again into SLiM at 2 bits per base. The high water mark will now be two 8-bit-per-base copies of the data (kastore's and tskit's), rather than one 8-bit-per-base copy and one 2-bit-per-base copy. If tskit could take the buffer from kastore, it would get rid of the copy in step 2 and lower the high-water mark to what it was before. Is that possible? If not, is there a way that tskit could vend kastore's data directly to SLiM so that I could pull it into 2-bits-per-base directly, without the additional copy being made?
The other spot is on write, when I want to move SLiM's reference sequence (at 2-bits-per-base) back into the table collection (at 8-bits-per-base). Of course this unavoidably involves having both of those buffers in memory at the time, so that's the high-water mark we're shooting for, which at present we achieve with the following code (error checks edited out):
std::size_t buflen = chromosome_->AncestralSequence()->size();
char *buffer = (char *)malloc(buflen);
kastore_t store;
chromosome_->AncestralSequence()->WriteNucleotidesToBuffer(buffer);
ret = kastore_open(&store, path.c_str(), "a", 0);
kastore_oputs_int8(&store, "reference_sequence/data", (int8_t *)buffer, buflen, 0);
ret = kastore_close(&store);
// kastore owns buffer now, so we do not free it
Because kastore_oputs_int8() takes the buffer from us, an additional copy is avoided. But in the new scheme of things, if I understand correctly, the intended usage pattern is that I (1) malloc an 8-bit-per-base buffer and put my data into it, (2) malloc a tsk_reference_sequence_t, (3) call tsk_reference_sequence_init() and then tsk_reference_sequence_set_data() on my malloced tsk_reference_sequence_t, (4) put the pointer to my malloced tsk_reference_sequence_t into the table collection's reference_sequence ivar directly (giving it to the table collection), and then (5) call tsk_table_collection_dump() to write to disk, which calls kastore_puts() under the hood to do the write. If I'm not mistaken, the high-water mark is therefore comprised of SLiM's 2-bit-per-base buffer, the 8-bit-per-base buffer I malloced, another 8-bit-per-base buffer that tsk_reference_sequence_set_data() malloced (because there is no call to donate a buffer to tsk_reference_sequence_t rather than copying), and another 8-bit-per-base buffer that tsk_table_collection_dump() makes inside kastore_puts(). So if the reference sequence is ~10 GB, the high water mark is now ~20 GB higher than it was before – about 32.5 GB altogether, as compared to 12.5 GB before. This is a significant problem.
I can fix the additional overhead on write by going behind tskit's back, I think; I can write out the reference sequence to the kastore directly, as I do now, and simply never set it on the table collection at all. On read, I don't think there's anything I can do to avoid the increased high-water mark at present, since tskit copies kastore's data and that's out of my control. If there were a flag I could pass to tsk_table_collection_load() that said "don't load the reference sequence", I could then, again, go behind tskit's back to get the reference sequence directly from kastore to avoid the copy; perhaps such a flag could be added?
Of course, going behind tskit's back to access the reference sequence directly in kastore is not optimal. If it brings down peak memory usage for a big model by 10 GB or 20 GB, though, I will enthusiastically embrace that ugliness. But if there's a better way to do all this, perhaps with some tweaks on the tskit side, that would be great too.
I also do wonder whether you guys considered using 2-bits-per-base rather than 8-bits-per-base in tskit. The memory savings is pretty substantial. Are you allowing states other than ACGT? I guess you guys do allow weird mutation states – even the "poo" emoji as in Peter's timeless example. :-> Those "poo" emojis must be UTF-8 encoded or something, I guess? If you allow non-ASCII characters in the buffer, that means it isn't even 8-bits-per-base any more, and you can't access the reference state for base i just by indexing reference_sequence.data[i] either, eh? Is all that being handled under the hood somehow? I can't even imagine how you'd deal with that efficiently :-O. But anyway, that's a side point that I'm just curious about. SLiM will assume that the buffer is 8-bits-per-base in ASCII ACGT, and will raise if it sees anything else. :->
So, back to the main topic: avoiding extra copies. At a minimum, I'd request a flag to tsk_table_collection_load() to tell it not to load the reference sequence, and then I can do everything as I am now and it should be fine. But if you guys think I should not need to go behind tskit's back to avoid these extra copies, then let's talk about other options.
Thoughts, @benjeffery @jeromekelleher @petrelharp?