avoid double copies of the reference sequence?

Hi folks.  Over in SLiM-land I've pulled the current tskit head (well, current as of a couple days ago now) into SLiM, in PR https://github.com/MesserLab/SLiM/pull/248, and am making adjustments to SLiM in anticipation of tskit's C API 1.0.  As part of that work I'm moving SLiM over from reading/writing the reference sequence in kastore directly, to having it use the new support for the reference sequence in tskit (as mentioned in issue https://github.com/MesserLab/SLiM/issues/180).  There are a few bumps along the way, however.

One bump is probably just my own lack of knowledge of how tskit is intended to be used.  Am I allowed to munge around with the `reference_sequence` ivar in `tsk_table_collection_t` directly?  I don't see any getter/setter functions for it, so I guess so?  Similarly, if I have a `tsk_table_collection_t` and get the `tsk_reference_sequence_t *reference_sequence` from its ivar, I guess I'm allowed to just go into the reference sequence's struct and get the data from it directly, from its `data` and `data_length` ivars?  Feels a bit weird, as a C++ programmer used to objects having private state and using accessor functions.  But I guess this is how it's designed, since it is C not C++?

A more important bump has to do with copying of the reference sequence – or more precisely, avoiding such copying.  A 1e10 reference sequence is almost 10 GB, so it would be good not to make unnecessary copies of it – a little bit for the speed penalty, but mostly to avoid increasing the memory allocation high-water mark associated with handling it.  This comes up in two spots.

One spot is on load, in `tsk_table_collection_load_reference_sequence()`.  That calls `tsk_reference_sequence_set_data()`, which makes a copy of the whole buffer.  It would be great if that copy could be eliminated somehow – presumably by taking over ownership of the buffer from kastore, I guess?  Right now in SLiM we do the following on load (removing the error checking for readability):

```
		kastore_t store;

		ret = kastore_open(&store, p_file, "r", 0);
		ret = kastore_gets_int8(&store, "reference_sequence/data", (int8_t **)&buffer, &buffer_length);
		
		chromosome_->AncestralSequence()->ReadNucleotidesFromBuffer(buffer);
		
		// buffer is owned by kastore and is freed by closing the store
		kastore_close(&store);
```

The `ReadNucleotidesFromBuffer()` method copies the data directly from the kastore buffer into a private buffer in SLiM that stores the sequence in a 2-bit-per-base format.  Now, with the way `tsk_table_collection_load_reference_sequence()` works at present, we will end up (1) loading the data from kastore, (2) copying that data into tskit at 8 bits per base, (3) freeing the kastore buffer, presumably, and then (4) copying tskit's data again into SLiM at 2 bits per base.  The high water mark will now be two 8-bit-per-base copies of the data (kastore's and tskit's), rather than one 8-bit-per-base copy and one 2-bit-per-base copy.  If tskit could take the buffer from kastore, it would get rid of the copy in step 2 and lower the high-water mark to what it was before.  Is that possible?  If not, is there a way that tskit could vend kastore's data directly to SLiM so that I could pull it into 2-bits-per-base directly, without the additional copy being made?

The other spot is on write, when I want to move SLiM's reference sequence (at 2-bits-per-base) back into the table collection (at 8-bits-per-base).  Of course this unavoidably involves having both of those buffers in memory at the time, so that's the high-water mark we're shooting for, which at present we achieve with the following code (error checks edited out):

```
			std::size_t buflen = chromosome_->AncestralSequence()->size();
			char *buffer = (char *)malloc(buflen);
			kastore_t store;
			
			chromosome_->AncestralSequence()->WriteNucleotidesToBuffer(buffer);
			
			ret = kastore_open(&store, path.c_str(), "a", 0);
			kastore_oputs_int8(&store, "reference_sequence/data", (int8_t *)buffer, buflen, 0);
			ret = kastore_close(&store);
			
			// kastore owns buffer now, so we do not free it
```

Because `kastore_oputs_int8()` takes the buffer from us, an additional copy is avoided.  But in the new scheme of things, if I understand correctly, the intended usage pattern is that I (1) malloc an 8-bit-per-base buffer and put my data into it, (2) malloc a `tsk_reference_sequence_t`, (3) call `tsk_reference_sequence_init()` and then `tsk_reference_sequence_set_data()` on my malloced `tsk_reference_sequence_t`, (4) put the pointer to my malloced `tsk_reference_sequence_t` into the table collection's `reference_sequence` ivar directly (giving it to the table collection), and then (5) call `tsk_table_collection_dump()` to write to disk, which calls `kastore_puts()` under the hood to do the write.  If I'm not mistaken, the high-water mark is therefore comprised of SLiM's 2-bit-per-base buffer, the 8-bit-per-base buffer I malloced, another 8-bit-per-base buffer that `tsk_reference_sequence_set_data()` malloced (because there is no call to donate a buffer to `tsk_reference_sequence_t` rather than copying), and another 8-bit-per-base buffer that `tsk_table_collection_dump()` makes inside `kastore_puts()`.  So if the reference sequence is ~10 GB, the high water mark is now ~20 GB higher than it was before – about 32.5 GB altogether, as compared to 12.5 GB before.  This is a significant problem.

I can fix the additional overhead on write by going behind tskit's back, I think; I can write out the reference sequence to the kastore directly, as I do now, and simply never set it on the table collection at all.  On read, I don't think there's anything I can do to avoid the increased high-water mark at present, since tskit copies kastore's data and that's out of my control.  If there were a flag I could pass to `tsk_table_collection_load()` that said "don't load the reference sequence", I could then, again, go behind tskit's back to get the reference sequence directly from kastore to avoid the copy; perhaps such a flag could be added?

Of course, going behind tskit's back to access the reference sequence directly in kastore is not optimal.  If it brings down peak memory usage for a big model by 10 GB or 20 GB, though, I will enthusiastically embrace that ugliness.  But if there's a better way to do all this, perhaps with some tweaks on the tskit side, that would be great too.

I also do wonder whether you guys considered using 2-bits-per-base rather than 8-bits-per-base in tskit.  The memory savings is pretty substantial.  Are you allowing states other than ACGT?  I guess you guys do allow weird mutation states – even the "poo" emoji as in Peter's timeless example.  :->  Those "poo" emojis must be UTF-8 encoded or something, I guess?  If you allow non-ASCII characters in the buffer, that means it isn't even 8-bits-per-base any more, and you can't access the reference state for base `i` just by indexing `reference_sequence.data[i]` either, eh?  Is all that being handled under the hood somehow?  I can't even imagine how you'd deal with that efficiently :-O.  But anyway, that's a side point that I'm just curious about.  SLiM will assume that the buffer is 8-bits-per-base in ASCII ACGT, and will raise if it sees anything else.  :->

So, back to the main topic: avoiding extra copies.  At a minimum, I'd request a flag to `tsk_table_collection_load()` to tell it not to load the reference sequence, and then I can do everything as I am now and it should be fine.  But if you guys think I should not need to go behind tskit's back to avoid these extra copies, then let's talk about other options.

Thoughts, @benjeffery @jeromekelleher @petrelharp?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avoid double copies of the reference sequence? #1970

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

avoid double copies of the reference sequence? #1970

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions