Support for reference sequences #146

jeromekelleher · 2019-03-13T10:08:14Z

For some applications it would be useful to know the reference sequence that a tree sequence coordinate space refers to. For example, with real data, we should (at a minimum) record the reference build (e.g., GRCh38) and the contig ID (e.g., chr22) associated with a tree sequence. Ideally, we would also like to be able to do things like:

for site in ts.sites():
     print(site.position, site.ref_allele)

As well as situations in which a well-known canonical reference is available, we may also have a one-off reference sequence that we wish to record, e.g., in simulations.

To support this, I suggest adding a reference section to the file store, with some fields. Roughly, these might looks like:

reference/build         -- e.g. GRCh38
reference/contig       -- e.g. chr22
reference/id              -- md5 hash of the sequence, c.f. refget
reference/sequence -- Actual sequence information

The references section would be itself optional (keeping backward compatability), and some fields would probably be optional within this section (for example, reference/sequence would definitely be optional --- no point in storing GRCh38 chr1 over and over again). We can imagine having some mechanisms for automatically retrieving sequences using refget, but this isn't at all necessary for a basic implementation.

At a high-level, we should try to follow any upstream standards as closely as possible, e.g. GA4GH refget and any others that are relevant.

Any thoughts @tskit-dev/all, @bhaller?

The text was updated successfully, but these errors were encountered:

hyanwong · 2019-03-13T12:05:03Z

This seems very sensible, but I wonder if the word "reference" is a bit too context-dependent. Perhaps "genome_reference" or something similar would be clearer?

bhaller · 2019-03-13T13:08:45Z

In SLiM I'm calling it the "ancestral sequence". I'm fine with "reference" or "genome_reference" or any such variant. The rest of the proposal seems good to me. It should be clarified, when this is documented, exactly what format reference/sequence is expected to be in – FASTA? Is a FASTA header allowed? Are newline characters allowed (\r, \n)? Nucleotides only, presumably, not FASTA amino acid sequences? But FASTA codes for ambiguous nucleotides of various sorts? But I think SLiM will just dump an unbroken stream of ACGT into it, so that is all that I need to be supported.

jeromekelleher · 2019-03-13T13:51:37Z

Thanks @bhaller. To clarify, it'll be a flat array of chars, such that reference[pos] is the reference allele at position pos.

petrelharp · 2019-03-15T17:49:15Z

I think this is a great idea, and lean towards calling it just reference (although reference_sequence would also be fine). I don't think it should be called ancestral_sequence, because I'd argue these shouldn't be the same thing, necessarily. Here's some reasons:

If we say they're supposed to be the same, then the ancestral_state column of the Site table is supposed to match the positions in the reference sequence, which is annoying and redundant.
The reference sequence is usually not the ancestral sequence, in practice. (eg GRCh38), but is still useful.
The reference sequence could still be used by e.g. SLiM to store, essentially, the ancestral sequence.

hyanwong · 2019-03-15T17:51:38Z

FWIW I would (marginally) vote for reference_sequence. reference sounds like it might contain metadata about bibliographic references relevant to the provenance of the data. But maybe I'm being too naive.

molpopgen · 2019-03-15T17:53:33Z

@petrelharp is right here. I actually get papers across my desk where someone has calculated Fay and Wu's H w.r.to the reference allelic state :(.

I do like @hyanwong's idea to call it reference_sequence. At first, I thought this thread was about adding bibtex support to the docs.

petrelharp · 2019-03-19T20:11:01Z

@bhaller is going to actually start storing a reference sequence on output from SLiM nucleotide models, so we should at least get it nailed down what we will call it. The proposal is, I think,

reference_sequence/build -- e.g. GRCh38
reference_sequence/contig -- e.g. chr22
reference_sequence/id -- md5 hash of the sequence, c.f. refget
reference_sequence/sequence -- Actual sequence information

All of these would be optional. How about it?

jeromekelleher · 2019-03-19T20:28:00Z

OK, sounds good. I haven't researched what the right names for build, contig etc should be, so perhaps we could just keep it to reference_sequence/sequence if we're going to ship files that include it?

Also reference_sequence/sequence seems repetitive and unintuitive. Maybe reference_sequence/data would be better?

petrelharp · 2019-03-19T20:29:46Z

Besides deciding what to name these entries in the kastore, we should record this somehow, possibly by providing API functions that will write and read it. We think the main question is: is the reference sequence part of a table collection object? It would seem natural to make it a part of the table collection, because then you would get it when you read the .trees file. But, in the SLiM application, the reference sequence is expected to be large, and making it part of the table collection object means that there will be more than one copy of it sitting around, because:

we have to copy the table collection when we write it out; and also
SLiM has an internal version of the reference sequence, so when reading in the .trees file with table_collection_load( ), if reference_sequence was part of the resulting object, it would need to be copied over to where SLiM wants it

Since a table_collection_t has a kastore_t, we could define a function tsk_table_collection_read_reference_sequence( ) that would go into the store and pull out the reference sequence (if indeed such a store is associated). I'm not sure how to deal with writing out, though?

petrelharp · 2019-03-19T20:34:16Z

We could at least write a function that would re-open a .trees file and add a reference sequence to it.

jeromekelleher · 2019-03-19T20:40:30Z

I don't know how we're going to manage it from the C perspective @petrelharp, and I'm afraid I just don't have time to work on it right now. I don't want to rush in changes to the C API without thinking through the consequences very carefully (particularly not with something this fundamental).

I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the reference sequence/data key in. Should be ~10 lines of code.

molpopgen · 2019-03-19T20:42:12Z

Does it make sense to add just the reference state at each variable site vs the entire reference? __ Kevin Thornton Associate Professor Ecology and Evolutionary Biology UC Irvine www.molpopgen.org github.com/molpopgen github.com/ThorntonLab

…

On Tue, Mar 19, 2019 at 1:36 PM Peter Ralph ***@***.***> wrote: We could at least write a function that would re-open a .trees file and add a reference sequence to it. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#146 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGHnH_HtcVCCz56Nr_k_VW8NrK_Uscx8ks5vYUnIgaJpZM4bs1uU> .

jeromekelleher · 2019-03-19T20:43:41Z

Does it make sense to add just the reference state at each variable site vs the entire reference?

That was my original proposal, but Ben has reasons for wanting the full reference/ancestral sequence.

bhaller · 2019-03-19T20:59:51Z

I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the reference sequence/data key in. Should be ~10 lines of code.

Yep, that's what we've got for now. I think Peter just wants to clean up the code; but that can wait indefinitely, what we have now is fine for release.

petrelharp · 2019-03-23T05:52:20Z

Sounds good. We're doing that; and can continue this discussion when (if) someone else needs it.

petrelharp · 2019-09-02T16:10:36Z

I'm redirecting discussion from #326 over here, where I think it belongs. There, @hyanwong wrote:

I think we need to distinguish between (a) tree sequences that using the (msprime default) infinite sites mutation model and (b) tree sequences that have sites at integer positions (and hence probably use finite mutation models ).

Instead of guessing based on whether positions are all integers we should instead do something cleaner. Here are two non-mutually-exclusive options:

have a reference_sequence property of tree sequences, which might be NULL, and then have things like haplotypes and fasta output optionally use this to output invariant sites also.
make a subclass of the tree sequence, say FiniteSitesTreeSequence (ick, though) that only has mutations (and also recombinations?) at integer positions

This issue is about (1); but it seems like we might want to do something like (2) to be able to do (1). Otherwise whenever we do something with the reference sequence we'll have to check if positions are all integers, and if not, throw an error (or, something?). At the python level everything would be cleanest, if the relevant columns were actually integer... but this would probably be a giant pain otherwise.

hyanwong · 2019-09-02T16:21:54Z

Instead of guessing based on whether positions are all integers we should instead do something cleaner.

Yes, that would be much better.

have a reference_sequence property of tree sequences, which might be NULL, and then have things like haplotypes and fasta output optionally use this to output invariant sites also.

Nice idea.

make a subclass of the tree sequence, say FiniteSitesTreeSequence (ick, though) that only has mutations (and also recombinations?) at integer positions

It's not strictly necessary to have recombinations at integer positions, of course, but I suppose it is the option most likely to be needed by users.

This issue is about (1); but it seems like we might want to do something like (2) to be able to do (1). Otherwise whenever we do something with the reference sequence we'll have to check if positions are all integers, and if not, throw an error (or, something?). At the python level everything would be cleanest, if the relevant columns were actually integer... but this would probably be a giant pain otherwise.

Are you suggesting that the position in a FiniteSitesTreeSequence is stored as a C double, but converted to an integer in the python accessor functions?

hyanwong · 2019-11-27T16:14:59Z

I think there is a difference between a finite sites tree sequence and a tree sequence with a reference genome, and the two issues are getting confounded here. It is perfectly possible to have a TS with sites at finite positions, but without any reference sequence available. Indeed, I'm working on hacked versions of such tree sequences at the moment.

I suggest we might want to open a separate issue about finite sites?

benjeffery · 2021-11-08T13:46:45Z

It can't be in metadata as then the C API has no access.
From reading MesserLab/SLiM#180 it is clear that putting it in metadata was expected to be temporary? We originally expected a year from then to fix, it has been six months so far.

kastore key, which requires special treatment everywhere

Not sure I follow here? The special treatment would all be done by tskit.

bhaller · 2021-11-08T14:03:38Z

It can't be in metadata as then the C API has no access. From reading MesserLab/SLiM#180 it is clear that putting it in metadata was expected to be temporary? We originally expected a year from then to fix, it has been six months so far.

Yes, it was understood to be temporary; that's not a problem, although it would've been nice if, when I posted two weeks ago "Working on this now" I had known that it was about to change on the tskit side – I would've waited for the dust to settle. Anyway, I'm just saying that if we can now settle on the right final resting place, so that it doesn't have to change again after this, that would be good. :-> I'm also quite surprised that the place being proposed is back in kastore, since @petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.

kastore key, which requires special treatment everywhere

Not sure I follow here? The special treatment would all be done by tskit.

Well, in MesserLab/SLiM#180 @petrelharp wrote:

So, for the time being, "if it ain't broke don't fix it" would be my guiding principle. :->

Agree, but it is kinda broken at the moment, because currently it's way too easy for users to lose the reference sequence - to avoid this, pyslim's SlimTreeSequence would need to wrap all the tree sequence transformation methods to stick back on the reference sequence. In fact, many of our examples lose the reference sequence - any time we wrap something in SlimTreeSequence( ), like for instance SlimTreeSequence(msprime.mutate(ts, ....)), we have lost it. Moving it to metadata will be much easier than dealing with this, and much less confusing for the users. I don't see an alternative, besides just giving up on the idea of maintaining the reference sequence.

It also needed to be special-cased in SLiM on both read and write, to put it into kastore and fetch it from kastore; that's not a big deal, but reading/writing to/from metadata is simpler code since it works the same way as many other things. But the big concern, as I understood it, was all the special-casing needed in pyslim; would the proposed changes to tskit remove all of that difficulty for pyslim? @petrelharp is on west coast time, so he probably won't show up here for a couple of hours yet; perhaps we ought to wait for him to chime in on this?

benjeffery · 2021-11-08T14:32:41Z

@petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.

The problems come from pyslim writing to the kastore without tskit knowing anything about it. If we add tskit support for a reference sequence then pyslim/SLiM will just use the tskit API and not touch the kastore.

bhaller · 2021-11-08T14:40:16Z

@petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.

The problems come from pyslim writing to the kastore without tskit knowing anything about it. If we add tskit support for a reference sequence then pyslim/SLiM will just use the tskit API and not touch the kastore.

I see. That sounds like an improvement, then, certainly. Still, I think we ought to wait for @petrelharp to comment; I don't know all the issues on the pyslim side and can't really represent what his requirements might be.

jeromekelleher · 2021-11-08T16:36:32Z

Apologies for the mixed messages @bhaller, the confusion is my fault. I didn't expect us to have to deal with this for another 6 months or so but working on outputting alignments has sort of forced the issue. Putting in workarounds for not having a reference sequence now might be less work than just doing it, so we're considering just doing it now. As @benjeffery noted above, the arguments for putting the reference into metadata were all motivated by tskit not knowing about references for some significant amount of time.

bhaller · 2021-11-08T17:07:48Z

Gotcha. No worries, just a bit of whiplash. :-> The plan here seems fine if it works for @petrelharp, and it will be good to have this resolved definitively.

jeromekelleher · 2021-11-11T17:04:17Z

I've added this to the 0.4.0 milestone for now, just so we don't forget about it and release without thinking things through properly.

petrelharp · 2021-11-11T18:55:32Z

This plan sounds good - and, apologies for the mixed messages. As usual, I'm leaning too far on the side of "let's just make it work with existing tools". And, I wasn't thinking about the C API needing to be able to use the reference sequence.

bhaller · 2021-11-11T19:03:34Z

OK, so if I understand correctly, @petrelharp, I will back out the changes I made to move the reference sequence into metadata and go back to doing it exactly as it was before; as I understand it, that is the right move to be compatible with the tskit changes proposed here. I guess I'll wait to do that until the dust has settled here, to be sure that nothing changes.

petrelharp · 2021-11-11T19:08:26Z

That sounds right? I guess you'll want to do that before the next SLiM release (which is soon?) and so we need to decide whether this is going to be the final format (the kastore key anyhow) for reals.

bhaller · 2021-11-11T19:27:19Z

Next SLiM release is pretty much waiting for the tskit C API 1.0 release at this point; when that happens, we'll pull it into SLiM, resolve any final issues that causes (getting all the tskit-based tests for SLiM green again, notably), update pyslim to match (I believe relevant issues on pyslim are 170, 206, 207, 147), and stick a fork in it.

jeromekelleher · 2021-11-11T20:01:04Z

OK, so if I understand correctly, @petrelharp, I will back out the changes I made to move the reference sequence into metadata and go back to doing it exactly as it was before; as I understand it, that is the right move to be compatible with the tskit changes proposed here. I guess I'll wait to do that until the dust has settled here, to be sure that nothing changes.

Don't make any changes yet @bhaller! We haven't decided we're definitely going to do it yet, it's a question now of whether we want to get Python 0.4.0 out sooner without reference sequences or a bit later with them. Any opinions @bhaller @petrelharp ?

petrelharp · 2021-11-11T20:18:07Z

Well, it'd be nice to not change where reference sequences go and then change back; if this doesn't go into tskit now we'll need to put them in metadata for the time being. But, @bhaller has done the hard work already...

benjeffery · 2021-11-11T20:18:07Z

My gut says to get it in, but undocumented for 0.4.0. I will have a quick recce to guage required effort. As it's just extra fields should be straight forward.

bhaller · 2021-11-11T20:44:11Z

I'd very much like to see it happen now, so that SLiM 3.7 can have this stuff in place. Ties up a loose end for us, and as @petrelharp says, avoid having to move the reference sequence into metadata and then back out again, which is rather confusing for those users who use it. Undocumented for 0.4.0 seems fine from our end; let's just get it to its final resting place.

jeromekelleher · 2021-11-12T08:32:54Z

We'll have a look at the details @bhaller and report back with a proposal by Monday.

bhaller · 2021-11-16T15:07:07Z

Hi campers. Just a ping, since Monday is now history. :->

benjeffery · 2021-11-17T01:59:37Z

Sorry forgot you don't subscribe to the firehose! #1911

bhaller · 2021-11-17T02:14:10Z

Sorry forgot you don't subscribe to the firehose! #1911

OK, great! So the decision has been made, the reference sequence will go in reference_sequence/data in kastore for sure? I should set the wheels in motion on my end?

[EDIT: Well, it sure looks that way from the diffs at #1911, so I'm going to go ahead and do it. :-> Thanks all – good to have this settled!]

benjeffery · 2021-11-18T14:04:31Z

Sorry forgot you don't subscribe to the firehose! #1911

OK, great! So the decision has been made, the reference sequence will go in reference_sequence/data in kastore for sure? I should set the wheels in motion on my end?

[EDIT: Well, it sure looks that way from the diffs at #1911, so I'm going to go ahead and do it. :-> Thanks all – good to have this settled!]

Yes, but you should use the APIs, not accessing the kastore directly.

bhaller · 2021-11-18T15:45:16Z

Yes, but you should use the APIs, not accessing the kastore directly.

Ah, yes, indeed. Thanks for reminding me.

Closes tskit-dev#146

jeromekelleher added the enhancement New feature or request label Mar 13, 2019

petrelharp changed the title ~~Support for references~~ Support for reference sequences Mar 15, 2019

petrelharp mentioned this issue Aug 23, 2019

Outputting fasta files from tree sequences tskit-dev/msprime#802

Closed

marianne-aspbury mentioned this issue Aug 23, 2019

Outputting fasta files from tree sequences #338

Closed

petrelharp mentioned this issue Sep 2, 2019

Remove C support for haplotypes and implement in Python #326

Closed

hyanwong mentioned this issue Nov 19, 2019

Finalise semantics of fasta output #353

Closed

hyanwong mentioned this issue Nov 27, 2019

Multi letter alleles (e.g. indels) in haplotypes() #427

Closed

hyanwong mentioned this issue Feb 5, 2020

API for 'finite_sites'? tskit-dev/msprime#880

Closed

hyanwong mentioned this issue Aug 12, 2020

New top-level function for 1.0 tskit-dev/msprime#1119

Closed

hyanwong mentioned this issue Nov 27, 2020

Standardised metadata keys #495

Closed

benjeffery mentioned this issue Dec 10, 2020

add genotype matrix command line option #1080

Closed

benjeffery mentioned this issue Jun 9, 2021

move reference_sequence to metadata? tskit-dev/pyslim#147

Closed

jeromekelleher mentioned this issue Nov 11, 2021

Finalise alignment export #1894

Merged

jeromekelleher added this to the Python 0.4.0 milestone Nov 11, 2021

bhaller mentioned this issue Nov 17, 2021

move reference sequence to top-level metadata MesserLab/SLiM#180

Closed

jeromekelleher mentioned this issue Nov 23, 2021

Update alignments and downstream methods to use reference sequence #1935

Closed

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Dec 2, 2021

Closes tskit-dev#146

d43961e

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Dec 2, 2021

Basic undocumented support for reference sequences.

2df7ebf

Closes tskit-dev#146

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Dec 2, 2021

Basic undocumented support for reference sequences.

11c408e

Closes tskit-dev#146

mergify bot closed this as completed in 43921cc Dec 2, 2021

Support for reference sequences #146

Support for reference sequences #146

Comments

jeromekelleher commented Mar 13, 2019

hyanwong commented Mar 13, 2019

bhaller commented Mar 13, 2019

jeromekelleher commented Mar 13, 2019

petrelharp commented Mar 15, 2019

hyanwong commented Mar 15, 2019

molpopgen commented Mar 15, 2019

petrelharp commented Mar 19, 2019

jeromekelleher commented Mar 19, 2019

petrelharp commented Mar 19, 2019

petrelharp commented Mar 19, 2019

jeromekelleher commented Mar 19, 2019

molpopgen commented Mar 19, 2019 via email

jeromekelleher commented Mar 19, 2019

bhaller commented Mar 19, 2019

petrelharp commented Mar 23, 2019

petrelharp commented Sep 2, 2019

hyanwong commented Sep 2, 2019

hyanwong commented Nov 27, 2019

benjeffery commented Nov 8, 2021

bhaller commented Nov 8, 2021

benjeffery commented Nov 8, 2021

bhaller commented Nov 8, 2021

jeromekelleher commented Nov 8, 2021

bhaller commented Nov 8, 2021

jeromekelleher commented Nov 11, 2021

petrelharp commented Nov 11, 2021

bhaller commented Nov 11, 2021

petrelharp commented Nov 11, 2021

bhaller commented Nov 11, 2021 • edited

jeromekelleher commented Nov 11, 2021

petrelharp commented Nov 11, 2021

benjeffery commented Nov 11, 2021

bhaller commented Nov 11, 2021

jeromekelleher commented Nov 12, 2021

bhaller commented Nov 16, 2021

benjeffery commented Nov 17, 2021

bhaller commented Nov 17, 2021 • edited

benjeffery commented Nov 18, 2021

bhaller commented Nov 18, 2021

bhaller commented Nov 11, 2021 •

edited

bhaller commented Nov 17, 2021 •

edited