Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reference sequences #146

Closed
jeromekelleher opened this issue Mar 13, 2019 · 46 comments
Closed

Support for reference sequences #146

jeromekelleher opened this issue Mar 13, 2019 · 46 comments
Labels
enhancement New feature or request
Milestone

Comments

@jeromekelleher
Copy link
Member

For some applications it would be useful to know the reference sequence that a tree sequence coordinate space refers to. For example, with real data, we should (at a minimum) record the reference build (e.g., GRCh38) and the contig ID (e.g., chr22) associated with a tree sequence. Ideally, we would also like to be able to do things like:

for site in ts.sites():
     print(site.position, site.ref_allele)

As well as situations in which a well-known canonical reference is available, we may also have a one-off reference sequence that we wish to record, e.g., in simulations.

To support this, I suggest adding a reference section to the file store, with some fields. Roughly, these might looks like:

reference/build         -- e.g. GRCh38
reference/contig       -- e.g. chr22
reference/id              -- md5 hash of the sequence, c.f. refget
reference/sequence -- Actual sequence information

The references section would be itself optional (keeping backward compatability), and some fields would probably be optional within this section (for example, reference/sequence would definitely be optional --- no point in storing GRCh38 chr1 over and over again). We can imagine having some mechanisms for automatically retrieving sequences using refget, but this isn't at all necessary for a basic implementation.

At a high-level, we should try to follow any upstream standards as closely as possible, e.g. GA4GH refget and any others that are relevant.

Any thoughts @tskit-dev/all, @bhaller?

@jeromekelleher jeromekelleher added the enhancement New feature or request label Mar 13, 2019
@hyanwong
Copy link
Member

This seems very sensible, but I wonder if the word "reference" is a bit too context-dependent. Perhaps "genome_reference" or something similar would be clearer?

@bhaller
Copy link

bhaller commented Mar 13, 2019

In SLiM I'm calling it the "ancestral sequence". I'm fine with "reference" or "genome_reference" or any such variant. The rest of the proposal seems good to me. It should be clarified, when this is documented, exactly what format reference/sequence is expected to be in – FASTA? Is a FASTA header allowed? Are newline characters allowed (\r, \n)? Nucleotides only, presumably, not FASTA amino acid sequences? But FASTA codes for ambiguous nucleotides of various sorts? But I think SLiM will just dump an unbroken stream of ACGT into it, so that is all that I need to be supported.

@jeromekelleher
Copy link
Member Author

Thanks @bhaller. To clarify, it'll be a flat array of chars, such that reference[pos] is the reference allele at position pos.

@petrelharp petrelharp changed the title Support for references Support for reference sequences Mar 15, 2019
@petrelharp
Copy link
Contributor

I think this is a great idea, and lean towards calling it just reference (although reference_sequence would also be fine). I don't think it should be called ancestral_sequence, because I'd argue these shouldn't be the same thing, necessarily. Here's some reasons:

  • If we say they're supposed to be the same, then the ancestral_state column of the Site table is supposed to match the positions in the reference sequence, which is annoying and redundant.
  • The reference sequence is usually not the ancestral sequence, in practice. (eg GRCh38), but is still useful.
  • The reference sequence could still be used by e.g. SLiM to store, essentially, the ancestral sequence.

@hyanwong
Copy link
Member

FWIW I would (marginally) vote for reference_sequence. reference sounds like it might contain metadata about bibliographic references relevant to the provenance of the data. But maybe I'm being too naive.

@molpopgen
Copy link
Member

@petrelharp is right here. I actually get papers across my desk where someone has calculated Fay and Wu's H w.r.to the reference allelic state :(.

I do like @hyanwong's idea to call it reference_sequence. At first, I thought this thread was about adding bibtex support to the docs.

@petrelharp
Copy link
Contributor

@bhaller is going to actually start storing a reference sequence on output from SLiM nucleotide models, so we should at least get it nailed down what we will call it. The proposal is, I think,

  • reference_sequence/build -- e.g. GRCh38
  • reference_sequence/contig -- e.g. chr22
  • reference_sequence/id -- md5 hash of the sequence, c.f. refget
  • reference_sequence/sequence -- Actual sequence information

All of these would be optional. How about it?

@jeromekelleher
Copy link
Member Author

OK, sounds good. I haven't researched what the right names for build, contig etc should be, so perhaps we could just keep it to reference_sequence/sequence if we're going to ship files that include it?

Also reference_sequence/sequence seems repetitive and unintuitive. Maybe reference_sequence/data would be better?

@petrelharp
Copy link
Contributor

Besides deciding what to name these entries in the kastore, we should record this somehow, possibly by providing API functions that will write and read it. We think the main question is: is the reference sequence part of a table collection object? It would seem natural to make it a part of the table collection, because then you would get it when you read the .trees file. But, in the SLiM application, the reference sequence is expected to be large, and making it part of the table collection object means that there will be more than one copy of it sitting around, because:

  • we have to copy the table collection when we write it out; and also
  • SLiM has an internal version of the reference sequence, so when reading in the .trees file with table_collection_load( ), if reference_sequence was part of the resulting object, it would need to be copied over to where SLiM wants it

Since a table_collection_t has a kastore_t, we could define a function tsk_table_collection_read_reference_sequence( ) that would go into the store and pull out the reference sequence (if indeed such a store is associated). I'm not sure how to deal with writing out, though?

@petrelharp
Copy link
Contributor

We could at least write a function that would re-open a .trees file and add a reference sequence to it.

@jeromekelleher
Copy link
Member Author

I don't know how we're going to manage it from the C perspective @petrelharp, and I'm afraid I just don't have time to work on it right now. I don't want to rush in changes to the C API without thinking through the consequences very carefully (particularly not with something this fundamental).

I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the reference sequence/data key in. Should be ~10 lines of code.

@molpopgen
Copy link
Member

molpopgen commented Mar 19, 2019 via email

@jeromekelleher
Copy link
Member Author

Does it make sense to add just the reference state at each variable site vs the entire reference?

That was my original proposal, but Ben has reasons for wanting the full reference/ancestral sequence.

@bhaller
Copy link

bhaller commented Mar 19, 2019

I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the reference sequence/data key in. Should be ~10 lines of code.

Yep, that's what we've got for now. I think Peter just wants to clean up the code; but that can wait indefinitely, what we have now is fine for release.

@petrelharp
Copy link
Contributor

Sounds good. We're doing that; and can continue this discussion when (if) someone else needs it.

@petrelharp
Copy link
Contributor

I'm redirecting discussion from #326 over here, where I think it belongs. There, @hyanwong wrote:

I think we need to distinguish between (a) tree sequences that using the (msprime default) infinite sites mutation model and (b) tree sequences that have sites at integer positions (and hence probably use finite mutation models ).

Instead of guessing based on whether positions are all integers we should instead do something cleaner. Here are two non-mutually-exclusive options:

  1. have a reference_sequence property of tree sequences, which might be NULL, and then have things like haplotypes and fasta output optionally use this to output invariant sites also.
  2. make a subclass of the tree sequence, say FiniteSitesTreeSequence (ick, though) that only has mutations (and also recombinations?) at integer positions

This issue is about (1); but it seems like we might want to do something like (2) to be able to do (1). Otherwise whenever we do something with the reference sequence we'll have to check if positions are all integers, and if not, throw an error (or, something?). At the python level everything would be cleanest, if the relevant columns were actually integer... but this would probably be a giant pain otherwise.

@hyanwong
Copy link
Member

hyanwong commented Sep 2, 2019

Instead of guessing based on whether positions are all integers we should instead do something cleaner.

Yes, that would be much better.

  1. have a reference_sequence property of tree sequences, which might be NULL, and then have things like haplotypes and fasta output optionally use this to output invariant sites also.

Nice idea.

  1. make a subclass of the tree sequence, say FiniteSitesTreeSequence (ick, though) that only has mutations (and also recombinations?) at integer positions

It's not strictly necessary to have recombinations at integer positions, of course, but I suppose it is the option most likely to be needed by users.

This issue is about (1); but it seems like we might want to do something like (2) to be able to do (1). Otherwise whenever we do something with the reference sequence we'll have to check if positions are all integers, and if not, throw an error (or, something?). At the python level everything would be cleanest, if the relevant columns were actually integer... but this would probably be a giant pain otherwise.

Are you suggesting that the position in a FiniteSitesTreeSequence is stored as a C double, but converted to an integer in the python accessor functions?

@hyanwong
Copy link
Member

I think there is a difference between a finite sites tree sequence and a tree sequence with a reference genome, and the two issues are getting confounded here. It is perfectly possible to have a TS with sites at finite positions, but without any reference sequence available. Indeed, I'm working on hacked versions of such tree sequences at the moment.

I suggest we might want to open a separate issue about finite sites?

@benjeffery
Copy link
Member

It can't be in metadata as then the C API has no access.
From reading MesserLab/SLiM#180 it is clear that putting it in metadata was expected to be temporary? We originally expected a year from then to fix, it has been six months so far.

kastore key, which requires special treatment everywhere

Not sure I follow here? The special treatment would all be done by tskit.

@bhaller
Copy link

bhaller commented Nov 8, 2021

It can't be in metadata as then the C API has no access. From reading MesserLab/SLiM#180 it is clear that putting it in metadata was expected to be temporary? We originally expected a year from then to fix, it has been six months so far.

Yes, it was understood to be temporary; that's not a problem, although it would've been nice if, when I posted two weeks ago "Working on this now" I had known that it was about to change on the tskit side – I would've waited for the dust to settle. Anyway, I'm just saying that if we can now settle on the right final resting place, so that it doesn't have to change again after this, that would be good. :-> I'm also quite surprised that the place being proposed is back in kastore, since @petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.

kastore key, which requires special treatment everywhere

Not sure I follow here? The special treatment would all be done by tskit.

Well, in MesserLab/SLiM#180 @petrelharp wrote:

So, for the time being, "if it ain't broke don't fix it" would be my guiding principle. :->

Agree, but it is kinda broken at the moment, because currently it's way too easy for users to lose the reference sequence - to avoid this, pyslim's SlimTreeSequence would need to wrap all the tree sequence transformation methods to stick back on the reference sequence. In fact, many of our examples lose the reference sequence - any time we wrap something in SlimTreeSequence( ), like for instance SlimTreeSequence(msprime.mutate(ts, ....)), we have lost it. Moving it to metadata will be much easier than dealing with this, and much less confusing for the users. I don't see an alternative, besides just giving up on the idea of maintaining the reference sequence.

It also needed to be special-cased in SLiM on both read and write, to put it into kastore and fetch it from kastore; that's not a big deal, but reading/writing to/from metadata is simpler code since it works the same way as many other things. But the big concern, as I understood it, was all the special-casing needed in pyslim; would the proposed changes to tskit remove all of that difficulty for pyslim? @petrelharp is on west coast time, so he probably won't show up here for a couple of hours yet; perhaps we ought to wait for him to chime in on this?

@benjeffery
Copy link
Member

@petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.

The problems come from pyslim writing to the kastore without tskit knowing anything about it. If we add tskit support for a reference sequence then pyslim/SLiM will just use the tskit API and not touch the kastore.

@bhaller
Copy link

bhaller commented Nov 8, 2021

@petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.

The problems come from pyslim writing to the kastore without tskit knowing anything about it. If we add tskit support for a reference sequence then pyslim/SLiM will just use the tskit API and not touch the kastore.

I see. That sounds like an improvement, then, certainly. Still, I think we ought to wait for @petrelharp to comment; I don't know all the issues on the pyslim side and can't really represent what his requirements might be.

@jeromekelleher
Copy link
Member Author

Apologies for the mixed messages @bhaller, the confusion is my fault. I didn't expect us to have to deal with this for another 6 months or so but working on outputting alignments has sort of forced the issue. Putting in workarounds for not having a reference sequence now might be less work than just doing it, so we're considering just doing it now. As @benjeffery noted above, the arguments for putting the reference into metadata were all motivated by tskit not knowing about references for some significant amount of time.

@bhaller
Copy link

bhaller commented Nov 8, 2021

Gotcha. No worries, just a bit of whiplash. :-> The plan here seems fine if it works for @petrelharp, and it will be good to have this resolved definitively.

@jeromekelleher
Copy link
Member Author

I've added this to the 0.4.0 milestone for now, just so we don't forget about it and release without thinking things through properly.

@petrelharp
Copy link
Contributor

This plan sounds good - and, apologies for the mixed messages. As usual, I'm leaning too far on the side of "let's just make it work with existing tools". And, I wasn't thinking about the C API needing to be able to use the reference sequence.

@bhaller
Copy link

bhaller commented Nov 11, 2021

OK, so if I understand correctly, @petrelharp, I will back out the changes I made to move the reference sequence into metadata and go back to doing it exactly as it was before; as I understand it, that is the right move to be compatible with the tskit changes proposed here. I guess I'll wait to do that until the dust has settled here, to be sure that nothing changes.

@petrelharp
Copy link
Contributor

That sounds right? I guess you'll want to do that before the next SLiM release (which is soon?) and so we need to decide whether this is going to be the final format (the kastore key anyhow) for reals.

@bhaller
Copy link

bhaller commented Nov 11, 2021

Next SLiM release is pretty much waiting for the tskit C API 1.0 release at this point; when that happens, we'll pull it into SLiM, resolve any final issues that causes (getting all the tskit-based tests for SLiM green again, notably), update pyslim to match (I believe relevant issues on pyslim are 170, 206, 207, 147), and stick a fork in it.

@jeromekelleher
Copy link
Member Author

OK, so if I understand correctly, @petrelharp, I will back out the changes I made to move the reference sequence into metadata and go back to doing it exactly as it was before; as I understand it, that is the right move to be compatible with the tskit changes proposed here. I guess I'll wait to do that until the dust has settled here, to be sure that nothing changes.

Don't make any changes yet @bhaller! We haven't decided we're definitely going to do it yet, it's a question now of whether we want to get Python 0.4.0 out sooner without reference sequences or a bit later with them. Any opinions @bhaller @petrelharp ?

@petrelharp
Copy link
Contributor

Well, it'd be nice to not change where reference sequences go and then change back; if this doesn't go into tskit now we'll need to put them in metadata for the time being. But, @bhaller has done the hard work already...

@benjeffery
Copy link
Member

My gut says to get it in, but undocumented for 0.4.0. I will have a quick recce to guage required effort. As it's just extra fields should be straight forward.

@bhaller
Copy link

bhaller commented Nov 11, 2021

I'd very much like to see it happen now, so that SLiM 3.7 can have this stuff in place. Ties up a loose end for us, and as @petrelharp says, avoid having to move the reference sequence into metadata and then back out again, which is rather confusing for those users who use it. Undocumented for 0.4.0 seems fine from our end; let's just get it to its final resting place.

@jeromekelleher
Copy link
Member Author

We'll have a look at the details @bhaller and report back with a proposal by Monday.

@bhaller
Copy link

bhaller commented Nov 16, 2021

Hi campers. Just a ping, since Monday is now history. :->

@benjeffery
Copy link
Member

Sorry forgot you don't subscribe to the firehose! #1911

@bhaller
Copy link

bhaller commented Nov 17, 2021

Sorry forgot you don't subscribe to the firehose! #1911

OK, great! So the decision has been made, the reference sequence will go in reference_sequence/data in kastore for sure? I should set the wheels in motion on my end?

[EDIT: Well, it sure looks that way from the diffs at #1911, so I'm going to go ahead and do it. :-> Thanks all – good to have this settled!]

@benjeffery
Copy link
Member

Sorry forgot you don't subscribe to the firehose! #1911

OK, great! So the decision has been made, the reference sequence will go in reference_sequence/data in kastore for sure? I should set the wheels in motion on my end?

[EDIT: Well, it sure looks that way from the diffs at #1911, so I'm going to go ahead and do it. :-> Thanks all – good to have this settled!]

Yes, but you should use the APIs, not accessing the kastore directly.

@bhaller
Copy link

bhaller commented Nov 18, 2021

Yes, but you should use the APIs, not accessing the kastore directly.

Ah, yes, indeed. Thanks for reminding me.

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Dec 2, 2021
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Dec 2, 2021
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Dec 2, 2021
@mergify mergify bot closed this as completed in 43921cc Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

6 participants