New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for reference sequences #146
Comments
This seems very sensible, but I wonder if the word "reference" is a bit too context-dependent. Perhaps "genome_reference" or something similar would be clearer? |
In SLiM I'm calling it the "ancestral sequence". I'm fine with "reference" or "genome_reference" or any such variant. The rest of the proposal seems good to me. It should be clarified, when this is documented, exactly what format reference/sequence is expected to be in – FASTA? Is a FASTA header allowed? Are newline characters allowed (\r, \n)? Nucleotides only, presumably, not FASTA amino acid sequences? But FASTA codes for ambiguous nucleotides of various sorts? But I think SLiM will just dump an unbroken stream of ACGT into it, so that is all that I need to be supported. |
Thanks @bhaller. To clarify, it'll be a flat array of chars, such that |
I think this is a great idea, and lean towards calling it just
|
FWIW I would (marginally) vote for |
@petrelharp is right here. I actually get papers across my desk where someone has calculated Fay and Wu's H w.r.to the reference allelic state :(. I do like @hyanwong's idea to call it |
@bhaller is going to actually start storing a reference sequence on output from SLiM nucleotide models, so we should at least get it nailed down what we will call it. The proposal is, I think,
All of these would be optional. How about it? |
OK, sounds good. I haven't researched what the right names for Also |
Besides deciding what to name these entries in the kastore, we should record this somehow, possibly by providing API functions that will write and read it. We think the main question is: is the reference sequence part of a table collection object? It would seem natural to make it a part of the table collection, because then you would get it when you read the
Since a |
We could at least write a function that would re-open a |
I don't know how we're going to manage it from the C perspective @petrelharp, and I'm afraid I just don't have time to work on it right now. I don't want to rush in changes to the C API without thinking through the consequences very carefully (particularly not with something this fundamental). I think the best thing to do from a SLiM perspective is to reopen the kastore in append mode after tskit has written it out, and write the |
Does it make sense to add just the reference state at each variable site vs
the entire reference?
__
Kevin Thornton
Associate Professor
Ecology and Evolutionary Biology
UC Irvine
www.molpopgen.org
github.com/molpopgen
github.com/ThorntonLab
…On Tue, Mar 19, 2019 at 1:36 PM Peter Ralph ***@***.***> wrote:
We could at least write a function that would re-open a .trees file and
add a reference sequence to it.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#146 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGHnH_HtcVCCz56Nr_k_VW8NrK_Uscx8ks5vYUnIgaJpZM4bs1uU>
.
|
That was my original proposal, but Ben has reasons for wanting the full reference/ancestral sequence. |
Yep, that's what we've got for now. I think Peter just wants to clean up the code; but that can wait indefinitely, what we have now is fine for release. |
Sounds good. We're doing that; and can continue this discussion when (if) someone else needs it. |
I'm redirecting discussion from #326 over here, where I think it belongs. There, @hyanwong wrote:
Instead of guessing based on whether positions are all integers we should instead do something cleaner. Here are two non-mutually-exclusive options:
This issue is about (1); but it seems like we might want to do something like (2) to be able to do (1). Otherwise whenever we do something with the reference sequence we'll have to check if positions are all integers, and if not, throw an error (or, something?). At the python level everything would be cleanest, if the relevant columns were actually integer... but this would probably be a giant pain otherwise. |
Yes, that would be much better.
Nice idea.
It's not strictly necessary to have recombinations at integer positions, of course, but I suppose it is the option most likely to be needed by users.
Are you suggesting that the |
I think there is a difference between a finite sites tree sequence and a tree sequence with a reference genome, and the two issues are getting confounded here. It is perfectly possible to have a TS with sites at finite positions, but without any reference sequence available. Indeed, I'm working on hacked versions of such tree sequences at the moment. I suggest we might want to open a separate issue about finite sites? |
It can't be in metadata as then the C API has no access.
Not sure I follow here? The special treatment would all be done by tskit. |
Yes, it was understood to be temporary; that's not a problem, although it would've been nice if, when I posted two weeks ago "Working on this now" I had known that it was about to change on the tskit side – I would've waited for the dust to settle. Anyway, I'm just saying that if we can now settle on the right final resting place, so that it doesn't have to change again after this, that would be good. :-> I'm also quite surprised that the place being proposed is back in kastore, since @petrelharp and @jeromekelleher made such strong arguments that having it in kastore was a bad idea and was causing problems.
Well, in MesserLab/SLiM#180 @petrelharp wrote:
It also needed to be special-cased in SLiM on both read and write, to put it into kastore and fetch it from kastore; that's not a big deal, but reading/writing to/from metadata is simpler code since it works the same way as many other things. But the big concern, as I understood it, was all the special-casing needed in pyslim; would the proposed changes to tskit remove all of that difficulty for pyslim? @petrelharp is on west coast time, so he probably won't show up here for a couple of hours yet; perhaps we ought to wait for him to chime in on this? |
The problems come from pyslim writing to the kastore without tskit knowing anything about it. If we add tskit support for a reference sequence then pyslim/SLiM will just use the tskit API and not touch the kastore. |
I see. That sounds like an improvement, then, certainly. Still, I think we ought to wait for @petrelharp to comment; I don't know all the issues on the pyslim side and can't really represent what his requirements might be. |
Apologies for the mixed messages @bhaller, the confusion is my fault. I didn't expect us to have to deal with this for another 6 months or so but working on outputting alignments has sort of forced the issue. Putting in workarounds for not having a reference sequence now might be less work than just doing it, so we're considering just doing it now. As @benjeffery noted above, the arguments for putting the reference into metadata were all motivated by tskit not knowing about references for some significant amount of time. |
Gotcha. No worries, just a bit of whiplash. :-> The plan here seems fine if it works for @petrelharp, and it will be good to have this resolved definitively. |
I've added this to the 0.4.0 milestone for now, just so we don't forget about it and release without thinking things through properly. |
This plan sounds good - and, apologies for the mixed messages. As usual, I'm leaning too far on the side of "let's just make it work with existing tools". And, I wasn't thinking about the C API needing to be able to use the reference sequence. |
OK, so if I understand correctly, @petrelharp, I will back out the changes I made to move the reference sequence into metadata and go back to doing it exactly as it was before; as I understand it, that is the right move to be compatible with the tskit changes proposed here. I guess I'll wait to do that until the dust has settled here, to be sure that nothing changes. |
That sounds right? I guess you'll want to do that before the next SLiM release (which is soon?) and so we need to decide whether this is going to be the final format (the kastore key anyhow) for reals. |
Next SLiM release is pretty much waiting for the tskit C API 1.0 release at this point; when that happens, we'll pull it into SLiM, resolve any final issues that causes (getting all the tskit-based tests for SLiM green again, notably), update pyslim to match (I believe relevant issues on pyslim are 170, 206, 207, 147), and stick a fork in it. |
Don't make any changes yet @bhaller! We haven't decided we're definitely going to do it yet, it's a question now of whether we want to get Python 0.4.0 out sooner without reference sequences or a bit later with them. Any opinions @bhaller @petrelharp ? |
Well, it'd be nice to not change where reference sequences go and then change back; if this doesn't go into tskit now we'll need to put them in metadata for the time being. But, @bhaller has done the hard work already... |
My gut says to get it in, but undocumented for 0.4.0. I will have a quick recce to guage required effort. As it's just extra fields should be straight forward. |
I'd very much like to see it happen now, so that SLiM 3.7 can have this stuff in place. Ties up a loose end for us, and as @petrelharp says, avoid having to move the reference sequence into metadata and then back out again, which is rather confusing for those users who use it. Undocumented for 0.4.0 seems fine from our end; let's just get it to its final resting place. |
We'll have a look at the details @bhaller and report back with a proposal by Monday. |
Hi campers. Just a ping, since Monday is now history. :-> |
Sorry forgot you don't subscribe to the firehose! #1911 |
OK, great! So the decision has been made, the reference sequence will go in [EDIT: Well, it sure looks that way from the diffs at #1911, so I'm going to go ahead and do it. :-> Thanks all – good to have this settled!] |
Yes, but you should use the APIs, not accessing the kastore directly. |
Ah, yes, indeed. Thanks for reminding me. |
For some applications it would be useful to know the reference sequence that a tree sequence coordinate space refers to. For example, with real data, we should (at a minimum) record the reference build (e.g., GRCh38) and the contig ID (e.g., chr22) associated with a tree sequence. Ideally, we would also like to be able to do things like:
As well as situations in which a well-known canonical reference is available, we may also have a one-off reference sequence that we wish to record, e.g., in simulations.
To support this, I suggest adding a
reference
section to the file store, with some fields. Roughly, these might looks like:The references section would be itself optional (keeping backward compatability), and some fields would probably be optional within this section (for example,
reference/sequence
would definitely be optional --- no point in storing GRCh38 chr1 over and over again). We can imagine having some mechanisms for automatically retrieving sequences using refget, but this isn't at all necessary for a basic implementation.At a high-level, we should try to follow any upstream standards as closely as possible, e.g. GA4GH refget and any others that are relevant.
Any thoughts @tskit-dev/all, @bhaller?
The text was updated successfully, but these errors were encountered: