New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
method to convert alleles to nucleotides #174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think it would be easier to document and write if it was split into two functions.
pyslim/methods.py
Outdated
@@ -84,3 +86,93 @@ def recapitate(ts, | |||
return SlimTreeSequence(recap, reference_sequence=ts.reference_sequence) | |||
|
|||
|
|||
def convert_alleles(ts, generate=True, seed=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought it would be clearer to have two functions convert_alleles
which expects a SLiM nucleotide model and generate_alleles
which will generate random alleles. It took a few goes at reading to docstring to get the current semantics (why would you not use the SLiM nucleotides, if they exist already?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's a nice idea. Right now it's using nucleotides when they exist and generating them if they don't, so it works with mixed nuc- and non-nuc mutations. But there's no need to make it so complex.
I haven't looked at this closely. I guess my first-blush question would be: if this remapping is primarily for the purpose of handling SLiM output, why not follow the same policy that SLiM follows, for consistency? (That policy is outlined in detail in the manual.) I'm not saying that policy is good; but if one is trying to wedge SLiM's world of infinite alleles and stacked mutations into VCF's world of nucleotide-based alleles, I'm not sure any policy is good, and if people want VCF output that really makes total sense, they should probably just run a nucleotide-based model in the first place. Consistency seems like the most important goal, maybe? So having SLiM's VCF output and pyslim's VCF output doing totally different things just seems potentially weird and confusing. (On the other hand, I guess it gives users a choice of which style of VCF output they would prefer :->). |
Good idea. But after reading the policy careful, I think my answer is that (a) that'd be way too much work to duplicate (since I couldn't just use |
Ok, revised proposal:
|
OK, that's reasonable. I think it might turn out to be pretty weird, for many users, that (conceptually) point mutations get represented as indels in the VCF. I can see that maybe raising issues with downstream analysis tools, which is usually what people want VCF for in the first place. But maybe not any weirder than how SLiM handles them, and as you say, at least it's fully VCF-compliant. And people can import back into SLiM and write VCF from there if they really want that style of output. Re: your revised proposal. Hmm, "uniformly random nucleotide that differs from the parental state, and (if |
Sorry, I didn't make it clear - this revised proposal ditches the idea of using multi-nucleotides to ensure distinct alleles. I could provide that as an example somewhere, instead. In the current proposal, |
Ah, OK. I think that's for the best; creating fictional indels just for VCF output seems rather extreme.
I see, OK. Makes sense as far as I can see, but I can tell you that people who use VCF files for their analysis have very specific/fussy requirements, so you might want to run these plans past an empirical biologist or two for a sanity check. :-> |
Codecov Report
@@ Coverage Diff @@
## main #174 +/- ##
==========================================
+ Coverage 87.15% 87.76% +0.60%
==========================================
Files 7 7
Lines 919 981 +62
Branches 169 187 +18
==========================================
+ Hits 801 861 +60
- Misses 87 88 +1
- Partials 31 32 +1
Continue to review full report at Codecov.
|
Well, this turned out to be unavoidably complicated, since substitutions will manifest as stacked mutations, and for both of these methods we need to figure out which SLiM mutation was the 'most recent' one, which is not straightforward. I'm deciding the most recent one is the one that
This is implemented here and tested here. It's not pretty. @bhaller, mind having a look at whether the API and documentation makes sense (so, here and here) and whether what I said above seems right to you? I suppose I could also be testing for agreement with what SLiM thinks is the nucleotides, by loading things into SLiM, but I have not. |
Yeah, this is a pain for sure. It would be nice if SLiM put the derived state in order, but the reason it doesn't is that it would take time, and thus slow down every simulation, and most simulations don't care about this; so it would slow down the many for a benefit to the few or the one, which would contradict Spock's dictum. So. I'll have a look at this tomorrow. |
Hey. OK, well, I looked at the implementation and test code, but it is beyond my rudimentary skills in Python; I really can't say whether what it's doing looks correct or not. I'm just not a Python-fluent programmer, sorry. Maybe some day, but since I use it quite infrequently, I've even forgotten most of the Python that I learned in the past. I looked at the doc links you provided. The doc for Also, a nit: your use of the word "insert" doesn't seem quite right, as "insert" means, to me, "add a new thing somewhere internally, while leaving what was there before unchanged"; you seem to mean "replace", not "insert", especially in "will instead insert the nucleotide from the mutation's metadata". Another nit: "this method tries to assign" should start with a capitol letter. Should That's all that I see. |
Perfect, thanks! And, sorry the python is so impenetrable.
Well, there might be something better; but I'm trying to keep things reasonly simple; so in this case I would recommend calling TODO:
And, good thought, but I don't think it's the job of this function to go checking all possible aspects of the reference sequence. |
Just not a language I'm fluent in, I'm sure there's nothing wrong with the code. :->
OK. Perhaps the doc could suggest that workflow, rather than just saying that it produces an error?
OK. In general I'm in favor of sanity-checking inputs for user-visible APIs whenever there is not a performance reason not to, so that the user gets a good error message instead of just mysteriously/confusingly wrong behavior, but it certainly does add more work. Of course my example was a bit flip, but it might catch a legit typo in a hand-typed string like "ACGTAVGT", or a confused user doing "01230123", or FASTA data that contains extended nucleotide codes like U, R, Y, etc. (https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation). I have had users try to input such codes into SLiM for an ancestral sequence. :-> |
Compelling. Added these to the TODOs. |
Primary use case would be VCF output: if
ts
was produced by a SLiM nucleotide simulation then we'd do:If the simulation was produced by a non-nucleotide simulation, or a nucleotide simulation with some non-nucleotide mutations, then we'd do
Closes #73. Closes #168.