method to convert alleles to nucleotides #174

petrelharp · 2021-06-10T20:42:06Z

Primary use case would be VCF output: if ts was produced by a SLiM nucleotide simulation then we'd do:

nts = pyslim.convert_alleles(ts)
with open('out.vcf', 'w') as f:
  nts.write_vcf(f)

If the simulation was produced by a non-nucleotide simulation, or a nucleotide simulation with some non-nucleotide mutations, then we'd do

nts = pyslim.convert_alleles(
            pyslim.generate_nucleotides(ts)
)
with open('out.vcf', 'w') as f:
  nts.write_vcf(f)

Closes #73. Closes #168.

jeromekelleher

LGTM, but I think it would be easier to document and write if it was split into two functions.

jeromekelleher · 2021-06-11T11:28:52Z

pyslim/methods.py

@@ -84,3 +86,93 @@ def recapitate(ts,
    return SlimTreeSequence(recap, reference_sequence=ts.reference_sequence)


+def convert_alleles(ts, generate=True, seed=None):


I would have thought it would be clearer to have two functions convert_alleles which expects a SLiM nucleotide model and generate_alleles which will generate random alleles. It took a few goes at reading to docstring to get the current semantics (why would you not use the SLiM nucleotides, if they exist already?)

Hm, that's a nice idea. Right now it's using nucleotides when they exist and generating them if they don't, so it works with mixed nuc- and non-nuc mutations. But there's no need to make it so complex.

bhaller · 2021-06-11T13:29:17Z

I haven't looked at this closely. I guess my first-blush question would be: if this remapping is primarily for the purpose of handling SLiM output, why not follow the same policy that SLiM follows, for consistency? (That policy is outlined in detail in the manual.) I'm not saying that policy is good; but if one is trying to wedge SLiM's world of infinite alleles and stacked mutations into VCF's world of nucleotide-based alleles, I'm not sure any policy is good, and if people want VCF output that really makes total sense, they should probably just run a nucleotide-based model in the first place. Consistency seems like the most important goal, maybe? So having SLiM's VCF output and pyslim's VCF output doing totally different things just seems potentially weird and confusing. (On the other hand, I guess it gives users a choice of which style of VCF output they would prefer :->).

petrelharp · 2021-06-11T15:32:23Z

why not follow the same policy that SLiM follows, for consistency?

Good idea. But after reading the policy careful, I think my answer is that (a) that'd be way too much work to duplicate (since I couldn't just use tskit.write_vcf), and (b) I think it'd be nice to have an alternative for how multiallelic states are handled. SLiM for non-nucleotide models represents all mutations as A->T, with one line per mutation, and you say that there are only four nucleotides so you can represent at most only four states; but I think that's not right - you're allowed multi-nucleotide alleles in VCF (e.g., A,CCA,TTTTT in the REF column). Here, I'm generating unique alleles by using multicharacter strings. I think my goal here is to produce a legal VCF file (so, without duplicated POS rows) with as much information as possible (without dropping multiallelic sites), and in a relatively simple way.

petrelharp · 2021-06-11T15:39:44Z

Ok, revised proposal:

convert_alleles(ts):
    ```
    Copies the nucleotide state of each SLiM mutation to the derived state, and the reference sequence to the ancestral state.
    Mutations with no nucleotide (i.e., -1 in metadata) will have a derived state of `"."` (i.e., "missing") and tree sequences without
    a reference sequence will have ancestral states of `"."`.
    ```

generate_nucleotides(ts, reference_sequence=None, keep=True):
    ```
    Generates random nucleotides, by choosing a uniformly random nucleotide that differs from the parental state,
    and (if `ancestral_sequence` is True) a random reference sequence. Nucleotide state is stored in mutation metadata;
    this method will modify the metadata and keep everything else the same. If `keep=True`, then mutations that already have
    a nucleotide state (i.e., not set to -1 in metadata) will retain their state.
    ```

bhaller · 2021-06-11T16:27:41Z

why not follow the same policy that SLiM follows, for consistency?

Good idea. But after reading the policy careful, I think my answer is that (a) that'd be way too much work to duplicate (since I couldn't just use tskit.write_vcf), and (b) I think it'd be nice to have an alternative for how multiallelic states are handled. SLiM for non-nucleotide models represents all mutations as A->T, with one line per mutation, and you say that there are only four nucleotides so you can represent at most only four states; but I think that's not right - you're allowed multi-nucleotide alleles in VCF (e.g., A,CCA,TTTTT in the REF column). Here, I'm generating unique alleles by using multicharacter strings. I think my goal here is to produce a legal VCF file (so, without duplicated POS rows) with as much information as possible (without dropping multiallelic sites), and in a relatively simple way.

OK, that's reasonable. I think it might turn out to be pretty weird, for many users, that (conceptually) point mutations get represented as indels in the VCF. I can see that maybe raising issues with downstream analysis tools, which is usually what people want VCF for in the first place. But maybe not any weirder than how SLiM handles them, and as you say, at least it's fully VCF-compliant. And people can import back into SLiM and write VCF from there if they really want that style of output.

Re: your revised proposal. Hmm, "uniformly random nucleotide that differs from the parental state, and (if ancestral_sequence is True) a random reference sequence". If there are multiple segregating mutations at a single site, is there any guarantee that those mutations will receive different random nucleotides? Is that a concern?

petrelharp · 2021-06-11T18:10:52Z

Sorry, I didn't make it clear - this revised proposal ditches the idea of using multi-nucleotides to ensure distinct alleles. I could provide that as an example somewhere, instead. In the current proposal, generate_nucleotides serves two purposes: (a) getting a tree sequence ready to load into SLiM for a nucleotide model, and (b) letting users output vcf files from any tree sequences. More specific requirements for VCF output (eg having all alleles distinct) could be done ad-hoc by users (and I'd provide an example).

bhaller · 2021-06-11T18:36:45Z

Sorry, I didn't make it clear - this revised proposal ditches the idea of using multi-nucleotides to ensure distinct alleles. I could provide that as an example somewhere, instead.

Ah, OK. I think that's for the best; creating fictional indels just for VCF output seems rather extreme.

In the current proposal, generate_nucleotides serves two purposes: (a) getting a tree sequence ready to load into SLiM for a nucleotide model, and (b) letting users output vcf files from any tree sequences. More specific requirements for VCF output (eg having all alleles distinct) could be done ad-hoc by users (and I'd provide an example).

I see, OK. Makes sense as far as I can see, but I can tell you that people who use VCF files for their analysis have very specific/fussy requirements, so you might want to run these plans past an empirical biologist or two for a sanity check. :->

codecov-commenter · 2021-06-11T23:16:37Z

Codecov Report

Merging #174 (806dd4e) into main (198949f) will increase coverage by 0.60%.
The diff coverage is 96.77%.

@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   87.15%   87.76%   +0.60%     
==========================================
  Files           7        7              
  Lines         919      981      +62     
  Branches      169      187      +18     
==========================================
+ Hits          801      861      +60     
- Misses         87       88       +1     
- Partials       31       32       +1

Impacted Files	Coverage Δ
pyslim/methods.py	`94.50% <96.77%> (+4.85%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 198949f...806dd4e. Read the comment docs.

petrelharp · 2021-06-13T05:12:28Z

Well, this turned out to be unavoidably complicated, since substitutions will manifest as stacked mutations, and for both of these methods we need to figure out which SLiM mutation was the 'most recent' one, which is not straightforward. I'm deciding the most recent one is the one that

has the largest slim_time
is not present in the list of SLiM mutations for the parent mutation, if any
has the largest SLiM ID, if the other two fail to distinguish.

This is implemented here and tested here. It's not pretty.

@bhaller, mind having a look at whether the API and documentation makes sense (so, here and here) and whether what I said above seems right to you? I suppose I could also be testing for agreement with what SLiM thinks is the nucleotides, by loading things into SLiM, but I have not.

bhaller · 2021-06-13T06:42:28Z

Yeah, this is a pain for sure. It would be nice if SLiM put the derived state in order, but the reason it doesn't is that it would take time, and thus slow down every simulation, and most simulations don't care about this; so it would slow down the many for a benefit to the few or the one, which would contradict Spock's dictum. So. I'll have a look at this tomorrow.

bhaller · 2021-06-14T17:03:45Z

Hey. OK, well, I looked at the implementation and test code, but it is beyond my rudimentary skills in Python; I really can't say whether what it's doing looks correct or not. I'm just not a Python-fluent programmer, sorry. Maybe some day, but since I use it quite infrequently, I've even forgotten most of the Python that I learned in the past.

I looked at the doc links you provided. The doc for convert_alleles is very clear, seems good. The one thing that gave me pause was this statement: "This method will produce an error... if any mutations do not have nucleotides". Hmm. It is legal and valid to do a nucleotide-based model that contains some non-nucleotide-based mutations; this can be useful, for example, if you want to use marker mutations to designate which genomes contain an inversion at a particular position, for example. So, is there something better than throwing an error that this function could do when confronted by such tree sequences?

Also, a nit: your use of the word "insert" doesn't seem quite right, as "insert" means, to me, "add a new thing somewhere internally, while leaving what was there before unchanged"; you seem to mean "replace", not "insert", especially in "will instead insert the nucleotide from the mutation's metadata".

Another nit: "this method tries to assign" should start with a capitol letter.

Should reference_sequence be a string of letters, "ACGTACGT", or a string of integers, "01230123"? I would have assumed letters, but then you discuss integer nucleotide values in a couple spots, so then I start to wonder. Also: should the code check that reference_sequence is actually well-formed, containing what it is expected to contain, beyond just its length? What happens if someone passes "Thequickbrownfoxjumpsoverthelazydog" for a reference sequence? :->

That's all that I see.

petrelharp · 2021-06-14T17:13:09Z

Perfect, thanks! And, sorry the python is so impenetrable.

It is legal and valid to do a nucleotide-based model that contains some non-nucleotide-based mutations; this can be useful, for example, if you want to use marker mutations to designate which genomes contain an inversion at a particular position, for example. So, is there something better than throwing an error that this function could do when confronted by such tree sequences?

Well, there might be something better; but I'm trying to keep things reasonly simple; so in this case I would recommend calling generate_nucleotides first (with keep=True, the default); this will insert nucleotides for the non-nucleotide mutations. I cannot think of a better general-purpose default; for more complex things the user will need to do it themselves.

TODO:

your use of the word "insert" doesn't seem quite right, as "insert" means, to me, "add a new thing somewhere internally, while leaving what was there before unchanged"; you seem to mean "replace", not "insert", especially in "will instead insert the nucleotide from the mutation's metadata".
"this method tries to assign" should start with a capitol letter.
clarify that reference_sequence should be a string of nucleotides (not digits).
suggest using generate_nucleotides in convert_alleles besides just saying not having nucs produces an error
add consistency check on reference sequences

And, good thought, but I don't think it's the job of this function to go checking all possible aspects of the reference sequence.

bhaller · 2021-06-14T17:21:58Z

Perfect, thanks! And, sorry the python is so impenetrable.

Just not a language I'm fluent in, I'm sure there's nothing wrong with the code. :->

It is legal and valid to do a nucleotide-based model that contains some non-nucleotide-based mutations; this can be useful, for example, if you want to use marker mutations to designate which genomes contain an inversion at a particular position, for example. So, is there something better than throwing an error that this function could do when confronted by such tree sequences?

Well, there might be something better; but I'm trying to keep things reasonly simple; so in this case I would recommend calling generate_nucleotides first (with keep=True, the default); this will insert nucleotides for the non-nucleotide mutations. I cannot think of a better general-purpose default; for more complex things the user will need to do it themselves.

OK. Perhaps the doc could suggest that workflow, rather than just saying that it produces an error?

And, good thought, but I don't think it's the job of this function to go checking all possible aspects of the reference sequence.

OK. In general I'm in favor of sanity-checking inputs for user-visible APIs whenever there is not a performance reason not to, so that the user gets a good error message instead of just mysteriously/confusingly wrong behavior, but it certainly does add more work. Of course my example was a bit flip, but it might catch a legit typo in a hand-typed string like "ACGTAVGT", or a confused user doing "01230123", or FASTA data that contains extended nucleotide codes like U, R, Y, etc. (https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation). I have had users try to input such codes into SLiM for an ancestral sequence. :->

petrelharp · 2021-06-14T17:26:39Z

Compelling. Added these to the TODOs.

This was referenced Jun 10, 2021

allele remapper for write_vcf tskit-dev/tskit#1081

Closed

provide method to remap alleles (eg for VCF output) #168

Closed

petrelharp force-pushed the nucleotidify branch from 456d30c to d4b929f Compare June 10, 2021 20:44

petrelharp mentioned this pull request Jun 11, 2021

Write VCF with nucleotides only #115

Closed

petrelharp force-pushed the nucleotidify branch from 411a72d to 26dd122 Compare June 11, 2021 05:32

jeromekelleher reviewed Jun 11, 2021

View reviewed changes

petrelharp force-pushed the nucleotidify branch from 26dd122 to 990ff44 Compare June 11, 2021 22:58

petrelharp force-pushed the nucleotidify branch from f700aa3 to 185a8c4 Compare June 15, 2021 03:54

nucleotide assignment

34e6016

petrelharp force-pushed the nucleotidify branch from 185a8c4 to 34e6016 Compare June 15, 2021 04:12

petrelharp merged commit bcc6ebe into tskit-dev:main Jun 15, 2021

petrelharp mentioned this pull request Jun 15, 2021

add example of converting to nucleotides for VCF output #178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

method to convert alleles to nucleotides #174

method to convert alleles to nucleotides #174

petrelharp commented Jun 10, 2021 •

edited

jeromekelleher left a comment

jeromekelleher Jun 11, 2021

petrelharp Jun 11, 2021

bhaller commented Jun 11, 2021

petrelharp commented Jun 11, 2021

petrelharp commented Jun 11, 2021

bhaller commented Jun 11, 2021

petrelharp commented Jun 11, 2021

bhaller commented Jun 11, 2021

codecov-commenter commented Jun 11, 2021 •

edited

petrelharp commented Jun 13, 2021

bhaller commented Jun 13, 2021

bhaller commented Jun 14, 2021

petrelharp commented Jun 14, 2021 •

edited

bhaller commented Jun 14, 2021

petrelharp commented Jun 14, 2021

		@@ -84,3 +86,93 @@ def recapitate(ts,
		return SlimTreeSequence(recap, reference_sequence=ts.reference_sequence)


		def convert_alleles(ts, generate=True, seed=None):

method to convert alleles to nucleotides #174

method to convert alleles to nucleotides #174

Conversation

petrelharp commented Jun 10, 2021 • edited

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Jun 11, 2021

Choose a reason for hiding this comment

petrelharp Jun 11, 2021

Choose a reason for hiding this comment

bhaller commented Jun 11, 2021

petrelharp commented Jun 11, 2021

petrelharp commented Jun 11, 2021

bhaller commented Jun 11, 2021

petrelharp commented Jun 11, 2021

bhaller commented Jun 11, 2021

codecov-commenter commented Jun 11, 2021 • edited

Codecov Report

petrelharp commented Jun 13, 2021

bhaller commented Jun 13, 2021

bhaller commented Jun 14, 2021

petrelharp commented Jun 14, 2021 • edited

bhaller commented Jun 14, 2021

petrelharp commented Jun 14, 2021

petrelharp commented Jun 10, 2021 •

edited

codecov-commenter commented Jun 11, 2021 •

edited

petrelharp commented Jun 14, 2021 •

edited