Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

method to convert alleles to nucleotides #174

Merged
merged 1 commit into from Jun 15, 2021

Conversation

petrelharp
Copy link
Contributor

@petrelharp petrelharp commented Jun 10, 2021

Primary use case would be VCF output: if ts was produced by a SLiM nucleotide simulation then we'd do:

nts = pyslim.convert_alleles(ts)
with open('out.vcf', 'w') as f:
  nts.write_vcf(f)

If the simulation was produced by a non-nucleotide simulation, or a nucleotide simulation with some non-nucleotide mutations, then we'd do

nts = pyslim.convert_alleles(
            pyslim.generate_nucleotides(ts)
)
with open('out.vcf', 'w') as f:
  nts.write_vcf(f)

Closes #73. Closes #168.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think it would be easier to document and write if it was split into two functions.

@@ -84,3 +86,93 @@ def recapitate(ts,
return SlimTreeSequence(recap, reference_sequence=ts.reference_sequence)


def convert_alleles(ts, generate=True, seed=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought it would be clearer to have two functions convert_alleles which expects a SLiM nucleotide model and generate_alleles which will generate random alleles. It took a few goes at reading to docstring to get the current semantics (why would you not use the SLiM nucleotides, if they exist already?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's a nice idea. Right now it's using nucleotides when they exist and generating them if they don't, so it works with mixed nuc- and non-nuc mutations. But there's no need to make it so complex.

@bhaller
Copy link
Collaborator

bhaller commented Jun 11, 2021

I haven't looked at this closely. I guess my first-blush question would be: if this remapping is primarily for the purpose of handling SLiM output, why not follow the same policy that SLiM follows, for consistency? (That policy is outlined in detail in the manual.) I'm not saying that policy is good; but if one is trying to wedge SLiM's world of infinite alleles and stacked mutations into VCF's world of nucleotide-based alleles, I'm not sure any policy is good, and if people want VCF output that really makes total sense, they should probably just run a nucleotide-based model in the first place. Consistency seems like the most important goal, maybe? So having SLiM's VCF output and pyslim's VCF output doing totally different things just seems potentially weird and confusing. (On the other hand, I guess it gives users a choice of which style of VCF output they would prefer :->).

@petrelharp
Copy link
Contributor Author

why not follow the same policy that SLiM follows, for consistency?

Good idea. But after reading the policy careful, I think my answer is that (a) that'd be way too much work to duplicate (since I couldn't just use tskit.write_vcf), and (b) I think it'd be nice to have an alternative for how multiallelic states are handled. SLiM for non-nucleotide models represents all mutations as A->T, with one line per mutation, and you say that there are only four nucleotides so you can represent at most only four states; but I think that's not right - you're allowed multi-nucleotide alleles in VCF (e.g., A,CCA,TTTTT in the REF column). Here, I'm generating unique alleles by using multicharacter strings. I think my goal here is to produce a legal VCF file (so, without duplicated POS rows) with as much information as possible (without dropping multiallelic sites), and in a relatively simple way.

@petrelharp
Copy link
Contributor Author

Ok, revised proposal:

convert_alleles(ts):
    ```
    Copies the nucleotide state of each SLiM mutation to the derived state, and the reference sequence to the ancestral state.
    Mutations with no nucleotide (i.e., -1 in metadata) will have a derived state of `"."` (i.e., "missing") and tree sequences without
    a reference sequence will have ancestral states of `"."`.
    ```

generate_nucleotides(ts, reference_sequence=None, keep=True):
    ```
    Generates random nucleotides, by choosing a uniformly random nucleotide that differs from the parental state,
    and (if `ancestral_sequence` is True) a random reference sequence. Nucleotide state is stored in mutation metadata;
    this method will modify the metadata and keep everything else the same. If `keep=True`, then mutations that already have
    a nucleotide state (i.e., not set to -1 in metadata) will retain their state.
    ```

@bhaller
Copy link
Collaborator

bhaller commented Jun 11, 2021

why not follow the same policy that SLiM follows, for consistency?

Good idea. But after reading the policy careful, I think my answer is that (a) that'd be way too much work to duplicate (since I couldn't just use tskit.write_vcf), and (b) I think it'd be nice to have an alternative for how multiallelic states are handled. SLiM for non-nucleotide models represents all mutations as A->T, with one line per mutation, and you say that there are only four nucleotides so you can represent at most only four states; but I think that's not right - you're allowed multi-nucleotide alleles in VCF (e.g., A,CCA,TTTTT in the REF column). Here, I'm generating unique alleles by using multicharacter strings. I think my goal here is to produce a legal VCF file (so, without duplicated POS rows) with as much information as possible (without dropping multiallelic sites), and in a relatively simple way.

OK, that's reasonable. I think it might turn out to be pretty weird, for many users, that (conceptually) point mutations get represented as indels in the VCF. I can see that maybe raising issues with downstream analysis tools, which is usually what people want VCF for in the first place. But maybe not any weirder than how SLiM handles them, and as you say, at least it's fully VCF-compliant. And people can import back into SLiM and write VCF from there if they really want that style of output.

Re: your revised proposal. Hmm, "uniformly random nucleotide that differs from the parental state, and (if ancestral_sequence is True) a random reference sequence". If there are multiple segregating mutations at a single site, is there any guarantee that those mutations will receive different random nucleotides? Is that a concern?

@petrelharp
Copy link
Contributor Author

Sorry, I didn't make it clear - this revised proposal ditches the idea of using multi-nucleotides to ensure distinct alleles. I could provide that as an example somewhere, instead. In the current proposal, generate_nucleotides serves two purposes: (a) getting a tree sequence ready to load into SLiM for a nucleotide model, and (b) letting users output vcf files from any tree sequences. More specific requirements for VCF output (eg having all alleles distinct) could be done ad-hoc by users (and I'd provide an example).

@bhaller
Copy link
Collaborator

bhaller commented Jun 11, 2021

Sorry, I didn't make it clear - this revised proposal ditches the idea of using multi-nucleotides to ensure distinct alleles. I could provide that as an example somewhere, instead.

Ah, OK. I think that's for the best; creating fictional indels just for VCF output seems rather extreme.

In the current proposal, generate_nucleotides serves two purposes: (a) getting a tree sequence ready to load into SLiM for a nucleotide model, and (b) letting users output vcf files from any tree sequences. More specific requirements for VCF output (eg having all alleles distinct) could be done ad-hoc by users (and I'd provide an example).

I see, OK. Makes sense as far as I can see, but I can tell you that people who use VCF files for their analysis have very specific/fussy requirements, so you might want to run these plans past an empirical biologist or two for a sanity check. :->

@codecov-commenter
Copy link

codecov-commenter commented Jun 11, 2021

Codecov Report

Merging #174 (806dd4e) into main (198949f) will increase coverage by 0.60%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   87.15%   87.76%   +0.60%     
==========================================
  Files           7        7              
  Lines         919      981      +62     
  Branches      169      187      +18     
==========================================
+ Hits          801      861      +60     
- Misses         87       88       +1     
- Partials       31       32       +1     
Impacted Files Coverage Δ
pyslim/methods.py 94.50% <96.77%> (+4.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 198949f...806dd4e. Read the comment docs.

@petrelharp
Copy link
Contributor Author

Well, this turned out to be unavoidably complicated, since substitutions will manifest as stacked mutations, and for both of these methods we need to figure out which SLiM mutation was the 'most recent' one, which is not straightforward. I'm deciding the most recent one is the one that

  1. has the largest slim_time
  2. is not present in the list of SLiM mutations for the parent mutation, if any
  3. has the largest SLiM ID, if the other two fail to distinguish.

This is implemented here and tested here. It's not pretty.

@bhaller, mind having a look at whether the API and documentation makes sense (so, here and here) and whether what I said above seems right to you? I suppose I could also be testing for agreement with what SLiM thinks is the nucleotides, by loading things into SLiM, but I have not.

@bhaller
Copy link
Collaborator

bhaller commented Jun 13, 2021

Yeah, this is a pain for sure. It would be nice if SLiM put the derived state in order, but the reason it doesn't is that it would take time, and thus slow down every simulation, and most simulations don't care about this; so it would slow down the many for a benefit to the few or the one, which would contradict Spock's dictum. So. I'll have a look at this tomorrow.

@bhaller
Copy link
Collaborator

bhaller commented Jun 14, 2021

Hey. OK, well, I looked at the implementation and test code, but it is beyond my rudimentary skills in Python; I really can't say whether what it's doing looks correct or not. I'm just not a Python-fluent programmer, sorry. Maybe some day, but since I use it quite infrequently, I've even forgotten most of the Python that I learned in the past.

I looked at the doc links you provided. The doc for convert_alleles is very clear, seems good. The one thing that gave me pause was this statement: "This method will produce an error... if any mutations do not have nucleotides". Hmm. It is legal and valid to do a nucleotide-based model that contains some non-nucleotide-based mutations; this can be useful, for example, if you want to use marker mutations to designate which genomes contain an inversion at a particular position, for example. So, is there something better than throwing an error that this function could do when confronted by such tree sequences?

Also, a nit: your use of the word "insert" doesn't seem quite right, as "insert" means, to me, "add a new thing somewhere internally, while leaving what was there before unchanged"; you seem to mean "replace", not "insert", especially in "will instead insert the nucleotide from the mutation's metadata".

Another nit: "this method tries to assign" should start with a capitol letter.

Should reference_sequence be a string of letters, "ACGTACGT", or a string of integers, "01230123"? I would have assumed letters, but then you discuss integer nucleotide values in a couple spots, so then I start to wonder. Also: should the code check that reference_sequence is actually well-formed, containing what it is expected to contain, beyond just its length? What happens if someone passes "Thequickbrownfoxjumpsoverthelazydog" for a reference sequence? :->

That's all that I see.

@petrelharp
Copy link
Contributor Author

petrelharp commented Jun 14, 2021

Perfect, thanks! And, sorry the python is so impenetrable.

It is legal and valid to do a nucleotide-based model that contains some non-nucleotide-based mutations; this can be useful, for example, if you want to use marker mutations to designate which genomes contain an inversion at a particular position, for example. So, is there something better than throwing an error that this function could do when confronted by such tree sequences?

Well, there might be something better; but I'm trying to keep things reasonly simple; so in this case I would recommend calling generate_nucleotides first (with keep=True, the default); this will insert nucleotides for the non-nucleotide mutations. I cannot think of a better general-purpose default; for more complex things the user will need to do it themselves.

TODO:

  • your use of the word "insert" doesn't seem quite right, as "insert" means, to me, "add a new thing somewhere internally, while leaving what was there before unchanged"; you seem to mean "replace", not "insert", especially in "will instead insert the nucleotide from the mutation's metadata".
  • "this method tries to assign" should start with a capitol letter.
  • clarify that reference_sequence should be a string of nucleotides (not digits).
  • suggest using generate_nucleotides in convert_alleles besides just saying not having nucs produces an error
  • add consistency check on reference sequences

And, good thought, but I don't think it's the job of this function to go checking all possible aspects of the reference sequence.

@bhaller
Copy link
Collaborator

bhaller commented Jun 14, 2021

Perfect, thanks! And, sorry the python is so impenetrable.

Just not a language I'm fluent in, I'm sure there's nothing wrong with the code. :->

It is legal and valid to do a nucleotide-based model that contains some non-nucleotide-based mutations; this can be useful, for example, if you want to use marker mutations to designate which genomes contain an inversion at a particular position, for example. So, is there something better than throwing an error that this function could do when confronted by such tree sequences?

Well, there might be something better; but I'm trying to keep things reasonly simple; so in this case I would recommend calling generate_nucleotides first (with keep=True, the default); this will insert nucleotides for the non-nucleotide mutations. I cannot think of a better general-purpose default; for more complex things the user will need to do it themselves.

OK. Perhaps the doc could suggest that workflow, rather than just saying that it produces an error?

And, good thought, but I don't think it's the job of this function to go checking all possible aspects of the reference sequence.

OK. In general I'm in favor of sanity-checking inputs for user-visible APIs whenever there is not a performance reason not to, so that the user gets a good error message instead of just mysteriously/confusingly wrong behavior, but it certainly does add more work. Of course my example was a bit flip, but it might catch a legit typo in a hand-typed string like "ACGTAVGT", or a confused user doing "01230123", or FASTA data that contains extended nucleotide codes like U, R, Y, etc. (https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation). I have had users try to input such codes into SLiM for an ancestral sequence. :->

@petrelharp
Copy link
Contributor Author

Compelling. Added these to the TODOs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

provide method to remap alleles (eg for VCF output) output VCF with legal REF/ALT columns
4 participants