Skip to content

Conversation

@jeromekelleher
Copy link
Member

This adds the ability to specify a fixed mapping for allelic values to genotypes. The motivation for this is that it can be quite annoying to have a different allele mapping for every site (especially when doing downstream things like haplotype matching). Allowing the user to specify a fixed mapping for this doesn't cover every possible situation, but it does make a very common one (fixed alphabet of ACGT) much simpler.

@codecov
Copy link

codecov bot commented Nov 29, 2019

Codecov Report

Merging #430 into master will increase coverage by 0.06%.
The diff coverage is 92.36%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #430      +/-   ##
=========================================
+ Coverage   86.74%   86.8%   +0.06%     
=========================================
  Files          20      20              
  Lines       14136   14232      +96     
  Branches     2748    2774      +26     
=========================================
+ Hits        12262   12354      +92     
- Misses        963     964       +1     
- Partials      911     914       +3
Flag Coverage Δ
#c_tests 87.88% <94.5%> (+0.02%) ⬆️
#python_c_tests 90.64% <88.63%> (+0.06%) ⬆️
#python_tests 99.24% <100%> (ø) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.75% <100%> (ø) ⬆️
c/tskit/core.c 83.64% <100%> (+0.12%) ⬆️
python/tskit/__init__.py 100% <100%> (ø) ⬆️
python/_tskitmodule.c 83.8% <87.5%> (+0.18%) ⬆️
c/tskit/genotypes.c 91.43% <93.82%> (-0.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be76275...f0465e8. Read the comment docs.

@petrelharp
Copy link
Contributor

Very handy - I agree. I've gone through and commented.

@jeromekelleher jeromekelleher force-pushed the specify-fixed-alleles branch 2 times, most recently from 0ddcc28 to bf0ee5c Compare December 14, 2019 16:08
@jeromekelleher jeromekelleher changed the title Specify fixed alleles User specified alleles Dec 14, 2019
@jeromekelleher
Copy link
Member Author

Thanks for the comments @petrelharp and @hyanwong. Good catch on the fixed_alleles thing, that was a terrible choice of name! I've changed to "user alleles" throughout: better?

@hyanwong
Copy link
Member

hyanwong commented Dec 16, 2019

Wow, this is a lot of work, and a very useful addition. There's quite a lot in here that I don't quite understand, but I've done my best to make useful comments.

A few other points:

  1. As I understand it, it would be possible to output a fixed array for all sites (e.g. SNPs, indels, etc), as long as we listed all the possible variants at every site in the alleles string, is that right? E.g. the alleles string could be ['A', 'C', 'G', 'T', '', 'AT', 'TA' ...]. Obviously that would get tedious for a large number of different variants, but this is a possibility, right? So would it be useful to have a function that uniqu-ifies both the ancestral_states stored in the sites table, and the derived_states store in the mutations table, and returns a list or set? That could then, if small enough, be passed as an alleles list, so that even slightly more complex variant patterns could be represented as a fixed list?

  2. Will we need a way to get the 1D numpy array of integer encodings of the ancestral states over all sites, now that we can't assume it's all zeros?

@jeromekelleher
Copy link
Member Author

As I understand it, it would be possible to output a fixed array for all sites (e.g. SNPs, indels, etc), as long as we listed all the possible variants at every site in the alleles string, is that right? E.g. the alleles string could be ['A', 'C', 'G', 'T', '', 'AT', 'TA' ...]. Obviously that would get tedious for a large number of different variants, but this is a possibility, right? So would it be useful to have a function that uniqu-ifies both the ancestral_states stored in the sites table, and the derived_states store in the mutations table, and returns a list or set? That could then, if small enough, be passed as an alleles list, so that even slightly more complex variant patterns could be represented as a fixed list?

Sure, nothing stopping a user doing this (so long as there's less than 128 of them). I don't think we need functions for it right now though.

Will we need a way to get the 1D numpy array of integer encodings of the ancestral states over all sites, now that we can't assume it's all zeros?

You can still get the ancestral state from the site object from variants, so getting the ancestral state index is just alleles.index(site.ancestral_state). Easy to reproduce, so I don't think we need API additions for this unless someone asks for it.

@petrelharp
Copy link
Contributor

Looks good - "user_alleles" is ok, but I like it how the python interface just uses "alleles".

@jeromekelleher
Copy link
Member Author

Looks good - "user_alleles" is ok, but I like it how the python interface just uses "alleles".

OK, cool. Will think about it again in the new year.

@jeromekelleher
Copy link
Member Author

Merging this --- user_alleles is only used internally and isn't part of the external C API @petrelharp. This is necessary for memory management and only used to distinguish the case where a user specifies alleles and does not.

@jeromekelleher jeromekelleher merged commit bb8f2d2 into tskit-dev:master Jan 10, 2020
@jeromekelleher jeromekelleher deleted the specify-fixed-alleles branch January 10, 2020 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants