Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Nov 27, 2019

There was discussion of this in the following PR: #426. @jeromekelleher is keen not to open the multi-letter-allele can of worms, unless there is demand from users. @hyanwong thinks that (in the case of haplotype output) the semantics are reasonably clear, and being able to output haplotype strings containing indels will be needed by users quite soon.

Some of this is probably only relevant once finite site tree sequences have been implemented. But (IMO) if we think it is reasonable to concatenate widely spaced SNPs into a string of letters, it is also reasonable to concatenate multi-letter haplotypes from an infinite-sites TS, as long as each allele takes the same number of characters. This is what this PR does.

Also from that PR, a quick summary - do we think it useful to be able to output small indels as aligned haplotypes, e.g.

Sample1: .A.T.---.G.T
Sample2: .A.C.---.G.A
Sample3: .A.T.ATT.G.A

Note that the current code does not output the . between sites (an optional convention to represents non-variable regions between variants) - I put them there for clarity, and also because I think it might be a useful addition to address #353 (comment)

@hyanwong
Copy link
Member Author

p.s. I'm also opening this as a PR because I don't want to lose the work I've done on this, especially the fast function coded herein (sitewise_allele_max_bytes()) which speedily calculates the maximum byte length of all the alleles at a site, for all sites. I think this might be generally useful however we chose to deal with multi-letter alleles.

@codecov
Copy link

codecov bot commented Nov 27, 2019

Codecov Report

Merging #427 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #427      +/-   ##
==========================================
+ Coverage   86.73%   86.75%   +0.02%     
==========================================
  Files          20       20              
  Lines       14115    14137      +22     
  Branches     2745     2747       +2     
==========================================
+ Hits        12242    12264      +22     
  Misses        963      963              
  Partials      910      910
Flag Coverage Δ
#c_tests 87.87% <100%> (+0.02%) ⬆️
#python_c_tests 90.59% <100%> (+0.03%) ⬆️
#python_tests 99.24% <100%> (ø) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.76% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7950b25...12e2407. Read the comment docs.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 4, 2019

Temporarily closing due to lack of bandwidth for dealing with finite sites / multiletter alleles. May be worth revisiting later

@hyanwong hyanwong closed this Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant