Add map_deletions_to_ts as Dataset method#429
Conversation
|
Nice, thanks @hyanwong. I think we'd probably arrange the API a bit differently, in that what we'll probably want to do is remap all deletions that pass a specific frequency threshold, so we'll likely pass in a list of site IDs rather than one range. I'd also like to add some metadata so we can track these mutations more easily in analysis. Leave it with me and I'll rejig and write tests when I get a chance. |
|
Great. Re ranges, given the issues I just had with alignments (and after chatting to Isobel), it does seem worth including a short portion of flanking regions too: i.e. don't believe the site positions / ids are completely accurate w.r.t. deletions. I was pleasantly surprised that when I remapped the range of 11280 to 11305, counting only "significant" mutations that lead to more than 50 samples, the only regions with deletions were 11283-11296, as discussed in jeromekelleher/sc2ts-paper#249 (comment). That implies that significant deletions might actually be quite rare, and we might be able to pass in all the sites as a first approximation, then narrow down to only those with significant deletions. |

This seemed like the neatest API for mapping deletions, as we require a Dataset object, and can add methods to it easily enough, i.e.
I have coded it so that the first sample could be Wuhan, or not. I have also put a stub for a test in, but I'm not sure how to actually test it, as I don't know if we have a test tree sequence equivalent of fx_dataset with the same named samples.