Skip to content

Add map_deletions_to_ts as Dataset method#429

Merged
jeromekelleher merged 1 commit intotskit-dev:mainfrom
hyanwong:remap
Feb 5, 2025
Merged

Add map_deletions_to_ts as Dataset method#429
jeromekelleher merged 1 commit intotskit-dev:mainfrom
hyanwong:remap

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Dec 3, 2024

This seemed like the neatest API for mapping deletions, as we require a Dataset object, and can add methods to it easily enough, i.e.

ds = sc2ts.Dataset("../data/viridian_2024-04-29.alpha1.zarr.zip")
ts = tszip.load("../data/find_problematic_v2-2023-01-01.ts.tsz")

start = 11284
end = 11302
del_ts = ds.map_deletions_to_ts(ts, start, end)

I have coded it so that the first sample could be Wuhan, or not. I have also put a stub for a test in, but I'm not sure how to actually test it, as I don't know if we have a test tree sequence equivalent of fx_dataset with the same named samples.

@jeromekelleher
Copy link
Member

Nice, thanks @hyanwong.

I think we'd probably arrange the API a bit differently, in that what we'll probably want to do is remap all deletions that pass a specific frequency threshold, so we'll likely pass in a list of site IDs rather than one range. I'd also like to add some metadata so we can track these mutations more easily in analysis.

Leave it with me and I'll rejig and write tests when I get a chance.

@hyanwong
Copy link
Member Author

hyanwong commented Dec 4, 2024

Great. Re ranges, given the issues I just had with alignments (and after chatting to Isobel), it does seem worth including a short portion of flanking regions too: i.e. don't believe the site positions / ids are completely accurate w.r.t. deletions.

I was pleasantly surprised that when I remapped the range of 11280 to 11305, counting only "significant" mutations that lead to more than 50 samples, the only regions with deletions were 11283-11296, as discussed in jeromekelleher/sc2ts-paper#249 (comment). That implies that significant deletions might actually be quite rare, and we might be able to pass in all the sites as a first approximation, then narrow down to only those with significant deletions.

@jeromekelleher
Copy link
Member

My guess is that if we do something like including only sites with > 10% frequency of deletions (or something) we'll get a very good approx. We track this in the site QC of the ARG:

Screenshot from 2024-12-04 14-34-19

@jeromekelleher jeromekelleher merged commit 3158f29 into tskit-dev:main Feb 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants