Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add samples and sites options to ts.genotype_matrix? #678

Closed
hyanwong opened this issue Jun 13, 2020 · 12 comments
Closed

Add samples and sites options to ts.genotype_matrix? #678

hyanwong opened this issue Jun 13, 2020 · 12 comments
Assignees
Labels
enhancement New feature or request Python API Issue is about the Python API
Milestone

Comments

@hyanwong
Copy link
Member

As noted in the docs, the genotype_matrix method can be very memory intensive. As we get to larger datasets, I suspect users might want to only extract portions of the matrix, e.g. certain samples, or certain regions. For example one extreme would be to export the haplotype for a single individual. The obvious way to do this is to run simplify to reduce the number of samples, and keep_intervals to reduce the number of sites, but this isn't likely to be obvious to the average user. Should we provide 2 parameters to the genotype_matrix method itself which, if not None, run these two preprocessing steps to reduce the output before returning the genotype matrix?

@jeromekelleher
Copy link
Member

I think there's a samples argument to variants which we could also easily add to genotype matrix. That would be an easy PR, I think. Going through simplify etc seems excessive.

Sites is a bit less obvious, but I guess list of site IDs to output would be fine. For now, we could just skip any sites that are not in this list as we're iterating over the variants.

But, ah - this is done in C at the moment IIRC. That's probably pointless, to be honest, and we should probably just implement the genotype_matrix method in Python.

@mufernando
Copy link
Member

mufernando commented Jun 24, 2020

I'm interested in this, so might have a go at it soon.

@mufernando mufernando self-assigned this Jun 24, 2020
@jeromekelleher
Copy link
Member

I'm interested in this, so might have a go at it soon.

Great, thanks @mufernando! The first thing to do is to figure out if a Python version of the current genotype_matrix function performs much worse than what we have now. To do this, create a large tree sequence (mininum 100K samples) and do some timing checks.

If we don't lose any real perf by doing it in python, then we can get rid of the current Python-C function, and our lives will be much easier.

@benjeffery benjeffery added C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API labels Sep 29, 2020
@jeromekelleher
Copy link
Member

To really do this efficiently for sites we'd need a method for efficiently seeking to particular trees, so I'm putting this in the random access project.

@petrelharp
Copy link
Contributor

In removing SlimTreeSequence over in pyslim, I've run in to the ts.mutations_at( ) method I provided there, that provides more or less this functionality. It'd be nice to point to a replacement in tskit. What's the proposal for this API? Is it a method of Variant? I'm not sure if anyone is using it, so perhaps I won't actually do it right away (maybe, though?), but it'd at least be nice to have a concrete path forward worked out, and perhaps I can re-use my testing code from pyslim at least.

@petrelharp
Copy link
Contributor

Edit: actually, this method is used in an example recipe in the SLiM manual ("18.13 - Tree-sequence recording and nucleotide-based models") to count up the trinucleotide mutation spectrum (ie the contexts in which mutations occurred), so I will be including the functionality somehow.

@benjeffery
Copy link
Member

We have #2176 planned (soon) to bring the new C variant methods up to Python. After this, it will be pretty easy to make convenient methods on top. What's your timescale on needing this?

@petrelharp
Copy link
Contributor

Ah, awesome. I can just move over my python code, so I don't need it at any particular time. Which sort of 'soon' is this one, do you think? That'd help me figure out how to deal with it in the short term.

@benjeffery
Copy link
Member

I would hope before the paper release! So a couple of months?

@benjeffery benjeffery added this to the Python 0.5.1 (paper) milestone May 24, 2022
@szhan szhan self-assigned this Aug 16, 2022
@szhan szhan removed the C API Issue is about the C API label Aug 16, 2022
@szhan
Copy link
Member

szhan commented Aug 16, 2022

@benjeffery has suggested doing the following tasks (as separate PRs) to solve this issue.

0 - Add ts.genotype_matrix to performance benchmark - PR 1
1 - Change current implementation to python similar to ts.variants - no changes to tests - PR 2
1.5 - Check for perf regression
2 - Add sites argument to this function and tests for it - PR 3
3 - Add samples argument and tests - PR 4

@hyanwong
Copy link
Member Author

Samples argument is already done, and can be ticked off.

@szhan
Copy link
Member

szhan commented Sep 16, 2022

As agreed in #2494 we will revisit adding sites argument at some time.

@szhan szhan closed this as completed Sep 16, 2022
Random access for trees automation moved this from To do to Done Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python API Issue is about the Python API
Development

No branches or pull requests

6 participants