-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add genotype matrix command line option #1080
Comments
Thanks for raising this issue! It would be super useful to have a command line to spit out either a haplotype or a genotype matrix. The format we usually need this data in is quite simple:
Usually it's also useful to have the first column or row names with individuals IDs (with haplotypes the rows can be named ID_1, ID_2 or a/b or something similar) |
Tab-separated? Is there a standard we should copy (maybe one of plink's)? Here's a quick example:
producing
|
PLINK .ped uses space as the separator but also has some additional columns (Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype). My intention is not really to use the genotypes in PLINK, so it doesn't really matter what the separator is - as long as it can be read back in, it's fine 👍 But if someone has an intention to use it in PLINK, maybe a standard format would be better. |
One thing that I am missing in the current version of tskit is an actual sequence output that includes non-variant sites. The sequence could still be binary (or trinary). |
Thanks for the issue! @igronau For the full sequence we would need #146 to be done also. |
@benjeffery , for example, I am reading the haplotypes into R, to use them for further simulation of breeding programs (with AlphaSimR). For that I just to read them in and then feed them as vectors - so for this purpose it doesn't really matter. Even just writing out the |
That's interesting, thanks for explaining @janaobsteter. If interoperability with R is what we're looking for then I wonder if text is the right approach - is there some binary matrix format that could be used? Or, would it be better to recommend using reticulate, and cut out the format conversion entirely? This is probably something we want to have a look at at some point. |
@hyanwong - I think you have some experience using reticulate with tskit - do you have any advice here on how best to interchange genotype data? |
I've only tried using reticulate with |
R interoperability may be worth promoting to a top-level section in the docs, if we see lots of demand for it. |
Seems like it's worth a tutorial - should be possible using jupyter-sphinx. Basically we'd want to show
|
I could have a go at this after Xmas, if you like Edit - here's my play with reticulate previously: #465 (comment) |
Thanks @hyanwong - I've created tskit-dev/tutorials#11 to track. |
I think output to plink format would be the way to go here. |
Some version of the plink text format would be useful; I've created an issue to track #1086 |
Is this OK with you @janaobsteter? We don't really want to add yet another text file format to the already far too many that are out there! |
Silly question (that I should probably know the answer to) - does tskit support output of FASTA sequences? |
Yes, but it's not documented and not quite finished, I think: #353 It would be nice to finish this off. |
P.s. @igronau - if you have any comments on my suggestion in #353 (comment) please add them. This is outside my area really, so input from people who really use these formats would be very helpful. |
Thanks @hyanwong. Looks like you have it figured in #353. The main thing that I think tha tpeople care about is having some padding between variant sites. It's easy to replace these padded bases by anything you wish later on. The real pain is having to figure out the padding for yourself given the haplotype output of tskit. |
It's very useful to know that the padding is expected. I would prefer to use |
Yes, that would be perfect! I know reticulate is an option too, but sometimes it's useful also to have the command line option. |
Just to make sure. The character '.' is used to specify "invariant base", and you use one character per sequence position? That would be the preferred option, in my view. |
I hadn't seen it used for that. We can't use one character per base because sites may not be at integer positions. So every character would have a single dot between it, in my suggested output. I'm open to using other characters though - please do suggest improvements @igronau, but bear in mind that the site positions may be weird to people coming from a real data background, e.g. we could have positions |
Mind if we move the discussion of fasta output to #353? |
I'm going to close this as I think the issues are being dealt with elsewhere - the main issue is to implement the reference sequence (#146, and related issues in https://github.com/tskit-dev/tskit/projects/11) |
(Please do reopen if there's something we're not covering with fasta output, etc, like #1889) |
This came up in the stdpopsim workshop just now: it'd be nice to have a
tskit genotype_matrix
command, like thetskit vcf
command. I'm hoping others will say what exactly the output should look like?@janaobsteter ? @gphocs-dev ?
The text was updated successfully, but these errors were encountered: