Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add genotype matrix command line option #1080

Closed
petrelharp opened this issue Dec 9, 2020 · 27 comments
Closed

add genotype matrix command line option #1080

petrelharp opened this issue Dec 9, 2020 · 27 comments

Comments

@petrelharp
Copy link
Contributor

This came up in the stdpopsim workshop just now: it'd be nice to have a tskit genotype_matrix command, like the tskit vcf command. I'm hoping others will say what exactly the output should look like?
@janaobsteter ? @gphocs-dev ?

@janaobsteter
Copy link

Thanks for raising this issue! It would be super useful to have a command line to spit out either a haplotype or a genotype matrix. The format we usually need this data in is quite simple:

  • genotypes: each line is one individual and the genotypes are coded as 0, 1 or 2 (the standard SNP chip output)
  • haplotypes: the two subsequent rows are one individual and the haplotypes are coded as 0 or 1

Usually it's also useful to have the first column or row names with individuals IDs (with haplotypes the rows can be named ID_1, ID_2 or a/b or something similar)

@petrelharp
Copy link
Contributor Author

petrelharp commented Dec 10, 2020

Tab-separated? Is there a standard we should copy (maybe one of plink's)?

Here's a quick example:

import msprime, tskit
ts = msprime.simulate(6, mutation_rate=0.01, length=100, recombination_rate=0.01)
g = ts.genotype_matrix() 
for k in range(int(ts.num_samples / 2)):
  print(f"tsk_{k}\t" + "\t".join(map(str, g[:, 2*k] + g[:, 2*k + 1])))

producing

tsk_0	0	0	0	2	0	2	0	1	0	0	2
tsk_1	0	0	0	2	0	2	0	0	1	0	2
tsk_2	1	1	1	0	2	0	1	0	0	1	1

@janaobsteter
Copy link

janaobsteter commented Dec 10, 2020

PLINK .ped uses space as the separator but also has some additional columns (Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype). My intention is not really to use the genotypes in PLINK, so it doesn't really matter what the separator is - as long as it can be read back in, it's fine 👍 But if someone has an intention to use it in PLINK, maybe a standard format would be better.

@igronau
Copy link

igronau commented Dec 10, 2020

One thing that I am missing in the current version of tskit is an actual sequence output that includes non-variant sites. The sequence could still be binary (or trinary).

@benjeffery
Copy link
Member

Thanks for the issue!
@janaobsteter What are you reading back into? It is good to get an idea of the use cases here to check we are doing the right thing. I think adding standard format outputs to tskit is a good thing, but I'm not so sure about adding non-standard things that are one/two liners unless the use case is common and compelling as we would be adding yet another text format into the ecosystem!

@igronau For the full sequence we would need #146 to be done also.

@janaobsteter
Copy link

@benjeffery , for example, I am reading the haplotypes into R, to use them for further simulation of breeding programs (with AlphaSimR). For that I just to read them in and then feed them as vectors - so for this purpose it doesn't really matter. Even just writing out the ts.genotpe_matrix() would be helpful for that purpose. But I do agree it is easier to work with standard formats - I just don't know what that is - expect for PLINK.

@jeromekelleher
Copy link
Member

That's interesting, thanks for explaining @janaobsteter. If interoperability with R is what we're looking for then I wonder if text is the right approach - is there some binary matrix format that could be used? Or, would it be better to recommend using reticulate, and cut out the format conversion entirely?

This is probably something we want to have a look at at some point.

@jeromekelleher
Copy link
Member

@hyanwong - I think you have some experience using reticulate with tskit - do you have any advice here on how best to interchange genotype data?

@hyanwong
Copy link
Member

hyanwong commented Dec 11, 2020

I've only tried using reticulate with tskit once, but ISTR it was reasonably easy. Mark Ravinet probably has more experience with it. I thought we might have written a short example somewhere - perhaps we should have it as a tskit vignette?

@benjeffery
Copy link
Member

R interoperability may be worth promoting to a top-level section in the docs, if we see lots of demand for it.

@jeromekelleher
Copy link
Member

Seems like it's worth a tutorial - should be possible using jupyter-sphinx. Basically we'd want to show

  • Reading a tskit file
  • Exporting the genotype/haplotype data to an R dataframe (or whatever's most appropriate)
  • Exporting trees as newick and loading into some phylo library.

@hyanwong
Copy link
Member

hyanwong commented Dec 11, 2020

I could have a go at this after Xmas, if you like

Edit - here's my play with reticulate previously: #465 (comment)

@jeromekelleher
Copy link
Member

Thanks @hyanwong - I've created tskit-dev/tutorials#11 to track.

@petrelharp
Copy link
Contributor Author

I think output to plink format would be the way to go here.

@jeromekelleher
Copy link
Member

Some version of the plink text format would be useful; I've created an issue to track #1086

@jeromekelleher
Copy link
Member

Is this OK with you @janaobsteter? We don't really want to add yet another text file format to the already far too many that are out there!

@igronau
Copy link

igronau commented Dec 11, 2020

Silly question (that I should probably know the answer to) - does tskit support output of FASTA sequences?

@hyanwong
Copy link
Member

Yes, but it's not documented and not quite finished, I think: #353

It would be nice to finish this off.

@hyanwong
Copy link
Member

P.s. @igronau - if you have any comments on my suggestion in #353 (comment) please add them. This is outside my area really, so input from people who really use these formats would be very helpful.

@igronau
Copy link

igronau commented Dec 11, 2020

Thanks @hyanwong. Looks like you have it figured in #353. The main thing that I think tha tpeople care about is having some padding between variant sites. It's easy to replace these padded bases by anything you wish later on. The real pain is having to figure out the padding for yourself given the haplotype output of tskit.

@hyanwong
Copy link
Member

It's very useful to know that the padding is expected. I would prefer to use . for the padding (which could stand for any length) so that we reserve - for missing data.

@janaobsteter
Copy link

Is this OK with you @janaobsteter? We don't really want to add yet another text file format to the already far too many that are out there!

Yes, that would be perfect! I know reticulate is an option too, but sometimes it's useful also to have the command line option.

@igronau
Copy link

igronau commented Dec 12, 2020

It's very useful to know that the padding is expected. I would prefer to use . for the padding (which could stand for any length) so that we reserve - for missing data.

Just to make sure. The character '.' is used to specify "invariant base", and you use one character per sequence position? That would be the preferred option, in my view.

@hyanwong
Copy link
Member

It's very useful to know that the padding is expected. I would prefer to use . for the padding (which could stand for any length) so that we reserve - for missing data.

Just to make sure. The character '.' is used to specify "invariant base", and you use one character per sequence position? That would be the preferred option, in my view.

I hadn't seen it used for that. We can't use one character per base because sites may not be at integer positions. So every character would have a single dot between it, in my suggested output. I'm open to using other characters though - please do suggest improvements @igronau, but bear in mind that the site positions may be weird to people coming from a real data background, e.g. we could have positions [10, 11, 12, 12.22, 12.5, 12.9, 13.1, ...] or any random set of non-integer positions (indeed, often just numbers between 0 and 1).

@petrelharp
Copy link
Contributor Author

Mind if we move the discussion of fasta output to #353?

@jeromekelleher
Copy link
Member

I'm going to close this as I think the issues are being dealt with elsewhere - the main issue is to implement the reference sequence (#146, and related issues in https://github.com/tskit-dev/tskit/projects/11)

@jeromekelleher
Copy link
Member

(Please do reopen if there's something we're not covering with fasta output, etc, like #1889)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants