add genotype matrix command line option #1080

petrelharp · 2020-12-09T19:03:12Z

This came up in the stdpopsim workshop just now: it'd be nice to have a tskit genotype_matrix command, like the tskit vcf command. I'm hoping others will say what exactly the output should look like?
@janaobsteter ? @gphocs-dev ?

The text was updated successfully, but these errors were encountered:

janaobsteter · 2020-12-09T19:27:49Z

Thanks for raising this issue! It would be super useful to have a command line to spit out either a haplotype or a genotype matrix. The format we usually need this data in is quite simple:

genotypes: each line is one individual and the genotypes are coded as 0, 1 or 2 (the standard SNP chip output)
haplotypes: the two subsequent rows are one individual and the haplotypes are coded as 0 or 1

Usually it's also useful to have the first column or row names with individuals IDs (with haplotypes the rows can be named ID_1, ID_2 or a/b or something similar)

petrelharp · 2020-12-10T05:52:21Z

Tab-separated? Is there a standard we should copy (maybe one of plink's)?

Here's a quick example:

import msprime, tskit
ts = msprime.simulate(6, mutation_rate=0.01, length=100, recombination_rate=0.01)
g = ts.genotype_matrix() 
for k in range(int(ts.num_samples / 2)):
  print(f"tsk_{k}\t" + "\t".join(map(str, g[:, 2*k] + g[:, 2*k + 1])))

producing

tsk_0	0	0	0	2	0	2	0	1	0	0	2
tsk_1	0	0	0	2	0	2	0	0	1	0	2
tsk_2	1	1	1	0	2	0	1	0	0	1	1

janaobsteter · 2020-12-10T06:59:44Z

PLINK .ped uses space as the separator but also has some additional columns (Family ID, Individual ID, Paternal ID, Maternal ID, Sex, Phenotype). My intention is not really to use the genotypes in PLINK, so it doesn't really matter what the separator is - as long as it can be read back in, it's fine 👍 But if someone has an intention to use it in PLINK, maybe a standard format would be better.

igronau · 2020-12-10T07:10:28Z

One thing that I am missing in the current version of tskit is an actual sequence output that includes non-variant sites. The sequence could still be binary (or trinary).

benjeffery · 2020-12-10T10:44:37Z

Thanks for the issue!
@janaobsteter What are you reading back into? It is good to get an idea of the use cases here to check we are doing the right thing. I think adding standard format outputs to tskit is a good thing, but I'm not so sure about adding non-standard things that are one/two liners unless the use case is common and compelling as we would be adding yet another text format into the ecosystem!

@igronau For the full sequence we would need #146 to be done also.

janaobsteter · 2020-12-11T07:04:55Z

@benjeffery , for example, I am reading the haplotypes into R, to use them for further simulation of breeding programs (with AlphaSimR). For that I just to read them in and then feed them as vectors - so for this purpose it doesn't really matter. Even just writing out the ts.genotpe_matrix() would be helpful for that purpose. But I do agree it is easier to work with standard formats - I just don't know what that is - expect for PLINK.

jeromekelleher · 2020-12-11T12:03:29Z

That's interesting, thanks for explaining @janaobsteter. If interoperability with R is what we're looking for then I wonder if text is the right approach - is there some binary matrix format that could be used? Or, would it be better to recommend using reticulate, and cut out the format conversion entirely?

This is probably something we want to have a look at at some point.

jeromekelleher · 2020-12-11T12:04:02Z

@hyanwong - I think you have some experience using reticulate with tskit - do you have any advice here on how best to interchange genotype data?

hyanwong · 2020-12-11T12:14:15Z

I've only tried using reticulate with tskit once, but ISTR it was reasonably easy. Mark Ravinet probably has more experience with it. I thought we might have written a short example somewhere - perhaps we should have it as a tskit vignette?

benjeffery · 2020-12-11T12:35:44Z

R interoperability may be worth promoting to a top-level section in the docs, if we see lots of demand for it.

jeromekelleher · 2020-12-11T14:56:16Z

Seems like it's worth a tutorial - should be possible using jupyter-sphinx. Basically we'd want to show

Reading a tskit file
Exporting the genotype/haplotype data to an R dataframe (or whatever's most appropriate)
Exporting trees as newick and loading into some phylo library.

hyanwong · 2020-12-11T14:57:44Z

I could have a go at this after Xmas, if you like

Edit - here's my play with reticulate previously: #465 (comment)

jeromekelleher · 2020-12-11T15:04:00Z

Thanks @hyanwong - I've created tskit-dev/tutorials#11 to track.

petrelharp · 2020-12-11T15:20:56Z

I think output to plink format would be the way to go here.

jeromekelleher · 2020-12-11T15:36:49Z

Some version of the plink text format would be useful; I've created an issue to track #1086

jeromekelleher · 2020-12-11T15:38:01Z

Is this OK with you @janaobsteter? We don't really want to add yet another text file format to the already far too many that are out there!

igronau · 2020-12-11T16:23:37Z

Silly question (that I should probably know the answer to) - does tskit support output of FASTA sequences?

hyanwong · 2020-12-11T16:28:18Z

Yes, but it's not documented and not quite finished, I think: #353

It would be nice to finish this off.

hyanwong · 2020-12-11T16:31:16Z

P.s. @igronau - if you have any comments on my suggestion in #353 (comment) please add them. This is outside my area really, so input from people who really use these formats would be very helpful.

igronau · 2020-12-11T21:01:36Z

Thanks @hyanwong. Looks like you have it figured in #353. The main thing that I think tha tpeople care about is having some padding between variant sites. It's easy to replace these padded bases by anything you wish later on. The real pain is having to figure out the padding for yourself given the haplotype output of tskit.

hyanwong · 2020-12-11T21:49:26Z

It's very useful to know that the padding is expected. I would prefer to use . for the padding (which could stand for any length) so that we reserve - for missing data.

janaobsteter · 2020-12-12T10:54:45Z

Is this OK with you @janaobsteter? We don't really want to add yet another text file format to the already far too many that are out there!

Yes, that would be perfect! I know reticulate is an option too, but sometimes it's useful also to have the command line option.

igronau · 2020-12-12T15:12:15Z

It's very useful to know that the padding is expected. I would prefer to use . for the padding (which could stand for any length) so that we reserve - for missing data.

Just to make sure. The character '.' is used to specify "invariant base", and you use one character per sequence position? That would be the preferred option, in my view.

hyanwong · 2020-12-12T17:13:03Z

It's very useful to know that the padding is expected. I would prefer to use . for the padding (which could stand for any length) so that we reserve - for missing data.

Just to make sure. The character '.' is used to specify "invariant base", and you use one character per sequence position? That would be the preferred option, in my view.

I hadn't seen it used for that. We can't use one character per base because sites may not be at integer positions. So every character would have a single dot between it, in my suggested output. I'm open to using other characters though - please do suggest improvements @igronau, but bear in mind that the site positions may be weird to people coming from a real data background, e.g. we could have positions [10, 11, 12, 12.22, 12.5, 12.9, 13.1, ...] or any random set of non-integer positions (indeed, often just numbers between 0 and 1).

petrelharp · 2020-12-14T05:30:34Z

Mind if we move the discussion of fasta output to #353?

jeromekelleher · 2021-11-05T17:40:07Z

I'm going to close this as I think the issues are being dealt with elsewhere - the main issue is to implement the reference sequence (#146, and related issues in https://github.com/tskit-dev/tskit/projects/11)

jeromekelleher · 2021-11-05T17:40:58Z

(Please do reopen if there's something we're not covering with fasta output, etc, like #1889)

jeromekelleher mentioned this issue Dec 11, 2020

Tutorial: Using tskit from R tskit-dev/tutorials#11

Closed

petrelharp mentioned this issue Dec 14, 2020

Finalise semantics of fasta output #353

Closed

jeromekelleher closed this as completed Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add genotype matrix command line option #1080

add genotype matrix command line option #1080

petrelharp commented Dec 9, 2020

janaobsteter commented Dec 9, 2020

petrelharp commented Dec 10, 2020 •

edited

janaobsteter commented Dec 10, 2020 •

edited

igronau commented Dec 10, 2020

benjeffery commented Dec 10, 2020

janaobsteter commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

hyanwong commented Dec 11, 2020 •

edited

benjeffery commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

hyanwong commented Dec 11, 2020 •

edited

jeromekelleher commented Dec 11, 2020

petrelharp commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

igronau commented Dec 11, 2020

hyanwong commented Dec 11, 2020

hyanwong commented Dec 11, 2020

igronau commented Dec 11, 2020

hyanwong commented Dec 11, 2020

janaobsteter commented Dec 12, 2020

igronau commented Dec 12, 2020

hyanwong commented Dec 12, 2020

petrelharp commented Dec 14, 2020

jeromekelleher commented Nov 5, 2021

jeromekelleher commented Nov 5, 2021

add genotype matrix command line option #1080

add genotype matrix command line option #1080

Comments

petrelharp commented Dec 9, 2020

janaobsteter commented Dec 9, 2020

petrelharp commented Dec 10, 2020 • edited

janaobsteter commented Dec 10, 2020 • edited

igronau commented Dec 10, 2020

benjeffery commented Dec 10, 2020

janaobsteter commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

hyanwong commented Dec 11, 2020 • edited

benjeffery commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

hyanwong commented Dec 11, 2020 • edited

jeromekelleher commented Dec 11, 2020

petrelharp commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

jeromekelleher commented Dec 11, 2020

igronau commented Dec 11, 2020

hyanwong commented Dec 11, 2020

hyanwong commented Dec 11, 2020

igronau commented Dec 11, 2020

hyanwong commented Dec 11, 2020

janaobsteter commented Dec 12, 2020

igronau commented Dec 12, 2020

hyanwong commented Dec 12, 2020

petrelharp commented Dec 14, 2020

jeromekelleher commented Nov 5, 2021

jeromekelleher commented Nov 5, 2021

petrelharp commented Dec 10, 2020 •

edited

janaobsteter commented Dec 10, 2020 •

edited

hyanwong commented Dec 11, 2020 •

edited

hyanwong commented Dec 11, 2020 •

edited