GitHub - szpiech/asd: Calculate allele sharing distances

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
bin		bin
include/win32		include/win32
lib/win32		lib/win32
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README		README
test_data_1.stru		test_data_1.stru
test_data_2.stru		test_data_2.stru

Repository files navigation

**Be aware that VCF files must contain GTs only and no other INFO for each sample and variant site otherwise GTs will be read incorrectly. This is a bug that will be fixed eventually. You can ensure proper loading of your vcf file by running: bcftools annotate -x FMT in.vcf -Ov -o out.vcf until this is fixed.

asd is a multi-threaded program to quickly compute pairwise allele sharing distances, GRMs, or frequency weighted allele similarities between diploid individuals with an arbitrary number of alleles per locus (GRMs biallelic only).  asd reads TPED/TFAM, STRU, and VCF formatted files and will load gzipped files without first requiring decompression. If your data are not biallelic, you must use the --multiallelic flag. This flag is not compatible with --grm and is automatically set when using --weighted.

usage

./asd --stru <filename> --id-col <column where individual names are>

or

./asd --tped <filename>.tped --tfam <filename>.tfam

or

./asd --vcf <filename>.vcf

optional flags --partial, --log, --ibs-all, --out, --missing-stru, --missing-tped, --biallelic, --combine, --long, --long-ibs, --grm, --weighted

CHANGE LOG
----------

18 FEB 2020 - Release of v1.1.0a fixes a bug when reading VCF without --multiallelic set.

14 FEB 2020 - Release of v1.1.0 includes support for VCF files and computing the weighted allele similarity score from Greenbaum et al. 2019 Genome Research 29:2020-2033.


asd v1.1.0a
----------Command Line Arguments----------

--combine <string1> ... <stringN>: Combine several files	
	generated with --partial.
	Default: __none

--grm <bool>: Calculate the genomic relationship matrix.
	Default: false

--help <bool>: Prints this help dialog.
	Default: false

--ibs <bool>: Set to output all IBS sharing matrices too.
	Default: false

--id <int>: Column where individual IDs are. (stru files only)
	Default: 1

--log <bool>: Transforms allele sharing distance by -ln(ps) instead of 1-(ps).
	Default: false

--long <bool>: Instead of printing a matrix, print allele sharing distances one	
	per row. Formatted <ID1> <ID2> <allele sharing distance>.	
	Not compatible with --partial.
	Default: false

--long-ibs <bool>: Instead of printing a matrix, print IBS calculations one	
	per row. Formatted <ID1> <ID2> <IBS0/1/2>.	
	Not compatible with --partial.
	Default: false

--missing-stru <string>: For stru files, set the missing data value.
	Default: -9

--missing-tped <string>: For tped files, set the missing data value.
	Default: 0

--multiallelic <bool>: Set if there are more than two alleles at at least one locus.	
	Computations are less efficient with this flag set.
	Default: false

--out <string>: Basename for output files.
	Default: outfile

--partial <bool>: If set, outputs two nind x nind matrices.    
	The first is the number of loci used for each pairwise    
	comparison, and the second is the untransformed dist/IBS	
	matrix. Useful for splitting up very large jobs.	
	Can combine with --combine.
	Default: false

--stru <string>: The input data filename (stru format).	
	Change missing data code with --missing-stru.	
	Requires two header rows: locus names, and locus positions.
	Default: __none

--tfam <string>: The input data filename (tfam format).	
	Requires a .tped file (--tped).
	Default: __none

--threads <int>: Number of threads to spawn for faster calculation.
	Default: 1

--tped <string>: The input data filename (tped format).	
	Requires a .tfam file (--tfam).	
	Change missing data code with --missing-tped.
	Default: __none

--vcf <string>: The input data filename (vcf format).
	Default: __none

--weighted <bool>: Calculate weighted allele sharing similarity scores from Greenbaum et al. 2019.
	Default: false


NOTE: Multi-dimensional scaling plots can be created easily in R with the following commands (assuming --partial was not used):

      data<-read.table("<outfile>.asd.dist",header=TRUE)
      data.mds<-cmdscale(data)
      plot(data.mds)

A toy example with no missing data:

./asd --stru test_data_1.stru --id-col 1

output:

ind0 ind1 ind2 ind3 ind4 ind5 
ind0 0 0.5 0.3 0.2 0.3 0.5 
ind1 0.5 0 0.2 0.3 0.4 0.8 
ind2 0.3 0.2 0 0.1 0.2 0.6 
ind3 0.2 0.3 0.1 0 0.3 0.5 
ind4 0.3 0.4 0.2 0.3 0 0.6 
ind5 0.5 0.8 0.6 0.5 0.6 0 


./asd --stru test_data_1.stru --id-col 1 --partial

output:

6
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 
5 5 5 5 5 5 

ind0 ind1 ind2 ind3 ind4 ind5 
ind0 5 2.5 3.5 4 3.5 2.5 
ind1 2.5 5 4 3.5 3 1 
ind2 3.5 4 5 4.5 4 2 
ind3 4 3.5 4.5 5 3.5 2.5 
ind4 3.5 3 4 3.5 5 2 
ind5 2.5 1 2 2.5 2 5 


A toy example with missing data:

./asd --stru test_data_2.stru --id-col 1

output:

ind0 ind1 ind2 ind3 ind4 ind5 
ind0 0 0.5 0.25 0.125 0.333333 0.166667 
ind1 0.5 0 0.125 0.25 0.333333 0.666667 
ind2 0.25 0.125 0 0.1 0.125 0.5 
ind3 0.125 0.25 0.1 0 0.25 0.375 
ind4 0.333333 0.333333 0.125 0.25 0 0.625 
ind5 0.166667 0.666667 0.5 0.375 0.625 0 


./asd --stru test_data_2.stru --id-col 1 --partial

output:

6
4 3 4 4 3 3 
3 4 4 4 3 3 
4 4 5 5 4 4 
4 4 5 5 4 4 
3 3 4 4 4 4 
3 3 4 4 4 4 

ind0 ind1 ind2 ind3 ind4 ind5 
ind0 4 1.5 3 3.5 2 2.5 
ind1 1.5 4 3.5 3 2 1 
ind2 3 3.5 5 4.5 3.5 2 
ind3 3.5 3 4.5 5 3 2.5 
ind4 2 2 3.5 3 4 1.5 
ind5 2.5 1 2 2.5 1.5 4