Skip to content
/ lassip Public

LASSI-Plus: A program to calculate haplotype frequency spectrum statistics

License

Notifications You must be signed in to change notification settings

szpiech/lassip

Repository files navigation

# LASSI Plus - a program to calculate haplotype frequency spectrum statistics

This is an implementation of various haplotype frequency spectrum statistics useful for detecting hard and soft selective sweeps in genomes. This program implements the following statistics:

saltiLASSI: DeGriorgio and Szpiech (2022) PLoS Genetics 18: e1010134.
LASSI: Harris and DeGiorgio (2020) MBE doi.org/10.1093/molbev/msaa115.
H12: Garud et al. (2015) PLoS Genetics 11:e1005004.
H2/H1: Garud et al. (2015) PLoS Genetics 11:e1005004.
G123: Harris et al. (2018) Genetics 210:1419-1452.
G2/G1: Harris et al. (2018) Genetics 210:1419-1452.
Number of Unique Haplotypes at Locus

lassip accepts VCF files, either phased (default, "hap" output files) or unphased (set --unphased, "mlg" output files), with or without missing data. ***SEE CHANGELOG 19JAN2024 for update on how missing data is handeled. lassip expects one vcf file per contig, provided one at a time. You must provide a population file that specifies population IDs for each individual ID you wish to analyse. Only IDs listed in the population file will be analyzed, and if multiple populations are present, all statistics will be computed on a per-population basis.

Use --hapstats to compute H/G stats in sliding windows along the genome, whether or not data are phased determines whether H or G statistics are used. 

Use --calc-spec to compute the top K haplotype frequency specra in sliding windows along a contig. Pass multiple spectra files (e.g. from multiple contigs) with --spectra and --lassi to run the CLR computation for detecting sweeps (Harris and DeGiorgio 2020) or with --salti to run CLR computation from (DeGiorgio and Szpiech 2021).

If only --hapstats is given, files are named <basename>.lassip.[hap|mlg].stats.gz with format:

<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <h12|g123> <h2h1|g2g1>

The final 4 columns are repeated for each population in the analysis, with the population code prepended on the appropriate header label.

If only --calc-spec is given with --vcf, files are named <basename>.lassip.[hap|mlg].spectra.gz with format:

<header line for use with lassip when reading with --spectra>
<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <hfs1> ... <hfsK>

The final K+2 columns are repeated for each population in the analysis, with the population code prepended on the appropriate header label. hfsN gives the frequency of the Nth most common haplotype in the given window.

If both --calc-spec and --hapstats are given with --vcf, files are named <basename>.lassip.[hap|mlg].spectra.gz with format:

<header line for use with lassip when reading with --spectra>
<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <h12|g123> <h2h1|g2g1> <hfs1> ... <hfsK>

The final K+4 columns are repeated for each population in the analysis, with the population code prepended on the appropriate header label. hfsN gives the frequency of the Nth most common haplotype in the given window.

Passing *.spectra.gz files to lassip with --spectra <file1> ... <fileN> will compute the LASSI CLR if --lassi is set and the saltiLASSI CLR if --salti is set. Output is a single file concatenating the results from all contigs, named <basename>.lassip.[hap|mlg].out.gz. For LASSI computations, one of two formats is output:

<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <h12|g123> <h2h1|g2g1> <m> <T>

or

<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <m> <T>

For saltiLASSI computations, one of two formats is output:

<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <h12|g123> <h2h1|g2g1> <m> <A> <L>

or

<chr> <start> <end> <nSNPs> <nHaps> <uniqHaps> <m> <A> <L>

Depending on whether --hapstats was set when the *.spectra.gz files were generated. The final columns are repeated for each population in the analysis, with the population code prepended on the appropriate header label. m gives the inferred number of sweeping haplotypes, A gives the sweep “width” and T/L gives the CLR test statistic.

If --filter-level 2 is given (default), loci are filtered on a population basis and all monomorphic snps in a given population are filtered. As a consequence, total number of windows and the window coordinates will not necessarily match between pops, and so each population has results output in a separate file.


lassip v1.2.0

----------Command Line Arguments----------

--avg-spec <bool>: Set this flag to compute and output the average
K-truncated haplotype frequency spectrum from a set of .spectra files.
	Default: false

--calc-spec <bool>: Set this flag to compute K-truncated haplotype
frequency spectra.
	Default: false

--dist-type <string>: Distance measure for saltiLASSI: bp, cm, nw.
	Default: bp

--filter-level <int>: Filter monomorphic sites and sites
with missing data: 0-no filtering, 1-compute freq for all samples, 2-compute freq per pop.
	Default: 2

--hapstats <bool>: Set this flag to calculate haplotype statistics.
	Default: false

--help <bool>: Prints this help dialog.
	Default: false

--k <int>: Top K haplotypes for LASSI computations.
	Default: 10

--lassi <bool>: Set this flag to use the LASSI method.
	Default: false

--lassi-choice <int>: Set this flag to change the way LASSI
	distributes mass across sweeping haplotype classes. Takes an integer in {1..5}.
	Default: 4

--map <string>: A map file formatted <chr#> <locusID> <genetic pos> <physical pos>.
	Sites in VCF not in map file will be interpolated.
	Default: __mapfile1

--match-tol <int>: Group haplotypes with missing data into
the same class as a haplotype with no missing data if they have <= this many pairwise differences.
	Default: 0

--max-extend-bp <double>: Maximum distance in basepairs from core window to consider for saltiLASSI.
	Default: 100000.00

--max-extend-cm <double>: Maximum distance in centimorgans from core window to consider for saltiLASSI.
	Default: 0.05

--max-extend-nw <double>: Maximum distance in number of windows from core window to consider for saltiLASSI.
	Default: 5.00

--max-hmiss <double>: Drop haplotypes with > this proportion of missing data when computing the HFS.
	Default: 0.20

--max-lmiss <double>: Filter loci with > this proportion of missing data.
	Default: 0.10

--null-spec <string>: A file containing a null K-truncated
haplotype spectrum for use computing LASSI or saltiLASSI.
	Default: __nullspec1

--out <string>: The basename for all output files.
	Default: outfile

--pop <string>: A file containing <ind ID> <pop ID>.
	Default: __popfile1

--salti <bool>: Set this flag to use the saltiLASSI method.
	Default: false

--spectra <string1> ... <stringN>: A list of spectra files for finalization.
	Default: __specfile1

--threads <int>: The number of threads to spawn during computations.
	Default: 1

--unphased <bool>: Set this flag to indicate data are unphased.
	Default: false

--vcf <string>: A VCF file containing haplotype data.
	Variants should be coded 0/1.
	Default: __vcffile1

--winsize <int>: The window size within which to calculate statistics.
	Default: 0

--winstep <int>: The sliding window step size.
	Default: 0


***CHANGE LOG***

29FEB2024 - Fixed more bugs that appear in the 1.1.2a version. DO NOT USE 1.1.2a.

18JAN2024 - v1.2.0. Changing the way missing data is handled. Three new command line arguments:

--max-hmiss <double>: Drop haplotypes with > this proportion of missing data when computing the HFS.
	Default: 0.20

Previous behavior is that as soon as a missing genotype is encountered this haplotype is excluded from HFS calculation. You can now modify this behavior with --max-hmiss. A haplotype will only be dropped if there is more than MAX_HMISS proportion of alleles missing in that haplotype. e.g. 10-00--011 is a 10-snp haplotype with missing data represented by '-'. The default MAX_HMISS is 0.2, and since 3/10 > 0.2 this haplotype would be excluded. 10-0011011 has only 1/10 missing data, so it is kept.

--max-lmiss <double>: Filter loci with > this proportion of missing data.
	Default: 0.10

Previous behavior did not filter loci based on missing data. Here you can set the proportion of missing data beyond which a locus will be excluded. --filter-level controls how this is calculated, 0 fir no filtering at all, 1 to calculate the denominator from all provided samples, 2 to calculate denominator based on population groups.

--match-tol <int>: Group haplotypes with missing data into
	the same class as a haplotype with no missing data if they have <= this many pairwise differences.
	Default: 0

The current implementation now tries to cluster any haplotypes that have missing data (that pass the MAX_HMISS check) with haplotypes in the data that are complete (i.e. no missing data). To do this, I calculate # of pairwise differences, ignoring any locations with missing data, and cluster the missing-data haplotype with the complete haplotype(s) to which it has the fewest differences (enforcing that the number of differences must at least be <= MATCH_TOL). If there are ties, it is counted fractionally with equal contribution to each matching haplotype. Let me illustrate:

Complete haps
A 101
B 111
C 110

Haps w/missing data
D 1-1
E 0-1

Let d(X,Y) be the # pairwise differences and MATCH_TOL == 0, then d(D,A) = 0, d(D,B) = 0, d(D,C) = 1. In this case D would be counted 0.5 as A and 0.5 as B.

d(E,A) = 1, d(E,B) = 1, d(E,C) = 2. E would be kept as its own haplotype class.

Let MATCH_TOL == 1, then D would still be counted as 0.5 towards both A and B. The clustering always prefers the fewest number of differences even if MATCH_TOL would allow more. E would also cluster 0.5 with A and 0.5 with B.

Even if MATCH_TOL > 0, complete haplotypes are never merged (this could be changed, thinking about it). Haps w/ missing data are never merged among themselves (some non-trivial issues here, but thinking about solutions).

03DEC2023 - v1.1.2a Reverting the changes from v1.1.2 regarding grid search. This introduced more bugs than it helped to fix. Please do not use v1.1.2. 

12SEPT2023 - v1.1.2 Grid search across epsilon now starts at 0 and not 1/(100*K). This fixes a bug that tends to manifest when there is lots of missing data. When the average/null truncated HFS has its smallest frequency class p_K < 1/(100*K), this created a condition where the likelihoods were not computed and only placeholder values were output for all windows, since we set p_K = U but also expected p_K = U >= 1/(100*K). 

03DEC2021 - Bug fixes relating to genetic map distance interpolation.

02SEPT2021 - v1.1.1 Introduced saltiLASSI options for measuring distance in bp (basepairs), nw (number of windows), or cm (centimorgans). New command line flags:
    --max-extend-nw <double>: Maximum distance in number of windows from core window to consider for saltiLASSI.
    --max-extend-cm <double>: Maximum distance in centimorgans from core window to consider for saltiLASSI.
    --map <string>: A map file formatted <chr#> <locusID> <genetic pos> <physical pos>. Sites in VCF not in map file will be interpolated.
    --dist-type <string>: Distance measure for saltiLASSI: bp, cm, nw.
    These flags are used with --salti and --spectra flags when finalizing the computations.

07MAY2021 - v1.1.0 Introduced saltiLASSI framework. Must specify --lassi or --salti when passing spectra files with --spectra in order to choose a method.
	--filter-level default changes to 2.
	--null-spec introduced. Can be used to provide a null HFS for LASSI or saltiLASSI methods.
	--avg-spec introduced. When used with --spectra will compute and output an averaged HFS which can be passed with --null-spec to use as background null HFS.
	--max-extend-bp introduced. Maximum distance in basepairs from core window to consider for saltiLASSI.

10JUN2020 - v1.0.2a Fixed a bug that caused chromosome name to be blank with filter level > 0.

07JUN2020 - v1.0.2 Memory improvements for filter levels < 2.

06JUN2020 - v1.0.1 Options for filtering monomorphic loci with --filter-level
	0 - no filtering
	1 - filter all loci that are monomorphic in the samples selected for analysis
	2 - filter monomorphic loci on a population by population basis

	Also fixed error in computation of H2/H1 and G2/G1.

26MAY2020 - Initial release v1.0.0

About

LASSI-Plus: A program to calculate haplotype frequency spectrum statistics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages