Skip to content
This repository has been archived by the owner on Nov 4, 2021. It is now read-only.
/ SeleDiff Public archive

A fast and scalable tool for estimating and testing selection differences between populations

License

Notifications You must be signed in to change notification settings

xin-huang/SeleDiff

Repository files navigation

SeleDiff

license language codecov build Status manual release DOI DOI

NOTE: This project is no longer actively maintained.

Introduction

  • SeleDiff implements a probabilistic method for estimating and testing selection (coefficient) differences between populations1.
  • If you have any problem, please feel free to contact xinhuang.res@gmail.com, or open an issue in this repository.
  • If you would like to reproduce our simulation, please check the codes in ./appendix.
  • If you are interested in contributing to SeleDiff, please feel free to clone and modify it. You should include unit tests for your modified codes. Besides, you can edit build.gradle to include new dependencies. After your modification, please send a GitHub Pull Request with a clear list of what you've done.
  • For more details, please see the manual in ./docs.

Installation

To install SeleDiff, you should first install Java SE Development Kit 8 or OpenJDK8.

Linux/Mac

In Linux/Mac, you can open the terminal and clone SeleDiff using git:

> git clone https://github.com/xin-huang/SeleDiff

Then you can enter the SeleDiff directory and use gradlew to install SeleDiff:

> cd ./SeleDiff
> ./gradlew build
> ./gradlew install

The runnable SeleDiff is in ./build/install/SeleDiff/bin/. You can add this directory into your PATH environment variable by:

> export PATH="/path/to/SeleDiff/build/install/SeleDiff/bin/":$PATH

You can get help information by typing:

> SeleDiff

You can use gradlew to remove SeleDiff:

> ./gradlew clean

Windows

In Windows, you can download the latest release. Please make sure your environment variable JAVA_HOME correctly point to your JDK directory. After download and uncompression, you can open cmd and enter the directory of SeleDiff in cmd. Please use gradlew.bat to build and install SeleDiff.

> cd /path/to/SeleDiff
> gradlew.bat build
> gradlew.bat install

And run SeleDiff.bat in ./build/install/SeleDiff/bin/:

> cd /build/install/SeleDiff/bin/
> SeleDiff.bat

You can use gradlew.bat to remove SeleDiff:

> cd /path/to/SeleDiff
> gradlew.bat clean

Commands

SeleDiff contains two sub-commands:

  • compute-var for estimating variances of Ω1, which is required for the compute-diff command;
  • compute-diff for estimating selection differences among loci.

Input Files

SeleDiff assumes bi-allelic genetic data and will not perform any checks on this assumption. All input files can be compressed by gzip.

EIGENSTRAT

SeleDiff accepts EIGENSTRAT format of genetic data as inputs. EIGENSOFT provides several functions to convert other formats to EIGENSTRAT format.

VCF

SeleDiff also accepts VCF format of genetic data as inputs, and assumes genotypes of each individual are encoded with 0 and 1. Because VCF format contains no population information of each individual, users should provide an additional file following EIGENSTRAT IND format.

Var File

The Var file is the output file from the first sub-command compute-var, which stores variances of pairwise Ω. SeleDiff does not divide Ω with generation times as He et al. (2015) in order to reduce floating-point rounding errors. When estimating Ω, SeleDiff uses SNPs are not fixed in any population. When using sub-command compute-diff to estimate selection differences, SeleDiff uses --var option to accept a a SPACE delimited file without header that specifies variances of Ω between populations.

    YRI CEU 1.547660
    YRI CHS 1.639591
    CEU CHS 0.989241

The first two columns are the population IDs, and the third column is the variances of Ω between populations.

Divergence Time File

When using sub-command compute-diff to estimate selection differences, SeleDiff uses --time option to accept a SPACE delimited file without header that specifies divergence times between two populations.

    YRI CEU 5000
    YRI CHS 5000
    CEU CHS 3000

The first two columns are the population IDs, and the third column is the divergence times of the two populations.

Output File

The output file from SeleDiff is TAB delimited. The first row is a header that describes the meaning of each column.

Column Column Name Description
1 SNP ID The name of a SNP
2 Ref The reference allele
3 Alt The alternative allele
4 Population1 The first population ID
5 Population2 The second population ID
6 Selection difference The selection difference between the first and second populations
7 Std The standard deviation of the selection difference
8 Lower bound of 95% CI Lower bound of 95% confidence interval of the selection difference
9 Upper bound of 95% CI Upper bound of 95% confidence interval of the selection difference
10 Delta The delta statistic for selection difference
11 p-value The p-value of the delta statistic

An Example

Here is an example to show how SeleDiff estimates and tests selection differences between populations. Four populations (YRI, CEU, CHB, CHD) from HapMap3 (release3) were extracted. CHB and CHD were merged into one population called CHS. PLINK 1.7 were used to remove correlated individuals and SNPs with minor allele frequences less than 0.05 and strong linkage disequilibrium. These genome-wide data are stored in ./examples/data/example.geno and used for estimating variances of Ω.

Two alternative alleles (rs1800407 and rs12913832) associated with blue eyes were identified in genes HERC2 and OCA22. These candidate data are stored in ./examples/data/example.candidates.geno and used for estimating selection differences of these SNPs between populations.

The counts of alleles in our example data were summarized in below.

SNP ID Population Reference Allele Count Alternative Allele Count
rs1800407 YRI 290 0
rs1800407 CEU 207 17
rs1800407 CHS 486 4
rs12913832 YRI 294 0
rs12913832 CEU 47 177
rs12913832 CHS 491 1

We assume the divergence time of YRI-CEU and YRI-CHS are both 5000 generations, while the divergence time of CEU-CHS is 3000 generations. This information is stored in ./examples/data/example.time.

First, we estimate variances of Ω using sub-command compute-var:

> SeleDiff compute-var --geno ./examples/data/example.geno \
                       --ind ./examples/data/example.ind \
                       --snp ./examples/data/example.snp \
                       --output ./examples/results/example.geno.var

To estimate selection differences of candidates, we use the sub-command compute-diff:

> SeleDiff compute-diff --geno ./examples/data/example.candidates.geno \
                        --ind ./examples/data/example.candidates.ind \
                        --snp ./examples/data/example.candidates.snp \
                        --var ./examples/results/example.geno.var \
                        --time ./examples/data/example.time \
                        --output ./examples/results/example.candidates.geno.results

The result is stored in ./examples/results/example.candidates.geno.results. The main result is in below.

SNP ID Population1 Population2 Selection difference Std delta p-value
rs1800407 YRI CEU -0.000773 0.000380 4.129 0.042154
rs1800407 YRI CHS -0.000336 0.000393 0.731 0.392559
rs1800407 CEU CHS 0.000728 0.000377 3.730 0.053443
rs12913832 YRI CEU -0.001541 0.000378 16.583 0.000047
rs12913832 YRI CHS -0.000117 0.000415 0.080 0.777297
rs12913832 CEU CHS 0.002372 0.000433 30.062 0.000000

From the result, we can see the selection coefficient of rs12913832 in CEU is significantly larger than that in YRI or CHS, which indicates rs12913832 is under directional selection in CEU. While the selection coefficient of rs1800407 in CEU is marginal significantly larger than that in YRI or CHS.

Please refer to our previous study1 for a more comprehensive working example using the HapMap3 dataset.

Dependencies

References

  1. He et al., Genome Res, 2015.
  2. Sturm et al., Am J Hum Genet, 2008.