ArchIE

ArchIE (ARCHaic Introgression Explorer) is a method for reference-free inference of archaic local ancestry. It uses coalescent simulations to train a model to distinguish between archaic and non-archaic windows. The process is as follows:

run bash create_training.sh to simulate training data and calculate features (from the directory)
run bash create_test.sh to simulate test data (from the directory)
run Rscript train.R to train a logistic regression model (from the directory)
run python data/calc_stats_window_data.py -s data.snp -i data.ind -a data.geno -r ref.geno -c 1 -b $start -e $end -w 50000 -z 50000 | gzip > output.gz

The last step will calculate features in 50KB windows with a 50KB step size (i.e. nonoverlapping) from $start to $end on chromosome 1.

You can find out more about the options for calc_stats_window_data.py using the -h flag:

python data/calc_stats_window_data.py -h
usage: calc_stats_window_data.py [-h] -s SNP -i IND -a ADMIX [-r REFERENCE] -c
                                 CHROM -b BEGIN -e END -w WINDOW -z STEP

Script to calculate stats for ArchIE. Prints stats to stdout.

optional arguments:
  -h, --help            show this help message and exit
  -s SNP, --snp SNP     .snp file with positions (eigenstrat format)
  -i IND, --ind IND     .ind file of IDS (eigenstrat format)
  -a ADMIX, --admix ADMIX
                        .geno file for admixed/test population (eigenstrat
                        format)
  -r REFERENCE, --reference REFERENCE
                        .geno file for reference population (eigenstrat
                        format).
  -c CHROM, --chrom CHROM
                        Chromosome
  -b BEGIN, --begin BEGIN
                        Beginning position
  -e END, --end END     End position
  -w WINDOW, --window WINDOW
                        Window length (eg. 50000)
  -z STEP, --step STEP  Step size for the window (eg. 10000)

Requirements

ArchIE requires the following dependencies:

Python 3
Numpy (1.13.0)
Scipy (0.19.0)
Scikit Learn (0.19.1)
R (3.4.0)

Note that other versions may work -- these are the ones I used.

Data formats

ArchIE uses the eigenstrat format, which contains .snp (list of polymorphic positions), .geno (genotypes), and .ind (individual IDs) files. ArchIE expects phased genotypes. You can convert from VCF format to eigenstrat using Iain Mathieson's genetic data converter. A version of the script is included in the data folder which preserves phase information during the conversion (which is required to run ArchIE). If you use your own script for this conversion, the output should be in the following order: Ind1-Hap1 Ind1-Hap2 Ind2-Hap1 Ind2-Hap2....

To calculate features on your data, use the calc_stats_window_data.py script in the data folder.

Applying to your data

When running ArchIE, you can specify the number of haploid genomes to simulate (n), the window size, or the demographic history (in ms.sh). You must simulate the same number of haploid genomes as are in your data set. E.g if I have 15 diploid individuals with phased genomes, I need to simulate a dataset with 30 haploid genomes.

Features

We use several features to predict local ancestry. When the number of haplotypes simulated is 100 (the default) this is what the columns describe:

1-100 The individual frequency spectrum (sample size dependent)
101-201 - the distribution of distances between haplotypes (sample size dependent)
202 - Mean of the distribution of distances between haplotypes
203 - Variance of the distribution of distances between haplotypes
204 - Skew of the distribution of distances between haplotypes
205 - Kurtosis of the distribution of distances between haplotypes
206 - Minimum distance to the reference population
207 - S*
208 - Number of SNPs private to the target population
209-211 one hot encoded label for [archaic, not archaic, in-between]. 'In-between' is chosen if haplotypes fall between 30-70% archaic. This can be modified in the calc_stats_ms.py script
212 - Proportion of the haplotype that is archaic ranges [0-1]

The order of the haplotypes remains the same if the sample size is changed, but the column numbers will change.

Note about `ms`

We use a customized version of ms that allows us to track the ancestry of each SNP. Check around line 218 in ms.c for the modifications. ms was compiled on Mac OSX, if you run into problems you can recompile it with gcc -o ms ms.c streec.c rand1.c -lm

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
msmodified		msmodified
simulations		simulations
train		train
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArchIE

Requirements

Data formats

Applying to your data

Features

Note about `ms`

About

Releases

Packages

Languages

sriramlab/ArchIE

Folders and files

Latest commit

History

Repository files navigation

ArchIE

Requirements

Data formats

Applying to your data

Features

Note about ms

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Note about `ms`

Packages