-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Stage 1: Sequence alignment
Stage 2: Construction of the reference panel
Stage 3: Tabulation of allele observations at known SNPs
Stage 4: Model comparison with log likelihood ratios
Stage 5: Plot log-likelihood ratio vs. chromosomal position
Simulating various forms of trisomy
Balanced ROC Analysis
Description of the IMPUTE2 format for reference panels
Generating IMPUTE2 reference panels from a VCF file
Dependencies
Extra or missing chromosomes—a phenomenon termed aneuploidy—frequently arises during human meiosis and embryonic mitosis and is the leading cause of pregnancy loss, including in the context of in vitro fertilization (IVF). While meiotic aneuploidies affect all cells and are deleterious, mitotic errors generate mosaicism, which may be compatible with healthy live birth. Large-scale abnormalities such as triploidy and haploidy also contribute to adverse pregnancy outcomes, but remain hidden from standard sequencing-based approaches to preimplantation genetic testing (PGT-A). The ability to reliably distinguish meiotic and mitotic aneuploidies, as well as abnormalities in genome-wide ploidy may thus prove valuable for enhancing IVF outcomes. Here, we describe a statistical method for distinguishing these forms of aneuploidy based on analysis of low-coverage whole-genome sequencing data, which is the current standard in the field. Our approach overcomes the sparse nature of the data by leveraging allele frequencies and linkage disequilibrium (LD) measured in a population reference panel. The method, which we term LD-informed PGT-A (LD-PGTA), retains high accuracy down to coverage as low as 0.05x and at higher coverage can also distinguish between meiosis I and meiosis II errors based on signatures spanning the centromeres. LD-PGTA provides fundamental insight into the origins of human chromosome abnormalities, as well as a practical tool with the potential to improve genetic testing during IVF.
For more information:
Daniel Ariad, Stephanie M. Yan, Andrea R. Victor, Frank L. Barnes, Christo G. Zouves, Manuel Viotti, Rajiv C. McCoy. “Haplotype-aware inference of human chromosome abnormalities” | PNAS November 16, 2021 118 (46) e2109307118 | bioRxiv:10.1101/2021.05.18.444721 | The study is summarized in this poster
To demonstrate the use of LD-PGTA, we can obtain a trisomic sample from SRA.
wget 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR6676163/SRR6676163.1'
mv SRR6676163.1 SRR6676163.sra
fastq-dump ./SRR6676163.sra --gzip
We next prepare a reference genome file (here GRCh38/hg38; UCSC release, Dec. 2013). In order to accelerate the alignment, we do not include unplaced sequences, centromeric sequences and alternates:
wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr?.fa.gz'
wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr??.fa.gz'
gunzip *.fa.gz
cat *.fa > hg38.fa
rm chr*.fa
The complete GRCh38/hg38 (UCSC release, Dec. 2013) reference genome, which includes unplaced sequences, centromeric sequences and alternates, can be downloaded :
wget 'http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz'
gunzip hg38.fa.gz
We use the faidx command from samtools to prepare the FASTA index file. This index file provides byte offsets in the FASTA file for each contig. Thus, it allows us to compute exactly where to find a particular reference base at specific genomic coordinates in the FASTA file.
samtools faidx hg38.fa
For reads shorter than 70 bp we use the software BWA. The reference index is created from the reference genome FASTA by executing:
bwa index -p hg38bwaidx -a bwtsw hg38.fa
See the article and the documentation of the software for more details.
For reads larger than 70 bp we use the software minimap2. The reference index is created from the reference genome FASTA by executing:
minimap2 -x sr -d hg38.sr.mmi hg38.fa
See the article and the documentation of the software for more details.
In order to align short reads (<70bp) using bwa, we execute it with the following arguments:
bwa aln -t 6 hg38bwaidx SRR6676163.fastq.gz > SRR6676163.bwa
bwa samse hg38bwaidx SRR6676163.bwa SRR6676163.fastq.gz > SRR6676163.sam
,
where the option -t 6
allows multi-threading with 6 threads.
In order to map reads ranging from 0.07 kbp to 1 kbp, we use minimap2 with the following arguments:
minimap2 -t6 -ax sr ../genome_ref_$2/minimap2/hg38.sr.mmi SRR6676163.fastq.gz > SRR6676163.sam
,
where the option -t6
allows multi-threading with 6 threads.
Conversion from SAM to BAM format reduces data size and improves performance. In addition, many applications (such as genome browsers) require the reads in the BAM file to be coordinate-based sorted, i.e., the reads are ordered from “left” to “right” across the reference genome just as you would read across a line in a book.
samtools view -bS -F 4 SRR6676163.sam | samtools sort -@ 6 -o SRR6676163.sorted.bam -
The option -F 4
filters out unmapped reads, while -@ 6
allows samtools
to use up to 6 threads.
Next, we create an index from the BAM file with the samtools index subcommand:
samtools index SRR6676163.sorted.bam SRR6676163.sorted.bai
Source: http://bioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/Day1/Sequence%20Alignment_July2015_ShamithSamarajiwa.pdf and http://genomeintelligence.org/?p=1486
As described in our manuscript, LD-PGTA overcomes the sparse nature of low-coverage data by using information about allele frequencies and haplotype structure as measured in a population reference panel. Here we demonstrate the construction of such a reference panel using data from the 1000 Genomes Project.VCF files from the 1000 Genome Project Phase 3 aligned against GRCh38/hg38:
wget 'ftp://ftp.sra.ebi.ac.uk/vol1/ERZ822/ERZ822766/*'
This call set is also available from:
Reference: Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project
Create an IMPUTE2 SAMPLE file, as explained in this section. In order to generate such a list for the 1000 genome project phase 1\3 visit https://www.internationalgenome.org/data-portal/sample.
One important note: the columns of sample file for population, group, and sex expected to be filled by the user. LD-PGTA uses the values in the group column to differentiate between samples that are associated to distinct super-populations, and thus providing this information is required.
The script MAKE_REF_PANEL.py
creates reference panels for LD-PGTA, using phased genotypes in VCF files. The reference panels of LD-PGTA follow the structure of the IMPUTE2 format but are encoded, but is encoded as binary for efficient storage and retrieval.
We run the script with the following arguments and flags:
python3 MAKE_REF_PANEL.py EUR_panel.samples ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --mask 20160622.chr1.mask.fasta.gz
,
where the first argument is the IMPUTE2 SAMPLE file. The second required argument is a VCF filename that correspond to a single chromosome. The third argument an accessibility mask file in a gzipped FASTA format and is optional. Supplying an accessibility mask file will reduce false SNPs in regions of the genome that are less accessible to NGS methods. GRCh38 genome accessibility masks for 1000 Genomes data are available from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/.
In addition, the following flags are supported:
Optional argument | Description |
---|---|
--output-directory <PATH> |
The directory in which the reference panel would be created. |
--force-module cyvcf2/bcftools/pysam |
By deafult cyvcf2 module would be used. This allows to use pysam or bcftools instead. |
The script MAKE_OBS_TAB.py
uses a BAM file, which is aligned and sorted, as well as an IMPUTE2 legend file.
It then builds a nested list that contains observed alleles at SNPs positions together with their chromosome position and line number in the legend file. The nested list is stored in a pickle format.
We run the script with the following arguments and flags:
python3 MAKE_OBS_TAB.py SRR6676163.bam EUR_panel.chr21.legend --min-bq 30 --min-mq 30 --max-depth 0 --handle-multiple-observations all
,
where the first argument is the filename of the BAM file, containing reads mapped to a reference sequence. The second required argument is the filename of the IMPUTE2 legend file, which lists the SNPs positions. In addition, the following flags are supported:
Optional argument | Description |
---|---|
--fasta_filename <FILENAME> |
The faidx-indexed reference file in the FASTA format. Supplying a reference file will reduce false SNPs caused by misalignments using the Base Alignment Quality (BAQ) method described in the paper “Improving SNP discovery by base alignment quality”, Heng Li, Bioinformatics, Volume 27, Issue 8. |
--handle-multiple-observations <all/first/random/skip> |
We expect to observe at most a single base per SNP. When encountering an exception, the default behavior is to skip the SNP. However, a few alternative options to handle multiple observations are available: (a) take the first observed base, (b) randomly pick an observed base and (c) keep all the observed bases. |
--min-bq <INT> |
Minimum base quaility for observations. Default value is 30. |
--min-mq <INT> |
Minimum mapping quaility for observations. Default value 30. |
--max-depth <FLOAT> |
Maximum depth coverage to be considered (inclusive). Default value is 0, effectively removing the depth limit. |
--output-filename <FILENAME> |
Output filename. The default filename is the same as the BAM filename, but with an extension of .chr_id.obs.p . |
- In order to use the script, the Python module Pysam (v0.18.0) must be installed. Pysam is a lightweight wrapper of the htslib C-API, allowing our script to read SAM/BAM files. For more information check pysam-developers / pysam and also pysam: htslib interface for python.
The script ANEUPLOIDY_TEST.py
builds a dictionary that lists linkage disequilibrium (LD) blocks that contain at least two meaningful reads and gives the associated log-likelihood BPH/SPH ratio (LLR). BPH (Both Parental Homologs) describes to the presence of three unmatched haplotypes, while SPH (Single Parental Homolog) describes the presence of two identical haplotypes, with an additional unmatched haplotype. Finally, the script analyzes the dictionary and reports summary statistics, including the mean LLR, the standard error, the number of informative genomic windows and the fraction of genomic windows with a negative LLR.
As demonstration, we run the script with the following arguments and flags:
python3 ANEUPLOIDY_TEST.py SRR6676163.obs.p EUR_panel.chr21.legend EUR_panel.chr21.hap EUR_panel.chr21.sam EUR --window-size 0 --min-reads 6 --max-reads 4 --compress bz2
,
where the first argument is the filename of a pickle file created by MAKE_OBS_TAB, containing base observations at known SNP positions. The second argument is the filename of the reference panel legend, which lists the SNPs in our reference panel. The third required argument is the filename of the reference panel haplotypes, which lists the haplotypes in our reference panel. The forth required argument is the filename of the reference panel samples, which describes the ancestral group of the individuals consisting the reference panel. The fifth argument is ancestral makeup of SRR6676163. In general, the ancestral makeup is specified as follows:
- For non-admixtures the argument consists a single superpopulation, e.g., EUR.
- For recent admixtures the argument consists two superpopulations, e.g., EUR EAS.
- For distant admixtures the argument consists of the superpoplations and their proportions, e.g, EUR 0.8 EAS 0.1 SAS 0.1.
In addition, the following optional arguments are supported:
Optional argument | Description |
---|---|
--window-size <INT> |
Specifies the size of the genomic window. The default value is 100 kbp. When given a zero-size genomic window, it adjusts the size of the window to include min-reads reads. |
--offset <INT> |
Shifts all the genomic windows by the requested base pairs. The default value is 0. |
--min-reads <INT> |
Takes into account only genomic windows with at least INT reads, admitting non-zero score. The minimal value is 3, while the default is 6. |
--max-reads <INT> |
Selects up to INT reads from each genomic windows in each bootstrap sampling. The minimal value is 2, while the default value is 4. |
--min-HF <INT> |
Only haplotypes with a frequnecy between FLOAT and 1-FLOAT add to the score of a read. The default value is 0.05. |
--min-score <INT> |
Consider only reads that reach the minimal score. The default value is 2. |
--output-filename <FILENAME> |
The output filename. The default is the input filename with the extension ".obs.p" replaced by ".LLR.p". |
--compress gz/bz2/unc |
Output compressed via gzip, bzip2 or uncompressed. Default is uncompressed. |
- The presence of the Python module gmpy2 accelerates performance of this script. gmpy2 is an implementation of a standard GMP (GNU Multiple Precision Arithmetic Library) and supports arbitrary precision integers. For more information check gmpy2 and also gmpy2’s documentation.
The script PLOT_PANEL.py
plots log-likelihood ratio vs. chromosomal position from a LLR file.
As demonstration, we run the script with the following arguments and flags:
python3 PLOT_PANEL.py SRR6676163.LLR.p
,
where the first argument is the filename of a pickle file created by ANEUPLOIDY_TEST, containing likelihoods to observese various aneuploidy landscapes. In addition, the following flags are supported:
Optional argument | Description |
---|---|
--pairs |
Plots the LLR between scenario A and scenario B along the chromosome. The possible pairs are: BPH,DISOMY; DISOMY,SPH; SPH,MONOSOMY; DISOMY,MONOSOMY; BPH,SPH. In addition, giving a list of pairs would plot the LLR of each pair in the same figure, e.g. "BPH,SPH SPH,MONOSOMY". The default value is BPH,SPH. |
--bins |
The numbers of bins the chromosome is divided into. The default value is 10. |
To further demonstrate and evaluate LD-PGTA and other methods, it can be useful to simulate trisomies (as well as other ploidy configurations) with various haplotype patterns. This can be accomplished by drawing haplotypes from a reference panel such as the 1000 Genomes Project, and mixing them in defined proportions.
The script MIX_HAPLOIDS.py simulates observed bases at known SNP positions from an aneuploid cell, based on mixtures of haploid sequences. The simulation supports various aneuploidy configurations:
- Monosomy - a single copy of a chromosome.
- Disomy - two unmatched haplotypes.
- SPH (`single parental homolog') - trisomy with two identical haplotypes and one unique haplotype.
- BPH (`both parental homologs') - trisomy with three distinct haplotypes.
- Realistic meiotic-origin trisomy with recombination, characterized by a transitions between SPH to BPH tracts.
We demonstrate the script with the following arguments and flags:
python3 MIX_HAPLOIDS.py HAPLOID1.obs.p HAPLOID2.obs.p HAPLOID3.obs.p --depth 0.01 --read-length 36 --scenarios monosomy,SPH,disomy,BPH
,
where the three first arguments are the filenames of observation tables created by MAKE_OBS_TAB for each haploid sequence. In addition, some of the following optional flags were used:
Optional argument | Description |
---|---|
--depth <FLOAT> |
The average coverage for the whole chromosome. Default value 0.1 |
--read-length <INT> |
The number of base pairs (bp) sequenced from a DNA fragment. Default value 36. |
--scenarios <LIST OF STRINGS> |
The simulation supports five scenarios: monosomy/disomy/SPH/BPH/transitions. Default scenario is disomy. Giving a list of scenarios, e.g. "SPH BPH" would create a batch of simulations. |
--output-filename <FILENAME> |
The output filename. The default filename is a combination of three obs filenames. |
--compress <gz/bz2/unc> |
Output compressed via gzip, bzip2 or uncompressed. Default is uncompressed. |
--transitions <STR,FLOAT,FLOAT,...,FLOAT,> |
Relevant only for transitions scenario. Introduces transitions between SPH and BPH along the chromosome. The locations of the transition is determined by a fraction of chromosome length, ranging between 0 to 1. For example a BPH-SPH-BPH transition that equally divides the chromosomes is exressed as BPH,0.333,0.666 and, similarly, a SPH-BPH transition at the middle of the chromosome is expressed as SPH,0.5. In addition, giving a list of cases, e.g. "SPH,0.2 SPH,0.4 SPH,0.6" would create a batch of three simulations. |
--distant-admixture <FLOAT FLOAT> |
Assume a distant admixture with a certain ancestry proportion, e.g, AFR 0.8 EUR 0.2. In addition, the order of observation tables that are given as arguments is important; Odd positions are associated with population 1, while even positions with population 2. For example, in order to simulate a SPH case the observation tables should be given as follows: "python MIX_HAPLOIDS -s SPH HAPLOID1_AFR.obs.p HAPLOID2_EUR.obs.p HAPLOID3_AFR.obs.p HAPLOID4_EUR.obs.p". When simulating SPH, the first two observation tables would be associated with the duplicated homolog. |
The script EXTRACT_GENOTYPES.py simulates two observation tables of a haploid, using phased genotypes from VCF files. Each observation table includes SNP positions, "alleles from the same haplotype of a certain individual and their corresponding line number within the IMPUTE2 legend file.
We run the script with the following arguments and flags:
python3 EXTRACT_GENOTYPES.py ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz chr21_EUR_panel.legend chr21 HG00097
where the first argument is the filename of a VCF file, containing phased genotypes. The second required argument is the filename of the IMPUTE2 legend file, which lists the SNPs positions. The third and forth arguments are the chromosome ID and sample ID, respectively.
The script IMPUTE2OBS.py simulates two observation tables of a haploid, using phased genotypes from IMPUTE2 files. Each observation table includes SNP positions, alleles from the same haplotype of a certain individual and their corresponding line number within the IMPUTE2 legend file.
We run the script with the following arguments and flags:
python IMPUTE2OBS.py chr21_EUR_panel.legend chr21_EUR_panel.hap chr21_EUR_panel.samples chr21 HG00097
where the first argument is the filename of a legend file, describing the SNPs. The second argument is the filename of the IMPUTE2 haplotypes file, which contains a table of haplotypes. The thirds argument is the filename of the IMPUTE2 samples file, which lists the individuals. The forth and fifth arguments are the chromosome ID and sample ID, respectively.
Based on simulated scenario that was created by MIX_HAPLOIDS and analyzed by ANEUPLOIDY_TEST, balanced ROC (Receiver Operating Characteristic) curves for predicting BPH (both parental homologs) and SPH (single parental homologs) are created.
The balanced ROC curve is a plot of BTPR (Balanced True Positive Rate) vs. BFPR (Balanced False Positive Rate), where BTPR ≡ 0.5(TPRBPH + TPRSPH) and BFPR ≡ 0.5(FPRBPH + FPRSPH).
The genome is divided into bins. For each simulated data, the mean LLR and the standard deviation for a bin are calculated. A positive (negative) prediction is made if the bounds of the confidence interval lie on the positive (negative) side of the number line. In each bin, the z-score is varied and the balanced positive and negative rates are calculated for each value it takes.
We run the script with the following arguments and flags:
python BALANCED_ROC_CURVE.py ~/results/non_masked/nonadmixed_EAS_chr16/ nonadmixed_EAS_chr16.p
where the first argument is the path of a directory that contains LLR files with simulated scenarios and the second argument is the output filename. In addition, the following optional arguments are supported:
Optional argument | Description |
---|---|
--number-of-bins <INT> |
The genome is divided into bins and for each bin a ROC curve is calculated. Default value is 15. |
--compress <gz/bz2/unc> |
Output compressed via gzip, bzip2 or uncompressed. Default is uncompressed. |
--scenarios <BPH/SPH/disomy/monosomy> |
Two simulated scenarios for which a balanced ROC curve would be created. The default is "BPH SPH". |
--ancestral-makeup
|
Apply a criterion for the ancestral makeup: a. For non-admixtures the argument consists a single superpopulation, e.g., EUR. b. For recent admixtures the argument consists two superpopulations, e.g., EUR EAS. c. For distant admixtures the argument consists of the superpoplations and their proportions, e.g, EUR 0.8 EAS 0.1 SAS 0.1. |
--chr-id <STR> |
Apply a criterion for the chromosome number, e.g., chrX. |
--window-size <INT> |
Apply a criterion for the size of the genomic window. |
--subsamples <INT> |
Apply a criterion for the number of subsamples per genomic window. |
--min-reads <INT> |
Apply a criterion for the minimal number of reads in a genomic window, admitting non-zero score. |
--max-reads <INT> |
Apply a criterion for the maximal number of sampled reads per bootstrap iteration. |
--min-HF <FLOAT> |
Apply a criterion for the minimal haplotype frequency |
--min-score <INT> |
Apply a criterion for the minimal score of a read. |
--read-length <INT> |
Apply a criterion for the number of base pairs in read. |
--depth <INT> |
Apply a criterion for the depth of coverage. |
Adopted from the documentation of SHAPEIT, https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#haplegsample
This is a default haplotype text file format used by IMPUTE2 to encode the haplotypes of the 1,000 Genomes Project. To describe it briefly, consider 4 unrelated individuals (Two white American CEU1 & CEU2, an English GBR1 & a Nigerian YRI1) for which haplotypes are available at 3 SNPs. The SNPs are located at positions 123, 456 and 789 and are donoted by SNP123, SNP456 & SNP789, respectively:
Individuals | Haplotypes | SNP123 | SNP456 | SNP789 |
---|---|---|---|---|
CEU1 | hap1 | A | T | A |
CEU1 | hap2 | A | C | T |
CEU2 | hap1 | G | C | T |
CEU2 | hap2 | A | T | A |
GBR1 | hap1 | A | T | T |
GBR1 | hap2 | A | C | T |
YRI1 | hap1 | G | T | T |
YRI1 | hap2 | G | C | T |
The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is:
sample | population | group | sex |
---|---|---|---|
CEU1 | CEU | EUR | 1 |
CEU2 | CEU | EUR | 2 |
GBR1 | GBR | EUR | 2 |
YRI1 | YRI | AFR | 2 |
It is SPACE delimited. The first line is a header line that describe the content of the file. Then, each line corresponds to a single individual. The first four columns are:
- Individual ID
- Population
- Superpopulation
- Sex (1 for male, 2 for female)
LD-PGTA requires at least 4 columns in the SAMPLE file. Additional columns can be added but SHAPEIT will ignore them. This file should have N+1 lines and at least 4 columns where N is the number of individuals in the reference panel. Each individual must have unique IDs containing only alphanumeric characters. Each individual can have 2 associated group IDs for subsetting purpose.
The LEGEND file describes the SNPs. The minimal LEGEND file corresponding to the example dataset is:
id | position | ref | alt |
---|---|---|---|
SNP1 | 123 | A | G |
SNP2 | 456 | T | C |
SNP3 | 789 | A | T |
This file is SPACE delimited. The first line is a header line that describe the content of the file. Each line corresponds to a single SNP. The first four columns are:
- SNP ID [string]
- SNP Position [integer]
- Reference allele [string]
- Alternative allele [string]
LD-PGTA requires 4 columns in the LEGEND file. This file should have L+1 lines and at least 4 columns where L is the number of SNPs in the reference panel.
The HAP file contains the haplotypes. The HAP file corresponding to the example dataset is:
0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
This file is SPACE delimited. Each line corresponds to a single SNP. Each successive column pair (0, 1), (2, 3), (4, 5) and (6, 7) corresponds to the alleles carried at the 4 SNPs by each haplotype of a single individual. For example a pair "1 0" means that the first haplotype carries the alternative allele while the second carries the reference allele as specified in the LEGEND file. The haplotypes are given in the same order than in the SAMPLE file. This file should have L lines and 2N columns, where L and N are the numbers of SNPs and individuals respectively.
Here we create an IMPUTE2 reference panel using bcftools. First, create a file containing a list of individuals to include in subsequent analysis. Each individual ID (as defined in the VCF header line) should be included on a separate line. In order to generate such a list for the 1000 genome project phase 1\3 visit https://www.internationalgenome.org/data-portal/sample.
The next step is to convert the reference panel VCF to BCF format:
bcftools view \
ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--samples-file EUR_indv_phase3.txt \
--exclude-types indels,mnps,ref,bnd,other \
--min-alleles 2 \
--max-alleles 2 \
--min-ac 1:minor \
--phased \
--exclude 'AN!=2*N_SAMPLES' \
--output-file chr21_EUR_panel.bcf \
--output-type u
\
The arguments that were used:
Argument | Description |
---|---|
--samples-file EUR_indv_phase3.txt |
File of sample names to include with one sample per line. The sample order is updated to reflect that given in the input file. |
--exclude-types indels,mnps,ref,bnd,other |
Select only sites with SNP variants across all samples, by excluding any other type. Types are determined by comparing the REF and ALT alleles in the VCF record not INFO tags. |
--min-alleles 2 --max-alleles 2 |
Select only biallelic SNPs, by selecting sites with (at least and at most) two alleles listed in REF and ALT columns. |
--min-ac 1:minor |
Select only SNPs with a minor allele count that reaches a defined threshold. |
--phased --exclude 'AN!=2*N_SAMPLES' |
An IMPUTE reference-panel format requires phased data and bi-allelic sites. In order to make sure that no data is missing, sites where the total number of alleles across all samples is not equal to twice the number of samples are excluded. |
--output-file chr21_EUR_panel.bcf --output-type u |
Here we specify the output file name and format. We choose an uncompressed BCF in order to speed up performance by removing unnecessary compression/decompression and BCF → IMPUTE2 conversion. |
Then, we convert from BCF to hap/legend/sample format used by IMPUTE2 and SHAPEIT:
bcftools convert chr21_EUR_panel.bcf --haplegendsample chr21_EUR_panel
rm chr21_EUR_panel.bcf
where --haplegendsample prefix
converts from BCF to hap/legend/sample format used by IMPUTE2. The columns of the .legend file are ID,POS,REF,ALT. Moreover, in order to prevent strand swaps, the program uses IDs of the form "CHROM:POS_REF_ALT". The columns of the .sample file are population, group and sex.
- All scripts require Python v3.7 or above. The CPython and PyPy implementation are supported.
- The script
MAKE_REF_PANEL.py
requires one of the following: (a) bcftools v1.14 or above, (b) cyvcf2 v0.30.12 or above, (c) pysam v0.18.0 or above. - The script
MAKE_OBS_TAB.py
requires pysam v0.18.0 or above. - The script
ANEUPLOIDY_TEST.py
would perform faster when gmpy2 v2.1.0rc1 is present. - The script
PLOT_PANEL.py
requires matplotlib v3.5.1 or above.
Created by Daniel Ariad (2020).