Skip to content
Daniel Ariad edited this page Aug 6, 2022 · 71 revisions

Table of Contents

Introduction

What is LD-PGTA?

Tutorial

Stage 1: Sequence alignment
Stage 2: Construction of the reference panel
Stage 3: Tabulation of allele observations at known SNPs
Stage 4: Generating the statistical models
Stage 5: Model comparison with log likelihood ratios
Stage 6: Plot log-likelihood ratio vs. chromosomal position

Additional documentation

Simulating various forms of trisomy
Balanced ROC Analysis
Description of the IMPUTE2 format for reference panels
Generating IMPUTE2 reference panels from a VCF file
Dependencies

What is LD-PGTA?

Extra or missing chromosomes—a phenomenon termed aneuploidy—frequently arises during human meiosis and embryonic mitosis and is the leading cause of pregnancy loss, including in the context of in vitro fertilization (IVF). While meiotic aneuploidies affect all cells and are deleterious, mitotic errors generate mosaicism, which may be compatible with healthy live birth. Large-scale abnormalities such as triploidy and haploidy also contribute to adverse pregnancy outcomes, but remain hidden from standard sequencing-based approaches to preimplantation genetic testing (PGT-A). The ability to reliably distinguish meiotic and mitotic aneuploidies, as well as abnormalities in genome-wide ploidy may thus prove valuable for enhancing IVF outcomes. Here, we describe a statistical method for distinguishing these forms of aneuploidy based on analysis of low-coverage whole-genome sequencing data, which is the current standard in the field. Our approach overcomes the sparse nature of the data by leveraging allele frequencies and linkage disequilibrium (LD) measured in a population reference panel. The method, which we term LD-informed PGT-A (LD-PGTA), retains high accuracy down to coverage as low as 0.05x and at higher coverage can also distinguish between meiosis I and meiosis II errors based on signatures spanning the centromeres. LD-PGTA provides fundamental insight into the origins of human chromosome abnormalities, as well as a practical tool with the potential to improve genetic testing during IVF.

For more information:

Daniel Ariad, Stephanie M. Yan, Andrea R. Victor, Frank L. Barnes, Christo G. Zouves, Manuel Viotti, Rajiv C. McCoy. “Haplotype-aware inference of human chromosome abnormalities” | PNAS November 16, 2021 118 (46) e2109307118 | bioRxiv:10.1101/2021.05.18.444721 | The study is summarized in this poster

Stage 1: Sequence alignment

Convert the sequence SRR6676163 from SRA to FASTQ

To demonstrate the use of LD-PGTA, we can obtain a trisomic sample from SRA.

wget 'https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-2/SRR6676163/SRR6676163.1'
mv SRR6676163.1 SRR6676163.sra
fastq-dump ./SRR6676163.sra --gzip

Prepare a reference genome for alignment

We next prepare a reference genome file (here GRCh38/hg38; UCSC release, Dec. 2013). In order to accelerate the alignment, we do not include unplaced sequences, centromeric sequences and alternates:

wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr?.fa.gz'
wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr??.fa.gz'
gunzip *.fa.gz
cat *.fa > hg38.fa
rm chr*.fa

The complete GRCh38/hg38 (UCSC release, Dec. 2013) reference genome, which includes unplaced sequences, centromeric sequences and alternates, can be downloaded :

wget 'http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz'
gunzip hg38.fa.gz

Creating the fasta index file

We use the faidx command from samtools to prepare the FASTA index file. This index file provides byte offsets in the FASTA file for each contig. Thus, it allows us to compute exactly where to find a particular reference base at specific genomic coordinates in the FASTA file.

samtools faidx hg38.fa

Prepare the reference index for BWA

For reads shorter than 70 bp we use the software BWA. The reference index is created from the reference genome FASTA by executing:

bwa index -p hg38bwaidx -a bwtsw hg38.fa

See the article and the documentation of the software for more details.

For reads larger than 70 bp we use the software minimap2. The reference index is created from the reference genome FASTA by executing:

minimap2 -x sr -d hg38.sr.mmi hg38.fa

See the article and the documentation of the software for more details.

Align to reference genome

In order to align short reads (<70bp) using bwa, we execute it with the following arguments:

bwa aln -t 6 hg38bwaidx SRR6676163.fastq.gz > SRR6676163.bwa
bwa samse hg38bwaidx SRR6676163.bwa SRR6676163.fastq.gz > SRR6676163.sam,

where the option -t 6 allows multi-threading with 6 threads.

In order to map reads ranging from 0.07 kbp to 1 kbp, we use minimap2 with the following arguments:

minimap2 -t6 -ax sr ../genome_ref_$2/minimap2/hg38.sr.mmi SRR6676163.fastq.gz > SRR6676163.sam,

where the option -t6 allows multi-threading with 6 threads.

Convert from SAM to BAM format

Conversion from SAM to BAM format reduces data size and improves performance. In addition, many applications (such as genome browsers) require the reads in the BAM file to be coordinate-based sorted, i.e., the reads are ordered from “left” to “right” across the reference genome just as you would read across a line in a book.

samtools view -bS -F 4 SRR6676163.sam | samtools sort -@ 6 -o SRR6676163.sorted.bam -

The option -F 4 filters out unmapped reads, while -@ 6 allows samtools to use up to 6 threads.

Next, we create an index from the BAM file with the samtools index subcommand:

samtools index SRR6676163.sorted.bam SRR6676163.sorted.bai

Source: http://bioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/Day1/Sequence%20Alignment_July2015_ShamithSamarajiwa.pdf and http://genomeintelligence.org/?p=1486

Stage 2: Construction of the reference panel

As described in our manuscript, LD-PGTA overcomes the sparse nature of low-coverage data by using information about allele frequencies and haplotype structure as measured in a population reference panel. Here we demonstrate the construction of such a reference panel using data from the 1000 Genomes Project.

Download VCF files from the 1000 Genome Project Phase 3

VCF files from the 1000 Genome Project Phase 3 aligned against GRCh38/hg38:

wget 'ftp://ftp.sra.ebi.ac.uk/vol1/ERZ822/ERZ822766/*'

This call set is also available from:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/.

Reference: Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Generate list of individuals to include in the creation of the reference panel

Create an IMPUTE2 sample file, as explained in this section. In order to generate such a list for the 1000 genome project phase 1\3 visit https://www.internationalgenome.org/data-portal/sample.

One important note: the columns of a SAMPLE file for population, group, and sex expected to be filled by the user. LD-PGTA uses the values in the group column to differentiate between samples that are associated to distinct super-populations, and thus providing this information is required.

Generate a reference panel

The script MAKE_REF_PANEL.py creates reference panels for LD-PGTA, using phased genotypes in VCF files. A reference panel of LD-PGTA follows the structure of the IMPUTE2 format and consist of three files: sample, legend and haplotypes. However,the reference panel is stored as binary files for efficient storage and retrieval.

We run the script with the following arguments and flags:

python3 MAKE_REF_PANEL.py EUR_panel.sample ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --mask 20160622.chr1.mask.fasta.gz,

where the first argument is the IMPUTE2 sample file. The second required argument is a VCF filename that correspond to a single chromosome. The third argument is an accessibility mask file in a gzipped FASTA format and is optional. Supplying an accessibility mask file will reduce false SNPs in regions of the genome that are less accessible to NGS methods. GRCh38 genome accessibility masks for 1000 Genomes data are available from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/.

In addition, the following flags are supported:

Optional argument Description
--ignore-duplicates Ignore multiple records with the same chromosomal position.
--output-directory <PATH> The directory in which the reference panel would be created.
--force-module cyvcf2/bcftools/pysam By default cyvcf2 module would be used. This allows to use pysam or bcftools instead.

Stage 3: Tabulation of allele observations at known SNPs

The script MAKE_OBS_TAB.py uses a BAM file, which is aligned and sorted, as well as a legend file. It then builds a nested list that contains observed alleles at SNPs positions together with their chromosome position and the associated read ID. The nested list is stored in a pickle format.

We run the script with the following arguments and flags:

python3 MAKE_OBS_TAB.py SRR6676163.bam EUR_panel.chr21.legend.gz EUR_panel.chr21.sample.gz --min-bq 30 --min-mq 30 --max-depth 0 --handle-multiple-observations all,

where the first argument is the filename of the BAM file, containing reads mapped to a reference sequence. The second required argument is the filename of the legend file, which lists the SNPs positions. The third required argument is the filename of the SAMPLE file, which lists the reference samples and their associated population, group/superpopulation and sex. In addition, the following flags are supported:

Optional argument Description
--fasta_filename <FILENAME> The faidx-indexed reference file in the FASTA format. Supplying a reference file will reduce false SNPs caused by misalignments using the Base Alignment Quality (BAQ) method described in the paper “Improving SNP discovery by base alignment quality”, Heng Li, Bioinformatics, Volume 27, Issue 8.
--handle-multiple-observations <all/first/random/skip> We expect to observe at most a single base per SNP. When encountering an exception, the default behavior is to skip the SNP. However, a few alternative options to handle multiple observations are available: (a) take the first observed base, (b) randomly pick an observed base and (c) keep all the observed bases.
--min-bq <INT> Minimum base quaility for observations. Default value is 30.
--min-mq <INT> Minimum mapping quaility for observations. Default value 30.
--max-depth <FLOAT> Maximum depth coverage to be considered (inclusive). Default value is 0, effectively removing the depth limit.
--output-filename <FILENAME> Output filename. The default filename is the same as the BAM filename, but with an extension of .chr_id.obs.p .

Stage 4: Generating the statistical models

The script BUILD_STATISTICAL_MODEL.py generates a file that encodes efficiently all the statistical models for up to N reads. , which are required for the next step (running the script ANEUPLOIDY_TEST.py. The script builds statistical models for 4 scenarios, namely, BPH, SPH, disomy and monosomy, where BPH (Both Parental Homologs) correspond to the presence of three unmatched haplotypes, while SPH (Single Parental Homolog) correspond to chromosome gains involving identical homologs.

As demonstration, we run the script with the following arguments and flags:

python3 MAKE_STATISTICAL_MODEL.py 12,

where the first argument is the maximal number of sampled reads that the statistical model supports. The number of terms in the statistical models grows exponentially as the number of sampled reads increases. Thus, we recommend to generate the statistical models for up to 12 sampled reads.

Stage 5: Model comparison with log likelihood ratios

The script ANEUPLOIDY_TEST.py builds a dictionary that lists linkage disequilibrium (LD) blocks that contain at least three meaningful reads and gives the associated log-likelihood BPH/SPH ratio (LLR). BPH (Both Parental Homologs) describes to the presence of three unmatched haplotypes, while SPH (Single Parental Homolog) describes the presence of two identical haplotypes, with an additional unmatched haplotype. Finally, the script analyzes the dictionary and reports summary statistics, including the mean LLR, the standard error, the number of informative genomic windows and the fraction of genomic windows with a negative LLR.

As demonstration, we run the script with the following arguments and flags:

python3 ANEUPLOIDY_TEST.py SRR6676163.obs.p EUR_panel.chr21.legend EUR_panel.chr21.hap EUR --window-size 0 --min-reads 6 --max-reads 4 --compress bz2,

where the first argument is the filename of a pickle file created by MAKE_OBS_TAB, containing base observations at known SNP positions. The second argument is the filename of the reference panel legend, which lists the SNPs in our reference panel. The third required argument is the filename of the reference panel haplotypes, which lists the haplotypes in our reference panel. The forth argument is ancestral makeup of SRR6676163. In general, the ancestral makeup is specified as follows:

  1. For non-admixtures the argument consists a single superpopulation, e.g., EUR.
  2. For recent admixtures the argument consists two superpopulations, e.g., EUR EAS.
  3. For distant admixtures the argument consists of the superpoplations and their proportions, e.g, EUR 0.8 EAS 0.1 SAS 0.1.

In addition, the following optional arguments are supported:

Optional argument Description
--window-size <INT> Specifies the size of the genomic window. The default value is 100 kbp. When given a zero-size genomic window, it adjusts the size of the window to include min-reads reads.
--offset <INT> Shifts all the genomic windows by the requested base pairs. The default value is 0.
--min-reads <INT> Takes into account only genomic windows with at least INT reads, admitting non-zero score. The minimal value is 3, while the default is 6.
--max-reads <INT> Selects up to INT reads from each genomic windows in each bootstrap sampling. The minimal value is 2, while the default value is 4.
--min-HF <FLOAT> Only haplotypes with a frequnecy between FLOAT and 1-FLOAT add to the score of a read. The default value is 0.05.
--min-score <INT> Consider only reads that reach the minimal score. The default value is 2.
--output-filename <FILENAME> The output filename. The default is the input filename with the extension ".obs.p" replaced by ".LLR.p".
--compress gz/bz2/unc Output compressed via gzip, bzip2 or uncompressed. Default is uncompressed.
  • When the python module gmpy2 is present, the script would use its implementation of popcount to boost performance. gmpy2 is an implementation of a standard GMP (GNU Multiple Precision Arithmetic Library) and supports arbitrary precision integers. For more information check gmpy2 and also gmpy2’s documentation.

Stage 6: Plot log-likelihood ratios vs. chromosomal position.

The script PLOT_PANEL.py plots log-likelihood ratio vs. chromosomal position from a LLR file.

As demonstration, we run the script with the following arguments and flags:

python3 PLOT_PANEL.py SRR6676163.LLR.p,

where the first argument is the filename of a LLR file created by ANEUPLOIDY_TEST, containing likelihoods to observe various aneuploidy landscapes. When a few LLR files are given, a panel of plots would be produced. In addition, the following flags are supported:

Optional argument Description
--pairs Plots the LLR between scenario A and scenario B along the chromosome. The possible pairs are: BPH,DISOMY; DISOMY,SPH; SPH,MONOSOMY; DISOMY,MONOSOMY; BPH,SPH. In addition, giving a list of pairs would plot the LLR of each pair in the same figure, e.g. "BPH,SPH SPH,MONOSOMY". The default value is BPH,SPH.
--bin-size The bin size in which the chromosome is divided. The default value is 4,000,000 bp.
--z-score The z-score value for the confidence intervals. The default value is 1.96, which corresponds to confidence level of 95%.

Simulating various forms of trisomy

To further demonstrate and evaluate LD-PGTA and other methods, it can be useful to simulate trisomies (as well as other ploidy configurations) with various haplotype patterns. This can be accomplished by drawing haplotypes from a reference panel such as the 1000 Genomes Project, and mixing them in defined proportions.

Creating mixtures of haploid sequences

The script MIX_HAPLOIDS.py simulates observed bases at known SNP positions from an aneuploid cell, based on mixtures of haploid sequences. The simulation supports various aneuploidy configurations:

  1. Monosomy - a single copy of a chromosome.
  2. Disomy - two unmatched haplotypes.
  3. SPH (`single parental homolog') - trisomy with two identical haplotypes and one unique haplotype.
  4. BPH (`both parental homologs') - trisomy with three distinct haplotypes.
  5. Realistic meiotic-origin trisomy with recombination, characterized by a transitions between SPH to BPH tracts.

We demonstrate the script with the following arguments and flags:

python3 MIX_HAPLOIDS.py HAPLOID1.obs.p HAPLOID2.obs.p HAPLOID3.obs.p --depth 0.01 --read-length 36 --scenarios monosomy SPH disomy BPH,

where the three first arguments are the filenames of observation tables created by MAKE_OBS_TAB for each haploid sequence. In addition, some of the following optional flags were used:

Optional argument Description
--depth <FLOAT> The average coverage for the whole chromosome. Default value 0.1
--read-length <INT> The number of base pairs (bp) sequenced from a DNA fragment. Default value 36.
--scenarios <LIST OF STRINGS> The simulation supports five scenarios: monosomy/disomy/SPH/BPH/transitions. Default scenario is disomy. Giving a list of scenarios, e.g. "SPH BPH" would create a batch of simulations.
--output-filename <FILENAME> The output filename. The default filename is a combination of three obs filenames.
--compress <gz/bz2/unc> Output compressed via gzip, bzip2 or uncompressed. Default is uncompressed.
--transitions <STR,FLOAT,FLOAT,...,FLOAT,> Relevant only for transitions scenario. Introduces transitions between SPH and BPH along the chromosome. The locations of the transition is determined by a fraction of chromosome length, ranging between 0 to 1. For example a BPH-SPH-BPH transition that equally divides the chromosomes is exressed as BPH,0.333,0.666 and, similarly, a SPH-BPH transition at the middle of the chromosome is expressed as SPH,0.5. In addition, giving a list of cases, e.g. "SPH,0.2 SPH,0.4 SPH,0.6" would create a batch of three simulations.
--distant-admixture <FLOAT FLOAT> Assume a distant admixture with a certain ancestry proportion, e.g, AFR 0.8 EUR 0.2. In addition, the order of observation tables that are given as arguments is important; Odd positions are associated with population 1, while even positions with population 2. For example, in order to simulate a SPH case the observation tables should be given as follows: "python MIX_HAPLOIDS -s SPH HAPLOID1_AFR.obs.p HAPLOID2_EUR.obs.p HAPLOID3_AFR.obs.p HAPLOID4_EUR.obs.p". When simulating SPH, the first two observation tables would be associated with the duplicated homolog.

Create haploid-like observation table using phased genotypes from VCF files.

The script EXTRACT_GENOTYPES.py simulates two observation tables of a haploid, using phased genotypes from VCF files. Each observation table includes SNP positions, "alleles from the same haplotype of a certain individual and their corresponding line number within the IMPUTE2 legend file.

We run the script with the following arguments and flags:

python3 EXTRACT_GENOTYPES.py ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz chr21_EUR_panel.legend chr21 HG00097

where the first argument is the filename of a VCF file, containing phased genotypes. The second required argument is the filename of the IMPUTE2 legend file, which lists the SNPs positions. The third and forth arguments are the chromosome ID and sample ID, respectively.

Create haploid-like observation table using phased genotypes from IMPUTE2 files.

The script IMPUTE2OBS.py simulates two observation tables of a haploid, using phased genotypes from IMPUTE2 files. Each observation table includes SNP positions, alleles from the same haplotype of a certain individual and their corresponding line number within the IMPUTE2 legend file.

We run the script with the following arguments and flags:

python IMPUTE2OBS.py chr21_EUR_panel.legend chr21_EUR_panel.hap chr21_EUR_panel.samples chr21 HG00097

where the first argument is the filename of a legend file, describing the SNPs. The second argument is the filename of the IMPUTE2 haplotypes file, which contains a table of haplotypes. The thirds argument is the filename of the IMPUTE2 sample file, which lists the individuals. The forth and fifth arguments are the chromosome ID and sample ID, respectively.

Balanced ROC Analysis

Based on simulated scenario that was created by MIX_HAPLOIDS and analyzed by ANEUPLOIDY_TEST, balanced ROC (Receiver Operating Characteristic) curves for predicting BPH (both parental homologs) and SPH (single parental homologs) are created.

The balanced ROC curve is a plot of BTPR (Balanced True Positive Rate) vs. BFPR (Balanced False Positive Rate), where BTPR ≡ 0.5(TPRBPH + TPRSPH) and BFPR ≡ 0.5(FPRBPH + FPRSPH).

The genome is divided into bins. For each simulated data, the mean LLR and the standard deviation for a bin are calculated. A positive (negative) prediction is made if the bounds of the confidence interval lie on the positive (negative) side of the number line. In each bin, the z-score is varied and the balanced positive and negative rates are calculated for each value it takes.

We run the script with the following arguments and flags:

python BALANCED_ROC_CURVE.py ~/results/non_masked/nonadmixed_EAS_chr16/ nonadmixed_EAS_chr16.p

where the first argument is the path of a directory that contains LLR files with simulated scenarios and the second argument is the output filename. In addition, the following optional arguments are supported:

Optional argument Description
--number-of-bins <INT> The genome is divided into bins and for each bin a ROC curve is calculated. Default value is 15.
--compress <gz/bz2/unc> Output compressed via gzip, bzip2 or uncompressed. Default is uncompressed.
--scenarios <BPH/SPH/disomy/monosomy> Two simulated scenarios for which a balanced ROC curve would be created. The default is "BPH SPH".
--ancestral-makeup Apply a criterion for the ancestral makeup:
a. For non-admixtures the argument consists a single superpopulation, e.g., EUR.
b. For recent admixtures the argument consists two superpopulations, e.g., EUR EAS.
c. For distant admixtures the argument consists of the superpoplations and their proportions, e.g, EUR 0.8 EAS 0.1 SAS 0.1.
--chr-id <STR> Apply a criterion for the chromosome number, e.g., chrX.
--window-size <INT> Apply a criterion for the size of the genomic window.
--subsamples <INT> Apply a criterion for the number of subsamples per genomic window.
--min-reads <INT> Apply a criterion for the minimal number of reads in a genomic window, admitting non-zero score.
--max-reads <INT> Apply a criterion for the maximal number of sampled reads per bootstrap iteration.
--min-HF <FLOAT> Apply a criterion for the minimal haplotype frequency
--min-score <INT> Apply a criterion for the minimal score of a read.
--read-length <INT> Apply a criterion for the number of base pairs in read.
--depth <INT> Apply a criterion for the depth of coverage.

Description of the IMPUTE2 format for reference panels

Adopted from the documentation of SHAPEIT, https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#haplegsample

This is a default haplotype text file format used by IMPUTE2 to encode the haplotypes of the 1,000 Genomes Project. To describe it briefly, consider 4 unrelated individuals (Two white American CEU1 & CEU2, an English GBR1 & a Nigerian YRI1) for which haplotypes are available at 3 SNPs. The SNPs are located at positions 123, 456 and 789 and are donoted by SNP123, SNP456 & SNP789, respectively:

Individuals Haplotypes SNP123 SNP456 SNP789
CEU1 hap1 A T A
CEU1 hap2 A C T
CEU2 hap1 G C T
CEU2 hap2 A T A
GBR1 hap1 A T T
GBR1 hap2 A C T
YRI1 hap1 G T T
YRI1 hap2 G C T

SAMPLE file

The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is:

sample population group sex
CEU1 CEU EUR 1
CEU2 CEU EUR 2
GBR1 GBR EUR 2
YRI1 YRI AFR 2

It is SPACE delimited. The first line is a header line that describe the content of the file. Then, each line corresponds to a single individual. The first four columns are:

  • Individual ID
  • Population
  • Superpopulation
  • Sex (1 for male, 2 for female)

LD-PGTA requires at least 4 columns in the SAMPLE file. Additional columns can be added but SHAPEIT will ignore them. This file should have N+1 lines and at least 4 columns where N is the number of individuals in the reference panel. Each individual must have unique IDs containing only alphanumeric characters. Each individual can have 2 associated group IDs for subsetting purpose.

LEGEND file

The LEGEND file describes the SNPs. The minimal LEGEND file corresponding to the example dataset is:

id position ref alt
SNP1 123 A G
SNP2 456 T C
SNP3 789 A T

This file is SPACE delimited. The first line is a header line that describe the content of the file. Each line corresponds to a single SNP. The first four columns are:

  • SNP ID [string]
  • SNP Position [integer]
  • Reference allele [string]
  • Alternative allele [string]

LD-PGTA requires 4 columns in the LEGEND file. This file should have L+1 lines and at least 4 columns where L is the number of SNPs in the reference panel.

HAP file

The HAP file contains the haplotypes. The HAP file corresponding to the example dataset is:

0 0 1 0 0 0 1 1
0 1 1 0 0 1 0 1
0 1 1 0 1 1 1 1

This file is SPACE delimited. Each line corresponds to a single SNP. Each successive column pair (0, 1), (2, 3), (4, 5) and (6, 7) corresponds to the alleles carried at the 4 SNPs by each haplotype of a single individual. For example a pair "1 0" means that the first haplotype carries the alternative allele while the second carries the reference allele as specified in the LEGEND file. The haplotypes are given in the same order than in the SAMPLE file. This file should have L lines and 2N columns, where L and N are the numbers of SNPs and individuals respectively.

Generating IMPUTE2 reference panels from a VCF file

Here we create an IMPUTE2 reference panel using bcftools. First, create a file containing a list of individuals to include in subsequent analysis. Each individual ID (as defined in the VCF header line) should be included on a separate line. In order to generate such a list for the 1000 genome project phase 1\3 visit https://www.internationalgenome.org/data-portal/sample.

The next step is to convert the reference panel VCF to BCF format:

bcftools view \
ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--sample-file EUR_indv_phase3.txt \
--exclude-types indels,mnps,ref,bnd,other \
--min-alleles 2 \
--max-alleles 2 \
--min-ac 1:minor \
--phased \
--exclude 'AN!=2*N_SAMPLES' \
--output-file chr21_EUR_panel.bcf \
--output-type u\

The arguments that were used:

Argument Description
--sample-file EUR_indv_phase3.txt File of sample names to include with one sample per line. The sample order is updated to reflect that given in the input file.
--exclude-types indels,mnps,ref,bnd,other Select only sites with SNP variants across all samples, by excluding any other type. Types are determined by comparing the REF and ALT alleles in the VCF record not INFO tags.
--min-alleles 2 --max-alleles 2 Select only biallelic SNPs, by selecting sites with (at least and at most) two alleles listed in REF and ALT columns.
--min-ac 1:minor Select only SNPs with a minor allele count that reaches a defined threshold.
--phased --exclude 'AN!=2*N_SAMPLES' An IMPUTE reference-panel format requires phased data and bi-allelic sites. In order to make sure that no data is missing, sites where the total number of alleles across all samples is not equal to twice the number of samples are excluded.
--output-file chr21_EUR_panel.bcf --output-type u Here we specify the output file name and format. We choose an uncompressed BCF in order to speed up performance by removing unnecessary compression/decompression and BCF → IMPUTE2 conversion.

Then, we convert from BCF to hap/legend/sample format used by IMPUTE2 and SHAPEIT:

bcftools convert chr21_EUR_panel.bcf --haplegendsample chr21_EUR_panel
rm chr21_EUR_panel.bcf

where --haplegendsample prefix converts from BCF to hap/legend/sample format used by IMPUTE2. The columns of the .legend file are ID,POS,REF,ALT. Moreover, in order to prevent strand swaps, the program uses IDs of the form "CHROM:POS_REF_ALT". The columns of the .sample file are population, group and sex.

Dependencies

  • All scripts require Python v3.8 or above. The CPython and PyPy implementation are supported.
  • The script MAKE_REF_PANEL.py requires one of the following: (a) bcftools v1.14 or above, (b) cyvcf2 v0.30.12 or above, (c) pysam v0.18.0 or above.
  • The script MAKE_OBS_TAB.py requires pysam v0.18.0 or above.
  • The script ANEUPLOIDY_TEST.py would perform faster when gmpy2 v2.1.0rc1 is present.
  • The script PLOT_PANEL.py requires matplotlib v3.5.1 or above.