Skip to content

Latest commit

 

History

History
322 lines (272 loc) · 15.6 KB

README.md

File metadata and controls

322 lines (272 loc) · 15.6 KB

Seq Experimental Notebook

The following datafiles were used to conduct the experiments:

The following commands will set up the benchmarking environment:

# Location of tools' sources and executables
export TOOLS=`pwd`/tools
# Location of test datasets
export DATA=`pwd`/data
# Locatiion of the local Seq installation
export SEQ=`pwd`/seq
# Make sure to point LD_LIBRARY_PATH, PATH and SEQ_LIBRARY_PATH to your Seq installation directory
export LD_LIBRARY_PATH=$SEQ/build:$SEQ/deps/lib:${LD_LIBRARY_PATH}
export PATH=$SEQ/build:${PATH}
export SEQ_LIBRARY_PATH=$SEQ/build
# Use 1 thread
export OMP_NUM_THREADS=1
# Current working directory
DIR=`pwd`
# Make the necessary directories
mkdir -p $DIR/runs/time
mkdir -p $DIR/runs/stderr
mkdir -p $DIR/runs/stdout
mkdir -p $TOOLS/seq
export OUT=$DIR/runs/out
mkdir -p $OUT

The following function is used to collect the performance metrics of each experiment.

function tm {
  TIME=`which time`  # use gtime on macOS
  NAME=$1
  if grep -qs 'Exit status: 0' $DIR/runs/time/$NAME ; then
    echo -ne "- $NAME already done "
    grep  'Exit status:' $DIR/runs/time/$NAME
  else
    cmd="$TIME -v -o $DIR/runs/time/$NAME ${@:2} >$DIR/runs/stdout/$NAME 2>$DIR/runs/stderr/$NAME"
    eval "$cmd"
    es=$?
    if [ $es -ne 0 ] ; then
      echo "- $NAME failed with $es: $cmd"
      cat $DIR/runs/stderr/$NAME
    else
      echo "- $NAME completed"
    fi
  fi
}

CORA (Homology table construction)

We have used CORA 1.1.5b to evaluate CORA performance.

CORA first needs to be run with default parameters to generate __temporary_CORA_files directory. You can kill CORA once this directory is generated and proceed by running the commands below:

cd $OUT/cora
# Generate inexact homology table
tm cora-inexact $TOOLS/cora/homTable_setup $DATA/genome.fa 64 __temporary_CORA_files/__TEMP__Aux_Physical_Splits_File cora_hom_exact_k64 cora_hom_inexact_k64 __temporary_CORA_files/__TEMP__Aux_Parallel_Splits_File EXACT 1 FULL
# Generate exact homology table (needs 2 steps)
tm cora-exact-1 $TOOLS/cora/homTable_setup $DATA/genome.fa 64 __temporary_CORA_files/__TEMP__Aux_Physical_Splits_File cora_hom_exact_k64 cora_hom_inexact_k64 __temporary_CORA_files/__TEMP__Aux_Parallel_Splits_File EXACT 1 ONLY_EXACT_COMPACT
tm cora-exact-2 $TOOLS/cora/homTable_setup $DATA/genome.fa 64 __temporary_CORA_files/__TEMP__Aux_Physical_Splits_File cora_hom_exact_k64 cora_hom_inexact_k64 __temporary_CORA_files/__TEMP__Aux_Parallel_Splits_File BOTH  1 ONLY_CONSTRUCT
cd ..

For Seq, we used:

# Build
tm seq-build-cora-exact   seqc build -release -o $TOOLS/seq/cora-exact   $TOOLS/seq/cora/hom_exact.seq
tm seq-build-cora-inexact seqc build -release -o $TOOLS/seq/cora-inexact $TOOLS/seq/cora/hom_inexact.seq

# Use 24 threads
export OMP_NUM_THREADS=24
# Generate exact homology table
tm seq-cora-exact $TOOLS/seq/cora-exact   $DATA/genome.fa   $OUT/seq-cora-k64
# Generate inexact homology table
tm seq-cora-inexact $TOOLS/seq/cora-inexact $DATA/genome.fa 1 $OUT/seq-cora-k64
export OMP_NUM_THREADS=1

Seq reported 438k homologies that were not found by CORA; CORA, on the other hand, reported 252k homologies that were not found by Seq. These homologies are available for inspection here. After inspection, we discovered that the homologies not found by Seq are not true homologies due to a bug in the original version (for example, a reported homology between chr6:78300426-78300505 and chr6:78300466-78300545 is incorrect as can be seen by comparing the first and the second region— note the sequence differences at the end).

BWA-MEM (fastmap)

We used BWA 0.7.17 with the following patch that reports the time needed to load an index.

BWA was run as follows:

# Prepare the index
tm bwa-index-chr1 $TOOLS/bwa-0.7.17/bwa index   $DATA/chr1.fa
tm bwa-chr1       $TOOLS/bwa-0.7.17/bwa fastmap $DATA/chr1.fa $DATA/sample.fastq

Seq was run as follows:

# Build prefetch and no-prefetch versions of fastmap
tm seq-build-fastmap-bn seqc build -release -o $TOOLS/seq/fastmap-bn $TOOLS/seq/bwa/fastmap_build.seq
tm seq-build-fastmap-bp "sed 's/^#\@prefetch/\@prefetch/' $TOOLS/seq/bwa/fastmap_build.seq | seqc build -release -o $TOOLS/seq/fastmap-bp -"
# Run both versions
tm seq-fastmap-bn-chr1 $TOOLS/seq/fastmap-bn search $DATA/chr1.fa $DATA/sample.fastq $OUT/seq-fastmap-bn
tm seq-fastmap-bp-chr1 $TOOLS/seq/fastmap-bp search $DATA/chr1.fa $DATA/sample.fastq $OUT/seq-fastmap-bp

Rust code was built with rust-bio v0.32.0 with the following patch that allows proper SMEM reconstruction:

# Build the tool
(cd $TOOLS/rust; cargo build --release)
# Run Rust fastmap
tm rust-fastmap-chr1 $TOOLS/rust/target/release/biorust fastmap $DATA/chr1.fa $DATA/sample.fastq $OUT/rust-fastmap-chr1

SeqAn/C++ code was built with SeqAn v3.0.2 with g++ 10:

# Build the tool
( cd $TOOLS/seqan; 
  g++ -std=c++2a -fconcepts -O3 smem_seqan.cc -o fastmap \
    -I seqan/3.0.2/include -I seqan/3.0.2/include/seqan3/submodules/range-v3/include -I seqan/3.0.2/include/seqan3/submodules/sdsl-lite/include )
# Run SeqAn fastmap
tm seqan-fastmap-chr1 $TOOLS/seqan/fastmap $DATA/chr1.fa $DATA/sample.fastq

mrsFAST

We used mrsFAST v3.4.1 with the following patch that turns off the calculation of MD tag for fair comparison, and that force-enabes SSE4 version to be built.

# Index the genomes
tm mrsfast-index-hg19 $TOOLS/mrsfast/mrsfast --index  $DATA/genome.fa
tm mrsfast-index-chr1 $TOOLS/mrsfast/mrsfast --index  $DATA/chr1.fa
# Inexact version (e=2)
tm mrsfast-chr1 $TOOLS/mrsfast/mrsfast --search $DATA/chr1.fa --seq $DATA/FIN1_1.fastq --crop 96 -e 2 -o $OUT/mrsfast-chr1 -u /dev/null --disable-sam-header --mem 10
# Exact version
tm mrsfast-hg19-all $TOOLS/mrsfast/mrsfast --search $DATA/genome.fa --seq $DATA/FIN1_1.fastq -e 0 -o $OUT/mrsfast-hg19-all -u /dev/null --disable-sam-header --mem 200

Seq version was run as follows:

# Build prefetch and no-prefetch versions of exact match mapper (FM-index)
tm seq-build-mrsfast-exact-n seqc build -release -o $TOOLS/seq/mrsfast-exact-n $TOOLS/seq/mrsfast/exact.seq
tm seq-build-mrsfast-exact-p "sed 's/^#@prefetch/@prefetch/' $TOOLS/seq/mrsfast/exact.seq | seqc build -release -o $TOOLS/seq/mrsfast-exact-p -"
# Build inexact mapper (k-mer index)
tm seq-build-mrsfast-inexact seqc build -release -o $TOOLS/seq/mrsfast-inexact $TOOLS/seq/mrsfast/mrsfast.seq
# Index the genomes
tm seq-mrsfast-index-hg19 $TOOLS/seq/mrsfast-exact-n index  $DATA/genome.fa
tm seq-mrsfast-index-chr1 $TOOLS/seq/mrsfast-inexact index  $DATA/chr1.fa
# Inexact version
tm seq-mrsfast-chr1 $TOOLS/seq/mrsfast-inexact search $DATA/chr1.fa $DATA/FIN1_1.fastq $OUT/seq-mrsfast-chr1
# Exact version
tm seq-mrsfast-hg19-n $TOOLS/seq/mrsfast-exact-n search $DATA/genome.fa $DATA/FIN1_1.fastq $OUT/seq-mrsfast-hg19-n
tm seq-mrsfast-hg19-p $TOOLS/seq/mrsfast-exact-p search $DATA/genome.fa $DATA/FIN1_1.fastq $OUT/seq-mrsfast-hg19-p

minimap2 (Smith-Waterman alignment)

The original minimap2 uses KSW2 library that Seq uses internally for seq.align function. Thus, we just run unoptimized seq.align to measure the performance of minimap2 / KSW2 alignment.

# Build the tool
tm seq-build-sw seqc build -release -o $TOOLS/seq/sw $TOOLS/seq/minimap2/sw.seq
# Run the benchmark. It reports the average time for: 
#  - non-optimized alignment, 
#  - SSE4-optimized inter-alignment, 
#  - AVX2-optimized inter-alignment, and 
#  - AVX512F-optimized inter-alignment (if available).
tm seq-sw $TOOLS/seq/sw $DATA/queries_m $DATA/targets_m

We compared our alignment with rust-bio v0.32.0, SeqAn v3.0.2 and Bio.jl and BioAlignments.jl v1.0.1 in Julia v1.6.0. Other tools (SeqAn, BioJulia and rust-bio) do not support consistently advanced alignment options (e.g. setting a band width, Z-drop score and so on). Thus, we compared a simple version of Seq's seq.align to these tools to maintain the consistency:

# Build and run a simple version of the Seq alignment benchmark
tm seq-build-sw-simple seqc build -release -o $TOOLS/seq/sw-simple $TOOLS/seq/minimap2/sw_simple.seq
tm seq-sw-simple $TOOLS/seq/sw-simple $DATA/queries.small $DATA/targets.small
# Build and run rust-bio alignment benchmark
(cd $TOOLS/rust; cargo build --release)
tm rust-sw-simple $TOOLS/rust/target/release/biorust align $DATA/queries.small $DATA/targets.small
# Build and run SeqAn v3 alignment benchmark
( cd $TOOLS/seqan; 
  g++ -std=c++2a -fconcepts -O3 align_seqan.cc -o sw \
    -I seqan/3.0.2/include -I seqan/3.0.2/include/seqan3/submodules/range-v3/include -I seqan/3.0.2/include/seqan3/submodules/sdsl-lite/include )
tm seqan-sw-simple $TOOLS/seqan/sw $DATA/queries.small $DATA/targets.small
# Run Julia/BioJulia alignment benchmark
tm julia-sw-simple julia --check-bounds=no -O3 $TOOLS/julia/align.jl $DATA/queries.small $DATA/targets.small

AVID (global alignment)

We used a Linux version of AVID v2.1 build 0. (Note: you need to register to download a statically compiled binary. Source code is not available).

As AVID only supports reading a single pair of sequences, we used a wrapper script runavid.py to run AVID and its Seq counterpart on many pairs of sequences and collect the correct time.

# Run AVID on WGAC and SEDEF-generated segmental duplications
tm avid-wgac  python3 $TOOLS/runavid.py $DATA/wgac-segdups  $OUT/avid-wgac  avid
tm avid-sedef python3 $TOOLS/runavid.py $DATA/sedef-segdups $OUT/avid-sedef avid

Seq was run as follows:

# Build Seq-AVID
tm seq-build-sw seqc build -release -o $TOOLS/seq/sw $TOOLS/seq/avid/avid.seq
# Run Seq-AVID on WGAC and SEDEF-generated segmental duplications
tm seq-avid-wgac  python3 $TOOLS/runavid.py $DATA/wgac-segdups  $OUT/seq-avid-wgac  seq
tm seq-avid-sedef python3 $TOOLS/runavid.py $DATA/sedef-segdups $OUT/seq-avid-sedef seq

UMI-tools (barcode whitelisting)

We used UMI-tools v1.1.1:

tm umi-tools umi_tools whitelist --stdin $DATA/hgmm_100_R1.fastq --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN --log2stderr

Seq verson was run as follows:

# Build
tm seq-build-umi seqc build -release -o $TOOLS/seq/umi $TOOLS/seq/umi-tools/whitelist.seq
# Run
tm seq-umi $TOOLS/seq/umi $DATA/hgmm_100_R1.fastq

GATK (SplitNCigar)

To ensure fairness in comparison, all files were converted to single-ended BAM files with no optional tags:

for i in giab k562 cytosol; do 
  ( samtools view $DATA/$i.bam -H; \
    samtools view $DATA/$i.bam \
    | awk '{$1="read"NR; $2=and($2,compl(235)); $7="*"; $8=0; $9=0; print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11}' 'OFS=\t' ) \
  | samtools view - -1 -o $DATA/$i.gatk.bam -@ 4 
done

We used GATK 4.1.4.1 to run SplitNCigar as follows:

for i in giab k562 cytosol; do
  tm gatk-$i $TOOLS/gatk-4.1.4.1/gatk SplitNCigarReads -I $DATA/$i.gatk.bam -O $OUT/gatk-$i.sam -R $DATA/Homo_sapiens_assembly19_1000genomes_decoy.fasta
done

Seq was run as follows:

tm seq-build-gatk seqc build -release -o $TOOLS/seq/gatk $TOOLS/seq/gatk-splitncigar/gatk-split.seq
for i in giab k562 cytosol; do
  tm seq-gatk-$i $TOOLS/seq/gatk $DATA/$i.gatk.bam $DATA/Homo_sapiens_assembly19_1000genomes_decoy.fasta $OUT/gatk-$i.sam
done

HapTree-X (Haplotype phasing)

We compared HapTree-X fd3bef to HapCUT2 v1.2 and phASER v1.1.1.

HapCUT2 was run as follows:

# RNA-seq and exome data:
for i in na12878 exome ; do
  # - Extract fragments
  tm hapcut2-extract-$i $TOOLS/hapcut2/build/extractHAIRS --bam $DATA/$i.bam --VCF $DATA/$i.vcf --out $OUT/${i}-frag-hapcut2
  # - Run phasing
  tm hapcut2-$i $TOOLS/hapcut2/build/HAPCUT2 --fragments $OUT/${i}-frag-hapcut2 --VCF $DATA/$i.vcf --output $OUT/${i}-hapcut2
done

# 10X data:
# - Extract fragments
tm hapcut2-extract-10x $TOOLS/hapcut2/build/extractHAIRS --bam $DATA/10x.bam --VCF $DATA/10x.vcf --out $OUT/10x-frag-hapcut2 --10X 1
# - Link barcode fragments
tm hapcut2-link-10x python3 $TOOLS/hapcut2/utilities/LinkFragments.py --bam $DATA/10x.bam --VCF $DATA/10x.vcf --fragments $OUT/10x-frag-hapcut2 --out $OUT/10x-link-hapcut2
# - Run phasing
tm hapcut2-10x $TOOLS/hapcut2/build/HAPCUT2 --nf 1 --VCF $DATA/10x.vcf --fragments $OUT/10x-link-hapcut2 --output $OUT/10x-hapcut2

phASER was run as follows:

tm phaser python2 $TOOLS/phaser/phaser/phaser.py --vcf $DATA/na12878.vcf.gz --bam $DATA/na12878.bam --paired_end 1 --mapq 255 --baseq 0 --sample NA12878 --o $DATA/na12878.phaser

phASER was run only on RNA-seq sample as it is only designed to phase RNA-seq data.

HapTree-X was run as follows:

# Build
tm seq-build-haptreex seqc build -release -o $TOOLS/seq/haptreex $TOOLS/haptreex/src/main.seq
# Run
tm haptreex-rna   $TOOLS/seq/haptreex -v $DATA/na12878.vcf -r $DATA/na12878.bam -o $OUT/na12878.haptreex -g $DATA/Homo_sapiens.GRCh37.87.gtf
tm haptreex-exome $TOOLS/seq/haptreex -v $DATA/exome.vcf -d $DATA/exome.bam -o $OUT/exome.haptreex
tm haptreex-10x   $TOOLS/seq/haptreex -v $DATA/10x.vcf -d $DATA/10x.bam -o $OUT/10x.haptreex --10x

# Run 10X phasing with 4 threads
export OMP_NUM_THREADS=4
tm haptreex-10x-4 $TOOLS/seq/haptreex -v $DATA/10x.vcf -d $DATA/10x.bam -o $OUT/10x.haptreex-4 --10x