Skip to content

Diploid assembly based SV detection using cuteSV

JiangTao edited this page Jan 15, 2022 · 7 revisions

With the well acquisition of haplotype-resolved or phased genome assembly, it has great potential for detection of phased SVs. We wrote a script that post-process the typically cuteSV callsets from assembly alignments to generate the diploid-assembly-based SV calls. Here we show a simple demo about the performance about this new function. A step-by-step operation is listed below.

NOTICE

  1. It is beneficial to set --max_size as -1 that enables cuteSV to report full-length SVs.
  2. Setting --max_split_parts as -1 will apply the alignments with all split segments for SV calling.

Get data

  1. Create directory structure
mkdir -p assemblies ref alns calls
  1. Download assembly datasets

We collect two latest published NA24385 (also called HG002) human genome assembly datasets from Wenger et al and Garg et al respectively. Both of the two assemblies were used 29.7x PacBio CCS data. The former applied Canu for contig assembly and Arrow for polishing. The latter applied DipAsm for contig assembly and a 28.5x Hi-C dataset for scaffolding.

echo "Download maternal haplotype from Wenger."
curl-s https://downloads.pacbcloud.com/public/publications/2019-HG002-CCS/asm/HG002_canu_maternal.fasta \
    > assemblies/HG002_canu_maternal.fasta
echo "Download paternal haplotype from Wenger."
curl-s https://downloads.pacbcloud.com/public/publications/2019-HG002-CCS/asm/HG002_canu_paternal.fasta \
    > assemblies/HG002_canu_paternal.fasta
echo "Download haplotype H1 from Garg."
curl-s ftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/asm/NA24385-denovo-H1.fa.gz \
    > assemblies/NA24385-denovo-H1.fa.gz
echo "Download haplotype H2 from Garg."
curl-s ftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/asm/NA24385-denovo-H2.fa.gz \
    > assemblies/NA24385-denovo-H2.fa.gz
  1. Download hg19 reference with decoys and map non-ACGT characters to N
curl -s ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
    > ref/human_hs37d5.fasta.gz
gunzip ref/human_hs37d5.fasta.gz
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' \
    ref/human_hs37d5.fasta

Alignment

  1. Add haplotype-unique tag
cat <(cat assemblies/HG002_canu_maternal.fasta | sed 's/>/>cutesvh1_/g') \
    <(cat assemblies/HG002_canu_paternal.fasta | sed 's/>/>cutesvh2_/g') \
    > assemblies/HG002_wenger.fasta
cat <(zcat assemblies/NA24385-denovo-H1.fa.gz | sed 's/>/>cutesvh1_/g') \
    <(zcat assemblies/NA24385-denovo-H2.fa.gz | sed 's/>/>cutesvh2_/g') \
    > assemblies/HG002_garg.fasta
  1. Contigs alignment
conda install minimap2 -c bioconda
minimap2 --paf-no-hit -a -x asm5 --cs -r 2k -t 8 ref/human_hs37d5.fasta assemblies/HG002_wenger.fasta \
    > alns/HG002_wenger.sam
minimap2 --paf-no-hit -a -x asm5 --cs -r 2k -t 8 ref/human_hs37d5.fasta assemblies/HG002_garg.fasta \
    > alns/HG002_garg.sam
  1. Alignments post-processing
conda install samtools -c bioconda
samtools sort -@ 8 -o alns/HG002_wenger.bam alns/HG002_wenger.sam
samtools index -@ 8 alns/HG002_wenger.bam
samtools sort -@ 8 -o alns/HG002_garg.bam alns/HG002_garg.sam
samtools index -@ 8 alns/HG002_garg.bam

Run cuteSV

  1. Install cuteSV
git clone https://github.com/tjiangHIT/cuteSV.git
cd cuteSV && python3 setup install && cd ..
  1. Diploid-assembly-based SV calling
cuteSV alns/HG002_wenger.bam ref/human_hs37d5.fasta calls/HG002_wenger.vcf ./ \
    -s 1 --genotype --report_readid -p -1 -mi 500 -md 500 \
    --max_cluster_bias_INS 1000 \
    --diff_ratio_merging_INS 0.9 \
    --max_cluster_bias_DEL 1000 \
    --diff_ratio_merging_DEL 0.5
python3 cuteSV/src/cuteSV/diploid_calling.py calls/HG002_wenger.vcf calls/HG002_wenger_final.vcf
cuteSV alns/HG002_garg.bam ref/human_hs37d5.fasta calls/HG002_garg.vcf ./ \
    -s 1 --genotype --report_readid -p -1 -mi 500 -md 500 \
    --max_cluster_bias_INS 1000 \
    --diff_ratio_merging_INS 0.9 \
    --max_cluster_bias_DEL 1000 \
    --diff_ratio_merging_DEL 0.5
python3 cuteSV/src/cuteSV/diploid_calling.py calls/HG002_garg.vcf calls/HG002_garg_final.vcf

Evaluation

Assembly F1 % Precision % Recall % GT-F1 % GT-Precision % GT-Recall %
Wenger et al 92.18 93.32 91.08 84.09 79.17 89.65
Garg et al 96.09 95.61 96.58 95.29 94.09 96.52
Clone this wiki locally