GitHub - qinqian/minisv.py: Parsing GAF and PAF file format

Getting started

mamba create -n msvpy python==3.12 poetry
conda activate msvpy
git clone https://github.com/qinqian/minisv.py
cd minisv.py && make

# extract candidate SVs from hg38- and pangenome-based long read alignments
minisv extract -n COLO829T tests/demo/COLO829T.hs38l.paf.gz | gzip - > COLO829T_hs38l.rsv.gz
minisv extract -x 5 -n COLO829T tests/demo/COLO829T.chm13g.paf.gz | gzip - > COLO829T_chm13g.rsv.gz
minisv extract -n COLO829BL tests/demo/COLO829BL.hs38l.paf.gz | gzip - > COLO829BL.rsv.gz

# joint filtering analysis
minisv isec COLO829T_hs38l.rsv.gz COLO829T_chm13g.rsv.gz | gzip > COLO829T.rsv.gz

# single-sample germline calling
zcat COLO829T_hs38l.rsv.gz | sort -k1,1 -k2,2n -S4G | minisv merge - > COLO829T_germline.msv

# single-sample mosaic calling
zcat COLO829T.rsv.gz | sort -k1,1 -k2,2n -S4G | minisv merge - > COLO829T_mosaic.msv

# tumor-normal paired somatic calling
zcat COLO829{T,BL}.rsv.gz | sort -k1,1 -k2,2n -S4G | minisv merge - | grep COLO829T | grep -v COLO829BL > COLO829T_somatic.msv

# convert to vcf
minisv genvcf COLO829T_germline.msv > COLO829T_germline.vcf
minisv genvcf COLO829T_mosaic.msv > COLO829T_mosaic.vcf
minisv genvcf COLO829T_somatic.msv > COLO829T_somatic.vcf

Introduction

Minisv is a lightweight mosaic/somatic structural variation (SV) caller for long genomic reads. Different from other SV callers, it prefers to combine read alignments against multiple reference genomes. Minisv retains an SV on a read only if the SV is observed on the read alignments against all references. This simple strategy reduces alignment errors and filters out germline SVs in the sample (when used with the assembly of input reads) or in the population (when used with pangenomes).

Given PacBio HiFi reads at high coverage, minisv achieves higher specificity for mosaic SV calling, has comparable accuracy to tumor-normal paired SV callers, and is the only tool accurate enough for calling large chromosomal alterations (CAs) from a single tumor sample without matched normal.

minisv.py is a equivalent implementation as minisv.js, and add extra functions.

Data preprocessing

Minisv seamlessly works with PAF or GAF (graph PAF) formats. It requires the ds:Z tag outputted by minigraph-0.21+ or minimap2-2.28+. For minigraph, use -cxlr for long reads. For minimap2, use -cxmap-hifi -s50 --ds for HiFi reads and -cxlr:hq for ONT reads. The minimap2 option for Nanopore reads varies with the read error rate.

Getting the reference genomes,

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
wget -c https://zenodo.org/records/6983934/files/chm13-90c.r518.gfa.gz?download=1 -O chm13-90c.r518.gfa.gz

Optional step: de novo assembly of paired normal sample

Generate denovo assembly from the paired normal sample using hifiasm.

hifiasm -t32 -o normal_asm normal.fastq.gz
# for ONT
hifiasm --ont -t32 -o normal_asm normal.fastq.gz

gfatools gfa2fa normal_asm.bp.hap1.p_ctg.gfa | bgzip -c - > normal_asm.bp.hap1.p_ctg.fa.gz
gfatools gfa2fa normal_asm.bp.hap2.p_ctg.gfa | bgzip -c - > normal_asm.bp.hap2.p_ctg.fa.gz

Hifi read alignment

# hg38
minimap2 -cx map-hifi -s50 --ds -t 24 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz tumor.fastq.gz | gzip - > tumor_hg38.paf.gz
minimap2 -cx map-hifi -s50 --ds -t 24 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz normal.fastq.gz | gzip - > normal_hg38.paf.gz
# T2T
minimap2 -cx map-hifi -s50 --ds -t 24 chm13v2.0.fa.gz tumor.fastq.gz | gzip - > tumor_chm13.paf.gz
# denovo assembly
minimap2 -cx map-hifi -s50 --ds -t 24 <(zcat normal_asm.bp.hap1.p_ctg.fa.gz normal_asm.bp.hap2.p_ctg.fa.gz) -I100g --secondary=no tumor.fastq.gz | gzip - > tumor_self.paf.gz

# pangenome graph genome
minigraph -cxlr -t 24 --ds chm13-90c.r518.gfa.gz tumor.fastq.gz | gzip - > tumor_chm13g.paf.gz

ONT read alignment

# hg38
minimap2 -cxlr:hq --ds -t 24 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz tumor.fastq.gz | gzip - > tumor_hg38.paf.gz
minimap2 -cxlr:hq -s50 --ds -t 24 GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz normal.fastq.gz | gzip - > normal_hg38.paf.gz
# T2T
minimap2 -cxlr:hq -s50 --ds -t 24 chm13v2.0.fa.gz tumor.fastq.gz | gzip - > tumor_chm13.paf.gz
# denovo assembly
minimap2 -cxlr:hq -s50 --ds -t 24 <(zcat normal_asm.bp.hap1.p_ctg.fa.gz normal_asm.bp.hap2.p_ctg.fa.gz) -I100g --secondary=no tumor.fastq.gz | gzip - > tumor_self.paf.gz

# pangenome graph genome
minigraph -cxlr -t 24 --ds chm13-90c.r518.gfa.gz tumor.fastq.gz | gzip - > tumor_chm13g.paf.gz

Minisv standalone mode design

Minisv can call 1) germline SVs, 2) somatic SVs in tumor-normal pairs, 3) large somatic SVs in a single tumor, 4) mosaic SVs and 5) de novo SVs in a trio; the exact use also depends on the input data types. Minisv achieves this variety of calling modes in three steps:

Extract SVs from read alignment with extract. This command processes one read at a time, extracts long INDELs or breakends and outputs in a BED-like minisv format. If you have tumor-normal pairs, remember to use -n to specify samples such that you can distinguish the samples in step 3.
Intersect candidate SVs extracted from multiple alignment files with isec. You can skip this step if you use one reference genome only, but you would lose the key advantage of minisv. The first two steps are shared by all calling modes, though the input alignments may be different.
Call SVs supported by multiple reads using merge. This command counts the number of supporting reads for each sample. You can filter out SVs supported by normal reads to call somatic SVs - this is how minisv works with multiple samples jointly.

Calling SVs

Germline SVs

minisv extract -b data/hs38.cen-mask.bed tumor_hg38.paf.gz > sv.hg38l.rsv
cat sv.hg38l.rsv | sort -k1,1 -k2,2n -S4g | minisv merge - > sv.hg38l.msv
minisv genvcf sv.hg38l.msv > sv.hg38l.vcf

For calling germline SVs, you only need one linear reference genome. This is the simplest use case. However, minisv does not infer genotypes. It is not the best tool for germline SV calling.

Somatic SVs in tumor-normal pairs

minisv extract -n TUMOR -b data/hs38.cen-mask.bed tumor_hg38.paf.gz | gzip - > tumor_hs38l.rsv.gz
minisv extract -n TUMOR tumor_chm13.paf.gz | gzip - > tumor_chm13l.rsv.gz
minisv extract -x 5 -n TUMOR tumor_chm13g.paf.gz | gzip - > tumor_chm13g.rsv.gz
minisv extract -q 0 -x 0 -n TUMOR tumor_self.paf.gz | gzip - > tumor_self.rsv.gz

minisv extract -n NORMAL normal_hg38.paf.gz | gzip - > normal.rsv.gz

# joint filtering analysis
minisv isec tumor_hs38l.rsv.gz tumor_chm13l.rsv.gz tumor_chm13g.rsv.gz tumor_self.rsv.gz | gzip > tumor_tgs_filter.rsv.gz

zcat tumor_tgs_filter.rsv.gz normal.rsv.gz | sort -k1,1 -k2,2 -S4g \
  | minisv merge - | grep TUMOR | grep -v NORMAL > sv-paired.hg38l+tgs.msv

The last command selects SVs only present in TUMOR but not in NORMAL.

Graph alignment greatly reduces alignment errors when the normal assembly is not available. When you have the normal assembly, intersecting with graph alignment can be optional. While graph alignment still improves specificity, it may affect sensitivity a little - a classical sensitivity vs specificity problem.

Calling de novo SVs will be similar.

Large somatic SVs in tumor-only samples

minisv extract -n TUMOR -b data/hs38.cen-mask.bed tumor_hg38.paf.gz | gzip - > tumor_hs38l.rsv.gz
minisv extract -n TUMOR tumor_chm13.paf.gz | gzip - > tumor_chm13l.rsv.gz
minisv extract -x 5 -n TUMOR tumor_chm13g.paf.gz | gzip - > tumor_chm13g.rsv.gz

# joint filtering analysis
minisv isec tumor_hs38l.rsv.gz tumor_chm13l.rsv.gz tumor_chm13g.rsv.gz | gzip > tumor_tg_filter.rsv.gz

zcat tumor_tg_filter.rsv.gz | sort -k1,1 -k2,2 -S4g | minisv merge - > sv.hg38l+tg.msv
minisv.js view -IC sv.hg38l+tg.msv

In the lack of the normal assembly in this case, pangenome graph alignment is critical for reducing alignment errors and for filtering out common germline SVs. There may be several hundred SVs called with this procedure. Most of the called SVs below 10 kb are rare germline events, not somatic. Nevertheless, most large chromosomal alterations, such as translocations, foldback inversions (tagged by foldback in the output) and events longer 100 kb, are likely somatic becase such large alterations rarely occur to germline.

Mosaic SVs

Mosaic calling can be applied to a single sample from either normal or tumor sample, take tumor sample as an example below.

minisv extract -n TUMOR -b data/hs38.cen-mask.bed tumor_hg38.paf.gz | gzip - > tumor_hs38l.rsv.gz
minisv extract -n TUMOR tumor_chm13.paf.gz | gzip - > tumor_chm13l.rsv.gz
minisv extract -x 5 -n TUMOR tumor_chm13g.paf.gz | gzip - > tumor_chm13g.rsv.gz
minisv extract -q 0 -x 0 -n TUMOR tumor_self.paf.gz | gzip - > tumor_self.rsv.gz

minisv isec tumor_hs38l.rsv.gz tumor_chm13l.rsv.gz tumor_chm13g.rsv.gz tumor_self.rsv.gz | gzip > tumor_tgs_filter.rsv.gz
zcat tumor_tgs_filter.rsv.gz | sort -k1,1 -k2,2 -S4g | minisv.js merge -c2 -s0 - > sv.hg38l+tgs.msv

Having the phased sample assembly is critical to the calling of small mosaic SVs. If you do not have the assembly, please perform graph alignment. However, because there are more small rare SVs than small mosaic SVs, only large mosaic chromosomal alteration calls are reliable. If you have the assembly, graph alignment may still help specificity but may hurt sensitivity around VNTRs.

Generic filtering for other callers

Downloading example dataset from the zenodo, use COLO829_hifi1.tar.gz as an example.

tar xvfz COLO829_hifi1.tar.gz
cd COLO829_hifi1

# de novo assembly-based filtering of severus
minisv filterasm -c 5 -a -b minisv.py/data/hs38.cen-mask.bed severus/read_ids.csv minisv/COLO829T.self.Q0.gsv.gz severus_filter.stat severus/severus_somatic.vcf > severus_asm.vcf

# de novo assembly-based filtering of sniffles2
minisv filterasm -c 5 -a -b minisv.py/data/hs38.reg.bed sniffles2/grch38_snf_somatic.vcf minisv/COLO829T.self.Q0.gsv.gz snf_filter.stat sniffles2/grch38_snf_somatic.vcf > snf_asm.vcf

# de novo assembly-b5sed filtering of savana
minisv filterasm -c 5 -a -b minisv.py/data/hs38.reg.bed savana/grch38_T_tag.sv_breakpoints_read_support.tsv minisv/COLO829T.self.Q0.gsv.gz savana_filter.stat savana/grch38_T_tag.classified.somatic.vcf > sanava_asm.vcf

# de novo assembly-b5sed filtering of nanomonsv
minisv filterasm -c 5 -a -b minisv.py/data/hs38.reg.bed nanomonsv/grch38_parse.nanomonsv.supporting_read.txt minisv/COLO829T.self.Q0.gsv.gz nanomonsv_filter.stat nanomonsv/grch38_tnpair.vcf > nanomonsv_asm.vcf

# de novo assembly-b5sed filtering of minisv ltg
minisv filterasm -c 5 -a -b minisv.py/data/hs38.reg.bed minisv/COLO829_hifi1T.hg38l+tg.pair-c2s1.msv minisv/COLO829T.self.Q0.gsv.gz msv_filter.stat minisv/COLO829_hifi1T.hg38l+tg.pair-c2s1.msv > msv_filter.msv

Ensembling filtered caller results

The ensemble depends on the union function from minisv.js. TODO: add union in minsv.py. This function takes the filtered SV call sets from each caller after denovo assembly-based filtering, then ensemble the SV sets into a in silico truth set. ensembleunion collapsed the truth set by selecting one SV per group in the union output.

minisv.js union -p -c 5 -b minisv.py/data/hs38.reg.bed severus_asm.vcf msv_filter.msv sanava_asm.vcf nanomonsv_asm.vcf > ensemble.vcf
minisv ensembleunion ensemble.vcf > ensemble_collapsed.vcf

Comparing SVs

The eval command of minisv compares two or multiple SV callsets. To compare two callsets:

minisv eval -b data/hs38.reg.bed -l 100 call1.vcf call2.msv

where -l specifies the minimum SV length and -b specifies confident regions. The command line outputs TP, FN and FP. Minisv considers two SVs, S1 and S2, to be the same if both ends of S1 are within 500 bp from ends of S2 and the INDEL types of S1 and S2 are the same. Minisv compares all types of SVs that can be associated with two ends. You can also specify the minimum read support (-c) and the minimum SV length (-l) on the command line.

If three or more callsets are given on the command line, minisv will generate an output like:

SN  980     0.6091  0.8580  0.6135  0.9104  0.8754  C1
SN  0.0673  110     0.0762  0.1319  0.1119  0.0665  C2
SN  0.8408  0.6727  958     0.6411  0.9216  0.8469  C3
SN  0.1980  0.4000  0.2119  326     0.2313  0.2154  C4
SN  0.7071  0.5818  0.7359  0.6074  536     0.7012  C5
SN  0.8459  0.5636  0.8361  0.6442  0.8731  947     C6

where the diagonal gives the count of SVs and the number at row R and column C equals to the fraction of calls in R found in C. In this example, 84.59% of calls in C6 were found in C1 and 87.54% of calls in C1 found in C6. Generally, higher fraction on a row is correlated with higher sensitivity; higher fraction on a column is correlated with higher specificity for the caller on the column.

If you have 3+ callsets, you may also use option -M for evaluation in the consensus mode. In this mode, an FP is a call in callset C that is not found in another other callsets. Conversely, an FN is a call that is supported by two or more other callsets but is not called in C. Here is sample output:

RN  845  98   0.1160  C1
RP  980  49   0.0500  C1
RN  994  918  0.9235  C2
RP  110  18   0.1636  C2
...

Here the false negative rate (RN) for C1 is 11.6% and the false positive rate (RP) is 5.0% based on the definition above.

Another way of SV evaluation is to count the true positive and false negative using union result of ensembled SV call set,

# k8 javascript
minisv.js union -c 5 -b minisv.py/data/hs38.reg.bed severus_asm.vcf msv_filter.msv sanava_asm.vcf nanomonsv_asm.vcf > ensemble.vcf
# python 
minisv union -c 5 -b minisv.py/data/hs38.reg.bed severus_asm.vcf msv_filter.msv sanava_asm.vcf nanomonsv_asm.vcf > ensemble.vcf

This will output the supported SV number for each combination of SV callers, the first column is the binary vector for the appearance of each caller as in the same order of union input, the second caller show the number of SVs, e.g., 1000 row means Severus uniquely reports 16 SVs.

Minisv can also integrate and ensemble sv from the vcfs and read id files directly:

minisv advunion -p -b ~/data/pangenome_sv_benchmarking/minisv/data/hs38.reg.bed \
        -i1 output/minisv_puretumor_somatic_asm/snf_HCC1954_hifi1_somatic_generation2.vcf \
        -i1 ../1a.alignment_sv_tools/output/savana12/HCC1954_hifi1/grch38/grch38_T_tag.sv_breakpoints_read_support.tsv \
        -i1 ../1a.alignment_sv_tools/output/severus/HCC1954_hifi1/grch38_cutoff2_read_ids/read_ids.csv \
        -i1 ../1a.alignment_sv_tools/output/nanomonsv/HCC1954_hifi1/T/grch38_parse.nanomonsv.supporting_read.txt \
        -i1 output/msv_somatic/HCC1954_hifi1T.hg38l+tg.pair-c2s1.msv \
        -i2 output/minisv_puretumor_somatic_asm/snf_HCC1954_hifi1_somatic_generation2.vcf \
        -i2 ../1a.alignment_sv_tools/output/savana12/HCC1954_hifi1/grch38/grch38_T_tag.classified.somatic.vcf \
        -i2 ../1a.alignment_sv_tools/output/severus/HCC1954_hifi1/grch38_cutoff2_read_ids/somatic_SVs/severus_somatic.vcf \
        -i2 ../1a.alignment_sv_tools/output/nanomonsv/HCC1954_hifi1/grch38_tnpair.vcf \
        -i2 output/msv_somatic/HCC1954_hifi1T.hg38l+tg.pair-c2s1.msv \
        HCC1954T.self.Q0.gsv.gz > HCC1954_withasmreadids.msv

# keep the SV call with maximum read counts
minisv advunion -u -p -b ~/data/pangenome_sv_benchmarking/minisv/data/hs38.reg.bed \
        -i1 output/minisv_puretumor_somatic_asm/snf_HCC1954_hifi1_somatic_generation2.vcf \
        -i1 ../1a.alignment_sv_tools/output/savana12/HCC1954_hifi1/grch38/grch38_T_tag.sv_breakpoints_read_support.tsv \
        -i1 ../1a.alignment_sv_tools/output/severus/HCC1954_hifi1/grch38_cutoff2_read_ids/read_ids.csv \
        -i1 ../1a.alignment_sv_tools/output/nanomonsv/HCC1954_hifi1/T/grch38_parse.nanomonsv.supporting_read.txt \
        -i1 output/msv_somatic/HCC1954_hifi1T.hg38l+tg.pair-c2s1.msv \
        -i2 output/minisv_puretumor_somatic_asm/snf_HCC1954_hifi1_somatic_generation2.vcf \
        -i2 ../1a.alignment_sv_tools/output/savana12/HCC1954_hifi1/grch38/grch38_T_tag.classified.somatic.vcf \
        -i2 ../1a.alignment_sv_tools/output/severus/HCC1954_hifi1/grch38_cutoff2_read_ids/somatic_SVs/severus_somatic.vcf \
        -i2 ../1a.alignment_sv_tools/output/nanomonsv/HCC1954_hifi1/grch38_tnpair.vcf \
        -i2 output/msv_somatic/HCC1954_hifi1T.hg38l+tg.pair-c2s1.msv \
        HCC1954T.self.Q0.gsv.gz > HCC1954_withasmreadids_collapsed.msv

Minisv seamlessly parses the VCF format and the minisv format. It has been tested with Severus, Sniffles2, cuteSV, SAVANA, SVision-Pro, nanomonsv, SvABA and GRIPSS.

Limitations

Minisv simply counts supporting reads to call an SV. More complex algorithms, such as phasing, VNTR reposition and machine learning, may improve its accuracy. On the other hand, matching the accuracy of other callers with such a simple algorithm implies the potential of our multi-reference approach.
Minisv does not output genotypes. This makes it less useful for germline SV calling. On the HG002 benchmark data, minisv is close to but not as good as the best germline SV caller.
For the best accuracy, minisv needs the alignment of all reads against multiple reference genomes or pangenome graphs. This greatly increases the running time. A better strategy is to only align reads with SVs to reduce alignment time. It is possible to build a pipeline on top of minisv but this has not been implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
data		data
minisv		minisv
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
NEWS.md		NEWS.md
README.md		README.md
build.py		build.py
build_docker.sh		build_docker.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Getting started

Introduction

Table of Contents

Data preprocessing

Optional step: de novo assembly of paired normal sample

Hifi read alignment

ONT read alignment

Minisv standalone mode design

Calling SVs

Germline SVs

Somatic SVs in tumor-normal pairs

Large somatic SVs in tumor-only samples

Mosaic SVs

Generic filtering for other callers

Ensembling filtered caller results

Comparing SVs

Limitations

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

qinqian/minisv.py

Folders and files

Latest commit

History

Repository files navigation

Getting started

Introduction

Table of Contents

Data preprocessing

Optional step: de novo assembly of paired normal sample

Hifi read alignment

ONT read alignment

Minisv standalone mode design

Calling SVs

Germline SVs

Somatic SVs in tumor-normal pairs

Large somatic SVs in tumor-only samples

Mosaic SVs

Generic filtering for other callers

Ensembling filtered caller results

Comparing SVs

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages