SVTYPE classification #4

insilicool · 2017-02-17T19:23:48Z

Hi,

I have been testing out your new program which I was hoping would filling a niche in our tumor sv pipeline. After running svaba on a number of in-house benchmarking datasets, I've notice that although in your bioRxiv paper you've classified SVs as DUP, DEL etc in the SV vcf outputs all SVTYPE are all classified as BND. Are there any plans to further classify the outputted SVs by SVTYPE and/or do you already have a mechanism to do so?

Thanks in advance.

Best,

walaj · 2017-02-17T19:42:41Z

I have been using the BND format for everything, since it the most inclusive and can handle complex events where the traditional "DUP" and "DEL" distinctions may not reflect what is physically going on (e.g. chromothripsis will have a mix of rearrangement orientations). For this reason, the BND convention is what we agreed on for our work in the ICGC consortium. For germline runs, the DUP / DEL annotations are much more likely to be accurate, and I may make a module to convert the VCFs. In the paper, for the germline analysis I used the BND format to get the classifications (++ and -- is inversion, +- is deletion, -+ is duplication/insertion)

Either way (BND or DUP / DEL), the information about the rearrangements connections is still there. I use page 13 of the VCF 4.2 specification often to get a better picture of the rearrangement connections.

Hope this helps, and let me know if issues / further questions with SvABA.

-Jeremiah

yeswzc · 2017-04-13T12:25:54Z

Hi Jeremiah,
I also want to know how to trans the BNDs to known SV type like DUP,INV,TRA,DUP, at least for some BND that can be transformed.
I cannot understand your meaning about ++, -- or other. If this information given in 'SCTG' ?
If it is from SCTG, is that your said ++/--/+-/-+ the two [+-]sign in the two parenthesis()？
Did:
'SCTG=1:215981(-)-1:5134458(-)__1_5120501_5145501D' mean -- ?
'SCTG=1:240491(+)-1:16723527(+)__1_220501_245501D' mean ++ ?
'SCTG=1:432171(+)-1:433701(-)__1_416501_441501D' mean +- ?
...
And what if there is no those information in SCTG?

Thank you.

Zhichao

walaj · 2017-04-19T19:06:33Z

The orientations of each breakpoint are encoded using the VCF 4.2 standard, and I elected to use the BND format for all of the variants. The relevant information is in pages 12 and 13 of the VCF 4.2 specifications: https://samtools.github.io/hts-specs/VCFv4.2.pdf For instance, you can see that the "W" to "Z" rearrangement would correspond to a DEL in the germline context.

The SCTG tags you provide above are for discordant-only variants. The + means that the break is oriented so that the DNA is to the left (e.g. piece "W" from page 13 of the VCF specs). However, it is more direct to read the orientations from orientations and positions of the brackets in the ALT column.

For germline genomes, a converter from BND to DUP/etc is on my to-do. For cancer, I think it would be a mistake to abandon BND format, since there is often no direct correspondence between the breakpoint orientations and the copy-number state, due to the immense complexity of somatic rearrangements.

Let me know if this doesn't make sense, or if other questions.

walaj · 2017-04-25T12:17:08Z

@yeswzc I saw a comment come through my email, but it looks like it's not here. To reply for others who are wondering the same, each ALT field of the BND format gives both orientations (this breakpoint, and it's rearrangement partner breakpoint). There are two bits of orientation information in the ALT field: 1) The nucleotide sequence before or after the first bracket (gives this breakpoint orientation) and 2) the orientation of the first bracket (gives partner breakpoint orientation).

If the nucleotide comes first, then it is a "+" facing break at that site (see element "W" of the VCF4.2 specs, pg 13), otherwise it is "-" at that site. Then, if the bracket is ], then the partner breakpoint is "+", otherwise it is left facing. Pg 13 of the VCF4.2 specs gives examples of this.

yeswzc · 2017-04-25T12:55:03Z

@walaj Thank you very much!
I deleted the question because I thought I got the right understanding about the VCF specification on BND, but soon became uncertain about my comprehension, after reading several times and checked examples. It is quite difficult to make the definition clear, to me. I think it is my problem, and I kept think it the whole day, bu still get no good comprehension.
I could not think out the mechanism of setting so, so I did not understand the ALT column.
After your explanation, did the place of "s" before of after "p" defined the +/- for the breakpoint 1, and did "]" or "[" defined + or - for the breakpoint2 respectively?
If I get the right comprehension, did:
s[p[ mean "+-",
s]p] mean "++",
[p[s mean "--",
]p]s mean "-+" ?

pk7zuva · 2018-04-03T22:08:21Z

@walaj
Is any BND to bed conversion script available?

walaj · 2018-04-04T01:19:47Z

This is high on my to-do list, but hasn’t made it yet. I’ll let you know when this is available.

…

On Apr 3, 2018, at 6:08 PM, Pankaj Kumar ***@***.***> wrote: @walaj Is any BND to bed conversion script available? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

drvenki · 2018-05-31T09:42:51Z

Hi Jeremiah,
I have been taking SvABA for a spin in the past few weeks and I am impressed.
Has there been any update on the above BND>BED conversion?
TIA,
Venkatesh Chellappa (Venki)

walaj · 2018-06-04T00:21:36Z

Hi Venki, Unfortunately not yet. I'm on full-time clinical service through the summer, so it'll be some time. If you have a script that you wrote, I would be happy to check out a PR! Thanks, Jeremiah

…

On Thu, May 31, 2018 at 5:42 AM, Venkatesh Chellappa < ***@***.***> wrote: Hi Jeremiah, I have been taking SvABA for a spin in the past few weeks and I am impressed. Has there been any update on the above BND>BED conversion? I didn't want to beat a dead horse here and rewrite a script for this issue. TIA, Venkatesh Chellappa (Venki) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGmfiKdFxWqMMjMbq6VvDwhGi08s2M40ks5t37shgaJpZM4MEpni> .

drvenki · 2018-06-20T08:13:01Z

Hi Jeremiah,
I am working on the script now and will pass it on soon. I am testing svaba's suitability for liquid biopsy samples and it is for this purpose I have been doing the sub-classification of BND to individual events. I was wondering if I could use your annotation R script and I noticed you use your own exon, intron definitions. May I please have the supporting files if possible? I am using GRCh37 with decoy sequences for my analysis.
Thanks,
Venki

drvenki · 2018-06-20T10:15:33Z

Hi linwang6, You wrote "Hi all, I'm also confused how to trans the BNDs (from svABA) to known SV type DUP,INV,TRA,DUP? Has any one figured it out? Thanks." I see the comment doesnt exist anymore. However, we could follow previous comments and replies on this same thread on how to interpret individual events. Based on Jeremiah's input, I wrote my own ugly classifier. Something like <CODE> if svtype == 'BND' and ':' in alt: end_pos = re.search(r'\d+', alt.split(':')[1]).group() position.append(end_pos) if pos in position: continue else: svtype = getSVType(cols[0],cols[1],cols[3]) if svtype in sv_all: sv_all[svtype] += 1 else: sv_all[svtype] = 1 return sv_all def getSVType(chr,start,alt): pattern = {'++': 'INV','-+': 'INS', '+-': 'DEL', '--': 'INV'} try: chr2 = re.search(r'\d+', alt.split(':')[0]).group() except Exception as e: chr2 = re.search(r'\d+', alt.split(':')[0][-1]) frst = alt.split(':')[0] if frst[0].isalpha(): test = '+' if frst[1] == ']': test += '+' else: test += '-' else: test = '-' if frst[0] == ']': test += '+' else: test += '-' if chr == chr2: return pattern[test] else: return 'BND' <CODE> The idea here is to strip out ins, del & inv apart from other complex events like translocations or fusions. I am still working on this task but if there's any help available, I would be glad! Thanks, Venkatesh Chellappa (Venki)

…

________________________________ From: linwang6 <notifications@github.com> Sent: 19 June 2018 18:40 To: walaj/svaba Cc: Venkatesh Chellappa; Comment Subject: Re: [walaj/svaba] SVTYPE classification (#4) Hi all, I'm also confused how to trans the BNDs (from svABA) to known SV type DUP,INV,TRA,DUP? Has any one figured it out? Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#4 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AB8Rrj_x7n-Vs_0M8VN8sWdoTaufU9pNks5t-TergaJpZM4MEpni>.

linwang6 · 2018-06-20T14:03:34Z

Hi Venki,
Thanks so much for your reply and script. It's very helpful for me.

yoyosef · 2018-07-16T15:37:36Z

@walaj
chr19 14507136 95244:1 C [chr19:14877965[C
chr19 14877965 95244:2 C [chr19:14507136[C

Could you give me more insight into what this could be? There are no more mateids for these two.. It looks like an Inversion, but in the VCF4.2 documentation, inversions should have 4 mates.

Also, could we expect a converter by the end of the summer, or are you still busy?

Thanks, Yosef

JaretK · 2018-08-03T05:50:58Z

I took a stab at parsing the vcf files to re-classify BNDs based on the above inputs. The python script below accepts the vcf file as an argument and writes the resulting re-classified vcf to stdout (pipe it into a .classified.vcf file to save)

#!/usr/bin/env python
import re
import sys
import os

#make mates dictionary given list input of non-comment lines
def makeMateDict(m):
    d = {}
    for index1, line1 in enumerate(m):
        id1 = line1.split('\t')[2]
        numMate = re.search(r':(\d)',id1).group(1)
        origId = re.search(r'(\d+):',id1).group(1)
        if int(numMate) == 1:
            for index2, line2 in enumerate(m):
                #never start from beginning of file
                if index2 <= index1:
                    continue
                # print str(index1) + " : " + str(index2)
                id2 = line2.split('\t')[2]
                duplicateId = re.search(r'(\d+):',id2).group(1)
                duplicateNumMate = re.search(r':(\d)',id2).group(1)
                if duplicateId == origId and int(duplicateNumMate) == 2:
                    d[line1] = line2
                    break
    return d

def classify(line, ALT_INDEX, mdict):
    #get alt, chrom1, chrom2, position (pos), id, old SVTYPE (should be BND if virgin svaba vcf) from line
    s = line.split("\t")
    alt = s[ALT_INDEX]
    chrom1 = s[0]
    pos = int(s[1])
    id=s[2]

    if int(re.search(r':(\d)',id).group(1)) != 1:
        return "NONE"

    mateLine = mdict[line].split('\t')
    mateChrom = mateLine[0]
    mateAlt = mateLine[ALT_INDEX]

    oldType = re.search(r'SVTYPE=(.+?)(\s+?|:)',line).group(1)

    # get new type
    if oldType == 'BND' and chrom1 == mateChrom:
        INV_PATTERN_1 = re.compile(r'\D\].+\]')
        INV_PATTERN_2 = re.compile(r'\[.+\[\D')
        if INV_PATTERN_1.match(alt) and INV_PATTERN_1.match(mateAlt):
            return "INV"
        if INV_PATTERN_2.match(alt) and INV_PATTERN_2.match(mateAlt):
            return "INV"

        # DEL
        DEL_PATTERN_THIS = re.compile(r'\D\[.+\[')
        DEL_PATTERN_MATE = re.compile(r'\].+\]\D')
        if DEL_PATTERN_THIS.match(alt) and DEL_PATTERN_MATE.match(mateAlt):
            return "DEL"

        # INS
        INS_PATTERN_THIS = re.compile(r'\D\].+\]')
        INS_PATTERN_MATE = re.compile(r'\[.+\[\D')
        if INS_PATTERN_THIS.match(alt) and INS_PATTERN_MATE.match(mateAlt):
            return "DUP/INS"

    return 'BND'

if __name__ == "__main__":
    file = sys.argv[1]
    if not os.path.exists(file):
        raise IOError(file)
    alt_index = -1
    #generate mate:mate dictionary
    #load file into ram
    vcf_file=[]
    with open (file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            vcf_file.append(line)
    matesDict = makeMateDict(vcf_file)
    with open(file, "r") as f:
        for line in f:
            # print comments
            if line.startswith("##"):
                sys.stdout.write(line)
                continue
            # header contains indexes
            if line.startswith('#'):
                split = line.split("\t")
                for index, val in enumerate(split):
                    if val == "ALT":
                        alt_index = index
                        break
                sys.stdout.write(line)
                continue
            if alt_index == -1:
                print "ERROR: NO ALT INDEX FOUND"
                exit(1)
            newType = classify(line, alt_index, matesDict)
            if newType != "NONE":
                newLine = re.sub(r'SVTYPE=BND',"SVTYPE="+newType,line)
                sys.stdout.write(newLine)

EDIT: Corrected DEL and INS calls. Added "TRA" is BND is across chromosomes. Changed INS to DUP/INS

drvenki · 2018-08-03T11:54:46Z

It is such a clean and simple work! But again, we need to add another flag for SVTYPE=TRA for complex events when it is a confirmed translocation. Right now, it is over ambitious to include fusion events as a part of this exercise but I could see if we can do something on the lines of annotating svaba vcfs. Jeremiah has a rudimentary annotation function which I am not sure if useful and also do not know if he will ever get the time to respond to queries.
It is sad that svaba performs well but it might just end up like one of those tools which are really good but been shelved because of lack of support!
Thanks for the code Jaret.
Cheers,
Venki

(Venkatesh Chellappa)

sfrenk · 2018-08-08T19:32:56Z

I also had a go at annotating the BNDs, and got a different algorithm to JaretK. Mine seems to be concordant with another software package (meerkat). I've included some R code below.

# Read in svaba vcf file
cols <- colnames(read.table(pipe(paste0('grep -v "##" ', svaba.sv.vcf,' | grep "#"| sed s/#//')), header = TRUE))
cols <- sapply(cols, function(x) gsub("(mrkdp\\.)|(\\.bam)", "", x))
svaba_uniq <- read.table(args$svaba, col.names = cols, stringsAsFactors = FALSE)

get_sv_type <- function(x){
    # Find mate pair
    root <- gsub(":[12]", "", x)
    mate1 <- paste0(root, ":1")
    mate2 <- paste0(root, ":2")
    alt1 <- svaba_uniq %>% filter(ID == mate1) %>% .$ALT
    alt2 <- svaba_uniq %>% filter(ID == mate2) %>% .$ALT
    # Determine sv type based on breakpoint orientation
    if ((grepl("\\[", alt1) & grepl("\\[", alt2)) | (grepl("\\]", alt1) & grepl("\\]", alt2))){
        sv_type <- "INV"
        
    } else if (grepl("[a-z]\\[", alt1) & grepl("^\\]", alt2)){
        sv_type <- "DEL"
        
    } else if (grepl("^\\]", alt1) & grepl("[a-z]\\[", alt2)){
        sv_type <- "DUP/INS"
        
    } else{
        sv_type <- "UNKNOWN"
    }
    return(sv_type)
}

svaba_uniq$SVTYPE <- sapply(svaba_uniq$ID, get_sv_type)

JaretK · 2018-08-09T16:45:01Z

@sfrenk 's approach is better than the one I posted (my DEL and INS calls were reversed). I edited the python script in my previous comment to reflect a correct implementation. I tested the new version against Manta and got similar results.

linwang6 · 2018-08-09T19:50:02Z

Hi Guys,
I am also working on SV calling, I have used five different callers to call SV. In my understanding, it's a little bit dogmatically to classify SV into different types(del, inv, dup,trans) just based on the orientation information (especially when you tumor sample is highly rearranged). What do you all guys think of?

PS: how do you all guys filter the SV based on supported reads or SV length or any others? What size of supported reads or SV length do you all guys use? Some papers cut off with 10 reads and larger than 10 kb. Thanks.

walaj · 2018-08-11T00:16:34Z

Hi all,

First, thank you so much for your work on this, especially while I'm tied up with clinical rotations. This will be really amazing to have a converter, not just for SvABA but for other tools as well.

In short, it is useful to have a converter between the two formats, but @linwang6 is right that the results have to be taken with caution and skepticism. For short variants, like most germline variants, the DEL, DUP etc calls are true to the physical reality. For somatic variants, they are usually not -- this actually causes a lot of problems for interpretation. Someone may think that physically there is a simple deletion, based on the read orientation and the DEL designation, but this is not necessarily the case! For instance, Yilong Li has tried to use the copy number state and orientations of somatic rearrangements to reconstruct the physical state. It turns out, for like >70% of somatic variants, this cannot be done! See: https://www.biorxiv.org/content/early/2017/08/27/181339

So this is the reason for my preference for my VCF format (which still complies with VCF4.2), but indeed for caller comparison or for germline calling, DEL, DUP is useful.

@yoyosef "Inversions" are another place where semantics should be more precise. An "in-place, copy-neutral local inversion" (what most people think of when they think inversion) should have 4 breakpoints (2 rearrangements). But, any rearrangement could have an "inverse" orientation, such that if you were to walk along the genome, when you hit that rearrangement you would start going backwards.

A related concept is that EVERY structural variation event that is made up of an ODD number of rearrangements MUST have a copy-number change. Always. Any event that preserves copy number must have an EVEN number of rearrangements. The inverse is not always true though. A collection of rearrangements with an even number of rearrangements does not have to preserve copy number.

-Jeremiah

shishuo16 · 2018-09-13T07:52:33Z

Hi,
@walaj Thanks so much for the good work on svaba. It is really a good SV detection tool. And, for the SV classification, I wrote a simple perl script. It is very rough, any suggestions will be appreciated.

#! /usr/bin/perl -w

use FindBin qw($Bin $Script);
use Getopt::Std;
use File::Path;
use File::Spec;
our %opts = (n=>'0',c=>'N');
getopts('f:o:',\%opts);

die "perl $0
          STD  output
          -f   
          -o   output file\n" unless ($opts{'f'} && $opts{'o'});
$opts{'o'}  = File::Spec->rel2abs($opts{'o'});
my %hash;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
    next if($line=~/^#/);
    my @arr=split(/\t/,$line);
    next if($arr[2]=~/:1/);
    $hash{$arr[2]}=$arr[4];
}
close IN;
open OUT,">$opts{'o'}" or die $!;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
    if($line=~/^#/){
        print OUT "$line";
        next;
    }
    my @arr=split(/\t/,$line);
    next if($arr[2]=~/:2/);
    $arr[2]=~s/:1//;
    my $END=$arr[4]=~s/[ATCGN\]\[]//gr;
    my @E=split(/:/,$END);
    $arr[7]="Chr2=".$E[0].";END=".$E[1].";";
    if($arr[0] eq $E[0]){
        if($arr[4]=~/[A-Z]\[/ && $hash{$arr[2].":2"}=~/\][A-Z]/){
            $arr[4]="DEL";
        }elsif($arr[4]=~/\[[A-Z]/ && $hash{$arr[2].":2"}=~/\[[A-Z]/ || ($arr[4]=~/[A-Z]\]/ && $hash{$arr[2].":2"}=~/[A-Z]\]/)){
            $arr[4]="INV";
        }elsif($arr[4]=~/\][A-Z]/ && $hash{$arr[2].":2"}=~/[A-Z]\[/){
            $arr[4]="DUP/INS";
        }
    }else{
        $arr[4]="TRA";
    }
    my $out=join("\t",@arr);
    print OUT "$out";
}
close IN;
close OUT;

vaishmanobala · 2019-06-19T10:08:31Z

Hi,
@walaj Thanks so much for the good work on svaba. It is really a good SV detection tool. And, for the SV classification, I wrote a simple perl script. It is very rough, any suggestions will be appreciated.

#! /usr/bin/perl -w

use FindBin qw($Bin $Script);
use Getopt::Std;
use File::Path;
use File::Spec;
our %opts = (n=>'0',c=>'N');
getopts('f:o:',%opts);

die "perl $0
STD output
-f
-o output file\n" unless ($opts{'f'} && $opts{'o'});
$opts{'o'} = File::Spec->rel2abs($opts{'o'});
my %hash;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
next if($line=/^#/);
my @arr=split(/\t/,$line);
next if($arr[2]=/:1/);
$hash{$arr[2]}=$arr[4];
}
close IN;
open OUT,">$opts{'o'}" or die $!;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
if($line=/^#/){
print OUT "$line";
next;
}
my @arr=split(/\t/,$line);
next if($arr[2]=/:2/);
$arr[2]=s/:1//;
my $END=$arr[4]=s/[ATCGN][]//gr;
my @e=split(/:/,$END);
$arr[7]="Chr2=".$E[0].";END=".$E[1].";";
if($arr[0] eq $E[0]){
if($arr[4]=/[A-Z][/ && $hash{$arr[2].":2"}=/][A-Z]/){
$arr[4]="DEL";
}elsif($arr[4]=~~/[[A-Z]/ && $hash{$arr[2].":2"}=~~/[[A-Z]/ || ($arr[4]=~~/[A-Z]]/ && $hash{$arr[2].":2"}=~~/[A-Z]]/)){
$arr[4]="INV";
}elsif($arr[4]=~~/][A-Z]/ && $hash{$arr[2].":2"}=~~/[A-Z][/){
$arr[4]="DUP/INS";
}
}else{
$arr[4]="TRA";
}
my $out=join("\t",@arr);
print OUT "$out";
}
close IN;
close OUT;

"my" variable @arr masks earlier declaration in same scope at translocation.pl line 31.
"my" variable $line masks earlier declaration in same statement at translocation.pl line 31.
"my" variable @arr masks earlier declaration in same statement at translocation.pl line 32.
"my" variable @arr masks earlier declaration in same scope at translocation.pl line 33.
syntax error at translocation.pl line 17, near "=)"
syntax error at translocation.pl line 22, near "}"
syntax error at translocation.pl line 26, near "=)"
syntax error at translocation.pl line 30, near "}"
syntax error at translocation.pl line 50, near "}"
Execution of translocation.pl aborted due to compilation errors.

drvenki · 2019-06-19T11:07:40Z

Hi Vaishnavi,
What is that you are trying to communicate here. While posting your issues, could you please

state your usage of this tool (what you are trying to do and how you run the tool if relevant to the issue)
what is your issue/error
what is your question regarding the issue/error (background, remedy, workaround, etc.)
Thanks,
Venkatesh Chellappa

Hi,
@walaj Thanks so much for the good work on svaba. It is really a good SV detection tool. And, for the SV classification, I wrote a simple perl script. It is very rough, any suggestions will be appreciated.
#! /usr/bin/perl -w
use FindBin qw($Bin $Script);
use Getopt::Std;
use File::Path;
use File::Spec;
our %opts = (n=>'0',c=>'N');
getopts('f⭕️',%opts);
die "perl $0
STD output
-f
-o output file\n" unless ($opts{'f'} && $opts{'o'});
$opts{'o'} = File::Spec->rel2abs($opts{'o'});
my %hash;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
next if($line=/^#/);
my @arr=split(/\t/,$line);
next if($arr[2]=/:1/);
$hash{$arr[2]}=$arr[4];
}
close IN;
open OUT,">$opts{'o'}" or die $!;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
if($line=/^#/){
print OUT "$line";
next;
}
my @arr=split(/\t/,$line);
next if($arr[2]=/:2/);
$arr[2]=s/:1//;
my $END=$arr[4]=s/[ATCGN][]//gr;
my @e=split(/:/,$END);
$arr[7]="Chr2=".$E[0].";END=".$E[1].";";
if($arr[0] eq $E[0]){
if($arr[4]=/[A-Z][/ && $hash{$arr[2].":2"}=/][A-Z]/){
$arr[4]="DEL";
}elsif($arr[4]=~~/[[A-Z]/ && $hash{$arr[2].":2"}=~~/[[A-Z]/ || ($arr[4]=~~/[A-Z]]/ && $hash{$arr[2].":2"}=~~/[A-Z]]/)){
$arr[4]="INV";
}elsif($arr[4]=~~/][A-Z]/ && $hash{$arr[2].":2"}=~~/[A-Z][/){
$arr[4]="DUP/INS";
}
}else{
$arr[4]="TRA";
}
my $out=join("\t",@arr);
print OUT "$out";
}
close IN;
close OUT;

"my" variable @arr masks earlier declaration in same scope at translocation.pl line 31.
"my" variable $line masks earlier declaration in same statement at translocation.pl line 31.
"my" variable @arr masks earlier declaration in same statement at translocation.pl line 32.
"my" variable @arr masks earlier declaration in same scope at translocation.pl line 33.
syntax error at translocation.pl line 17, near "=)"
syntax error at translocation.pl line 22, near "}"
syntax error at translocation.pl line 26, near "=)"
syntax error at translocation.pl line 30, near "}"
syntax error at translocation.pl line 50, near "}"
Execution of translocation.pl aborted due to compilation errors.

SalendraSingh · 2019-07-17T17:52:37Z

Hi @walaj, Great work with SvABA. Just like others, I am also trying to classify BND into various subtypes (INS, DUP, DEL, TRA). In terms of that, I am trying to understand the SCTG tag provided. It will be great if you can explain how is that tag defined and what different part of the tag means.
for example : "SCTG=1:215981(-)-1:5134458(-)__1_5120501_5145501D"
In this I am able to associate that there is a translocation between 1:215981(-) and 1:5134458(-), but I am not understanding what "__1_5120501_5145501D" means.

jykr · 2020-12-06T21:08:07Z

Hi, thanks a lot for Svaba and all the helpful discussion that's going on here! I'm trying to adopt the codes here to classify SV types and I'm confused whether we can define insertions just from the breakend information.

Based on my understanding, insertions should have another contig that would be inserted so would have 4 breakend entries in vcf output (as in p.14 in VCF 4.2 spec) (,or ideally would be only present in .sv.indel output of Svaba where the inserted sequences are assembled as a single line indel). Am I missing something here? It seems to me that all the mates with ALT

Mate 1: ]p]t & Mate2: t[p[

should be DUPs not INSs, given the position of Mate 1 is smaller than the position of Mate2.
Am I missing something here? Plus, is there a tag or any information in .sv.vcf that marks a single insertion? Thanks!

jykr · 2020-12-06T21:17:23Z

@JaretK @sfrenk Thanks a lot for the helpful scripts! I am a bit confused because the codes provided above still don't seem to agree each other. Is it correct to interpret that given the mate breakends' chromosome are same and Mate1 position is smaller than Mate2, then

Mate1 = t]position], Mate2 = t]position] or Mate1 = [position[t, Mate2 = [position[t : INVs
Mate1 = t[position[, Mate2 = ]position]t : DELs
Mate1 = ]position]t, Mate2 = t[position[ : DUPs (possibly INSs if I'm missing something)

Thank you!

panxiaoguang · 2022-09-01T02:35:21Z

I took a stab at parsing the vcf files to re-classify BNDs based on the above inputs. The python script below accepts the vcf file as an argument and writes the resulting re-classified vcf to stdout (pipe it into a .classified.vcf file to save)

#!/usr/bin/env python
import re
import sys
import os

#make mates dictionary given list input of non-comment lines
def makeMateDict(m):
    d = {}
    for index1, line1 in enumerate(m):
        id1 = line1.split('\t')[2]
        numMate = re.search(r':(\d)',id1).group(1)
        origId = re.search(r'(\d+):',id1).group(1)
        if int(numMate) == 1:
            for index2, line2 in enumerate(m):
                #never start from beginning of file
                if index2 <= index1:
                    continue
                # print str(index1) + " : " + str(index2)
                id2 = line2.split('\t')[2]
                duplicateId = re.search(r'(\d+):',id2).group(1)
                duplicateNumMate = re.search(r':(\d)',id2).group(1)
                if duplicateId == origId and int(duplicateNumMate) == 2:
                    d[line1] = line2
                    break
    return d

def classify(line, ALT_INDEX, mdict):
    #get alt, chrom1, chrom2, position (pos), id, old SVTYPE (should be BND if virgin svaba vcf) from line
    s = line.split("\t")
    alt = s[ALT_INDEX]
    chrom1 = s[0]
    pos = int(s[1])
    id=s[2]

    if int(re.search(r':(\d)',id).group(1)) != 1:
        return "NONE"

    mateLine = mdict[line].split('\t')
    mateChrom = mateLine[0]
    mateAlt = mateLine[ALT_INDEX]

    oldType = re.search(r'SVTYPE=(.+?)(\s+?|:)',line).group(1)

    # get new type
    if oldType == 'BND' and chrom1 == mateChrom:
        INV_PATTERN_1 = re.compile(r'\D\].+\]')
        INV_PATTERN_2 = re.compile(r'\[.+\[\D')
        if INV_PATTERN_1.match(alt) and INV_PATTERN_1.match(mateAlt):
            return "INV"
        if INV_PATTERN_2.match(alt) and INV_PATTERN_2.match(mateAlt):
            return "INV"

        # DEL
        DEL_PATTERN_THIS = re.compile(r'\D\[.+\[')
        DEL_PATTERN_MATE = re.compile(r'\].+\]\D')
        if DEL_PATTERN_THIS.match(alt) and DEL_PATTERN_MATE.match(mateAlt):
            return "DEL"

        # INS
        INS_PATTERN_THIS = re.compile(r'\D\].+\]')
        INS_PATTERN_MATE = re.compile(r'\[.+\[\D')
        if INS_PATTERN_THIS.match(alt) and INS_PATTERN_MATE.match(mateAlt):
            return "DUP/INS"

    return 'BND'

if __name__ == "__main__":
    file = sys.argv[1]
    if not os.path.exists(file):
        raise IOError(file)
    alt_index = -1
    #generate mate:mate dictionary
    #load file into ram
    vcf_file=[]
    with open (file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            vcf_file.append(line)
    matesDict = makeMateDict(vcf_file)
    with open(file, "r") as f:
        for line in f:
            # print comments
            if line.startswith("##"):
                sys.stdout.write(line)
                continue
            # header contains indexes
            if line.startswith('#'):
                split = line.split("\t")
                for index, val in enumerate(split):
                    if val == "ALT":
                        alt_index = index
                        break
                sys.stdout.write(line)
                continue
            if alt_index == -1:
                print "ERROR: NO ALT INDEX FOUND"
                exit(1)
            newType = classify(line, alt_index, matesDict)
            if newType != "NONE":
                newLine = re.sub(r'SVTYPE=BND',"SVTYPE="+newType,line)
                sys.stdout.write(newLine)

EDIT: Corrected DEL and INS calls. Added "TRA" is BND is across chromosomes. Changed INS to DUP/INS

Thank you so much for your great scripts, but I think you still make some mistakes.

for DUP/INS, it should be r'\].+\]\D' and r'\D\[.+\[' .
for BND( I think it's called interchromosomal translocation) , you should't remove it's mate record because we always need to know the strand information for both itself and it's mate reord.

drvenki · 2022-09-01T06:00:56Z

it is the end of 2022 and I see people are still using svaba despite no regular updates or upgrades. if SVtype classification was done and the tool itself was tweaked for sensitivity and specificity to detect low frequency SVs then, it would be such an amazing opportunity to bring this back to regular use in several pipelines.
Cheers,
Venki

Venkatesh Chellappa

sbamin mentioned this issue Jun 26, 2018

To-Do: SV pipeline fpbarthel/GLASS#7

Closed

3 tasks

hurleyLi mentioned this issue Aug 5, 2018

Specific types of BND #28

Open

xiucz mentioned this issue Apr 1, 2022

Annotate SV_type column with inversion/deletion/duplication/insertion lgmgeo/AnnotSV#118

Closed

walaj mentioned this issue Aug 1, 2022

Is it possible to infer duplications/deletions/translocations/insertions/inversions from SvABA output? #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVTYPE classification #4

SVTYPE classification #4

insilicool commented Feb 17, 2017

walaj commented Feb 17, 2017

yeswzc commented Apr 13, 2017

walaj commented Apr 19, 2017 •

edited

walaj commented Apr 25, 2017

yeswzc commented Apr 25, 2017

pk7zuva commented Apr 3, 2018

walaj commented Apr 4, 2018 via email

drvenki commented May 31, 2018 •

edited

walaj commented Jun 4, 2018 via email

drvenki commented Jun 20, 2018

drvenki commented Jun 20, 2018 via email •

edited

linwang6 commented Jun 20, 2018

yoyosef commented Jul 16, 2018 •

edited

JaretK commented Aug 3, 2018 •

edited

drvenki commented Aug 3, 2018

sfrenk commented Aug 8, 2018 •

edited

JaretK commented Aug 9, 2018

linwang6 commented Aug 9, 2018 •

edited

walaj commented Aug 11, 2018 •

edited

shishuo16 commented Sep 13, 2018 •

edited

vaishmanobala commented Jun 19, 2019

drvenki commented Jun 19, 2019

SalendraSingh commented Jul 17, 2019

jykr commented Dec 6, 2020

jykr commented Dec 6, 2020

panxiaoguang commented Sep 1, 2022

drvenki commented Sep 1, 2022 •

edited

SVTYPE classification #4

SVTYPE classification #4

Comments

insilicool commented Feb 17, 2017

walaj commented Feb 17, 2017

yeswzc commented Apr 13, 2017

walaj commented Apr 19, 2017 • edited

walaj commented Apr 25, 2017

yeswzc commented Apr 25, 2017

pk7zuva commented Apr 3, 2018

walaj commented Apr 4, 2018 via email

drvenki commented May 31, 2018 • edited

walaj commented Jun 4, 2018 via email

drvenki commented Jun 20, 2018

drvenki commented Jun 20, 2018 via email • edited

linwang6 commented Jun 20, 2018

yoyosef commented Jul 16, 2018 • edited

JaretK commented Aug 3, 2018 • edited

drvenki commented Aug 3, 2018

sfrenk commented Aug 8, 2018 • edited

JaretK commented Aug 9, 2018

linwang6 commented Aug 9, 2018 • edited

walaj commented Aug 11, 2018 • edited

shishuo16 commented Sep 13, 2018 • edited

vaishmanobala commented Jun 19, 2019

drvenki commented Jun 19, 2019

SalendraSingh commented Jul 17, 2019

jykr commented Dec 6, 2020

jykr commented Dec 6, 2020

panxiaoguang commented Sep 1, 2022

drvenki commented Sep 1, 2022 • edited

walaj commented Apr 19, 2017 •

edited

drvenki commented May 31, 2018 •

edited

drvenki commented Jun 20, 2018 via email •

edited

yoyosef commented Jul 16, 2018 •

edited

JaretK commented Aug 3, 2018 •

edited

sfrenk commented Aug 8, 2018 •

edited

linwang6 commented Aug 9, 2018 •

edited

walaj commented Aug 11, 2018 •

edited

shishuo16 commented Sep 13, 2018 •

edited

drvenki commented Sep 1, 2022 •

edited