Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVTYPE classification #4

Open
insilicool opened this issue Feb 17, 2017 · 27 comments
Open

SVTYPE classification #4

insilicool opened this issue Feb 17, 2017 · 27 comments

Comments

@insilicool
Copy link

Hi,

I have been testing out your new program which I was hoping would filling a niche in our tumor sv pipeline. After running svaba on a number of in-house benchmarking datasets, I've notice that although in your bioRxiv paper you've classified SVs as DUP, DEL etc in the SV vcf outputs all SVTYPE are all classified as BND. Are there any plans to further classify the outputted SVs by SVTYPE and/or do you already have a mechanism to do so?

Thanks in advance.

Best,

@walaj
Copy link
Owner

walaj commented Feb 17, 2017

I have been using the BND format for everything, since it the most inclusive and can handle complex events where the traditional "DUP" and "DEL" distinctions may not reflect what is physically going on (e.g. chromothripsis will have a mix of rearrangement orientations). For this reason, the BND convention is what we agreed on for our work in the ICGC consortium. For germline runs, the DUP / DEL annotations are much more likely to be accurate, and I may make a module to convert the VCFs. In the paper, for the germline analysis I used the BND format to get the classifications (++ and -- is inversion, +- is deletion, -+ is duplication/insertion)

Either way (BND or DUP / DEL), the information about the rearrangements connections is still there. I use page 13 of the VCF 4.2 specification often to get a better picture of the rearrangement connections.

Hope this helps, and let me know if issues / further questions with SvABA.

-Jeremiah

@yeswzc
Copy link

yeswzc commented Apr 13, 2017

Hi Jeremiah,
I also want to know how to trans the BNDs to known SV type like DUP,INV,TRA,DUP, at least for some BND that can be transformed.
I cannot understand your meaning about ++, -- or other. If this information given in 'SCTG' ?
If it is from SCTG, is that your said ++/--/+-/-+ the two [+-]sign in the two parenthesis()?
Did:
'SCTG=1:215981(-)-1:5134458(-)__1_5120501_5145501D' mean -- ?
'SCTG=1:240491(+)-1:16723527(+)__1_220501_245501D' mean ++ ?
'SCTG=1:432171(+)-1:433701(-)__1_416501_441501D' mean +- ?
...
And what if there is no those information in SCTG?

Thank you.

Zhichao

@walaj
Copy link
Owner

walaj commented Apr 19, 2017

The orientations of each breakpoint are encoded using the VCF 4.2 standard, and I elected to use the BND format for all of the variants. The relevant information is in pages 12 and 13 of the VCF 4.2 specifications: https://samtools.github.io/hts-specs/VCFv4.2.pdf For instance, you can see that the "W" to "Z" rearrangement would correspond to a DEL in the germline context.

The SCTG tags you provide above are for discordant-only variants. The + means that the break is oriented so that the DNA is to the left (e.g. piece "W" from page 13 of the VCF specs). However, it is more direct to read the orientations from orientations and positions of the brackets in the ALT column.

For germline genomes, a converter from BND to DUP/etc is on my to-do. For cancer, I think it would be a mistake to abandon BND format, since there is often no direct correspondence between the breakpoint orientations and the copy-number state, due to the immense complexity of somatic rearrangements.

Let me know if this doesn't make sense, or if other questions.

@walaj
Copy link
Owner

walaj commented Apr 25, 2017

@yeswzc I saw a comment come through my email, but it looks like it's not here. To reply for others who are wondering the same, each ALT field of the BND format gives both orientations (this breakpoint, and it's rearrangement partner breakpoint). There are two bits of orientation information in the ALT field: 1) The nucleotide sequence before or after the first bracket (gives this breakpoint orientation) and 2) the orientation of the first bracket (gives partner breakpoint orientation).

If the nucleotide comes first, then it is a "+" facing break at that site (see element "W" of the VCF4.2 specs, pg 13), otherwise it is "-" at that site. Then, if the bracket is ], then the partner breakpoint is "+", otherwise it is left facing. Pg 13 of the VCF4.2 specs gives examples of this.

@yeswzc
Copy link

yeswzc commented Apr 25, 2017

@walaj Thank you very much!
I deleted the question because I thought I got the right understanding about the VCF specification on BND, but soon became uncertain about my comprehension, after reading several times and checked examples. It is quite difficult to make the definition clear, to me. I think it is my problem, and I kept think it the whole day, bu still get no good comprehension.
I could not think out the mechanism of setting so, so I did not understand the ALT column.
After your explanation, did the place of "s" before of after "p" defined the +/- for the breakpoint 1, and did "]" or "[" defined + or - for the breakpoint2 respectively?
If I get the right comprehension, did:
s[p[ mean "+-",
s]p] mean "++",
[p[s mean "--",
]p]s mean "-+" ?

@pk7zuva
Copy link

pk7zuva commented Apr 3, 2018

@walaj
Is any BND to bed conversion script available?

@walaj
Copy link
Owner

walaj commented Apr 4, 2018 via email

@drvenki
Copy link

drvenki commented May 31, 2018

Hi Jeremiah,
I have been taking SvABA for a spin in the past few weeks and I am impressed.
Has there been any update on the above BND>BED conversion?
TIA,
Venkatesh Chellappa (Venki)

@walaj
Copy link
Owner

walaj commented Jun 4, 2018 via email

@drvenki
Copy link

drvenki commented Jun 20, 2018

Hi Jeremiah,
I am working on the script now and will pass it on soon. I am testing svaba's suitability for liquid biopsy samples and it is for this purpose I have been doing the sub-classification of BND to individual events. I was wondering if I could use your annotation R script and I noticed you use your own exon, intron definitions. May I please have the supporting files if possible? I am using GRCh37 with decoy sequences for my analysis.
Thanks,
Venki

@drvenki
Copy link

drvenki commented Jun 20, 2018 via email

@linwang6
Copy link

Hi Venki,
Thanks so much for your reply and script. It's very helpful for me.

@yoyosef
Copy link

yoyosef commented Jul 16, 2018

@walaj
chr19 14507136 95244:1 C [chr19:14877965[C
chr19 14877965 95244:2 C [chr19:14507136[C

Could you give me more insight into what this could be? There are no more mateids for these two.. It looks like an Inversion, but in the VCF4.2 documentation, inversions should have 4 mates.

Also, could we expect a converter by the end of the summer, or are you still busy?

Thanks, Yosef

@JaretK
Copy link

JaretK commented Aug 3, 2018

I took a stab at parsing the vcf files to re-classify BNDs based on the above inputs. The python script below accepts the vcf file as an argument and writes the resulting re-classified vcf to stdout (pipe it into a .classified.vcf file to save)

#!/usr/bin/env python
import re
import sys
import os

#make mates dictionary given list input of non-comment lines
def makeMateDict(m):
    d = {}
    for index1, line1 in enumerate(m):
        id1 = line1.split('\t')[2]
        numMate = re.search(r':(\d)',id1).group(1)
        origId = re.search(r'(\d+):',id1).group(1)
        if int(numMate) == 1:
            for index2, line2 in enumerate(m):
                #never start from beginning of file
                if index2 <= index1:
                    continue
                # print str(index1) + " : " + str(index2)
                id2 = line2.split('\t')[2]
                duplicateId = re.search(r'(\d+):',id2).group(1)
                duplicateNumMate = re.search(r':(\d)',id2).group(1)
                if duplicateId == origId and int(duplicateNumMate) == 2:
                    d[line1] = line2
                    break
    return d

def classify(line, ALT_INDEX, mdict):
    #get alt, chrom1, chrom2, position (pos), id, old SVTYPE (should be BND if virgin svaba vcf) from line
    s = line.split("\t")
    alt = s[ALT_INDEX]
    chrom1 = s[0]
    pos = int(s[1])
    id=s[2]

    if int(re.search(r':(\d)',id).group(1)) != 1:
        return "NONE"

    mateLine = mdict[line].split('\t')
    mateChrom = mateLine[0]
    mateAlt = mateLine[ALT_INDEX]

    oldType = re.search(r'SVTYPE=(.+?)(\s+?|:)',line).group(1)

    # get new type
    if oldType == 'BND' and chrom1 == mateChrom:
        INV_PATTERN_1 = re.compile(r'\D\].+\]')
        INV_PATTERN_2 = re.compile(r'\[.+\[\D')
        if INV_PATTERN_1.match(alt) and INV_PATTERN_1.match(mateAlt):
            return "INV"
        if INV_PATTERN_2.match(alt) and INV_PATTERN_2.match(mateAlt):
            return "INV"

        # DEL
        DEL_PATTERN_THIS = re.compile(r'\D\[.+\[')
        DEL_PATTERN_MATE = re.compile(r'\].+\]\D')
        if DEL_PATTERN_THIS.match(alt) and DEL_PATTERN_MATE.match(mateAlt):
            return "DEL"

        # INS
        INS_PATTERN_THIS = re.compile(r'\D\].+\]')
        INS_PATTERN_MATE = re.compile(r'\[.+\[\D')
        if INS_PATTERN_THIS.match(alt) and INS_PATTERN_MATE.match(mateAlt):
            return "DUP/INS"

    return 'BND'

if __name__ == "__main__":
    file = sys.argv[1]
    if not os.path.exists(file):
        raise IOError(file)
    alt_index = -1
    #generate mate:mate dictionary
    #load file into ram
    vcf_file=[]
    with open (file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            vcf_file.append(line)
    matesDict = makeMateDict(vcf_file)
    with open(file, "r") as f:
        for line in f:
            # print comments
            if line.startswith("##"):
                sys.stdout.write(line)
                continue
            # header contains indexes
            if line.startswith('#'):
                split = line.split("\t")
                for index, val in enumerate(split):
                    if val == "ALT":
                        alt_index = index
                        break
                sys.stdout.write(line)
                continue
            if alt_index == -1:
                print "ERROR: NO ALT INDEX FOUND"
                exit(1)
            newType = classify(line, alt_index, matesDict)
            if newType != "NONE":
                newLine = re.sub(r'SVTYPE=BND',"SVTYPE="+newType,line)
                sys.stdout.write(newLine)

EDIT: Corrected DEL and INS calls. Added "TRA" is BND is across chromosomes. Changed INS to DUP/INS

@drvenki
Copy link

drvenki commented Aug 3, 2018

It is such a clean and simple work! But again, we need to add another flag for SVTYPE=TRA for complex events when it is a confirmed translocation. Right now, it is over ambitious to include fusion events as a part of this exercise but I could see if we can do something on the lines of annotating svaba vcfs. Jeremiah has a rudimentary annotation function which I am not sure if useful and also do not know if he will ever get the time to respond to queries.
It is sad that svaba performs well but it might just end up like one of those tools which are really good but been shelved because of lack of support!
Thanks for the code Jaret.
Cheers,
Venki

(Venkatesh Chellappa)

@sfrenk
Copy link

sfrenk commented Aug 8, 2018

I also had a go at annotating the BNDs, and got a different algorithm to JaretK. Mine seems to be concordant with another software package (meerkat). I've included some R code below.

# Read in svaba vcf file
cols <- colnames(read.table(pipe(paste0('grep -v "##" ', svaba.sv.vcf,' | grep "#"| sed s/#//')), header = TRUE))
cols <- sapply(cols, function(x) gsub("(mrkdp\\.)|(\\.bam)", "", x))
svaba_uniq <- read.table(args$svaba, col.names = cols, stringsAsFactors = FALSE)

get_sv_type <- function(x){
    # Find mate pair
    root <- gsub(":[12]", "", x)
    mate1 <- paste0(root, ":1")
    mate2 <- paste0(root, ":2")
    alt1 <- svaba_uniq %>% filter(ID == mate1) %>% .$ALT
    alt2 <- svaba_uniq %>% filter(ID == mate2) %>% .$ALT
    # Determine sv type based on breakpoint orientation
    if ((grepl("\\[", alt1) & grepl("\\[", alt2)) | (grepl("\\]", alt1) & grepl("\\]", alt2))){
        sv_type <- "INV"
        
    } else if (grepl("[a-z]\\[", alt1) & grepl("^\\]", alt2)){
        sv_type <- "DEL"
        
    } else if (grepl("^\\]", alt1) & grepl("[a-z]\\[", alt2)){
        sv_type <- "DUP/INS"
        
    } else{
        sv_type <- "UNKNOWN"
    }
    return(sv_type)
}

svaba_uniq$SVTYPE <- sapply(svaba_uniq$ID, get_sv_type)

@JaretK
Copy link

JaretK commented Aug 9, 2018

@sfrenk 's approach is better than the one I posted (my DEL and INS calls were reversed). I edited the python script in my previous comment to reflect a correct implementation. I tested the new version against Manta and got similar results.

@linwang6
Copy link

linwang6 commented Aug 9, 2018

Hi Guys,
I am also working on SV calling, I have used five different callers to call SV. In my understanding, it's a little bit dogmatically to classify SV into different types(del, inv, dup,trans) just based on the orientation information (especially when you tumor sample is highly rearranged). What do you all guys think of?

PS: how do you all guys filter the SV based on supported reads or SV length or any others? What size of supported reads or SV length do you all guys use? Some papers cut off with 10 reads and larger than 10 kb. Thanks.

@walaj
Copy link
Owner

walaj commented Aug 11, 2018

Hi all,

First, thank you so much for your work on this, especially while I'm tied up with clinical rotations. This will be really amazing to have a converter, not just for SvABA but for other tools as well.

In short, it is useful to have a converter between the two formats, but @linwang6 is right that the results have to be taken with caution and skepticism. For short variants, like most germline variants, the DEL, DUP etc calls are true to the physical reality. For somatic variants, they are usually not -- this actually causes a lot of problems for interpretation. Someone may think that physically there is a simple deletion, based on the read orientation and the DEL designation, but this is not necessarily the case! For instance, Yilong Li has tried to use the copy number state and orientations of somatic rearrangements to reconstruct the physical state. It turns out, for like >70% of somatic variants, this cannot be done! See: https://www.biorxiv.org/content/early/2017/08/27/181339

So this is the reason for my preference for my VCF format (which still complies with VCF4.2), but indeed for caller comparison or for germline calling, DEL, DUP is useful.

@yoyosef "Inversions" are another place where semantics should be more precise. An "in-place, copy-neutral local inversion" (what most people think of when they think inversion) should have 4 breakpoints (2 rearrangements). But, any rearrangement could have an "inverse" orientation, such that if you were to walk along the genome, when you hit that rearrangement you would start going backwards.

A related concept is that EVERY structural variation event that is made up of an ODD number of rearrangements MUST have a copy-number change. Always. Any event that preserves copy number must have an EVEN number of rearrangements. The inverse is not always true though. A collection of rearrangements with an even number of rearrangements does not have to preserve copy number.

-Jeremiah

@shishuo16
Copy link

shishuo16 commented Sep 13, 2018

Hi,
@walaj Thanks so much for the good work on svaba. It is really a good SV detection tool. And, for the SV classification, I wrote a simple perl script. It is very rough, any suggestions will be appreciated.

#! /usr/bin/perl -w

use FindBin qw($Bin $Script);
use Getopt::Std;
use File::Path;
use File::Spec;
our %opts = (n=>'0',c=>'N');
getopts('f:o:',\%opts);

die "perl $0
          STD  output
          -f   
          -o   output file\n" unless ($opts{'f'} && $opts{'o'});
$opts{'o'}  = File::Spec->rel2abs($opts{'o'});
my %hash;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
    next if($line=~/^#/);
    my @arr=split(/\t/,$line);
    next if($arr[2]=~/:1/);
    $hash{$arr[2]}=$arr[4];
}
close IN;
open OUT,">$opts{'o'}" or die $!;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
    if($line=~/^#/){
        print OUT "$line";
        next;
    }
    my @arr=split(/\t/,$line);
    next if($arr[2]=~/:2/);
    $arr[2]=~s/:1//;
    my $END=$arr[4]=~s/[ATCGN\]\[]//gr;
    my @E=split(/:/,$END);
    $arr[7]="Chr2=".$E[0].";END=".$E[1].";";
    if($arr[0] eq $E[0]){
        if($arr[4]=~/[A-Z]\[/ && $hash{$arr[2].":2"}=~/\][A-Z]/){
            $arr[4]="DEL";
        }elsif($arr[4]=~/\[[A-Z]/ && $hash{$arr[2].":2"}=~/\[[A-Z]/ || ($arr[4]=~/[A-Z]\]/ && $hash{$arr[2].":2"}=~/[A-Z]\]/)){
            $arr[4]="INV";
        }elsif($arr[4]=~/\][A-Z]/ && $hash{$arr[2].":2"}=~/[A-Z]\[/){
            $arr[4]="DUP/INS";
        }
    }else{
        $arr[4]="TRA";
    }
    my $out=join("\t",@arr);
    print OUT "$out";
}
close IN;
close OUT;

@vaishmanobala
Copy link

Hi,
@walaj Thanks so much for the good work on svaba. It is really a good SV detection tool. And, for the SV classification, I wrote a simple perl script. It is very rough, any suggestions will be appreciated.

#! /usr/bin/perl -w

use FindBin qw($Bin $Script);
use Getopt::Std;
use File::Path;
use File::Spec;
our %opts = (n=>'0',c=>'N');
getopts('f:o:',%opts);

die "perl $0
STD output
-f
-o output file\n" unless ($opts{'f'} && $opts{'o'});
$opts{'o'} = File::Spec->rel2abs($opts{'o'});
my %hash;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
next if($line=/^#/);
my @arr=split(/\t/,$line);
next if($arr[2]=
/:1/);
$hash{$arr[2]}=$arr[4];
}
close IN;
open OUT,">$opts{'o'}" or die $!;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
if($line=/^#/){
print OUT "$line";
next;
}
my @arr=split(/\t/,$line);
next if($arr[2]=
/:2/);
$arr[2]=s/:1//;
my $END=$arr[4]=s/[ATCGN][]//gr;
my @e=split(/:/,$END);
$arr[7]="Chr2=".$E[0].";END=".$E[1].";";
if($arr[0] eq $E[0]){
if($arr[4]=
/[A-Z][/ && $hash{$arr[2].":2"}=
/][A-Z]/){
$arr[4]="DEL";
}elsif($arr[4]=/[[A-Z]/ && $hash{$arr[2].":2"}=/[[A-Z]/ || ($arr[4]=/[A-Z]]/ && $hash{$arr[2].":2"}=/[A-Z]]/)){
$arr[4]="INV";
}elsif($arr[4]=/][A-Z]/ && $hash{$arr[2].":2"}=/[A-Z][/){
$arr[4]="DUP/INS";
}
}else{
$arr[4]="TRA";
}
my $out=join("\t",@arr);
print OUT "$out";
}
close IN;
close OUT;

"my" variable @arr masks earlier declaration in same scope at translocation.pl line 31.
"my" variable $line masks earlier declaration in same statement at translocation.pl line 31.
"my" variable @arr masks earlier declaration in same statement at translocation.pl line 32.
"my" variable @arr masks earlier declaration in same scope at translocation.pl line 33.
syntax error at translocation.pl line 17, near "=)"
syntax error at translocation.pl line 22, near "}"
syntax error at translocation.pl line 26, near "=)"
syntax error at translocation.pl line 30, near "}"
syntax error at translocation.pl line 50, near "}"
Execution of translocation.pl aborted due to compilation errors.

@drvenki
Copy link

drvenki commented Jun 19, 2019

Hi Vaishnavi,
What is that you are trying to communicate here. While posting your issues, could you please

  1. state your usage of this tool (what you are trying to do and how you run the tool if relevant to the issue)
  2. what is your issue/error
  3. what is your question regarding the issue/error (background, remedy, workaround, etc.)
    Thanks,
    Venkatesh Chellappa

Hi,
@walaj Thanks so much for the good work on svaba. It is really a good SV detection tool. And, for the SV classification, I wrote a simple perl script. It is very rough, any suggestions will be appreciated.
#! /usr/bin/perl -w
use FindBin qw($Bin $Script);
use Getopt::Std;
use File::Path;
use File::Spec;
our %opts = (n=>'0',c=>'N');
getopts('f⭕️',%opts);
die "perl $0
STD output
-f
-o output file\n" unless ($opts{'f'} && $opts{'o'});
$opts{'o'} = File::Spec->rel2abs($opts{'o'});
my %hash;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
next if($line=/^#/);
my @arr=split(/\t/,$line);
next if($arr[2]=
/:1/);
$hash{$arr[2]}=$arr[4];
}
close IN;
open OUT,">$opts{'o'}" or die $!;
open IN,"$opts{'f'}" or die $!;
while (my $line=){
if($line=/^#/){
print OUT "$line";
next;
}
my @arr=split(/\t/,$line);
next if($arr[2]=
/:2/);
$arr[2]=s/:1//;
my $END=$arr[4]=s/[ATCGN][]//gr;
my @e=split(/:/,$END);
$arr[7]="Chr2=".$E[0].";END=".$E[1].";";
if($arr[0] eq $E[0]){
if($arr[4]=
/[A-Z][/ && $hash{$arr[2].":2"}=
/][A-Z]/){
$arr[4]="DEL";
}elsif($arr[4]=/[[A-Z]/ && $hash{$arr[2].":2"}=/[[A-Z]/ || ($arr[4]=/[A-Z]]/ && $hash{$arr[2].":2"}=/[A-Z]]/)){
$arr[4]="INV";
}elsif($arr[4]=/][A-Z]/ && $hash{$arr[2].":2"}=/[A-Z][/){
$arr[4]="DUP/INS";
}
}else{
$arr[4]="TRA";
}
my $out=join("\t",@arr);
print OUT "$out";
}
close IN;
close OUT;

"my" variable @arr masks earlier declaration in same scope at translocation.pl line 31.
"my" variable $line masks earlier declaration in same statement at translocation.pl line 31.
"my" variable @arr masks earlier declaration in same statement at translocation.pl line 32.
"my" variable @arr masks earlier declaration in same scope at translocation.pl line 33.
syntax error at translocation.pl line 17, near "=)"
syntax error at translocation.pl line 22, near "}"
syntax error at translocation.pl line 26, near "=)"
syntax error at translocation.pl line 30, near "}"
syntax error at translocation.pl line 50, near "}"
Execution of translocation.pl aborted due to compilation errors.

@SalendraSingh
Copy link

Hi @walaj, Great work with SvABA. Just like others, I am also trying to classify BND into various subtypes (INS, DUP, DEL, TRA). In terms of that, I am trying to understand the SCTG tag provided. It will be great if you can explain how is that tag defined and what different part of the tag means.
for example : "SCTG=1:215981(-)-1:5134458(-)__1_5120501_5145501D"
In this I am able to associate that there is a translocation between 1:215981(-) and 1:5134458(-), but I am not understanding what "__1_5120501_5145501D" means.

@jykr
Copy link

jykr commented Dec 6, 2020

Hi, thanks a lot for Svaba and all the helpful discussion that's going on here! I'm trying to adopt the codes here to classify SV types and I'm confused whether we can define insertions just from the breakend information.

Based on my understanding, insertions should have another contig that would be inserted so would have 4 breakend entries in vcf output (as in p.14 in VCF 4.2 spec) (,or ideally would be only present in .sv.indel output of Svaba where the inserted sequences are assembled as a single line indel). Am I missing something here? It seems to me that all the mates with ALT

  • Mate 1: ]p]t & Mate2: t[p[

should be DUPs not INSs, given the position of Mate 1 is smaller than the position of Mate2.
Am I missing something here? Plus, is there a tag or any information in .sv.vcf that marks a single insertion? Thanks!

@jykr
Copy link

jykr commented Dec 6, 2020

@JaretK @sfrenk Thanks a lot for the helpful scripts! I am a bit confused because the codes provided above still don't seem to agree each other. Is it correct to interpret that given the mate breakends' chromosome are same and Mate1 position is smaller than Mate2, then

  • Mate1 = t]position], Mate2 = t]position] or Mate1 = [position[t, Mate2 = [position[t : INVs

  • Mate1 = t[position[, Mate2 = ]position]t : DELs

  • Mate1 = ]position]t, Mate2 = t[position[ : DUPs (possibly INSs if I'm missing something)

Thank you!

@panxiaoguang
Copy link

I took a stab at parsing the vcf files to re-classify BNDs based on the above inputs. The python script below accepts the vcf file as an argument and writes the resulting re-classified vcf to stdout (pipe it into a .classified.vcf file to save)

#!/usr/bin/env python
import re
import sys
import os

#make mates dictionary given list input of non-comment lines
def makeMateDict(m):
    d = {}
    for index1, line1 in enumerate(m):
        id1 = line1.split('\t')[2]
        numMate = re.search(r':(\d)',id1).group(1)
        origId = re.search(r'(\d+):',id1).group(1)
        if int(numMate) == 1:
            for index2, line2 in enumerate(m):
                #never start from beginning of file
                if index2 <= index1:
                    continue
                # print str(index1) + " : " + str(index2)
                id2 = line2.split('\t')[2]
                duplicateId = re.search(r'(\d+):',id2).group(1)
                duplicateNumMate = re.search(r':(\d)',id2).group(1)
                if duplicateId == origId and int(duplicateNumMate) == 2:
                    d[line1] = line2
                    break
    return d

def classify(line, ALT_INDEX, mdict):
    #get alt, chrom1, chrom2, position (pos), id, old SVTYPE (should be BND if virgin svaba vcf) from line
    s = line.split("\t")
    alt = s[ALT_INDEX]
    chrom1 = s[0]
    pos = int(s[1])
    id=s[2]

    if int(re.search(r':(\d)',id).group(1)) != 1:
        return "NONE"

    mateLine = mdict[line].split('\t')
    mateChrom = mateLine[0]
    mateAlt = mateLine[ALT_INDEX]

    oldType = re.search(r'SVTYPE=(.+?)(\s+?|:)',line).group(1)

    # get new type
    if oldType == 'BND' and chrom1 == mateChrom:
        INV_PATTERN_1 = re.compile(r'\D\].+\]')
        INV_PATTERN_2 = re.compile(r'\[.+\[\D')
        if INV_PATTERN_1.match(alt) and INV_PATTERN_1.match(mateAlt):
            return "INV"
        if INV_PATTERN_2.match(alt) and INV_PATTERN_2.match(mateAlt):
            return "INV"

        # DEL
        DEL_PATTERN_THIS = re.compile(r'\D\[.+\[')
        DEL_PATTERN_MATE = re.compile(r'\].+\]\D')
        if DEL_PATTERN_THIS.match(alt) and DEL_PATTERN_MATE.match(mateAlt):
            return "DEL"

        # INS
        INS_PATTERN_THIS = re.compile(r'\D\].+\]')
        INS_PATTERN_MATE = re.compile(r'\[.+\[\D')
        if INS_PATTERN_THIS.match(alt) and INS_PATTERN_MATE.match(mateAlt):
            return "DUP/INS"

    return 'BND'

if __name__ == "__main__":
    file = sys.argv[1]
    if not os.path.exists(file):
        raise IOError(file)
    alt_index = -1
    #generate mate:mate dictionary
    #load file into ram
    vcf_file=[]
    with open (file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            vcf_file.append(line)
    matesDict = makeMateDict(vcf_file)
    with open(file, "r") as f:
        for line in f:
            # print comments
            if line.startswith("##"):
                sys.stdout.write(line)
                continue
            # header contains indexes
            if line.startswith('#'):
                split = line.split("\t")
                for index, val in enumerate(split):
                    if val == "ALT":
                        alt_index = index
                        break
                sys.stdout.write(line)
                continue
            if alt_index == -1:
                print "ERROR: NO ALT INDEX FOUND"
                exit(1)
            newType = classify(line, alt_index, matesDict)
            if newType != "NONE":
                newLine = re.sub(r'SVTYPE=BND',"SVTYPE="+newType,line)
                sys.stdout.write(newLine)

EDIT: Corrected DEL and INS calls. Added "TRA" is BND is across chromosomes. Changed INS to DUP/INS

Thank you so much for your great scripts, but I think you still make some mistakes.

  1. for DUP/INS, it should be r'\].+\]\D' and r'\D\[.+\[' .
  2. for BND( I think it's called interchromosomal translocation) , you should't remove it's mate record because we always need to know the strand information for both itself and it's mate reord.

@drvenki
Copy link

drvenki commented Sep 1, 2022

it is the end of 2022 and I see people are still using svaba despite no regular updates or upgrades. if SVtype classification was done and the tool itself was tweaked for sensitivity and specificity to detect low frequency SVs then, it would be such an amazing opportunity to bring this back to regular use in several pipelines.
Cheers,
Venki

Venkatesh Chellappa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests