Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fusion strand #26

Closed
kgaonkar6 opened this issue Aug 6, 2019 · 22 comments
Closed

fusion strand #26

kgaonkar6 opened this issue Aug 6, 2019 · 22 comments

Comments

@kgaonkar6
Copy link

Hello,

I am curious about the strand1 and strand2 columns in arriba output. I read in the documentation that the strand before "/" is the strand of the gene from the gtf and the second strand (that is after "/") is from the assembled fusion.

However I'm not sure how arriba assembles the fusion supporting reads to get the strand information from the STAR BAM file. Can you please explain?

Thanks!

@suhrig
Copy link
Owner

suhrig commented Aug 8, 2019

Assembly is not used to predict the transcribed strand (given after the slash). The transcribed strand is derived from either the strand of the supporting reads (if a stranded library was used) or from splice patterns of the supporting reads (if a non-stranded library was used).

Assembly is used to determine the transcript sequence around the breakpoints. Basically, Arriba performs a pileup of all the reads supporting a fusion and takes the most frequent base for each position.

Does this answer your question?

@kgaonkar6
Copy link
Author

Thank you that does help.

So to clarify in the draw_fusions.R the script reads the "after the slash" strand information to gather/re-arrange the gene structure (and overlapping domains) to plot the fusion? That visualization looks amazing so I'm trying to understand it better.

@suhrig
Copy link
Owner

suhrig commented Aug 8, 2019

draw_fusions.R could in theory use the strand information after the slash, but it does not need to, because Arriba arranges gene1 and gene2 to reflect the 5' and 3' end of the fusion. Transcription always starts in gene1 and ends in gene2 (with few exceptions where it's completely unclear). draw_fusions.R orients the fusion gene such that transcription goes from left to right. It does not need to evaluate the transcribed strand, because this information is implicit in the order of gene1 and gene2.

That visualization looks amazing so I'm trying to understand it better.

Thanks. Feel free to ask if you would like to customize some aspects. What may take an hour for you may be a matter of minutes for me. I can either send you a custom patch or - if the request is likely useful to many users - add a new feature.

@kgaonkar6
Copy link
Author

Oh great! I'm actually trying to visualize STAR-fusion calls the same way arriba visualizes them to have similar plots from both callers which is why I had specific questions as to how the draw_fusion.R script works. https://gist.github.com/kgaonkar6/d2efc3660d31c20217a9071c6244e70a is what I have now to format the files correctly which seems to work to create the similar plots. But if you can send it as a patch to the draw_fusion.R script itself that would be really appreciated. I guess others might find it useful as well..?

@Ephedria
Copy link

Hello @kgaonkar6

I'm trying to use your script to draw Fusions from StarFusion. It seem to work for some fusions but not all. Did you encounter this kind of behavior ?

Best,

@kgaonkar6
Copy link
Author

Hi @Ephedria , I was able to plot all the fusions from my StarFusion file. However, I did notice if the gene was not found in the gtf file provided it would print out "Error XYZ exon coordinate not found in gtf" on the pdf page.
Also you might want to check the format/colnames of Starfusion, mine was output from STAR-Fusion 1.5.0

Did you get any errors?

@Ephedria
Copy link

Hello Krutika,
I didn't get an error like you, mine stopped the process right away :

Warning message:
In colnames(fusions)[colnames(fusions) %in% c("X.gene1", "strand1.gene.fusion.", :
number of items to replace is not a multiple of replacement length
Loading annotation
Loading protein domains
Drawing fusion #1: CALR:TANC2
Drawing fusion #2: SENP6:MYO6
Drawing fusion #3: IKZF2:ERBB4
Error in if ((codingLength1 == 0 || antisenseTranscription1) && (codingLength2 == :
missing value where TRUE/FALSE needed
Calls: drawProteinDomains
Execution halted

I tried to look at the line that cause the bug, I didn't found anything relevant...
My output is from a FusionInspector with the --examine_coding_effect which is the same as StarFusion.

I didn't thought of this possibility since your script worked fine. Could you send me an exemple of your StarFusion output so I may try to understand what cause the bug ?

Best,

@kgaonkar6
Copy link
Author

Sorry for the late reply, here you go:
testStarFusion.txt

@Ephedria
Copy link

Thanks for the link, I will try to found out the problem.

@suhrig
Copy link
Owner

suhrig commented Feb 27, 2020

I also apologize for the lack of responses from my side. I have been really busy lately.

In colnames(fusions)[colnames(fusions) %in% c("X.gene1", "strand1.gene.fusion.", :

This is likely the source of the error. I suspect, the column names of your file are not correct. You should check that they are identical to the ones that are generated by Arriba.

I plan on making the draw_fusions.R script compatible with STAR-Fusion and could send you the script. But I probably will not come around to doing this in the next 1-2 weeks.

@Ephedria
Copy link

Hello
Sorry I forgot to answer ... The header doesn't exactly look like Arriba's but it work on some fusions and not on others fusions.
I don't find out why ...
I will love to see draw_fusions.R working with StarFusion. I hope this is not too much work for you ...

Best,

@suhrig
Copy link
Owner

suhrig commented Mar 10, 2020

I already have a prototype of draw_fusions.R that is compatible with STAR-Fusion. I'll push it to GitHub next week probably. Still need to do some testing. Stay tuned.

@Ephedria
Copy link

Ho cool I will be eager to test it !

@suhrig
Copy link
Owner

suhrig commented Mar 19, 2020

Hi Krutika,

I just uploaded an enhanced version of draw_fusions.R to the development branch of the repository that should also be compatible with STAR-Fusion output (both with and without FusionInspector extra columns). You can download it from here:

https://raw.githubusercontent.com/suhrig/arriba/develop/draw_fusions.R

Feel free to give it a try if you find the time.

Regards,
Sebastian

@Ephedria
Copy link

Thanks for the hard work.

I was testing it with an input of FusionInspector but I didn't seem to work ...
I had this error :

Loading annotation
Warning message:
In parseGtfAttribute("exon_number", exons) :
Failed to parse 'exon_number' attribute of 558 GTF record(s).
Loading protein domains

Can you send me the command used to create your output of StarFusion ? I will try to stick to it.

Best,
Ephe

@suhrig
Copy link
Owner

suhrig commented Mar 20, 2020

Warning message:
In parseGtfAttribute("exon_number", exons) :

This error is related to the GTF file and has nothing to do with the fusion file. What GTF are you using? Can you send me a link or upload it somewhere?

Moreover, this is only a warning and in view of the fact that only 558 records are affected (of probably tens of thousands), this should not even be a concern. Is a PDF file generated as output?

@Ephedria
Copy link

Hum quite strange, because I try with an ouput from Arriba and it worked well. I didn't change the gtf file.
I use the : GRCh38_gencode_v31_CTAT_lib_Oct012019.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf

It does create a PDF file, but it's empty ...

@suhrig
Copy link
Owner

suhrig commented Mar 22, 2020

The error about parsing the GTF file should appear regardless of whether you use Arriba or STAR-Fusion. Can you confirm? Please send me the full call to draw_fusions.R for both Arriba and STAR-Fusion.

When I use CTAT annotation, the GTF parsing error is shown with output from both Arriba and STAR-Fusion. So at least the error is definitely not related to the fusions file. And in both cases a non-empty PDF file is produced. I'm suspecting an issue with the content of your STAR-Fusion file that the script cannot handle. Can you share the content? Or at least the header line and one fusion prediction? Thanks.

@Ephedria
Copy link

I did verified, you were right, I do have the same error when trying with an Arriba output...

Here is the output with Arriba output :
Loading annotation
Warning message:
In parseGtfAttribute("exon_number", exons) :
Failed to parse 'exon_number' attribute of 558 GTF record(s).
Loading protein domains
Drawing fusion #1: RNASEH2B-AS1:RNASEH2B
Drawing fusion #2: AC004943.2:AC116667.2
....

And here is the output with a FusionInspector output :

Loading annotation
Warning message:
In parseGtfAttribute("exon_number", exons) :
Failed to parse 'exon_number' attribute of 558 GTF record(s).
Loading protein domains
/mnt/go-docker/cmd.sh: line 8: 36 Killed /scratch2/tmp/mneou/Scripts/draw_fusions.R --fusions="/scratch2/tmp/mneou/FusionInspector/MAP634_WTS/MAP634_WTS.FusionInspector.fusions.abridged.tsv.coding_effect" --alignments="/scratch2/tmp/mneou/Arriba/MAP634_WTS/STAR/MAP634_WTSAligned.sortedByCoord.out.bam" --output="/scratch2/tmp/mneou/FusionInspector/MAP634_WTS/MAP634_WTS_fusions_CR.pdf" --annotation=/data/annotations/Human/GRCh38/index/star-fusion/v1.8/GRCh38_gencode_v31_CTAT_lib_Oct012019.plug-n-play/ctat_genome_lib_build_dir/ref_annot.gtf --cytobands=/database/cytobands_hg38_GRCh38_2018-02-23.tsv --proteinDomains=/database/protein_domains_hg38_GRCh38_2019-07-05.gff3

What annotation are you using ? I will try to use the same one, if it can prevent this warning ?

Sure here it is :
https://github.com/Ephedria/FusionInspector/blob/master/Sample1_FusionInspector.tsv

Thanks again for the help.

Best

@suhrig
Copy link
Owner

suhrig commented Mar 22, 2020

/mnt/go-docker/cmd.sh: line 8: 36 Killed

This is the real issue. The script got killed by an external process. The most likely explanation is that it ran out of memory. Typically, the script needs < 2GB. Does the script run inside a job with an artificial memory limit? Another explanation is that your job has a time limit and was killed after reaching that limit. Either way, this is not a bug in the script, but an operating system issue.

What annotation are you using ? I will try to use the same one, if it can prevent this warning ?

I use GENCODE >= v28. The warning is harmless, though. You can continue using your GTF file if you want. All this means is that about 1% of the genes will lack exon labels when they are drawn, i.e., there is no number inside the exon. So it's only cosmetic.

@Ephedria
Copy link

Hoo my bad, you were right I did ran out of memory ...
I did again by expending the memory allocatated it works fine.

I will try to test on different samples, but I pretty sure it will turn out well.
Thanks again for this tool and the help.

Best,

@suhrig
Copy link
Owner

suhrig commented Oct 10, 2020

Hi Krutika,

Arriba version 2.0.0 has just been released. It natively supports creating visualizations of fusion predictions made by STAR-Fusion. So this issue is finally resolved. Let know if you still have questions. I am closing this thread for now.

Regards,
Sebastian

@suhrig suhrig closed this as completed Oct 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants