-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vg on RNA-seq data #35
Comments
@inti YES!!!! vg should in theory be able to all of these things. However, the intron as indel interpretation issue is somewhat complicated. I need to think about that. (perhaps you could elaborate?) You can start testing by building a graph with a few transcripts for a gene. However, there is a blocking issue: #39. Ideally, we should align assembled trancripts. This would get us the "indel" corresponding to the intron splicing events. However, it may be more feasible to align just the exons from each transcript, using the transcript name as the alignment name. Then, we'd have to modify the graph to include the introns that are excised. I'm not sure at present which way will work better. Do you have test data to work with? |
Hi @ekg , I think a test could be to takes some mRNA-seq data of the human genome, potentially for one of the 1KG samples, and testing and check whether we can find the known variation. Makes sense to you? I do not have that data at hand but for the sake of trying I can find it and send you small bam files. |
I really like this approach. It's not completely general and will require more coding to enable (ideally, we just align assembled transcripts to build up the RNA+DNA variation graph), but it will work cleanly and the input is well-curated. To get both variation and transcripts, all that needs to be done is to modify the 1000G+GRCh37 graph to include the intronic deletions that we see in the RNA-seq data. Then we'd just need to clean up the graph, ensuring partial ordering of ids and such ( Code will need to be written to do the inclusion. It would look like this.
So the basic idea is that we'd use the exon/intron map to build up alignments (vg::Alignment) whose paths are the transcript alignment and whose names are the transcript names. Then we can use an existing tool
What kind of format are the exon-intro maps in? |
Regarding the format of the exon-intro maps. it depends on the software (as always?). GNSAP does it like this, from the README http://research-pub.gene.com/gmap/src/README
the gtf_splicesites and gtf_introns come bundled with GSNAP and they come in handy to go from GTF/GFF3 annotations to into-exon maps. This approach could be flexible since people can either rely on known annotations or first map the RNA-seq data to the genome identify likely variation and transcript structures and then re-map/refine the variation/bam files using variation graphs. Is it possible to modify a variation graph with a VCF file? would it be easier to take a GTF/GFF file and convert it into a mock VCF file to code the introns as just another variant, perhaps with a special label like NM_004448.ERBB2.intron1 (as example above)? |
It would be possible, but again there needs to be some coding to make this happen. The simplest way is to convert the variants in the file into reads of the novel haplotype represented by the variant, then map these into the existing graph. One thing that's lacking in vg now is naming mechanisms. Read and sequence names need to be able to be used to annotate paths in the graph. You could actually script out the above solution using the vg command line API if this were the case. So, keep an eye on #39. |
Have you had a chance to test this out further? |
No yet. I got busy with something else. I am expecting to come back to this within 2-3 weeks. helps?
|
With banded/split read mapping and the graph editing stuff stabilizing, I'd say it's the perfect time to come back to this. --- These enable the alignment of long contigs (such as transcript sequences) to the reference, and then the editing of the reference with them. Also, I think I've got GCSA indexing working for the whole human genome in <200G of RAM, which is critical for making a usable high-performance mapper. Fingers crossed. Ping me if you come back to this and have any questions. |
We've done a couple proof of concept things with RNA graphs, and I think we're definitely ready for someone to do RNA-seq with vg as an actual project. I'm going to close this. |
Hi, I would like to follow-up on this interesting thread. |
If the splicing is represented in the graph then we should have no problem
aligning directly against them. Alternative splicings look the same as the
reference from the perspective of the aligner.
To detect novel splicing we could adapt the banded aligner or adjust the
scoring function that is used during clustering in single ended read
alignment.
I am happy to work on this in the near future. I expect there may be some
problems because we have not attempted this before, but in principle the
algorithms in vg should have no problem aligning reads to RNAseq graphs.
Further, we can align reads to RNAseq assemblies. This would be interesting
in that the reference would then represent the full data set down to some
noise threshold set in the assembler. Do others think that would be a
useful thing to explore?
…On Wed, Mar 22, 2017, 9:58 AM CDieterich ***@***.***> wrote:
Hi,
I would like to follow-up on this interesting thread.
While I understand that pan-transcriptome and pan-genome indices can be
build and used right away,
I was wondering how splice alignment in spirit of HiSAT2 could be
performed.
Any feedback is highly appreciated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAI4EcpHf7vvzsT7jEcJyplzEvxpqxfRks5roPDWgaJpZM4EVO3k>
.
|
Ok, first part sounds great (i.e. incorporating known splice sites). The second part is even more interesting (i.e. finding novel splice or even back.splice sites) I am very happy to see your positive reply. Could you pinpoint me towards possible options |
Hi @ekg, I tried to build a graph for a gene with different transcripts. But what I got was separate linear reference for each transcript. Could you help me with that? vg construct -r gene.fa > tmp/x.vg Looking forward to your reply. |
Hi @btrspg, since this issue was closed we've developed methods for constructing variation graphs with splice edges in |
@jeizenga Appreciate. |
Hi, @jeizenga, I still have some questions about I manipulated like this:
Then I used sequenceTubeMap to see is there any new path in the graph but what I got was a linear reference. |
If I recall correctly, sequenceTubeMap is designed primarily for genomic variation graphs with phased haplotypes. There are some pitfalls that might be catching you up on a splice graph. I would recommend |
Hi, @btrspg, transcript paths are not added to the graph by default in |
@btrspg If you run If you have multiple homologous sequences to make a graph from, you want https://github.com/ekg/seqwish or maybe |
Thank you all @jeizenga @jonassibbesen @adamnovak . In deed, what I need is to use vg to do some RNA-seq work. I have known the reason why I could not see the graph path was that I need to build the index for
Then I could see the different path from different transcripts in sequenceTubeMap. I used Thank you all again. |
Hi,
This is great work and I look forwards to try it! I had some questions and I thought it would be better to ask rather than not to.
thanks in advance,
Inti
The text was updated successfully, but these errors were encountered: