-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build a variation graph from a collection of yeast genomes #189
Comments
One alternative is to use ➜ Downloads zcat SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz | vcfsamplenames
BC187
DBVPG1106
DBVPG1373
DBVPG1788
DBVPG6044
DBVPG6765
L1374
L1528
SK1
UWOPS03-461.4
UWOPS83-787.3
UWOPS87-2421
W303
Y12
Y55
YJM975
YJM978
YJM981
YPS128 |
A paper describes the work: A High-Definition View of Functional Genetic Variation from Natural Yeast Genomes. Also, I'm waiting to hear back from Jared. |
ftp://ftp.sanger.ac.uk/pub/users/dmc/yeast/SGRP2/ |
Also, this may be what we're after: http://www.tgac.ac.uk/news/232/15/Yeast-treasure-trove-goes-live/ |
@markmcdowall might be able to help you with pombe data, help you find VCFs and such. |
For S. pombe there was a study of 57 of the lab strains of S. pombe that are most commonly used. The paper you want is: The matching VCFs are in the EVA at: |
Moving a previous conversation between me @ekg and @edawson over to this thread. Interestingly even with verbose output on, it doesn't seem clear why my reads aren't aligning. The backstory, I'm comparing the process of aligning reads from http://opendata.ifr.ac.uk/NCYC/ (NCYC10 specifically) to the variation graph I've built and the SGD_2010 reference. I'm able to map to the variation graph but not to the linear reference.
Also for the pipeline aln + samse:
It does just seem like the reads aren't mapping. I'll try a different yeast strain which I know to be S. Cerevisiae. |
Try aligning 100 reads with vg map with verbose debugging on (-D). What On Mon, Apr 18, 2016 at 11:22 AM William Jones notifications@github.com
|
Here's what it looks like for the first read. This read maps to the variation graph but not to the linear reference. It seems to be finding mems effectively
|
This is saying that it is aligning with 40 soft clips. Is this the whole We should check that it is passing the identity filter. I don't know if On Mon, Apr 18, 2016, 12:24 William Jones notifications@github.com wrote:
|
Yep it's the whole log from what is returned - and yes that would be a great exercise :D |
So the reads from a different strain NCYC88 map just fine - I think it must be do to this.
100 + 0 in total (QC-passed reads + QC-failed reads) |
Yes, I build on the farm. You will need to have a new version of g++ in On Mon, Apr 18, 2016, 15:07 William Jones notifications@github.com wrote:
|
Try adding /nfs/users/nfs_e/eg10/bin and On Mon, Apr 18, 2016, 15:23 Erik Garrison erik.garrison@gmail.com wrote:
|
I'm redownloading everything on the farm - is there a better way to do this? |
@ekg All goes well except jansson isn't is available! |
I guess you will need to install it in your home directory. Build with On Mon, Apr 18, 2016, 16:48 William Jones notifications@github.com wrote:
|
Hey @ekg @edawson noticing some strange behaviour in the recent version of vg map. I'm simulating perfect reads from the graph but the alignment scores vary quite a bit. Simulate 10 perfect reads
Map these reads
real 0m0.905s Look at scores
16 |
Along this theme: it seems that vg map doesn't quite work as well as BWA mem. Here's a histogram of alignment scores from vg map. Many of them are near zero, whereas bwa mem maps a very high percentage of them. 1001 + 0 in total (QC-passed reads + QC-failed reads) Possible reasons: Vg map is not mapping the reverse strand, around half of my reads mapping to the reverse strand in bwa mem. |
Hi @willgdjones - why do you expect there to be less variation in the mapping qualities? Even with perfect reads, mapping quality will depend on whether the read comes from a repeat. Does bwa_mem give a very different histo with them all more confident? |
Do you know for sure that exactly the same reads are present in each On Thu, Apr 21, 2016 at 10:40 AM William Jones notifications@github.com
|
@ekg Yep I just noticed this! Quick spot.. |
Among those reads that are in both sets, do a scatter plot where X=bwa mem On Thu, Apr 21, 2016 at 10:42 AM William Jones notifications@github.com
|
How was the index generated and how is the mapping working? Note that the alignment score is not equivalent to the mapping quality. They may be correlated depending on how they are derived, but they are no the same thing. What do you get when you plot X=bwa mappping quality and Y=vg alignment quality? |
The reads in the sam output are in the different order to the input fastq order. Just fiddling and joining on the reads now to get them to match up! |
Chatting with Richard it seems likely that the reason for half of the reads mapping unsuccessfully is that they need reverse complementing and that this is not done automatically within vg map. |
This is done automatically within vg map. On Thu, Apr 21, 2016 at 1:58 PM William Jones notifications@github.com
|
I see, you've probably generated the gcsa2 index without reverse kmers. You disable this behavior by removing -F from the vg kmers call in the On Thu, Apr 21, 2016 at 2:14 PM Erik Garrison erik.garrison@gmail.com
|
I see! That makes a lot of sense. Somehow this came from where we were discussing on the gist, can't immediately find where the -F came from prior to that. https://gist.github.com/willgdjones/8a3b2ac59a645d4c033a4f0abb22728a |
You want this part: Let's work from that page on the wiki to make sure we've got the right docs On Thu, Apr 21, 2016 at 2:41 PM William Jones notifications@github.com
|
Yep I can edit this - it's good as it is though for now though. I can also put my work notebooks up as extra documentation. |
Since this thread was last updated we've made this kind of resequencing analysis routine. Thanks to everyone who participated, in particular @willgdjones who's focus on the yeast data set really helped to kickstart serious improvement in vg map through rigorous testing. |
I should leave this here. It implements indexing of the SGRP2, including pruning with GBWT path filling into the gaps to build an GCSA2 index with order 256:
|
If you have FASTA sequences, it should be straightforward enough to name them, concatenate them into one FASTA per genome, and run
vg msga
as such:Or, more verbosely:
The text was updated successfully, but these errors were encountered: