Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build a variation graph from a collection of yeast genomes #189

Closed
ekg opened this issue Feb 2, 2016 · 35 comments
Closed

build a variation graph from a collection of yeast genomes #189

ekg opened this issue Feb 2, 2016 · 35 comments

Comments

@ekg
Copy link
Member

ekg commented Feb 2, 2016

If you have FASTA sequences, it should be straightforward enough to name them, concatenate them into one FASTA per genome, and run vg msga as such:

time vg msga -f yeasts.fa -B 1024 -k 22 -K 11 -X 1 -E 4 -Q 22 -D >yeasts.vg

Or, more verbosely:

time vg msga -f yeasts.fa \
    --band-width 1024 \
    --map-kmer-size 22 \
    --idx-kmer-size 11 \
    --idx-doublings 1 \
    --idx-edge-max 4 \
    --idx-prune-subs 22 \
    --debug >yeasts.vg
@ekg
Copy link
Member Author

ekg commented Feb 2, 2016

One alternative is to use vg construct to go from VCF files and the reference to a graph. You can also use this as the base graph in msga, to which it will align the other sequences. The Saccharomyces Genome Resequencing Project has a download page that includes VCF files for a number of yeast strains:

➜  Downloads  zcat SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz | vcfsamplenames 
BC187
DBVPG1106
DBVPG1373
DBVPG1788
DBVPG6044
DBVPG6765
L1374
L1528
SK1
UWOPS03-461.4
UWOPS83-787.3
UWOPS87-2421
W303
Y12
Y55
YJM975
YJM978
YJM981
YPS128

@ekg
Copy link
Member Author

ekg commented Feb 2, 2016

@ekg
Copy link
Member Author

ekg commented Feb 2, 2016

ftp://ftp.sanger.ac.uk/pub/users/dmc/yeast/SGRP2/

@ekg
Copy link
Member Author

ekg commented Feb 2, 2016

Also, this may be what we're after: http://www.tgac.ac.uk/news/232/15/Yeast-treasure-trove-goes-live/

@nerdstrike
Copy link

@markmcdowall might be able to help you with pombe data, help you find VCFs and such.

@markmcdowall
Copy link

For S. pombe there was a study of 57 of the lab strains of S. pombe that are most commonly used.

The paper you want is:
http://europepmc.org/abstract/MED/25665008

The matching VCFs are in the EVA at:
http://www.ebi.ac.uk/eva/?eva-study=PRJEB2733

@willgdjones
Copy link

willgdjones commented Apr 18, 2016

Moving a previous conversation between me @ekg and @edawson over to this thread. Interestingly even with verbose output on, it doesn't seem clear why my reads aren't aligning.

The backstory, I'm comparing the process of aligning reads from http://opendata.ifr.ac.uk/NCYC/ (NCYC10 specifically) to the variation graph I've built and the SGD_2010 reference. I'm able to map to the variation graph but not to the linear reference.

bwa mem -N10 -v 4 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam ​ [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 100 sequences (10100 bp)... [M::mem_process_seqs] Processed 100 reads in 0.012 CPU sec, 0.332 real sec [main] Version: 0.7.13-r1126 [main] CMD: bwa mem -N10 -v 4 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 6.612 sec; CPU: 0.084 sec

Also for the pipeline aln + samse:

!bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sai [bwa_aln_core] calculate SA coordinate... 0.01 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 100 sequences have been processed. [main] Version: 0.7.13-r1126 [main] CMD: bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 10.371 sec; CPU: 0.028 sec

!bwa samse data/SGD_2010.fasta NCYCaln-T10.sai data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam [bwa_aln_core] convert to sequence coordinate... 0.03 sec [bwa_aln_core] refine gapped alignments... 0.00 sec [bwa_aln_core] print alignments... 0.00 sec [bwa_aln_core] 100 sequences have been processed. [main] Version: 0.7.13-r1126 [main] CMD: bwa samse data/SGD_2010.fasta NCYCaln-T10.sai data/NCYC10_Run1/100NCYC10reads.fastq [main] Real time: 8.700 sec; CPU: 0.040 sec

!samtools flagstat NCYCaln-T10.sam 100 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 mapped (0.00% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

It does just seem like the reads aren't mapping. I'll try a different yeast strain which I know to be S. Cerevisiae.

@ekg
Copy link
Member Author

ekg commented Apr 18, 2016

Try aligning 100 reads with vg map with verbose debugging on (-D). What
does it say?

On Mon, Apr 18, 2016 at 11:22 AM William Jones notifications@github.com
wrote:

Moving a previous conversation between me @ekg https://github.com/ekg
and @edawson https://github.com/edawson over to this thread.
Interestingly even with verbose output on, it doesn't seem clear why my
reads aren't aligning.

The backstory, I'm comparing the process of aligning reads from
http://opendata.ifr.ac.uk/NCYC/ (NCYC10 specifically) to the variation
graph I've built and the SGD_2010 reference. I'm able to map to the
variation graph but not to the linear reference.

bwa mem -N10 -v 4 data/SGD_2010.fasta
data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 100 sequences (10100 bp)...
[M::mem_process_seqs] Processed 100 reads in 0.012 CPU sec, 0.332 real sec
[main] Version: 0.7.13-r1126
[main] CMD: bwa mem -N10 -v 4 data/SGD_2010.fasta
data/NCYC10_Run1/100NCYC10reads.fastq
[main] Real time: 6.612 sec; CPU: 0.084 sec

Also when doing it for

http://opendata.ifr.ac.uk/NCYC/

http://opendata.ifr.ac.uk/NCYC/

!bwa aln -n 100 data/SGD_2010.fasta data/NCYC10_Run1/100NCYC10reads.fastq

NCYCaln-T10.sai
[bwa_aln_core] calculate SA coordinate... 0.01 sec
[bwa_aln_core] write to the disk... 0.00 sec
[bwa_aln_core] 100 sequences have been processed.
[main] Version: 0.7.13-r1126
[main] CMD: bwa aln -n 100 data/SGD_2010.fasta
data/NCYC10_Run1/100NCYC10reads.fastq
[main] Real time: 10.371 sec; CPU: 0.028 sec

!bwa samse data/SGD_2010.fasta NCYCaln-T10.sai
data/NCYC10_Run1/100NCYC10reads.fastq > NCYCaln-T10.sam
[bwa_aln_core] convert to sequence coordinate... 0.03 sec
[bwa_aln_core] refine gapped alignments... 0.00 sec
[bwa_aln_core] print alignments... 0.00 sec
[bwa_aln_core] 100 sequences have been processed.
[main] Version: 0.7.13-r1126
[main] CMD: bwa samse data/SGD_2010.fasta NCYCaln-T10.sai
data/NCYC10_Run1/100NCYC10reads.fastq
[main] Real time: 8.700 sec; CPU: 0.040 sec

!samtools flagstat NCYCaln-T10.sam
100 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

willgdjones commented Apr 18, 2016

Here's what it looks like for the first read. This read maps to the variation graph but not to the linear reference. It seems to be finding mems effectively

!vg-81b5a2cb map -D -G -X 0.9 -f data/NCYC10_Run1/1NCYC10reads.fastq -x output/yeastgraph-m100.xg -g output/yeastgraph-m100-p4.gcsa > output/1NCYC10reads.gam

Loading xg index output/yeastgraph-m100.xg... Loading GCSA2 index output/yeastgraph-m100-p4.gcsa... Loading LCP index output/yeastgraph-m100-p4.gcsa... aligning {"quality": "IiIiJSAkJSUnJycmJycnKCkoJicmKCkpJicnIScnJyEnJycoKCQkJyYmKCcmKCkpKSIoKCgjJycoJygpEiUmJSEfISIlJScmKCcnJyQZIx8kHRYeHiIkIhohIyEeISAiIiIfIh8=", "sequence": "TATTTTGCTCTTTTCTGGCTTTGCTATAAAGTTCATAAAACATTATTCTATTAGAGGACTAGCTCAATTTTCATTCGATAGCATCTTTATCGTTATTATGA", "name": "HWI-ST319:276:D1BR0ACXX:7:1101:1116:2188 1:N:0:ATCACG"} mems before filling [["TATTTTGCTCTTTT",[]],["TTTTGCTCTTTTCTG",[]],["TGCTCTTTTCTGG",[]],["CTCTTTTCTGGC",[]],["TCTTTTCTGGCTT",[]],["TCTTTTCTGGCTTT",[]],["CTTTTCTGGCTTTG",[]],["TCTGGCTTTGCT",[]],["CTGGCTTTGCTA",[]],["GCTTTGCTAT",[]],["GCTTTGCTATA",[]],["GCTTTGCTATAA",[]],["CTTTGCTATAAA",[]],["TTTGCTATAAAGT",[]],["TTGCTATAAAGTT",[]],["GCTATAAAGTTC",[]],["CTATAAAGTTCA",[]],["CTATAAAGTTCAT",[]],["ATAAAGTTCATA",[]],["TAAAGTTCATAAA",[]],["AAAGTTCATAAAA",[]],["AAAGTTCATAAAACA",[]],["TTCATAAAACAT",[]],["TTCATAAAACATTAT",[]],["ATAAAACATTATT",[]],["TAAAACATTATTC",[]],["TAAAACATTATTCT",[]],["AAACATTATTCTA",[]],["ACATTATTCTAT",[]],["ACATTATTCTATT",[]],["CATTATTCTATTA",[]],["ATTATTCTATTAG",[]],["ATTCTATTAGA",[]],["ATTCTATTAGAG",[]],["TTCTATTAGAGGA",[]],["TATTAGAGGAC",[]],["TATTAGAGGACT",[]],["TTAGAGGACTA",[]],["TTAGAGGACTAG",[]],["TTAGAGGACTAGC",[]],["AGGACTAGCT",[]],["AGGACTAGCTCA",[]],["GACTAGCTCAAT",[]],["ACTAGCTCAATT",[]],["CTAGCTCAATTT",[]],["TAGCTCAATTTT",[]],["GCTCAATTTTC",[]],["GCTCAATTTTCA",[]],["CTCAATTTTCAT",[]],["CTCAATTTTCATT",[]],["TCAATTTTCATTC",[]],["AATTTTCATTCG",[]],["ATTTTCATTCGA",[]],["ATTTTCATTCGAT",[]],["TTTTCATTCGATAG",[]],["TCATTCGATAGC",[]],["TCATTCGATAGCA",[]],["ATTCGATAGCAT",[]],["TCGATAGCATC",[]],["TCGATAGCATCTT",[]],["CGATAGCATCTTTAT",[]],["TAGCATCTTTATCGT",[]],["CATCTTTATCGTTA",[]],["CATCTTTATCGTTATTAT",[]],["TATCGTTATTATG",[]],["TCGTTATTATGA",[]]] mems after filtering [["TATTTTGCTCTTTT",["458849:8"]],["TTTTGCTCTTTTCTG",["587307:32"]],["TGCTCTTTTCTGG",["754248:60"]],["CTCTTTTCTGGC",["161838:0"]],["TCTTTTCTGGCTT",["471056:20","620182:48"]],["TCTTTTCTGGCTTT",["471056:20"]],["CTTTTCTGGCTTTG",["420687:15"]],["TCTGGCTTTGCT",["691583:31"]],["CTGGCTTTGCTA",["143855:44","435460:66","634670:40"]],["GCTTTGCTAT",["44789:27","48958:3","54711:34","93361:70","386543:3","499676:19","502626:15","516985:26","647377:13","710274:36","717703:36","732156:23"]],["GCTTTGCTATA",["48958:3","54711:34","516985:26","717703:36","732156:23"]],["GCTTTGCTATAA",["54711:34"]],["CTTTGCTATAAA",["584676:36","638922:58","781922:35"]],["TTTGCTATAAAGT",["565437:90"]],["TTGCTATAAAGTT",["371940:63"]],["GCTATAAAGTTC",["88879:38","147055:25"]],["CTATAAAGTTCA",["148634:58","325868:16","404564:3","766676:6"]],["CTATAAAGTTCAT",["325868:16","766676:6"]],["ATAAAGTTCATA",["403193:44","628917:2"]],["TAAAGTTCATAAA",["276594:40"]],["AAAGTTCATAAAA",["126125:25","406841:25","517677:1","530178:41"]],["AAAGTTCATAAAACA",["406841:25","530178:41"]],["TTCATAAAACAT",["202193:21","318916:8","350442:61","571914:90"]],["TTCATAAAACATTAT",["318916:8"]],["ATAAAACATTATT",["37628:31","451365:15"]],["TAAAACATTATTC",["677054:3","745519:28"]],["TAAAACATTATTCT",["677054:3"]],["AAACATTATTCTA",["78831:30"]],["ACATTATTCTAT",["150957:54","183332:1","416058:38","425753:2","511228:34","511554:43","735341:7"]],["ACATTATTCTATT",["150957:54"]],["CATTATTCTATTA",["126151:23"]],["ATTATTCTATTAG",["387265:55"]],["ATTCTATTAGA",["107769:90","113002:17","338064:74","434236:7","480929:1","606976:13","759424:23"]],["ATTCTATTAGAG",["759424:23"]],["TTCTATTAGAGGA",["65330:95","765283:50"]],["TATTAGAGGAC",["713373:34","733531:1"]],["TATTAGAGGACT",["733531:1"]],["TTAGAGGACTA",["10601:65","593492:40","652337:49"]],["TTAGAGGACTAG",["593492:40","652337:49"]],["TTAGAGGACTAGC",["593492:40"]],["AGGACTAGCT",["52661:86","125077:37","227711:2","309070:70","744433:30"]],["AGGACTAGCTCA",["744433:30"]],["GACTAGCTCAAT",["713745:1"]],["ACTAGCTCAATT",["67818:37","564635:6"]],["CTAGCTCAATTT",["430488:38"]],["TAGCTCAATTTT",["506480:6"]],["GCTCAATTTTC",["401556:46","409100:2","512557:41","559702:80","669687:68","705882:12"]],["GCTCAATTTTCA",["401556:46","409100:2","559702:80","705882:12"]],["CTCAATTTTCAT",["12793:22","34924:0","669308:14","714469:33"]],["CTCAATTTTCATT",["714469:33"]],["TCAATTTTCATTC",["159190:12"]],["AATTTTCATTCG",["71309:10"]],["ATTTTCATTCGA",["363885:5","448792:34","567072:1","765258:66"]],["ATTTTCATTCGAT",["363885:5","567072:1"]],["TTTTCATTCGATAG",["263775:39"]],["TCATTCGATAGC",["406615:24","661628:33"]],["TCATTCGATAGCA",["406615:24"]],["ATTCGATAGCAT",["397453:46"]],["TCGATAGCATC",["415166:6","609877:9"]],["TCGATAGCATCTT",["415166:6"]],["CGATAGCATCTTTAT",["452838:38"]],["TAGCATCTTTATCGT",["588856:29"]],["CATCTTTATCGTTA",["267163:0","580741:71"]],["CATCTTTATCGTTATTAT",["580741:71"]],["TATCGTTATTATG",["575072:21"]],["TCGTTATTATGA",["181082:62","453155:16"]]] aligning to 112 clusters 18:1 580741-580741 15:1 452838-452838 15:1 530178-530178 15:1 406841-406841 15:1 587307-587307 15:1 588856-588856 15:1 318916-318916 14:1 267163-267163 14:1 471056-471056 14:1 458849-458849 14:1 420687-420687 14:1 263775-263775 14:1 677054-677054 13:1 406615-406615 13:1 593492-593492 13:1 325868-325868 13:1 65330-65330 13:1 714469-714469 13:1 371940-371940 13:1 387265-387265 13:1 745519-745519 13:1 575072-575072 13:1 567072-567072 13:1 363885-363885 13:1 754248-754248 13:1 565437-565437 13:1 415166-415166 13:1 517677-517677 13:1 451365-451365 13:1 37628-37628 13:1 765283-765283 13:1 766676-766676 13:1 276594-276594 13:1 78831-78831 13:1 620182-620182 13:1 150957-150957 13:1 126125-126125 13:1 159190-159190 13:1 126151-126151 12:1 511228-511228 12:1 759424-759424 12:1 559702-559702 12:1 765258-765258 12:1 511554-511554 12:1 661628-661628 12:1 691583-691583 12:1 506480-506480 12:1 705882-705882 12:1 713745-713745 12:1 564635-564635 12:1 733531-733531 12:1 669308-669308 12:1 571914-571914 12:1 652337-652337 12:1 638922-638922 12:1 634670-634670 12:1 584676-584676 12:1 628917-628917 12:1 744433-744433 12:1 781922-781922 12:1 735341-735341 12:1 397453-397453 12:1 143855-143855 12:1 147055-147055 12:1 148634-148634 12:1 161838-161838 12:1 181082-181082 12:1 183332-183332 12:1 202193-202193 12:1 88879-88879 12:1 71309-71309 12:1 350442-350442 12:1 67818-67818 12:1 401556-401556 12:1 403193-403193 12:1 409100-409100 12:1 404564-404564 12:1 416058-416058 12:1 425753-425753 12:1 430488-430488 12:1 54711-54711 12:1 12793-12793 12:1 453155-453155 12:1 448792-448792 12:1 34924-34924 12:1 435460-435460 11:1 480929-480929 11:1 113002-113002 11:1 713373-713373 11:1 48958-48958 11:1 107769-107769 11:1 717703-717703 11:1 732156-732156 11:1 10601-10601 11:1 669687-669687 11:1 512557-512557 11:1 516985-516985 11:1 434236-434236 11:1 338064-338064 11:1 606976-606976 11:1 609877-609877 10:1 93361-93361 10:1 309070-309070 10:1 227711-227711 10:1 386543-386543 10:1 647377-647377 10:1 52661-52661 10:1 710274-710274 10:1 44789-44789 10:1 125077-125077 10:1 502626-502626 10:1 499676-499676 attempt 1 on cluster 580741-580741 attempt 1 on subgraph 580734-580747 got 2 forward and 0 reverse mems softclip before 40 2 average node size 32 softclip after 40 2 attempt 2 on cluster 452838-452838 attempt 2 on subgraph 452830-452845 got 1 forward and 0 reverse mems softclip before 30 4 average node size 14 softclip after 30 4 attempt 3 on cluster 530178-530178 attempt 3 on subgraph 530170-530186 got 2 forward and 0 reverse mems softclip before 23 59 average node size 7 softclip after 23 21 softclip before 23 21 average node size 6 softclip after 23 21 attempt 4 on cluster 406841-406841 attempt 4 on subgraph 406833-406849 got 2 forward and 0 reverse mems softclip before 23 64 average node size 2 softclip after 23 3 softclip before 23 3 average node size 18 softclip after 23 3 attempt 5 on cluster 587307-587307 attempt 5 on subgraph 587300-587316 got 1 forward and 0 reverse mems softclip before 2 11 average node size 12 softclip after 2 6 softclip before 2 6 average node size 15 softclip after 2 6 attempt 6 on cluster 588856-588856 attempt 6 on subgraph 588848-588863 got 1 forward and 0 reverse mems softclip before 0 8 average node size 22 softclip after 0 8 attempt 7 on cluster 318916-318916 attempt 7 on subgraph 318908-318924 got 2 forward and 0 reverse mems softclip before 31 3 average node size 10 softclip after 31 3

@ekg
Copy link
Member Author

ekg commented Apr 18, 2016

This is saying that it is aligning with 40 soft clips. Is this the whole
log?

We should check that it is passing the identity filter. I don't know if
that is output to the log. It would be an easy and educational change to
add debugging output for your investigation :)

On Mon, Apr 18, 2016, 12:24 William Jones notifications@github.com wrote:

Here's what it looks like for the first read. This read maps to the
variation graph but not to the linear reference. It seems to be finding
mems effectively

`!vg-81b5a2cb map -D -G -X 0.9 -f data/NCYC10_Run1/1NCYC10reads.fastq -x
output/yeastgraph-m100.xg -g output/yeastgraph-m100-p4.gcsa >
output/1NCYC10reads.gam

Loading xg index output/yeastgraph-m100.xg...
Loading GCSA2 index output/yeastgraph-m100-p4.gcsa...
Loading LCP index output/yeastgraph-m100-p4.gcsa...
aligning {"quality":
"IiIiJSAkJSUnJycmJycnKCkoJicmKCkpJicnIScnJyEnJycoKCQkJyYmKCcmKCkpKSIoKCgjJycoJygpEiUmJSEfISIlJScmKCcnJyQZIx8kHRYeHiIkIhohIyEeISAiIiIfIh8=",
"sequence":
"TATTTTGCTCTTTTCTGGCTTTGCTATAAAGTTCATAAAACATTATTCTATTAGAGGACTAGCTCAATTTTCATTCGATAGCATCTTTATCGTTATTATGA",
"name": "HWI-ST319:276:D1BR0ACXX:7:1101:1116:2188 1:N:0:ATCACG"}
mems before filling
[["TATTTTGCTCTTTT",[]],["TTTTGCTCTTTTCTG",[]],["TGCTCTTTTCTGG",[]],["CTCTTTTCTGGC",[]],["TCTTTTCTGGCTT",[]],["TCTTTTCTGGCTTT",[]],["CTTTTCTGGCTTTG",[]],["TCTGGCTTTGCT",[]],["CTGGCTTTGCTA",[]],["GCTTTGCTAT",[]],["GCTTTGCTATA",[]],["GCTTTGCTATAA",[]],["CTTTGCTATAAA",[]],["TTTGCTATAAAGT",[]],["TTGCTATAAAGTT",[]],["GCTATAAAGTTC",[]],["CTATAAAGTTCA",[]],["CTATAAAGTTCAT",[]],["ATAAAGTTCATA",[]],["TAAAGTTCATAAA",[]],["AAAGTTCATAAAA",[]],["AAAGTTCATAAAACA",[]],["TTCATAAAACAT",[]],["TTCATAAAACATTAT",[]],["ATAAAACATTATT",[]],["TAAAACATTATTC",[]],["TAAAACATTATTCT",[]],["AAACATTATTCTA",[]],["ACATTATTCTAT",[]],["ACATTATTCTATT",[]],["CATTATTCTATTA",[]],["ATTATTCTATTAG",[]],["ATTCTATTAGA",[]],["ATTCTATTAGAG",[]],["TTCTATTAGAGGA",[]],["TATTAGAGGAC",[]],["TATTAGAGGACT",[]],["TTAGAGGACTA",[]],["TTAGAGGACTAG",[]],["TTAGAGGACTAGC",[]],["AGGACTAGCT",[]],["AGGACTAGCTCA",[]],["GACTAGCTCAAT",[]],["ACTAGCTCAATT",[]],["CTAGCTCAATTT",[]],["TAGCTCAATTTT",[]],["GCTCAATTTTC",[]],["GCTCAATTTTCA
",[]],["CTCAATTTTCAT",[]],["CTCAATTTTCATT",[]],["TCAATTTTCATTC",[]],["AATTTTCATTCG",[]],["ATTTTCATTCGA",[]],["ATTTTCATTCGAT",[]],["TTTTCATTCGATAG",[]],["TCATTCGATAGC",[]],["TCATTCGATAGCA",[]],["ATTCGATAGCAT",[]],["TCGATAGCATC",[]],["TCGATAGCATCTT",[]],["CGATAGCATCTTTAT",[]],["TAGCATCTTTATCGT",[]],["CATCTTTATCGTTA",[]],["CATCTTTATCGTTATTAT",[]],["TATCGTTATTATG",[]],["TCGTTATTATGA",[]]]
mems after filtering
[["TATTTTGCTCTTTT",["458849:8"]],["TTTTGCTCTTTTCTG",["587307:32"]],["TGCTCTTTTCTGG",["754248:60"]],["CTCTTTTCTGGC",["161838:0"]],["TCTTTTCTGGCTT",["471056:20","620182:48"]],["TCTTTTCTGGCTTT",["471056:20"]],["CTTTTCTGGCTTTG",["420687:15"]],["TCTGGCTTTGCT",["691583:31"]],["CTGGCTTTGCTA",["143855:44","435460:66","634670:40"]],["GCTTTGCTAT",["44789:27","48958:3","54711:34","93361:70","386543:3","499676:19","502626:15","516985:26","647377:13","710274:36","717703:36","732156:23"]],["GCTTTGCTATA",["48958:3","54711:34","516985:26","717703:36","732156:23"]],["GCTTTGCTATAA",["54711:34"]],["CTTTGCTATAAA",["584676:36","638922:58","781922:35"]],["TTTGCTATAAAGT",["565437:90"]],["TTGCTATAAAGTT",["371940:63"]],["GCTATAAAGTTC",["88879:38","147055:25"]],["CTATAAAGTTCA",["148634:58","325868:16","404564:3","766676:6"]],["CTATAAAGTTCAT",["325868:16","766676:6"]],["ATAAAGTTCATA",["403193:44","628917:2"]],["TAAAGTTCATAAA",["276594:40"]],["AAAGTTCATAAAA",["126125:25","406841:25","517677
:1","530178:41"]],["AAAGTTCATAAAACA",["406841:25","530178:41"]],["TTCATAAAACAT",["202193:21","318916:8","350442:61","571914:90"]],["TTCATAAAACATTAT",["318916:8"]],["ATAAAACATTATT",["37628:31","451365:15"]],["TAAAACATTATTC",["677054:3","745519:28"]],["TAAAACATTATTCT",["677054:3"]],["AAACATTATTCTA",["78831:30"]],["ACATTATTCTAT",["150957:54","183332:1","416058:38","425753:2","511228:34","511554:43","735341:7"]],["ACATTATTCTATT",["150957:54"]],["CATTATTCTATTA",["126151:23"]],["ATTATTCTATTAG",["387265:55"]],["ATTCTATTAGA",["107769:90","113002:17","338064:74","434236:7","480929:1","606976:13","759424:23"]],["ATTCTATTAGAG",["759424:23"]],["TTCTATTAGAGGA",["65330:95","765283:50"]],["TATTAGAGGAC",["713373:34","733531:1"]],["TATTAGAGGACT",["733531:1"]],["TTAGAGGACTA",["10601:65","593492:40","652337:49"]],["TTAGAGGACTAG",["593492:40","652337:49"]],["TTAGAGGACTAGC",["593492:40"]],["AGGACTAGCT",["52661:86","125077:37","227711:2","309070:70","744433:30"]],["AGGACTAGCTCA",["744433:30"]],["GACTAGCT
CAAT",["713745:1"]],["ACTAGCTCAATT",["67818:37","564635:6"]],["CTAGCTCAATTT",["430488:38"]],["TAGCTCAATTTT",["506480:6"]],["GCTCAATTTTC",["401556:46","409100:2","512557:41","559702:80","669687:68","705882:12"]],["GCTCAATTTTCA",["401556:46","409100:2","559702:80","705882:12"]],["CTCAATTTTCAT",["12793:22","34924:0","669308:14","714469:33"]],["CTCAATTTTCATT",["714469:33"]],["TCAATTTTCATTC",["159190:12"]],["AATTTTCATTCG",["71309:10"]],["ATTTTCATTCGA",["363885:5","448792:34","567072:1","765258:66"]],["ATTTTCATTCGAT",["363885:5","567072:1"]],["TTTTCATTCGATAG",["263775:39"]],["TCATTCGATAGC",["406615:24","661628:33"]],["TCATTCGATAGCA",["406615:24"]],["ATTCGATAGCAT",["397453:46"]],["TCGATAGCATC",["415166:6","609877:9"]],["TCGATAGCATCTT",["415166:6"]],["CGATAGCATCTTTAT",["452838:38"]],["TAGCATCTTTATCGT",["588856:29"]],["CATCTTTATCGTTA",["267163:0","580741:71"]],["CATCTTTATCGTTATTAT",["580741:71"]],["TATCGTTATTATG",["575072:21"]],["TCGTTATTATGA",["181082:62","453155:16"]]]
aligning to 112 clusters
18:1 580741-580741
15:1 452838-452838
15:1 530178-530178
15:1 406841-406841
15:1 587307-587307
15:1 588856-588856
15:1 318916-318916
14:1 267163-267163
14:1 471056-471056
14:1 458849-458849
14:1 420687-420687
14:1 263775-263775
14:1 677054-677054
13:1 406615-406615
13:1 593492-593492
13:1 325868-325868
13:1 65330-65330
13:1 714469-714469
13:1 371940-371940
13:1 387265-387265
13:1 745519-745519
13:1 575072-575072
13:1 567072-567072
13:1 363885-363885
13:1 754248-754248
13:1 565437-565437
13:1 415166-415166
13:1 517677-517677
13:1 451365-451365
13:1 37628-37628
13:1 765283-765283
13:1 766676-766676
13:1 276594-276594
13:1 78831-78831
13:1 620182-620182
13:1 150957-150957
13:1 126125-126125
13:1 159190-159190
13:1 126151-126151
12:1 511228-511228
12:1 759424-759424
12:1 559702-559702
12:1 765258-765258
12:1 511554-511554
12:1 661628-661628
12:1 691583-691583
12:1 506480-506480
12:1 705882-705882
12:1 713745-713745
12:1 564635-564635
12:1 733531-733531
12:1 669308-669308
12:1 571914-571914
12:1 652337-652337
12:1 638922-638922
12:1 634670-634670
12:1 584676-584676
12:1 628917-628917
12:1 744433-744433
12:1 781922-781922
12:1 735341-735341
12:1 397453-397453
12:1 143855-143855
12:1 147055-147055
12:1 148634-148634
12:1 161838-161838
12:1 181082-181082
12:1 183332-183332
12:1 202193-202193
12:1 88879-88879
12:1 71309-71309
12:1 350442-350442
12:1 67818-67818
12:1 401556-401556
12:1 403193-403193
12:1 409100-409100
12:1 404564-404564
12:1 416058-416058
12:1 425753-425753
12:1 430488-430488
12:1 54711-54711
12:1 12793-12793
12:1 453155-453155
12:1 448792-448792
12:1 34924-34924
12:1 435460-435460
11:1 480929-480929
11:1 113002-113002
11:1 713373-713373
11:1 48958-48958
11:1 107769-107769
11:1 717703-717703
11:1 732156-732156
11:1 10601-10601
11:1 669687-669687
11:1 512557-512557
11:1 516985-516985
11:1 434236-434236
11:1 338064-338064
11:1 606976-606976
11:1 609877-609877
10:1 93361-93361
10:1 309070-309070
10:1 227711-227711
10:1 386543-386543
10:1 647377-647377
10:1 52661-52661
10:1 710274-710274
10:1 44789-44789
10:1 125077-125077
10:1 502626-502626
10:1 499676-499676
attempt 1 on cluster 580741-580741
attempt 1 on subgraph 580734-580747
got 2 forward and 0 reverse mems
softclip before 40 2
average node size 32
softclip after 40 2
attempt 2 on cluster 452838-452838
attempt 2 on subgraph 452830-452845
got 1 forward and 0 reverse mems
softclip before 30 4
average node size 14
softclip after 30 4
attempt 3 on cluster 530178-530178
attempt 3 on subgraph 530170-530186
got 2 forward and 0 reverse mems
softclip before 23 59
average node size 7
softclip after 23 21
softclip before 23 21
average node size 6
softclip after 23 21
attempt 4 on cluster 406841-406841
attempt 4 on subgraph 406833-406849
got 2 forward and 0 reverse mems
softclip before 23 64
average node size 2
softclip after 23 3
softclip before 23 3
average node size 18
softclip after 23 3
attempt 5 on cluster 587307-587307
attempt 5 on subgraph 587300-587316
got 1 forward and 0 reverse mems
softclip before 2 11
average node size 12
softclip after 2 6
softclip before 2 6
average node size 15
softclip after 2 6
attempt 6 on cluster 588856-588856
attempt 6 on subgraph 588848-588863
got 1 forward and 0 reverse mems
softclip before 0 8
average node size 22
softclip after 0 8
attempt 7 on cluster 318916-318916
attempt 7 on subgraph 318908-318924
got 2 forward and 0 reverse mems
softclip before 31 3
average node size 10
softclip after 31 3`


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

Yep it's the whole log from what is returned - and yes that would be a great exercise :D

@willgdjones
Copy link

willgdjones commented Apr 18, 2016

So the reads from a different strain NCYC88 map just fine - I think it must be do to this.

!samtools flagstat NCYC88aln.sam

100 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
93 + 0 mapped (93.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

@willgdjones
Copy link

willgdjones commented Apr 18, 2016

@ekg @edawson Is there a way I can quickly build recent versions of VG up on the farm? Development seems to be moving quite quickly and I'm on an old version (vg-81b5a2cb)

@ekg
Copy link
Member Author

ekg commented Apr 18, 2016

Yes, I build on the farm. You will need to have a new version of g++ in
your path and also bison. There might be something else I'm forgetting. It
is awkward to build there because we don't have control over the package
manager.

On Mon, Apr 18, 2016, 15:07 William Jones notifications@github.com wrote:

@ekg https://github.com/ekg @edawson https://github.com/edawson Is
there a way I can quickly build recent versions of VG up on the farm?
Development seems to be moving quite quickly and I'm on an old version.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@ekg
Copy link
Member Author

ekg commented Apr 18, 2016

Try adding /nfs/users/nfs_e/eg10/bin and
/lustre/scratch113/projects/graphs/bin to your PATH. Put them in front so
the executables there take precedence. Now try to build vg.

On Mon, Apr 18, 2016, 15:23 Erik Garrison erik.garrison@gmail.com wrote:

Yes, I build on the farm. You will need to have a new version of g++ in
your path and also bison. There might be something else I'm forgetting. It
is awkward to build there because we don't have control over the package
manager.

On Mon, Apr 18, 2016, 15:07 William Jones notifications@github.com
wrote:

@ekg https://github.com/ekg @edawson https://github.com/edawson Is
there a way I can quickly build recent versions of VG up on the farm?
Development seems to be moving quite quickly and I'm on an old version.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

I'm redownloading everything on the farm - is there a better way to do this?

@willgdjones
Copy link

@ekg All goes well except jansson isn't is available!

@ekg
Copy link
Member Author

ekg commented Apr 18, 2016

I guess you will need to install it in your home directory. Build with
configure --prefix=$HOME.

On Mon, Apr 18, 2016, 16:48 William Jones notifications@github.com wrote:

@ekg https://github.com/ekg All goes well except jansson isn't is
available!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

willgdjones commented Apr 19, 2016

Hey @ekg @edawson noticing some strange behaviour in the recent version of vg map. I'm simulating perfect reads from the graph but the alignment scores vary quite a bit.

Simulate 10 perfect reads

!vg sim -l 101 -n 10 -e 0.00 -i 0.00 -x data/graphs/yeastgraph-m100.xg > data/raw/reads/sim10perf.reads

Map these reads

!time vg map -r data/raw/reads/sim10perf.reads -x data/graphs/yeastgraph-m100.xg -g data/graphs/yeastgraph-m100-p4.gcsa > data/GAM/sim10perf.gam

real 0m0.905s
user 0m0.444s
sys 0m0.260s

Look at scores

!vg view -a data/GAM/sim10perf.gam | jq '.score'

16
14
101
17
15
15
16
16
101
17

@willgdjones
Copy link

willgdjones commented Apr 21, 2016

Along this theme: it seems that vg map doesn't quite work as well as BWA mem. Here's a histogram of alignment scores from vg map.

download

Many of them are near zero, whereas bwa mem maps a very high percentage of them.

1001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
964 + 0 mapped (96.30%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Possible reasons: Vg map is not mapping the reverse strand, around half of my reads mapping to the reverse strand in bwa mem.

@iqbal-lab
Copy link

Hi @willgdjones - why do you expect there to be less variation in the mapping qualities? Even with perfect reads, mapping quality will depend on whether the read comes from a repeat. Does bwa_mem give a very different histo with them all more confident?

@ekg
Copy link
Member Author

ekg commented Apr 21, 2016

Do you know for sure that exactly the same reads are present in each
histogram? I ask because it looks like there are only 425 reads in the
biggest bin of the bwa set, but 2x this number in the vg one.

On Thu, Apr 21, 2016 at 10:40 AM William Jones notifications@github.com
wrote:

Yeah I just checked it out, it does [image: download 1]
https://cloud.githubusercontent.com/assets/1719848/14704604/6a745cd2-07ad-11e6-8e76-19eaf7340f35.png


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

@ekg Yep I just noticed this! Quick spot..

@ekg
Copy link
Member Author

ekg commented Apr 21, 2016

Among those reads that are in both sets, do a scatter plot where X=bwa mem
mapping quality and Y=vg alignment score.

On Thu, Apr 21, 2016 at 10:42 AM William Jones notifications@github.com
wrote:

@ekg https://github.com/ekg Yep I just noticed this! Quick spot..


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

willgdjones commented Apr 21, 2016

So this is the revised bwa mem version. The 0 quality scores correspond with the reads that weren't mapped at all.

download 2

@ekg
Copy link
Member Author

ekg commented Apr 21, 2016

How was the index generated and how is the mapping working?

Note that the alignment score is not equivalent to the mapping quality. They may be correlated depending on how they are derived, but they are no the same thing.

What do you get when you plot X=bwa mappping quality and Y=vg alignment quality?

@willgdjones
Copy link

The reads in the sam output are in the different order to the input fastq order. Just fiddling and joining on the reads now to get them to match up!

@willgdjones
Copy link

Chatting with Richard it seems likely that the reason for half of the reads mapping unsuccessfully is that they need reverse complementing and that this is not done automatically within vg map.

@ekg
Copy link
Member Author

ekg commented Apr 21, 2016

This is done automatically within vg map.

On Thu, Apr 21, 2016 at 1:58 PM William Jones notifications@github.com
wrote:

Chatting with Richard it seems likely that the reason for half of the
reads mapping unsuccessfully is that they need reverse complementing and
that this is not done automatically within vg map.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@ekg
Copy link
Member Author

ekg commented Apr 21, 2016

I see, you've probably generated the gcsa2 index without reverse kmers.
What documentation suggests to do that? We should correct it. There is no
need to generate kmers only from the reverse strand.

You disable this behavior by removing -F from the vg kmers call in the
indexing process.

On Thu, Apr 21, 2016 at 2:14 PM Erik Garrison erik.garrison@gmail.com
wrote:

This is done automatically within vg map.

On Thu, Apr 21, 2016 at 1:58 PM William Jones notifications@github.com
wrote:

Chatting with Richard it seems likely that the reason for half of the
reads mapping unsuccessfully is that they need reverse complementing and
that this is not done automatically within vg map.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

willgdjones commented Apr 21, 2016

I see! That makes a lot of sense. Somehow this came from where we were discussing on the gist, can't immediately find where the -F came from prior to that.

https://gist.github.com/willgdjones/8a3b2ac59a645d4c033a4f0abb22728a

@ekg
Copy link
Member Author

ekg commented Apr 21, 2016

You want this part:
https://github.com/vgteam/vg/wiki/working-with-a-whole-genome-variation-graph#indexing-with-xg-and-gcsa2

Let's work from that page on the wiki to make sure we've got the right docs
up and are on the same page. Can you edit it?

On Thu, Apr 21, 2016 at 2:41 PM William Jones notifications@github.com
wrote:

I see, some this came from where we were discussing on the gist
https://gist.github.com/willgdjones/8a3b2ac59a645d4c033a4f0abb22728a


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#189 (comment)

@willgdjones
Copy link

Yep I can edit this - it's good as it is though for now though. I can also put my work notebooks up as extra documentation.

@ekg
Copy link
Member Author

ekg commented Aug 27, 2018

Since this thread was last updated we've made this kind of resequencing analysis routine.

Thanks to everyone who participated, in particular @willgdjones who's focus on the yeast data set really helped to kickstart serious improvement in vg map through rigorous testing.

@ekg ekg closed this as completed Aug 27, 2018
@ekg
Copy link
Member Author

ekg commented Aug 27, 2018

I should leave this here. It implements indexing of the SGRP2, including pruning with GBWT path filling into the gaps to build an GCSA2 index with order 256:

vg construct -a -r SGD_2010.fasta -v SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz -p >SGRP2-cerevisiae.vg
vg index -p -x SGRP2-cerevisiae.haps.xg -v SGRP2-cerevisiae-freebayes-snps-Q30-GQ30.vcf.gz -G SGRP2-cerevisiae.gbwt SGRP2-cerevisiae.vg
vg prune -p -u -g SGRP2-cerevisiae.gbwt -m SGRP2-cerevisiae.unfold.mapping SGRP2-cerevisiae.vg >SGRP2-cerevisiae.prune.vg
mkdir -p work
vg index -b work/ -p -g SGRP2-cerevisiae.gcsa -f SGRP2-cerevisiae.unfold.mapping -k 16 SGRP2-cerevisiae.prune.vg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants