-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bubble removal and unitig construction while retaining original coordinates #2979
Comments
gfatools appears to be unable to record or work with the embedded paths. In this case there is a simple workaround. Map the unitigs back into the original graph. You can then annotate the positions of all the nodes in the original graph with xg's GFA output (https://github.com/vgteam/xg). The trivial way to do this is to add the unitigs to your source FASTA when you're running seqwish. Align them to the other sequences as well. Then, when you re-induce the graph, the unitigs will be embedded and you'll be able to get the positions of the bubbles relative to them. I think this will do what you want:
|
Thank you @ekg for the quick response! The simplified data source.gfa (= left in the previous figure)
unitig.gfa (= right in the previous figure)
Mapping attempt
However, this is different from what I would expect. For example, let's look at the json for
Questions
Thank you in advance! |
Take the sequences that you used as input and the sequences of the unitigs.
Make sure they have unique names.
Concatenate them into a single file (s.fa).
Map them all together with minimap2 -X s.fa s.fa >s.paf.
Induce the variation graph with seqwish.
seqwish -s s.fa -p s.paf -g s.gfa
Index the graph using xg to get the coordinates of each node relative to
all overlapping paths.
xg -g s.gfa -G >a.gfa
Now in a.gfa you will have a coordinate descriptor tag in JSON for each
node. You can use this to find the coordinates of the bubbles versus your
unitigs.
…On Tue, Sep 15, 2020, 14:34 rickbeeloo ***@***.***> wrote:
*The simplified data*
source.gfa (= left in the previous figure)
H VN:Z:1.0
S 1 ATCGACTGACACGATCGACTA
S 2 C
S 3 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
S 4 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S 5 G
S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
L 1 + 2 + 0M
L 1 + 5 + 0M
L 2 + 3 + 0M
L 3 + 4 + 0M
L 5 + 3 + 0M
L 6 + 4 + 0M
P genome1 1+,2+,3+,4+ *,*,*
P genome2 1+,5+,3+,4+ *,*,*
P genome3 6+,4+ *
unitig.gfa (= right in the previous figure)
S utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG LN:i:74 RC:i:3 lc:i:74
A utg0000001l 0 + 1 0 21
A utg0000001l 21 + 5 0 1
A utg0000001l 22 + 3 0 52
S utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC LN:i:289 RC:i:1 lc:i:289
A utg0000002l 0 + 4 0 289
S utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC LN:i:110 RC:i:1 lc:i:110
A utg0000003l 0 + 6 0 110
L utg0000001l + utg0000002l + 0M L1:i:74 L2:i:289
L utg0000002l - utg0000003l - 0M L1:i:289 L2:i:110
*Mapping attempt*
I was not sure where I'm supposed to use the xg command you provided
earlier, but this is what I tried now for the mapping:
# GFA to vg
vg view -F source.gfa -v > source.vg
# Index
vg index -x source_index.gx -g source_index.gcsa source.vg
# Unitig graph to FASTA
gfatools gfa2fa unitig.gfa > query.fasta
# Query the FASTA
vg map -F query.fasta -M 10 -x source_index.gx -g source_index.gcsa -j
However, this is different from what I would expect. For example, let's
look at the json for utg0000001l it only returns the matching nodes 1, 5,
3 for genome2 but not for genome1 (of course 1 and 3 overlap but 2 is
absent from the output).
[image: image]
<https://user-images.githubusercontent.com/19516376/93210373-c490d000-f75f-11ea-82c0-491b80f97889.png>
{
"identity":1.0,![image](https://user-images.githubusercontent.com/19516376/93209801-e6d61e00-f75e-11ea-8c1a-a24852ffb706.png)
"mapping_quality":60,
"name":"utg0000001l",
"path":{
"mapping":[
{
"edit":[
{
"from_length":21,
"to_length":21
}
],
"position":{
"node_id":"1"
},
"rank":"1"```
},
{
"edit":[
{
"from_length":1,
"to_length":1
}
],
"position":{
"node_id":"5"
},
"rank":"2"
},
{
"edit":[
{
"from_length":52,
"to_length":52
}
],
"position":{
"node_id":"3"
},
"rank":"3"
}
]
},
"refpos":[
{
"name":"genome1"
},
{
"name":"genome2"
}
],
"score":84,
"sequence":"ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG",
"time_used":351.0
}
*Questions*
- Am I missing something here? I thought the -M parameter would allow
more than one match and thereby include genome1 too.
- Where exactly did you mean to use the xg command you provided?
Thank you in advance!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2979 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEJZCUK4CXAIVAEI2JDSF5NNZANCNFSM4RLXDCOQ>
.
|
By the way, you can probably apply vg deconstruct at the end rather than
xg. Then you will get a VCF. You would specify the unitig names as
references. The xg route is just very simple.
…On Tue, Sep 15, 2020, 15:30 Erik Garrison ***@***.***> wrote:
Take the sequences that you used as input and the sequences of the
unitigs.
Make sure they have unique names.
Concatenate them into a single file (s.fa).
Map them all together with minimap2 -X s.fa s.fa >s.paf.
Induce the variation graph with seqwish.
seqwish -s s.fa -p s.paf -g s.gfa
Index the graph using xg to get the coordinates of each node relative to
all overlapping paths.
xg -g s.gfa -G >a.gfa
Now in a.gfa you will have a coordinate descriptor tag in JSON for each
node. You can use this to find the coordinates of the bubbles versus your
unitigs.
On Tue, Sep 15, 2020, 14:34 rickbeeloo ***@***.***> wrote:
> *The simplified data*
>
> source.gfa (= left in the previous figure)
>
> H VN:Z:1.0
> S 1 ATCGACTGACACGATCGACTA
> S 2 C
> S 3 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
> S 4 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 5 G
> S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
> L 1 + 2 + 0M
> L 1 + 5 + 0M
> L 2 + 3 + 0M
> L 3 + 4 + 0M
> L 5 + 3 + 0M
> L 6 + 4 + 0M
> P genome1 1+,2+,3+,4+ *,*,*
> P genome2 1+,5+,3+,4+ *,*,*
> P genome3 6+,4+ *
>
> unitig.gfa (= right in the previous figure)
>
> S utg0000001l ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG LN:i:74 RC:i:3 lc:i:74
> A utg0000001l 0 + 1 0 21
> A utg0000001l 21 + 5 0 1
> A utg0000001l 22 + 3 0 52
> S utg0000002l ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC LN:i:289 RC:i:1 lc:i:289
> A utg0000002l 0 + 4 0 289
> S utg0000003l TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC LN:i:110 RC:i:1 lc:i:110
> A utg0000003l 0 + 6 0 110
> L utg0000001l + utg0000002l + 0M L1:i:74 L2:i:289
> L utg0000002l - utg0000003l - 0M L1:i:289 L2:i:110
>
> *Mapping attempt*
> I was not sure where I'm supposed to use the xg command you provided
> earlier, but this is what I tried now for the mapping:
>
> # GFA to vg
> vg view -F source.gfa -v > source.vg
>
> # Index
> vg index -x source_index.gx -g source_index.gcsa source.vg
>
> # Unitig graph to FASTA
> gfatools gfa2fa unitig.gfa > query.fasta
>
> # Query the FASTA
> vg map -F query.fasta -M 10 -x source_index.gx -g source_index.gcsa -j
>
> However, this is different from what I would expect. For example, let's
> look at the json for utg0000001l it only returns the matching nodes 1,
> 5, 3 for genome2 but not for genome1 (of course 1 and 3 overlap but 2 is
> absent from the output).
> [image: image]
> <https://user-images.githubusercontent.com/19516376/93210373-c490d000-f75f-11ea-82c0-491b80f97889.png>
>
> {
> "identity":1.0,![image](https://user-images.githubusercontent.com/19516376/93209801-e6d61e00-f75e-11ea-8c1a-a24852ffb706.png)
> "mapping_quality":60,
> "name":"utg0000001l",
> "path":{
> "mapping":[
> {
> "edit":[
> {
> "from_length":21,
> "to_length":21
> }
> ],
> "position":{
> "node_id":"1"
> },
> "rank":"1"```
> },
> {
> "edit":[
> {
> "from_length":1,
> "to_length":1
> }
> ],
> "position":{
> "node_id":"5"
> },
> "rank":"2"
> },
> {
> "edit":[
> {
> "from_length":52,
> "to_length":52
> }
> ],
> "position":{
> "node_id":"3"
> },
> "rank":"3"
> }
> ]
> },
> "refpos":[
> {
> "name":"genome1"
> },
> {
> "name":"genome2"
> }
> ],
> "score":84,
> "sequence":"ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG",
> "time_used":351.0
> }
>
> *Questions*
>
> - Am I missing something here? I thought the -M parameter would allow
> more than one match and thereby include genome1 too.
> - Where exactly did you mean to use the xg command you provided?
>
> Thank you in advance!
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2979 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AABDQEJZCUK4CXAIVAEI2JDSF5NNZANCNFSM4RLXDCOQ>
> .
>
|
Hi @ekg, thanks again! I first created the new FASTA, with unitigs and the original sequences, as you suggested:
2. The
We know that I could resolve this by setting the
However, when resolving the unitigs heavily depends on the parameters I do wonder how robust this approach is when extending beyond thousands of genomes (which is our plan) where unitigs can be very small till almost the sizes of whole genomes. 3.
Here we see that each sequence gets its own node rather than the overlap it should have? 4.
|
Sorry just realized you did not provide |
You need to tell minimap2 to produce cigars to use it in seqwish. Add -c to
the mapping command. Otherwise there is no way to know what base pairs map
to others.
…On Wed, Sep 16, 2020, 11:13 rickbeeloo ***@***.***> wrote:
Hi @ekg <https://github.com/ekg>, thanks again!
I ran the exact steps you suggested however some unexpected things appear
:(
I first created the new FASTA, with unitigs and the original sequences, as
you suggested:
*1. cat query.fasta source.fa > s.fa*
Which thus looks like this:
>utg0000001l
ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCG
ACATCGATCGACTG
>utg0000002l
ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCA
TCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACT
GACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCT
ATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCG
ACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>utg0000003l
TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGC
TACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
>genome1
ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>genome2
ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>genome3
TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
*2. minimap2 -X s.fa s.fa >s.paf* (*not as expected*)
The s.paf content:
genome1 363 24 356 + genome2 363 24 356 332 332 0 tp:A:S cm:i:58 s1:i:332 dv:f:0.0011 rl:i:0
genome1 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
genome1 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0
genome1 363 24 74 + utg0000001l 74 24 74 50 50 0 tp:A:S cm:i:9 s1:i:50 dv:f:0.0070 rl:i:0
genome2 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
genome2 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0
genome2 363 4 74 + utg0000001l 74 4 74 70 70 0 tp:A:S cm:i:12 s1:i:70 dv:f:0 rl:i:0
genome3 399 119 392 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
genome3 399 2 102 + utg0000003l 110 2 102 100 100 0 tp:A:S cm:i:14 s1:i:100 dv:f:0 rl:i:0
We know that genome1 and genome2 are identical except for one SNP.
utg0000001l should cover positions 1-74 of both genome1 and genome2. In
fact, utg0000001l is identical to positions 1-74 of genome2. However,
upon inspection of this output we notice that neither of the genomes is
correctly covered (does not start at 1) nor are they equally aligned
(different start positions):
genome1 363 24 74 + utg0000001l
genome2 363 4 74 + utg0000001l
Perhaps parameter settings can fix this or I can make an issue on the
Github of Heng Li, however, I then wonder how reliable this approach is,
especially since utg0000001l is even identical to the first part of
genome2. Moreover, we plan to extend this to thousands of genomes.
I just continued regardless just to see what the GFA would look like in
the end.
*3. seqwish -s s.fa -p s.paf -g s.gfa*
With s.gfa looking like:
H VN:Z:1.0
S 1 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
S 2 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S 3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
S 4 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S 5 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
P utg0000001l 1+ *
P utg0000002l 2+ *
P utg0000003l 3+ *
P genome1 4+ *
P genome2 5+ *
P genome3 6+ *
Here we see that each sequence gets its own node rather than the overlap
it should have.
Any suggestions?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2979 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEMVSNQB62W7T4OGXRLSGB6TDANCNFSM4RLXDCOQ>
.
|
This is why you get a graph with a separate node for each contig.
…On Wed, Sep 16, 2020, 14:38 Erik Garrison ***@***.***> wrote:
You need to tell minimap2 to produce cigars to use it in seqwish. Add -c
to the mapping command. Otherwise there is no way to know what base pairs
map to others.
On Wed, Sep 16, 2020, 11:13 rickbeeloo ***@***.***> wrote:
> Hi @ekg <https://github.com/ekg>, thanks again!
> I ran the exact steps you suggested however some unexpected things appear
> :(
>
> I first created the new FASTA, with unitigs and the original sequences,
> as you suggested:
> *1. cat query.fasta source.fa > s.fa*
> Which thus looks like this:
>
> >utg0000001l
> ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCG
> ACATCGATCGACTG
> >utg0000002l
> ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCA
> TCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACT
> GACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCT
> ATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCG
> ACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> >utg0000003l
> TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGC
> TACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
> >genome1
> ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> >genome2
> ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> >genome3
> TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>
> *2. minimap2 -X s.fa s.fa >s.paf* (*not as expected*)
>
> The s.paf content:
>
> genome1 363 24 356 + genome2 363 24 356 332 332 0 tp:A:S cm:i:58 s1:i:332 dv:f:0.0011 rl:i:0
> genome1 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
> genome1 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0
> genome1 363 24 74 + utg0000001l 74 24 74 50 50 0 tp:A:S cm:i:9 s1:i:50 dv:f:0.0070 rl:i:0
> genome2 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
> genome2 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0
> genome2 363 4 74 + utg0000001l 74 4 74 70 70 0 tp:A:S cm:i:12 s1:i:70 dv:f:0 rl:i:0
> genome3 399 119 392 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
> genome3 399 2 102 + utg0000003l 110 2 102 100 100 0 tp:A:S cm:i:14 s1:i:100 dv:f:0 rl:i:0
>
> We know that genome1 and genome2 are identical except for one SNP.
> utg0000001l should cover positions 1-74 of both genome1 and genome2. In
> fact, utg0000001l is identical to positions 1-74 of genome2. However,
> upon inspection of this output we notice that neither of the genomes is
> correctly covered (does not start at 1) nor are they equally aligned
> (different start positions):
>
> genome1 363 24 74 + utg0000001l
> genome2 363 4 74 + utg0000001l
>
> Perhaps parameter settings can fix this or I can make an issue on the
> Github of Heng Li, however, I then wonder how reliable this approach is,
> especially since utg0000001l is even identical to the first part of
> genome2. Moreover, we plan to extend this to thousands of genomes.
>
> I just continued regardless just to see what the GFA would look like in
> the end.
>
> *3. seqwish -s s.fa -p s.paf -g s.gfa*
> With s.gfa looking like:
>
> H VN:Z:1.0
> S 1 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
> S 2 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
> S 4 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 5 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> P utg0000001l 1+ *
> P utg0000002l 2+ *
> P utg0000003l 3+ *
> P genome1 4+ *
> P genome2 5+ *
> P genome3 6+ *
>
>
> Here we see that each sequence gets its own node rather than the overlap
> it should have.
>
> Any suggestions?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2979 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AABDQEMVSNQB62W7T4OGXRLSGB6TDANCNFSM4RLXDCOQ>
> .
>
|
I'll add a warning to seqwish when this is encountered. It's an obvious
gotcha. Sorry for the confusion.
…On Wed, Sep 16, 2020, 14:38 Erik Garrison ***@***.***> wrote:
You need to tell minimap2 to produce cigars to use it in seqwish. Add -c
to the mapping command. Otherwise there is no way to know what base pairs
map to others.
On Wed, Sep 16, 2020, 11:13 rickbeeloo ***@***.***> wrote:
> Hi @ekg <https://github.com/ekg>, thanks again!
> I ran the exact steps you suggested however some unexpected things appear
> :(
>
> I first created the new FASTA, with unitigs and the original sequences,
> as you suggested:
> *1. cat query.fasta source.fa > s.fa*
> Which thus looks like this:
>
> >utg0000001l
> ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCG
> ACATCGATCGACTG
> >utg0000002l
> ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCA
> TCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACT
> GACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCT
> ATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCG
> ACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> >utg0000003l
> TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGC
> TACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
> >genome1
> ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> >genome2
> ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> >genome3
> TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>
> *2. minimap2 -X s.fa s.fa >s.paf* (*not as expected*)
>
> The s.paf content:
>
> genome1 363 24 356 + genome2 363 24 356 332 332 0 tp:A:S cm:i:58 s1:i:332 dv:f:0.0011 rl:i:0
> genome1 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
> genome1 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0
> genome1 363 24 74 + utg0000001l 74 24 74 50 50 0 tp:A:S cm:i:9 s1:i:50 dv:f:0.0070 rl:i:0
> genome2 363 83 356 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
> genome2 363 83 356 + genome3 399 119 392 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0.0014 rl:i:0
> genome2 363 4 74 + utg0000001l 74 4 74 70 70 0 tp:A:S cm:i:12 s1:i:70 dv:f:0 rl:i:0
> genome3 399 119 392 + utg0000002l 289 9 282 273 273 0 tp:A:S cm:i:46 s1:i:273 dv:f:0 rl:i:0
> genome3 399 2 102 + utg0000003l 110 2 102 100 100 0 tp:A:S cm:i:14 s1:i:100 dv:f:0 rl:i:0
>
> We know that genome1 and genome2 are identical except for one SNP.
> utg0000001l should cover positions 1-74 of both genome1 and genome2. In
> fact, utg0000001l is identical to positions 1-74 of genome2. However,
> upon inspection of this output we notice that neither of the genomes is
> correctly covered (does not start at 1) nor are they equally aligned
> (different start positions):
>
> genome1 363 24 74 + utg0000001l
> genome2 363 4 74 + utg0000001l
>
> Perhaps parameter settings can fix this or I can make an issue on the
> Github of Heng Li, however, I then wonder how reliable this approach is,
> especially since utg0000001l is even identical to the first part of
> genome2. Moreover, we plan to extend this to thousands of genomes.
>
> I just continued regardless just to see what the GFA would look like in
> the end.
>
> *3. seqwish -s s.fa -p s.paf -g s.gfa*
> With s.gfa looking like:
>
> H VN:Z:1.0
> S 1 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
> S 2 ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 3 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
> S 4 ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 5 ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> S 6 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
> P utg0000001l 1+ *
> P utg0000002l 2+ *
> P utg0000003l 3+ *
> P genome1 4+ *
> P genome2 5+ *
> P genome3 6+ *
>
>
> Here we see that each sequence gets its own node rather than the overlap
> it should have.
>
> Any suggestions?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2979 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AABDQEMVSNQB62W7T4OGXRLSGB6TDANCNFSM4RLXDCOQ>
> .
>
|
Thanks again @ekg, however, an interesting problem remains. I was looking at how I could format the output and noticed the following: So what we see here is that the bubble we initially popped, namely the SNP at position 22 (see the first figure I sent) re-appears here. This makes sense as the bubble popping took the
Minimap Each unitig indeed correctly covers the genomes. In this way, we can still extract the unitig sequences from the unitig GFA and their genomic locations from minimap. In fact, I suppose we can take the unitig GFA and store the minimap positions in the JSON - creating the "location graph". Perhaps I'm missing something in my reason, as I suppose you proposed re-inducing the graph using seqwish for a reason? The a.gfa (in case you wonder/need this)
|
Just out of curiosity, can you use In the near future (sometime in the next week) we'll have a clean way to get MAF out of these graphs, using smoothxg. That should also solve your problem.
I am a little confused by this. The unitigs are embedded in the graph with the seqwish step. The xg index makes accessing and determining the positions randomly fast (not a problem for your small graph, but potentially useful on big ones). You don't need a separate kind of graph to represent the mapping. The graph directly represents it. What's missing for you is something that gives you coordinates for things that are near existing reference sequences. This could be given by vg deconstruct. We might also extend the positional annotation to give the nearest coordinates on any path for sequences that don't have any overlapping paths. Sorry if this is trouble. It's helpful to me to work through a typical use case. I want to build a tool on top of xg that provides the kinds of interface that you're interested in here. As you see, xg nearly does this, but it's not user-friendly and doesn't quite answer all the questions you have. p.s. maffer might be able to make MAF from this, solving your problem in yet another way. |
Use case
Minimap confusion
Here I meant not running Maffer
Vg deconstruct If you have any other suggestions that would be useful to try before the meeting tomorrow, let me know :) |
I'm aware questions are preferably answered at Biostars, however, I saw older questions not being addressed there while being here. If this is not convenient just notify me and I will immediately transfer it to Biostars!
Simple example
We created three simple sequences of which two are identical except an SNP, and one has a new subsequence. We constructed the GFA using SeqWish after aligning them using Minimap2:
We used GFAtools (
gfatools asm -b 10 -u
) to remove bubbles and hereafter merge the nodes to unitigs.Question
As GFAtools is based on Miniasm, and thus "genome assembly", it does not (yet) remember the original paths. Hence, in the unitig graph, we are unable to tell the locations of the nodes/fragments in the input genomes (which we are interested in). Therefore we wonder whether/how we can pop bubbles and get the unitigs using VG? We took a look at
vg simplify
however this seems to require BED and VCF files instead of our GFA.The text was updated successfully, but these errors were encountered: