Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble removal and unitig construction while retaining original coordinates #2979

Open
rickbeeloo opened this issue Sep 14, 2020 · 12 comments
Open

Comments

@rickbeeloo
Copy link

rickbeeloo commented Sep 14, 2020

I'm aware questions are preferably answered at Biostars, however, I saw older questions not being addressed there while being here. If this is not convenient just notify me and I will immediately transfer it to Biostars!

Simple example
We created three simple sequences of which two are identical except an SNP, and one has a new subsequence. We constructed the GFA using SeqWish after aligning them using Minimap2:
image
We used GFAtools (gfatools asm -b 10 -u) to remove bubbles and hereafter merge the nodes to unitigs.
image

Question
As GFAtools is based on Miniasm, and thus "genome assembly", it does not (yet) remember the original paths. Hence, in the unitig graph, we are unable to tell the locations of the nodes/fragments in the input genomes (which we are interested in). Therefore we wonder whether/how we can pop bubbles and get the unitigs using VG? We took a look at vg simplify however this seems to require BED and VCF files instead of our GFA.

@ekg
Copy link
Member

ekg commented Sep 14, 2020

gfatools appears to be unable to record or work with the embedded paths. In this case there is a simple workaround. Map the unitigs back into the original graph. You can then annotate the positions of all the nodes in the original graph with xg's GFA output (https://github.com/vgteam/xg).

The trivial way to do this is to add the unitigs to your source FASTA when you're running seqwish. Align them to the other sequences as well. Then, when you re-induce the graph, the unitigs will be embedded and you'll be able to get the positions of the bubbles relative to them.

I think this will do what you want:

xg -g input.gfa -G >position_annotated.gfa

@rickbeeloo
Copy link
Author

rickbeeloo commented Sep 15, 2020

Thank you @ekg for the quick response!

The simplified data

source.gfa (= left in the previous figure)

H	VN:Z:1.0
S	1	ATCGACTGACACGATCGACTA
S	2	C
S	3	GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
S	4	ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S	5	G
S	6	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
L	1	+	2	+	0M
L	1	+	5	+	0M
L	2	+	3	+	0M
L	3	+	4	+	0M
L	5	+	3	+	0M
L	6	+	4	+	0M
P	genome1	1+,2+,3+,4+	*,*,*
P	genome2	1+,5+,3+,4+	*,*,*
P	genome3	6+,4+	*

unitig.gfa (= right in the previous figure)

S	utg0000001l	ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG	LN:i:74	RC:i:3	lc:i:74
A	utg0000001l	0	+	1	0	21
A	utg0000001l	21	+	5	0	1
A	utg0000001l	22	+	3	0	52
S	utg0000002l	ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC	LN:i:289	RC:i:1	lc:i:289
A	utg0000002l	0	+	4	0	289
S	utg0000003l	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC	LN:i:110	RC:i:1	lc:i:110
A	utg0000003l	0	+	6	0	110
L	utg0000001l	+	utg0000002l	+	0M	L1:i:74	L2:i:289
L	utg0000002l	-	utg0000003l	-	0M	L1:i:289	L2:i:110

Mapping attempt
I was not sure where I'm supposed to use the xg command you provided earlier, but this is what I tried now for the mapping:

# GFA to vg
vg view -F source.gfa -v > source.vg

# Index
vg index -x source_index.gx -g source_index.gcsa source.vg

# Unitig graph to FASTA
gfatools gfa2fa unitig.gfa > query.fasta

# Query the FASTA
vg map -F query.fasta -M 10 -x source_index.gx -g source_index.gcsa -j

However, this is different from what I would expect. For example, let's look at the json for utg0000001l it only returns the matching nodes 1, 5, 3 for genome2 but not for genome1 (although it is mentioned in refpos: of course 1 and 3 overlap but 2 is absent from the output).
image

{
   "identity":1.0,
   "mapping_quality":60,
   "name":"utg0000001l",
   "path":{
      "mapping":[
         {
            "edit":[
               {
                  "from_length":21,
                  "to_length":21
               }
            ],
            "position":{
               "node_id":"1"
            },
            "rank":"1"```
         },
         {
            "edit":[
               {
                  "from_length":1,
                  "to_length":1
               }
            ],
            "position":{
               "node_id":"5"
            },
            "rank":"2"
         },
         {
            "edit":[
               {
                  "from_length":52,
                  "to_length":52
               }
            ],
            "position":{
               "node_id":"3"
            },
            "rank":"3"
         }

Questions

  • Am I missing something here as I thought node 2 from genome1 of source.gfa should also be included in the mapping?
  • Where exactly did you mean to use the xg command you provided?

Thank you in advance!

@ekg
Copy link
Member

ekg commented Sep 15, 2020 via email

@ekg
Copy link
Member

ekg commented Sep 15, 2020 via email

@rickbeeloo
Copy link
Author

rickbeeloo commented Sep 16, 2020

Hi @ekg, thanks again!
I ran the exact steps you suggested however some unexpected things appear :(

I first created the new FASTA, with unitigs and the original sequences, as you suggested:
1. cat query.fasta source.fa > s.fa
Which thus looks like this:

>utg0000001l
ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
>utg0000002l
ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>utg0000003l
TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
>genome1
ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>genome2
ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
>genome3
TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCtACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

2. minimap2 -X s.fa s.fa >s.paf (not as expected)

The s.paf content:

genome1	363	24	356	+	genome2	363	24	356	332	332	0	tp:A:S	cm:i:58	s1:i:332	dv:f:0.0011	rl:i:0
genome1	363	83	356	+	utg0000002l	289	9	282	273	273	0	tp:A:S	cm:i:46	s1:i:273	dv:f:0	rl:i:0
genome1	363	83	356	+	genome3	399	119	392	273	273	0	tp:A:S	cm:i:46	s1:i:273	dv:f:0.0014	rl:i:0
genome1	363	24	74	+	utg0000001l	74	24	74	50	50	0	tp:A:S	cm:i:9	s1:i:50	dv:f:0.0070	rl:i:0
genome2	363	83	356	+	utg0000002l	289	9	282	273	273	0	tp:A:S	cm:i:46	s1:i:273	dv:f:0	rl:i:0
genome2	363	83	356	+	genome3	399	119	392	273	273	0	tp:A:S	cm:i:46	s1:i:273	dv:f:0.0014	rl:i:0
genome2	363	4	74	+	utg0000001l	74	4	74	70	70	0	tp:A:S	cm:i:12	s1:i:70	dv:f:0	rl:i:0
genome3	399	119	392	+	utg0000002l	289	9	282	273	273	0	tp:A:S	cm:i:46	s1:i:273	dv:f:0	rl:i:0
genome3	399	2	102	+	utg0000003l	110	2	102	100	100	0	tp:A:S	cm:i:14	s1:i:100	dv:f:0	rl:i:0

We know that genome1 and genome2 are identical except for one SNP. utg0000001l should cover positions 0-75 of both genome1 and genome2. In fact, utg0000001l is identical to positions 0-75 of genome2. However, upon inspection of this output we notice that neither of the genomes is correctly covered (does not start at 0) nor are they equally aligned (different start positions). Notably, the SNP is at position 22 probably influencing this start position difference here too, however, if I understood correctly, the goal of this approach was to actually make the unitig cover the exact parts it was originally derived from. Likewise, we know that utg0000003l should start at position 0 on genome3 but it starts at position 2.

I could resolve this by setting the -w (minimizer window size) parameter to 2:
minimap2 -X -w 2 s.fa s.fa >s.paf:

genome1	363	0	362	+	genome2	363	0	362	359	362	0	tp:A:S	cm:i:223	s1:i:359	dv:f:0.0032	rl:i:0
genome1	363	75	362	+	genome3	399	111	398	287	287	0	tp:A:S	cm:i:184	s1:i:287	dv:f:0.0004	rl:i:0
genome1	363	75	362	+	utg0000002l	289	1	288	287	287	0	tp:A:S	cm:i:184	s1:i:287	dv:f:0	rl:i:0
genome1	363	0	74	+	utg0000001l	74	0	74	71	74	0	tp:A:S	cm:i:29	s1:i:71	dv:f:0.0214	rl:i:0
genome2	363	75	362	+	genome3	399	111	398	287	287	0	tp:A:S	cm:i:184	s1:i:287	dv:f:0.0004	rl:i:0
genome2	363	75	362	+	utg0000002l	289	1	288	287	287	0	tp:A:S	cm:i:184	s1:i:287	dv:f:0	rl:i:0
genome2	363	0	74	+	utg0000001l	74	0	74	74	74	0	tp:A:S	cm:i:39	s1:i:74	dv:f:0	rl:i:0
genome3	399	111	398	+	utg0000002l	289	1	288	287	287	0	tp:A:S	cm:i:184	s1:i:287	dv:f:0	rl:i:0
genome3	399	0	110	+	utg0000003l	110	0	110	110	110	0	tp:A:S	cm:i:68	s1:i:110	dv:f:0	rl:i:0

However, when resolving the unitigs heavily depends on the parameters I do wonder how robust this approach is when extending beyond thousands of genomes (which is our plan) where unitigs can be very small till almost the sizes of whole genomes.

3. seqwish -s s.fa -p s.paf -g s.gfa
With s.gfa looking like:

H	VN:Z:1.0
S	1	ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG
S	2	ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S	3	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC
S	4	ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S	5	ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
S	6	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
P	utg0000001l	1+	*
P	utg0000002l	2+	*
P	utg0000003l	3+	*
P	genome1	4+	*
P	genome2	5+	*
P	genome3	6+	*

Here we see that each sequence gets its own node rather than the overlap it should have?

4. xg -g s.gfa -G >a.gfa
With a.gfa:

H	VN:Z:1.0
S	1	ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG	PO:J:[["utg0000001l","+",0]]
S	2	ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC	PO:J:[["utg0000002l","+",0]]
S	3	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC	PO:J:[["utg0000003l","+",0]]
S	4	ATCGACTGACACGATCGACTACGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC	PO:J:[["genome1","+",0]]
S	5	ATCGACTGACACGATCGACTAGGACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC	PO:J:[["genome2","+",0]]
S	6	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGCACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC	PO:J:[["genome3","+",0]]

@rickbeeloo
Copy link
Author

rickbeeloo commented Sep 16, 2020

Sorry just realized you did not provide -c for the minimap2 command in your example and as you recently noticed (ekg/seqwish#61) it didn't trigger a warning for me. I will report back after I run minimap -c!

@ekg
Copy link
Member

ekg commented Sep 16, 2020 via email

@ekg
Copy link
Member

ekg commented Sep 16, 2020 via email

@ekg
Copy link
Member

ekg commented Sep 16, 2020 via email

@rickbeeloo
Copy link
Author

rickbeeloo commented Sep 17, 2020

Thanks again @ekg, however, an interesting problem remains. I was looking at how I could format the output and noticed the following:

image

So what we see here is that the bubble we initially popped, namely the SNP at position 22 (see the first figure I sent) re-appears here. This makes sense as the bubble popping took the genome2 sequence with G at 22 as "replacement" rather than C from genome1. This is detected by SeqWish resulting in a separate node: S 13 C PO:J:[["genome1","+",21]]. Thus this approach does not seem to work with bubble popping unless we:

  • Would somehow remember where bubble popping occurred
  • Post-process the a.gfa where we assume that small windows of "missing coverage" were previous bubbles - and thus ignore this.
  • Perhaps replace the nucleotides in a bubble for N
  • Just use the minimap output directly? (see below)

Minimap
Perhaps we can just use the minmap output directly as this correctly tells us that e.g. utg0000001l covers 1-74 of genome1 and genome2. To make this more visual I plotted the minimap output where the blocks represent the found alignments for each genome (the unitigs in orange):

image

Each unitig indeed correctly covers the genomes. In this way, we can still extract the unitig sequences from the unitig GFA and their genomic locations from minimap. In fact, I suppose we can take the unitig GFA and store the minimap positions in the JSON - creating the "location graph". Perhaps I'm missing something in my reason, as I suppose you proposed re-inducing the graph using seqwish for a reason?

The a.gfa (in case you wonder/need this)

H	VN:Z:1.0
S	1	ATCGACTGACACGATCGACTA	PO:J:[["utg0000001l","+",0],["genome1","+",0],["genome2","+",0]]
L	1	+	2	+	0M
L	1	+	13	+	0M
S	2	G	PO:J:[["utg0000001l","+",21],["genome2","+",21]]
L	2	+	3	+	0M
S	3	GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACAT	PO:J:[["utg0000001l","+",22],["genome1","+",22],["genome2","+",22]]
L	3	+	4	+	0M
S	4	CGATCGA	PO:J:[["utg0000001l","+",64],["utg0000003l","+",98],["genome1","+",64],["genome2","+",64],["genome3","+",98]]
L	4	+	5	+	0M
L	4	+	10	+	0M
S	5	C	PO:J:[["utg0000001l","+",71],["utg0000003l","+",106],["genome1","+",71],["genome2","+",71],["genome3","+",106]]
L	5	+	6	+	0M
L	5	+	11	+	0M
S	6	T	PO:J:[["utg0000001l","+",72],["genome1","+",72],["genome2","+",72]]
L	6	+	7	+	0M
S	7	G	PO:J:[["utg0000001l","+",73],["utg0000003l","+",108],["genome1","+",73],["genome2","+",73],["genome3","+",108]]
L	7	+	8	+	0M
L	7	+	12	+	0M
S	8	ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC	PO:J:[["utg0000002l","+",0],["genome1","+",74],["genome2","+",74],["genome3","+",110]]
S	9	TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTA	PO:J:[["utg0000003l","+",0],["genome3","+",0]]
L	9	+	4	+	0M
S	10	T	PO:J:[["utg0000003l","+",105],["genome3","+",105]]
L	10	+	5	+	0M
S	11	A	PO:J:[["utg0000003l","+",107],["genome3","+",107]]
L	11	+	7	+	0M
S	12	C	PO:J:[["utg0000003l","+",109],["genome3","+",109]]
L	12	+	8	+	0M
S	13	C	PO:J:[["genome1","+",21]]
L	13	+	3	+	0M
P	utg0000001l	1+,2+,3+,4+,5+,6+,7+	21M,1M,42M,7M,1M,1M,1M
P	utg0000002l	8+	289M
P	utg0000003l	9+,4+,10+,5+,11+,7+,12+	98M,7M,1M,1M,1M,1M,1M
P	genome1	1+,13+,3+,4+,5+,6+,7+,8+	21M,1M,42M,7M,1M,1M,1M,289M
P	genome2	1+,2+,3+,4+,5+,6+,7+,8+	21M,1M,42M,7M,1M,1M,1M,289M
P	genome3	9+,4+,10+,5+,11+,7+,12+,8+	98M,7M,1M,1M,1M,1M,1M,289M

@ekg
Copy link
Member

ekg commented Sep 17, 2020

Just out of curiosity, can you use vg deconstruct? That should give you what you're interested in (variants vs. embedded references). You'd need to pick the reference sequence(s) when you run it. Then you'll get a VCF describing the bubbles versus the reference sequences.

In the near future (sometime in the next week) we'll have a clean way to get MAF out of these graphs, using smoothxg. That should also solve your problem.

In fact, I suppose we can take the unitig GFA and store the minimap positions in the JSON - creating the "location graph". Perhaps I'm missing something in my reason, as I suppose you proposed re-inducing the graph using seqwish for a reason?

I am a little confused by this. The unitigs are embedded in the graph with the seqwish step. The xg index makes accessing and determining the positions randomly fast (not a problem for your small graph, but potentially useful on big ones). You don't need a separate kind of graph to represent the mapping. The graph directly represents it. What's missing for you is something that gives you coordinates for things that are near existing reference sequences. This could be given by vg deconstruct. We might also extend the positional annotation to give the nearest coordinates on any path for sequences that don't have any overlapping paths.

Sorry if this is trouble. It's helpful to me to work through a typical use case. I want to build a tool on top of xg that provides the kinds of interface that you're interested in here. As you see, xg nearly does this, but it's not user-friendly and doesn't quite answer all the questions you have.

p.s. maffer might be able to make MAF from this, solving your problem in yet another way.

@rickbeeloo
Copy link
Author

rickbeeloo commented Sep 17, 2020

Use case
@ekg I'm a PhD student of Bas and we will have a meeting tomorrow so we can discuss everything there in more detail. In short, we want to extract shared sequences between genomes while allowing some variations (hence the bubble pop) and also retain sequences only present in a single genome. For all those sequences we want to know in what genomes they were present and at what locations. So basically:

ATCGATCGACTAGCACGACATC... genome1[start,end], genome2[start,end]...
GGGGGGGGGGGGACGACATC... genome10[start,end], genome3[start,end]...
CTAGCTACGGATCGACTACTGF... genome3[start,end]

Minimap confusion

I am a little confused by this. The unitigs are embedded in the graph with the seqwish step

Here I meant not running seqwish and gx at all. So original graph > bubble pop > unitigs. Then unitig + source fasta > minimap, this would already tell us the location of each unitig in the genomes (while allowing some variability). However, preferably we would not align everything over again as this would consume a lot of extra time for large sequence collections.

Maffer
I just tried maffer on the original graph (before bubble pop etc.) however the alignment breaks on the SNP (of note I did not use odgi sort yet - if this would make a difference). Moreover, looking at the second block, we preferably want something like this to appear without the big gap for genome3, thus having the shared sequence between genome1 and genome2 (and not genome3) separately: GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTG


a loops=false
s genome2 0 21 + 363 ATCGACTGACACGATCGACTA-
s genome1 0 22 + 363 ATCGACTGACACGATCGACTAC

a loops=false
s genome3 110 289 + 399 ----------------------------------------------------ACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
s genome2  22 341 + 363 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC
s genome1  22 341 + 363 GACTAGCACGATCACGACTATCGGCGCGCGATCGATCGACATCGATCGACTGACTACGGACTAGCAGTCGATCGTTTGTTTGATCGATGCTAGCTAGTAGCTACGACTAGCATCGACTACGACTAGCGACTACGACTAGCCTAGCTACGACTAGCTACGACTAGCATCGACTGACTGTACGGACGACTACGATCAGCTACGACGGCCGGCGCATCTAGCATCACGACGAGCTATCATCGATCGATCGATCTACTGACGATCGACTAGCTACTAGCTAGCAGCTAGCTAGTCGACGGATCGATCGATCGATCGATGCATCGACGATCGTACTATCGACTGAC

a loops=false
s genome2 21 1 + 363 G

a loops=false
s genome3 0 110 + 399 TGCTACGATCGTTTTTAGCTAGCATCGATCAGCTAGCTATCAGCTAGCTAGCTAGCATGCTACGTACGATCAGCTAGCTACGTACGACTAGCTAGCTACGATCGATCAGC

Vg deconstruct
I'm not sure what the references here should be?

If you have any other suggestions that would be useful to try before the meeting tomorrow, let me know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants