Support GFA with sequences from a FASTA file [feature request] #23

sjackman · 2016-04-26T18:18:21Z

The sequences of a GFA file may optionally be stored in an associated FASTA file. See https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#segment-line

ABySS for example stores the graph in scaffolds.gfa and the sequences in scaffolds.fa. The sequences of the segment records are *. I would like to be able to load the sequences from the associated FASTA file to support BLAST searches.

I don't know off the top of my head of a utility to load the GFA and FA files into a single GFA file with sequences, but I'll look.

The text was updated successfully, but these errors were encountered:

sjackman · 2016-04-26T22:37:10Z

abyss-todot --gfa foo.gv >foo.gfa does not include sequences in the GFA file.
gfatools abyss2gfa foo >foo.gfa combines foo.fa and foo.dot to create the GFA file including sequences. See https://github.com/lh3/gfatools

rrwick · 2016-04-27T13:43:12Z

This sounds like a pretty easy feature request. If when loading a GFA, Bandage sees a * for the sequence, it could look for a .fasta or .fa file with the same base name as the .gfa file. If it finds one, it would then load those sequences and use them in place of any *.

I don't have any ABySS graphs to test on - could you send/attach one for me? And would it be correct to assume that the GFA segment name exactly matches the FASTA header, or is there something more complex going on with the names?

sjackman · 2016-04-27T16:37:36Z

Thanks! That would be great. Yes, the sequence names are identical, ignoring the comment of the FASTA header (everything after the first space character).

Here's a gist of an assembly of ABySS 1.9.0 k=364 E.coli
Reads: https://basespace.illumina.com/sample/3756762/Ecoli
Scaffolds (.fa and .gfa): https://gist.github.com/49ba3aab5f1a9fba20debc55ca3f14ec

Here's some example graph files that I put together while writing the GFA spec, but it doesn't include a GFA without sequence: https://github.com/sjackman/assembly-graph

sjackman · 2016-04-27T16:44:53Z

When the sequences are missing from the GFA file, it would be helpful to include a header line that points to the associated FASTA file(s), along with their SHA1/SHA256, similar to what the SAM/BAM spec does with the @SQ UR and M5 header tags for reference sequences (which are not included in a SAM/BAM file). Rather than one tag per sequence, I'm intending one header per FASTA file. I'm going to submit a PR to the GFA spec requesting this addition. Do you have any comments before I do?

sjackman · 2016-04-27T16:46:23Z

If there are more than one FASTA file, should there be some way of indicating which sequences are found in which FASTA file? If all of the FASTA files are loaded at startup, it's not necessary, but for random access, it is. So, yes I think it's necessary to indicate which file has which seq. Of course only necessary if there's more than one FASTA file. ABySS does this for intermediate stages of its assembly, but it may be relatively uncommon in general.

sjackman · 2016-04-27T16:47:21Z

I should also mention that there may be sequences in the FASTA file that are not found in the GFA file for intermediate stages of the assembly, for example when they've been removed by an error removal algorithm.

rrwick · 2016-04-28T08:02:44Z

That all sounds good - I have nothing to add before you submit a PR to GFA spec.

Regarding the specific use of this in Bandage, I don't care whether there is an indication of which segment sequences are in which FASTA file. I'd probably just load all the FASTA sequences into memory at the start, using their names as keys. Then whenever I encounter a *, I'll check to see if I have a sequence for that name. So random access isn't as issue (for me and Bandage, anyway).

rrwick · 2016-04-28T08:06:32Z

And even when no FASTA is specified in the GFA, Bandage can still check for a same-name FASTA if it encounters a * sequence. So this GFA update shouldn't be necessary for working with ABySS graphs.

sjackman · 2016-04-28T16:04:18Z

I'd probably just load all the FASTA sequences into memory at the start, using their names as keys.

If the graph had a million vertices and 20 Gbp of sequence, e.g. conifer genomes, and the goal was to display a neighbourhood of 1 around a search vertex, loading the sequences only as they're needed would be faster. Just a performance improvement though. Perhaps it's fast enough as is. I haven't yet tested such a large graph, though I definitely have a few.

So this GFA update shouldn't be necessary for working with ABySS graphs.

Great thanks!

rrwick · 2016-04-28T23:16:59Z

Generally speaking, Bandage isn't very good with 20 Gbp, 1 million vertex graphs. I'm biased from working in bacterial genomics where assembly graphs are usually small and manageable. Bandage makes the implicit assumption that you have enough RAM to hold your entire graph, sequences and all.

A long-standing item on my to-do list for Bandage is to make it more large graph friendly by NOT loading sequences into memory and just grabbing them out of their file when needed. I'll get around to that someday...

sjackman · 2016-04-28T23:26:43Z

When I loaded a graph without sequences in it, I didn't even notice that the sequences were missing until I tried to BLAST something. Ideally each individually sequence would be loaded as-needed from an indexed FASTA file .fa.fai, but until that happens, a simpler approach would be to load no sequences at all until one is requested (e.g. for a BLAST) and then load all the sequences at that point.

rrwick · 2016-05-01T04:52:22Z

I've just published the new release of Bandage (v0.8.0), and this feature is included (sneaked in at the end!). It's implemented pretty much as you said in your last post - no sequences are loaded until one is requested, then all are loaded. It specifically looks for a .fa or .fasta file with the same base name as the GFA graph and in the same directory. The whole thing should be completely transparent to the user, except that when the sequences are loaded there may be a small delay and the RAM usage of Bandage will increase. Presumably on a very large graph (which I haven't tried) the delay may not be so small!

There are still some further developments for some day in the future:

If sequence-containing FASTA files make it into the GFA spec (as you suggested above), then I'll update Bandage to load sequences from those.
For all types of graph, I should at least have the option (probably automatically enabled for graphs above a certain size) to leave sequences in their files (not load into RAM) and get them only as needed. That should make working with a 20 GB graph feasible on an 8 GB machine.

But at least in the meantime, ABySS graphs should be better supported in Bandage. Those further developments are on my to-do list, so I'll close this particular issue. Thanks for the suggestion!

sjackman · 2016-05-01T16:54:26Z

Thanks, Ryan! I'll test it out tomorrow. I've submitted a PR to update the Homebrew-cask: Homebrew/homebrew-cask#20897

jelber2 · 2020-07-06T11:24:31Z

abyss-todot --gfa foo.gv >foo.gfa does not include sequences in the GFA file.
gfatools abyss2gfa foo >foo.gfa combines foo.fa and foo.dot to create the GFA file including sequences. See https://github.com/lh3/gfatools

@sjackman Does gfatools still have this feature?

sjackman · 2020-07-08T22:25:02Z

Does gfatools still have this feature?

I'm not sure, but Gfakluge has this feature gfak fillseq. https://github.com/edawson/gfakluge#command-line-utilities
You may install it using Homebrew on Linux, macOS, and Windows WSL. brew install brewsci/bio/gfakluge

jelber2 · 2020-07-09T08:29:50Z

Thanks- I couldn't find the gfatools feature for abyss anymore, but gfakluge fillseq worked fine!

sjackman mentioned this issue Apr 26, 2016

abyss-todot --gfa: Optionally include sequences in the GFA file bcgsc/abyss#109

Closed

rrwick closed this as completed May 1, 2016

sjackman mentioned this issue Jul 18, 2016

Pan-Genome Graphs GFA-spec/GFA-spec#7

Closed

mrvollger mentioned this issue Dec 6, 2019

Support GFA with sequences from a FASTA file [feature request] maickrau/GraphAligner#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GFA with sequences from a FASTA file [feature request] #23

Support GFA with sequences from a FASTA file [feature request] #23

sjackman commented Apr 26, 2016 •

edited

Loading

sjackman commented Apr 26, 2016

rrwick commented Apr 27, 2016

sjackman commented Apr 27, 2016 •

edited

Loading

sjackman commented Apr 27, 2016

sjackman commented Apr 27, 2016 •

edited

Loading

sjackman commented Apr 27, 2016

rrwick commented Apr 28, 2016

rrwick commented Apr 28, 2016

sjackman commented Apr 28, 2016

rrwick commented Apr 28, 2016

sjackman commented Apr 28, 2016 •

edited

Loading

rrwick commented May 1, 2016

sjackman commented May 1, 2016

jelber2 commented Jul 6, 2020

sjackman commented Jul 8, 2020

jelber2 commented Jul 9, 2020

Support GFA with sequences from a FASTA file [feature request] #23

Support GFA with sequences from a FASTA file [feature request] #23

Comments

sjackman commented Apr 26, 2016 • edited Loading

sjackman commented Apr 26, 2016

rrwick commented Apr 27, 2016

sjackman commented Apr 27, 2016 • edited Loading

sjackman commented Apr 27, 2016

sjackman commented Apr 27, 2016 • edited Loading

sjackman commented Apr 27, 2016

rrwick commented Apr 28, 2016

rrwick commented Apr 28, 2016

sjackman commented Apr 28, 2016

rrwick commented Apr 28, 2016

sjackman commented Apr 28, 2016 • edited Loading

rrwick commented May 1, 2016

sjackman commented May 1, 2016

jelber2 commented Jul 6, 2020

sjackman commented Jul 8, 2020

jelber2 commented Jul 9, 2020

sjackman commented Apr 26, 2016 •

edited

Loading

sjackman commented Apr 27, 2016 •

edited

Loading

sjackman commented Apr 27, 2016 •

edited

Loading

sjackman commented Apr 28, 2016 •

edited

Loading