bio-velvet is a biogem for interacting with the velvet sequence assembler. It includes both a wrapper for the velvet executable, as well as a a parser for the 'LastGraph' format files that velvet creates. This gives access to the underlying assembly graph created by velvet.
bio-velvet and its rubygem dependencies:
gem install bio-velvet
To run velvet with a kmer length of 87 on a set of single ended reads in
require 'bio-velvet' velvet_result = Bio::Velvet::Runner.new.velvet(87, '-short /path/to/reads.fa') #=> Bio::Velvet::Result object contigs_file = velvet_result.contigs_path #=> path to contigs file as a String lastgraph_file = velvet_result.last_graph_path #=> path to last graph file as a String Bio::Velvet::Runner.new.binary_version #=> e.g. "1.2.08"
By default, the
velvet method passes no parameters to
velvetg other than the velvet directory created by velveth. This directory is a temporary directory by default, but this can also be set. For instance, to run velvet using with a
-cov_cutoff parameter in the
velvet_result = Bio::Velvet::Runner.new.velvet(87, '-short /path/to/reads.fa', '-cov_cutoff 3.5', :output_assembly_path => 'velvet_dir')
The graph file can be parsed from a
graph = velvet_result.last_graph #=> Bio::Velvet::Graph object
In my experience (mostly on complex metagenomes), the graph object itself does not take as much RAM as initially expected. Most of the hard work has already been done by velvet itself, particularly if the
-cov_cutoff has been set. However parsing in the graph can take many minutes or even hours if the LastGraph file is big (>500MB). The slowest part of parsing is parsing in the positions of reads i.e. using the
-read_trkg yes velvet option. To speed up that process one can use e.g.
velvet_result.last_graph(:interesting_read_ids => Set.new([1,2,3]))
To only parse read in the positions of the first 3 reads.
With a parsed graph (a
Bio::Velvet::Graph object) you can interact with the graph e.g.
graph.kmer_length #=> 87 graph.nodes #=> Bio::Velvet::Graph::NodeArray object graph.nodes #=> Bio::Velvet::Graph::Node object with node ID 3 graph.get_arcs_by_node_id(1, 3) #=> an array of arcs between nodes 1 and 3 (Bio::Velvet::Graph::Arc objects) graph.nodes.noded_reads #=> array of Bio::Velvet::Graph::NodedRead objects, for read tracking
There is much more that can be done to interact with the graph object and its components - see the rubydoc.
With default parameters velvet generates a
Seqeunces file, that includes read ID information and the sequences themselves.
seqs = Bio::Velvet::Sequences.parse_from_file(File.join velvet_result.result_directory, 'Sequences') seqs => 'AAAATTGTCAGACTAGCTATCAGCATATCAGCGCGCATCTCAGACGAGCACTATC'
-create_binary flag is set when running
velveth, a names file is generated that encodes the read names and IDs.
entries = Bio::Velvet::CnyUnifiedSeqNamesFile.extract_entries( File.join(velvet_result.result_directory, 'CnyUnifiedSeq.names'), ['read1','read2'] ) #=> Hash of read name to Array of CnyUnifiedSeqNamesFileEntry objects entries['read1'] #=> Array of CnyUnifiedSeqNamesFileEntry objects entries['read1'].read_id #=> 1 (i.e. '1'.to_i)
When speed is required, grep can come to the rescue (at the cost of some portability)
entries = Bio::Velvet::CnyUnifiedSeqNamesFile.extract_entries_using_grep_hack( File.join(velvet_result.result_directory, 'CnyUnifiedSeq.names'), ['read1','read2'] ) #=> same returned object as above
The sequences themselves are stored in a separate file when
-create_binary is used - an interface for this is included in the bio-velvet_underground biogem.
Project home page
Information on the source tree, documentation, examples, issues and how to contribute, see
The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
This code is currently unpublished.
This Biogem is listed at biogems.info
Copyright (c) 2013 Ben J Woodcroft. See LICENSE.txt for further details.