Skip to content

wwood/bioruby-velvet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bio-velvet

Build Status

bio-velvet is a biogem for interacting with the velvet sequence assembler. It includes both a wrapper for the velvet executable, as well as a a parser for the 'LastGraph' format files that velvet creates. This gives access to the underlying assembly graph created by velvet.

Installation

To install bio-velvet and its rubygem dependencies:

gem install bio-velvet

Usage

To run velvet with a kmer length of 87 on a set of single ended reads in /path/to/reads.fa:

require 'bio-velvet'

velvet_result = Bio::Velvet::Runner.new.velvet(87, '-short /path/to/reads.fa') #=> Bio::Velvet::Result object

contigs_file = velvet_result.contigs_path #=> path to contigs file as a String
lastgraph_file = velvet_result.last_graph_path #=> path to last graph file as a String

Bio::Velvet::Runner.new.binary_version #=> e.g. "1.2.08"

By default, the velvet method passes no parameters to velvetg other than the velvet directory created by velveth. This directory is a temporary directory by default, but this can also be set. For instance, to run velvet using with a -cov_cutoff parameter in the velvet_dir directory:

velvet_result = Bio::Velvet::Runner.new.velvet(87,
  '-short /path/to/reads.fa',
  '-cov_cutoff 3.5', 
  :output_assembly_path => 'velvet_dir')

The graph file can be parsed from a velvet_result:

graph = velvet_result.last_graph #=> Bio::Velvet::Graph object

In my experience (mostly on complex metagenomes), the graph object itself does not take as much RAM as initially expected. Most of the hard work has already been done by velvet itself, particularly if the -cov_cutoff has been set. However parsing in the graph can take many minutes or even hours if the LastGraph file is big (>500MB). The slowest part of parsing is parsing in the positions of reads i.e. using the -read_trkg yes velvet option. To speed up that process one can use e.g.

velvet_result.last_graph(:interesting_read_ids => Set.new([1,2,3]))

To only parse read in the positions of the first 3 reads.

With a parsed graph (a Bio::Velvet::Graph object) you can interact with the graph e.g.

graph.kmer_length #=> 87
graph.nodes #=> Bio::Velvet::Graph::NodeArray object
graph.nodes[3] #=> Bio::Velvet::Graph::Node object with node ID 3
graph.get_arcs_by_node_id(1, 3) #=> an array of arcs between nodes 1 and 3 (Bio::Velvet::Graph::Arc objects)
graph.nodes[5].noded_reads #=> array of Bio::Velvet::Graph::NodedRead objects, for read tracking

There is much more that can be done to interact with the graph object and its components - see the rubydoc.

Parsers for Sequences and CnyUnifiedSeq.names files

With default parameters velvet generates a Seqeunces file, that includes read ID information and the sequences themselves.

seqs = Bio::Velvet::Sequences.parse_from_file(File.join velvet_result.result_directory, 'Sequences')
seqs[1] => 'AAAATTGTCAGACTAGCTATCAGCATATCAGCGCGCATCTCAGACGAGCACTATC'

If the -create_binary flag is set when running velveth, a names file is generated that encodes the read names and IDs.

entries = Bio::Velvet::CnyUnifiedSeqNamesFile.extract_entries(
  File.join(velvet_result.result_directory, 'CnyUnifiedSeq.names'),
  ['read1','read2']
  ) #=> Hash of read name to Array of CnyUnifiedSeqNamesFileEntry objects
entries['read1'] #=> Array of CnyUnifiedSeqNamesFileEntry objects
entries['read1'][0].read_id #=> 1 (i.e. '1'.to_i)

When speed is required, grep can come to the rescue (at the cost of some portability)

entries = Bio::Velvet::CnyUnifiedSeqNamesFile.extract_entries_using_grep_hack(
  File.join(velvet_result.result_directory, 'CnyUnifiedSeq.names'),
  ['read1','read2']
  ) #=> same returned object as above

The sequences themselves are stored in a separate file when -create_binary is used - an interface for this is included in the bio-velvet_underground biogem.

Project home page

Information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/wwood/bioruby-velvet

The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.

Cite

This code is currently unpublished.

Biogems.info

This Biogem is listed at biogems.info

Copyright

Copyright (c) 2013 Ben J Woodcroft. See LICENSE.txt for further details.

About

Parser and wrapper for the velvet DNA assembler

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages