Skip to content
Parsing and mutation calling from SAM, CIGAR and MD:Z.
Ruby
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
lib
spec
test
.document
.gitignore
.rspec
.travis.yml
Gemfile
LICENSE.txt
README.md
README.rdoc
Rakefile

README.md

bio-sam-mutation

Build Status

  • Methods for calling mutations from SAM alignments, including CIGAR and MD tag parsers.

  • Annotates mutations in HGVS format: http://www.hgvs.org/mutnomen/recs.html.

  • Incorporates Ensembl VEP lookup.

Installation

gem install bio-sam-mutation

Usage

Command line

mutations -r -c config.yml *.sam

Produces a report using settings in config.yml. Report contains read counts for each mutant allele broken down by file. Use to identify mutants generated by CRISPR mutagenesis (for example), with settings to specify target region.

gem install bioruby-sam-mutation # Ruby > 2.1 required.

mutations -t -c config.yml *.sam

mutations --help

Tags a SAM file with mutation calls in HGVS format, using the settings in config.yml.

Example config file (YAML format). Multiple amplicons can be deconvoluted on-the-fly by specifiying the start sequence (allows reuse of sequencing barcodes). Transcript specifies the transcript of interest, required for Ensembl VEP consequence lookup. Translation start should be specified when aligning to cDNA sequence to get accurate HGVS annotation.

:species: "human"
:products:
  pool1:
    :start: "GGG"
    :offset: 25
    :length: 140
    :transcript: "ENST00000366794"
    :translation_start: 145
  pool2:
    :start: "AGT"
    :offset: 10
    :length: 320
    :transcript: "ENST00000366794"
    :translation_start: 145

Include colons prior to field names where shown (so they are interpreted as symbols in Ruby).

Gem

require 'bio-sam-mutation'

  # NB must be tab-delimited
  insertion_and_deletion = Bio::DB::Alignment.new("I2M5K:00253:00406\t0\t5\t112839854\t70\t63M2I138M1D27M7S\t*\t0\t0\tCAGTGATCTTCCAGATAGCCCTGGACAAACCATGCCACCAAGCAGAAGTAAAACACCTCCACCATACCTCCTCAAACAGCTCAAACCAAGCGAGAAGTACCTAAAAATAAAGCACCTACTGCTGAAAAGAGAGAGAGTGGACCTAAGCAAGCTGCAGTAAATGCTGCAGTTCAGAGGGTCCAGGTTCTTCCAGATGCTGATACTTATTACATTTTGCCACGGAAAGTACTGCTGAGG\t@CDDDCCCCACACCCCCCCC?CCACCCC>A6;;;;7;;6;6;BC;;6;;;;;.;;>ADDA??;;;;;?CCACCCD>C??@CCCC>C@C;>?CCCC@C=::@:::::+:::/:CCC?>>>>CCCCDDD9CCCC@AB????=AB>??;?BB>@@@AA???CC<@@?????BB>??;;;B<BC;??8;6:A=@=@BBB;;;?<77//*08*088888*8=9=?B7;;4;??????????<\tPG:Z:novoalign\tAS:i:183\tUQ:i:183\tNM:i:3\tMD:Z:201^T27")

  insertion_and_deletion.mutations
  #=> [#<Bio::Mutation:0x007fa20b5b4fc8 @position=112839916, @type=:insertion, @reference=nil, @mutant="AT", @seqname="5">, #<Bio::Mutation:0x007fa20b5b4960 @position=112840055, @type=:deletion, @reference="T", @mutant=nil, @seqname="5">]

  insertion_and_deletion.mutations.first.to_hgvs("g")
  #=> "5:g.112839916_112839917insAT"

  puts YAML.dump(insertion_and_deletion.mutations.first.vep("human","g").first["transcript_consequences"].keep_if{|c| c["transcript_id"] == "ENST00000257430"})
  #---
  # - variant_allele: AT
  #   cdna_end: 4379
  #   codons: cca/ccATa
  #   protein_end: 1441
  #   strand: 1
  #   hgnc_id: HGNC:583
  #   amino_acids: P/PX
  #   gene_symbol: APC
  #   cdna_start: 4378
  #   transcript_id: ENST00000257430
  #   cds_start: 4322
  #   gene_id: ENSG00000134982
  #   protein_start: 1441
  #   biotype: protein_coding
  #   gene_symbol_source: HGNC
  #   cds_end: 4323
  #   consequence_terms:
  #   - frameshift_variant
  #   impact: HIGH
  # => nil

# E.g. of full request return
# http://rest.ensembl.org/documentation/info/vep_hgvs_get
insertion_and_deletion.mutations(112839854).first.vep("human","g")
# => [{"assembly_name"=>"GRCh38", "end"=>112839917, "seq_region_name"=>"5", "transcript_consequences"=>[{"gene_id"=>"ENSG00000134982", "distance"=>46, "variant_allele"=>"AT", "biotype"=>"nonsense_mediated_decay", "gene_symbol_source"=>"HGNC", "consequence_terms"=>["downstream_gene_variant"], "strand"=>1, "hgnc_id"=>"HGNC:583", "gene_symbol"=>"APC", "transcript_id"=>"ENST00000502371", "impact"=>"MODIFIER"}, {"variant_allele"=>"AT", "cdna_end"=>4380, "codons"=>"-/AT", "protein_end"=>1442, "strand"=>1, "hgnc_id"=>"HGNC:583", "amino_acids"=>"-/X", "gene_symbol"=>"APC", "cdna_start"=>4379, "transcript_id"=>"ENST00000257430", "cds_start"=>4323, "gene_id"=>"ENSG00000134982", "protein_start"=>1441, "biotype"=>"protein_coding", "gene_symbol_source"=>"HGNC", "cds_end"=>4324, "consequence_terms"=>["frameshift_variant"], "impact"=>"HIGH"}, {"gene_id"=>"ENSG00000134982", "distance"=>863, "variant_allele"=>"AT", "biotype"=>"protein_coding", "gene_symbol_source"=>"HGNC", "consequence_terms"=>["downstream_gene_variant"], "strand"=>1, "hgnc_id"=>"HGNC:583", "gene_symbol"=>"APC", "transcript_id"=>"ENST00000507379", "impact"=>"MODIFIER"}, {"variant_allele"=>"AT", "cdna_end"=>4481, "codons"=>"-/AT", "protein_end"=>1442, "strand"=>1, "hgnc_id"=>"HGNC:583", "amino_acids"=>"-/X", "gene_symbol"=>"APC", "cdna_start"=>4480, "transcript_id"=>"ENST00000508376", "cds_start"=>4323, "gene_id"=>"ENSG00000134982", "protein_start"=>1441, "biotype"=>"protein_coding", "gene_symbol_source"=>"HGNC", "cds_end"=>4324, "consequence_terms"=>["frameshift_variant"], "impact"=>"HIGH"}, {"gene_id"=>"ENSG00000134982", "distance"=>409, "variant_allele"=>"AT", "biotype"=>"protein_coding", "gene_symbol_source"=>"HGNC", "consequence_terms"=>["downstream_gene_variant"], "strand"=>1, "hgnc_id"=>"HGNC:583", "gene_symbol"=>"APC", "transcript_id"=>"ENST00000512211", "impact"=>"MODIFIER"}, {"gene_id"=>"ENSG00000134982", "variant_allele"=>"AT", "cdna_end"=>4569, "biotype"=>"nonsense_mediated_decay", "gene_symbol_source"=>"HGNC", "consequence_terms"=>["3_prime_UTR_variant", "NMD_transcript_variant"], "strand"=>1, "hgnc_id"=>"HGNC:583", "gene_symbol"=>"APC", "cdna_start"=>4568, "transcript_id"=>"ENST00000508624", "impact"=>"MODIFIER"}, {"gene_id"=>"ENSG00000258864", "variant_allele"=>"AT", "biotype"=>"nonsense_mediated_decay", "gene_symbol_source"=>"Clone_based_vega_gene", "consequence_terms"=>["intron_variant", "NMD_transcript_variant"], "strand"=>1, "gene_symbol"=>"CTC-554D6.1", "transcript_id"=>"ENST00000520401", "impact"=>"MODIFIER"}, {"gene_id"=>"ENSG00000134982", "distance"=>2195, "variant_allele"=>"AT", "biotype"=>"protein_coding", "gene_symbol_source"=>"HGNC", "consequence_terms"=>["downstream_gene_variant"], "strand"=>1, "hgnc_id"=>"HGNC:583", "gene_symbol"=>"APC", "transcript_id"=>"ENST00000504915", "impact"=>"MODIFIER"}], "strand"=>1, "id"=>"5:g.112839917_112839918insAT", "allele_string"=>"-/AT", "most_severe_consequence"=>"frameshift_variant", "start"=>112839918}]


The API doc is online. For more code examples see the test files in the source tree.

Project home page

Information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/stveep/bioruby-sam-mutation

The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at (http://biogems.info/index.html#bio-sam)

Copyright

Copyright (c) 2015 stveep. See LICENSE.txt for further details.

You can’t perform that action at this time.