Skip to content
Convert various sequence formats to FASTA
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore ignore backup files Oct 15, 2018
.travis.yml Add URL to -h, bump to 0.4.2 Oct 18, 2018
LICENSE Initial commit Oct 9, 2018 Now in conda May 17, 2019
any2fasta Add URL to -h, bump to 0.4.2 Oct 18, 2018
paper.bib First draft of JOSS paper Oct 18, 2018 Some small edits Oct 18, 2018
test.clw Issue #10 | add CLUSTAL support Oct 18, 2018
test.embl Issue #8 | add EMBL support Oct 18, 2018
test.fna Test data Oct 9, 2018
test.fq Issue #7 | add FASTQ support Oct 17, 2018
test.gbk Test data Oct 9, 2018
test.gfa Issue #3 | add (dodgy) GFA support Oct 13, 2018
test.gff Test data Oct 9, 2018
test.noseq.gff Test GFF with no sequence Oct 9, 2018
test.sth Issue #11 | add STOCKHOLM support Oct 18, 2018

Build Status License: GPL v3 Don't judge me


Convert various sequence formats to FASTA


You may wonder why this tool even exists. Well, I tried to do the right thing and use established tools like readseq and seqret from EMBOSS, but they both mangled IDs containing | or . characters, and there is no way to fix this behaviour. This resulted in inconsitences between my .gbk and .fna versions of files in my pipelines.

Then you may wonder why I didn't use Bioperl or Biopython. Well they are heavyweight libraries, and actually very slow at parsing Genbank files. This script uses only core Perl modules, has no other dependencies, and runs very quickly.

It supports the following input formats:

  1. Genbank flat file, typically .gb, .gbk, .gbff (starts with LOCUS)
  2. EMBL flat file, typically .embl, (starts with ID)
  3. GFF with sequence, typically .gff, .gff3 (starts with ##gff)
  4. FASTA DNA, typically .fasta, .fa, .fna, .ffn (starts with >)
  5. FASTQ DNA, typically .fastq, .fq (starts with @)
  6. CLUSTAL alignments, typically .clw, .clu (starts with CLUSTAL or MUSCLE)
  7. STOCKHOLM alignments, typically .sth (starts with # STOCKHOLM)
  8. GFA assembly graph, typically .gfa (starts with ^[A-Z]\t)

Files may be compressed with:

  1. gzip, typically .gz
  2. bzip2, typically .bz2
  3. zip, typically .zip


any2fasta has no dependencies except Perl 5.10 or higher. It only uses core modules, so no CPAN needed.

Direct script download

% cd /usr/local/bin  # choose a folder in your $PATH
% wget
% chmod +x any2fasta


% brew install brewsci/bio/any2fasta


% conda install -c bioconda any2fasta


% git clone
% cp any2fasta/any2fasta /usr/local/bin # choose a folder in your $PATH

Test Installation

% ./any2fasta -v
any2fasta 0.2.2

% ./any2fasta -h
  any2fasta 0.4.2
  Convert various sequence formats into FASTA
  any2fasta [options] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip] > output.fasta
  -h       Print this help
  -v       Print version and exit
  -q       No output while running, only errors
  -n       Replace ambiguous IUPAC letters with 'N'
  -l       Lowercase the sequence
  -u       Uppercase the sequence


% any2fasta ref.gbk > ref.fna

% any2fasta in.fasta > out.fasta  # should behave like "cat"

% any2fasta prokka.gff > prokka.fna  # only if GFF has FASTA appended

% any2fasta - < > file.fasta  # '-' means stdin

% anyfasta genes.gff.gz > genes.ffn  # automatically decompresses

% any2fasta 2.fa.gz 3.gff.bz2 - > out.fa  # multiple files and stdin

% any2fasta R1.fq.gz | bzip2 > R1.fa.bz2  # 'seqtk seq -A' is much faster

% any2fasta -q 23S.clw > 23S.aln  # gaps '-' will be preserved

% any2fasta pfam4321.sth > pfam4321.aln  # '.' gaps will become '-'


  • -n replaces any characters that aren't A,C,G,T with N (gaps preserved)
  • -l will lowercase all the letters
  • -u will uppercase all the letters
  • -q will prevent logging messages being printed


Submit feedback to the Issue Tracker


GPL v3


Torsten Seemann

You can’t perform that action at this time.