Skip to content
betsig edited this page Oct 8, 2019 · 2 revisions

borf - orf prediction using python

borf is a command line python tool for translating RNA sequences into open reading frames (ORFs). The defaults for borf are set to provide the most fitting ORF translations from de novo assembled transcripts, such as those generated by Trinity. In addition to providing the ORF predictions, borf also provides details about ORF locations within the provided transcripts in a seperate file.

Installation

To install from PyPi:

pip install borf

Usage

To run, call borf and provide it with a fasta file as the first argument.

borf test.fa

This will run borf with the deafult settings and produce two output files: test.pep and test.txt.

Output

borf produces two files, a .pep and a .txt file. The .pep file contains the predicted ORF sequences in fasta format, and the .txt file contains details about the predicted ORFs.

.txt output format

column name description
orf_id id assigned to the predicted orf sequences in the corresponding .pep file
transcript_id transcript id (from the input .fa file)
frame ORF reading frame (1-3)
strand ORF strand (+/-)
seq_length_nt length of the ORF in nt
start_site_nt position of the first nucleotide of the first predicted amino acid
stop_site_nt position of the last nucleotide of the last predicted amino acid
utr3_length_nt length of the 3' UTR in nt
start_site_aa position of the ORF's first predicted amino acid
stop_site_aa position of the ORF's stop site (*) or the last predicted amino acid (when no stop codon is found)
orf_length_aa length of the ORF in aa
first_aa_MET is the first amino acid of the ORF a Methionine (M/MET)? (M/ALT)
final_aa_stop is the last amino acid of the ORF a STOP (*)? (STOP/ALT)
orf_class orf class. One of 'complete' (first aa is M, and last is *); 'incomplete_5prime' (first aa is not M, and last is *); 'incomplete_3prime' (first aa is M, and last is not *); or 'incomplete' (first aa is not M, and last is not *)

We provide ORF classes as - particularly for denovo assembled transcripts - transcript annotations may be incomplete and missing parts of the 3' or 5' end. This allows ORFs which have uninterrupted strings of amino acids - but not neccessarily a start or a stop codon - to still be returned which can then be used in downstream applications such as functional domain annotations.

Options

borf has several options which can be changed to suit your data. To display all, use the -h or --help flag.

$ borf --help
usage: borf [-h] [-o OUTPUT_PATH] [-s] [-a] [-l ORF_LENGTH]
            [-u UPSTREAM_INCOMPLETE_LENGTH]
            fasta_file

Get orf predicitions from a nucleotide fasta file

positional arguments:
  fasta_file            fasta file to predict ORFs

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_PATH, --output_path OUTPUT_PATH
                        path to write output files. [OUTPUT_PATH].pep and
                        [OUTPUT_PATH].txt (default: input .fa file name)
  -s, --strand          Predict orfs for both strands
  -a, --all_orfs        Return all ORFs for each sequence longer than the
                        cutoff
  -l ORF_LENGTH, --orf_length ORF_LENGTH
                        Minimum ORF length (AA). (default: 100)
  -u UPSTREAM_INCOMPLETE_LENGTH, --upstream_incomplete_length UPSTREAM_INCOMPLETE_LENGTH
                        Minimum length (AA) of uninterupted sequence upstream
                        of ORF to be included for incomplete_5prime
                        transcripts (default: 50)

To change the default output file locations (same as input file, with .fa replaced by .pep or .txt), use the -o flag to provide a base file name (and path). e.g. borf test.fa -o test_borf will produce test_borf.pep and test_borf.txt.

To return all predicted ORFs longer than the minimum ORF length, use the -a flag.

To predict ORFs on both strands, use the -s flag. Note that unless the -a flag is as well, only the single longest ORF will be reported for each transcript, not one prediction for each strand.

The default ORF length is set to 100 amino acids. This can be changed using the -l argument and providing an integer.

The default upstream incomplete length is 50. This can be changed using the -u argument and providing an integer.

NOTE: We do not reccomend setting this lower than 50AA. A large proportion of transcripts will have uniterrupted upstream AAs up to 40AA long (average ~ 25AA in well annotated human transcripts) which do not code for protein sequence.

Clone this wiki locally