Skip to content

JSON-based FON (Feature Object Notation) format and tools to simplify genomic annotations usage


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



50 Commits

Repository files navigation


FONtools combines:

  • The new FON (Feature Object Notation) format to store genomic annotations based on JSON
  • Command-line tools to work with FON files: import (from Ensembl, Ensembl Genomes or any GFF3), export, merge, mask sequence, filter annotations and more.
  • A Python library to work with FON files

With FONtools:

  • Using genomic annotations is easy and program-friendly. No need to write parser to extract specific features from annotations, use it directly from the FON files.
  • Using genomic annotations is standardized. FON is based on JSON. Every standard library contains a JSON parser.
  • Importing annotations from external source (Ensembl) is convenient and included.
  • Maintaining your repository of genomic annotations, sequence and indices for read-mapping software is built-in.

Why the FON format?

FON is a program-friendly and extensible format to store genomic annotations. Since FON is using the JSON format for storing data, and JSON stands for JavaScript Object Notation, we named our format FON for Feature Object Notation.

Genomics annotations are mainly stored in files using the BED or GFF, specifically GFF3, formats.

Format Base format Parse Hierarchical Extensible Coordinates
BED tab Simple No Limited 0-based
GFF tab Complex Yes Yes 1-based
FON JSON Existing JSON libraries Possible Yes 0-based

While BED is simple to parse, it was not designed to store hierarchical annotations, such as exons on a transcript. Instead BED12 "sub-splits" columns using commas instead of tabulations to store exons coordinates of transcripts. Alternatively, GFF allows hierarchical annotations, but is difficult to parse. GFF translates such structures into multiple records linked with a common ID. This approach is generic and describes annotations as a graph, thus requiring more complex code to parse it.

To overcome these limitations, FON format enables simple parsing and hierarchical annotation storage by capitalizing on the strengths of the JSON format:

  • Dictionaries and lists are used to store structured annotations (limited to trees compared to graph with GFF), such as list of exons.
  • Types are implicit: Quoted names are imported as strings while coordinates are imported as numbers. Describing the type of each column or attribute is no longer necessary. More properties can be added to any annotations without extending the format.

FON isn't intended to replace GFF to share genomic annotations, but rather to simplify, ease and streamline the use of annotations within programs and pipelines.


FON1 is the first version of the Feature Object Notation format. It stores features in a list. Each feature is a dictionary with a set of defined keys. New keys for each feature can be freely added or removed, none of them are required. Programs using specific key(s) should provide the option to select by their name which key(s) to use (for example the --key option of fon_mask_fasta). Chromosome and scaffolds can be described in the assembly key with their name, level and length.

Example for one zebrafish transcript (one feature):

    "fon_version": 1,
    "assembly": [
            "name": "1",
            "level": "chromosome",
            "length": 59578282
            "name": "2",
            "level": "chromosome",
            "length": 59640629
            "name": "KN149708.1",
            "level": "scaffold",
            "length": 20567
    "features": [
            "transcript_stable_id": "ENSDART00000171909",
            "gene_stable_id": "ENSDARG00000099339",
            "gene_name": "pacsin3",
            "protein_stable_id": "ENSDARP00000138886",
            "chrom": "7",
            "strand": "+",
            "transcript_version": "2",
            "gene_version": "4",
            "transcript_biotype": "protein_coding",
            "gene_biotype": "protein_coding",
            "exons": [[54260003, 54260129], [54263505, 54263662]],
            "exons_on_transcript": [[0, 126], [126, 283]],
            "cds_exons": [[54260075, 54260129], [54263505, 54263662]],
            "cds_exons_on_transcript": [[72, 126], [126, 283]],
            "cds_exons_frame": [0, 0],
            "cds_exons_frame_on_transcript": [0, 0],
            "utr5_exons": [[54260003, 54260075]],
            "utr5_exons_on_transcript": [[0, 72]],
            "utr3_exons": [],
            "utr3_exons_on_transcript": [],
            "go": {
                "GO:0097320": {
                    "term": "plasma membrane tubulation",
                    "domain": "biological_process",
                    "sources": []

Field descriptions:

  • exons, cds_exons, utr5_exons, utr3_exons: Lists of exons, exons from the coding sequence (CDS), and exons from the 5' and 3' UTRs. Each exon is a list of start and end genomic coordinates. All coordinates are 0-based and relative to the forward genomic strand. In the example above, the transcript ENSDART00000171909 starts at position 54260003 on chromosome 7, this will be equal to the first exon start.
  • XX_on_transcript: Lists of exons, CDS etc coordinates translated to transcript coordinates. The first exon start is equal to 0 and the last exon end is equal to the length of the transcript. Translation includes the strand of the transcript: coordinates are forward to the transcript making these coordinates directly usable on the transcript sequence stored in seq.
  • cds_exons_frame: Frame of the first nucleotide. With the coding sequence ATGGCA, the following 4 exons would have frame 0, 1, 2 and 0 respectively:
  • seq contains the transcript sequence.
  • go holds Gene Ontology (GO) terms, domains and sources. Optional: GO is only imported with the --go option from the import_ensembl script.

Future FON version

Future versions might address limitation of the currently available FON version. For example, FON1 doesn't allow features to be stored hierarchically; they are stored in a list. Contributions to add new FON versions are welcome.


See tags page.


pip3 install fontools

If you don't have root permission, install in your home using --user option:

pip3 install fontools --user

Scripts are installed in $HOME/.local/bin, which should be added to your shell PATH to run the scripts. After adding for example $HOME/.local/bin to your PATH, try:

import_ensembl -h

If you get an error message like import_ensembl: command not found, then your PATH isn't properly configured.

FONtools depend on pyfaidx for reading FASTA and pyfnutils for logging.


Script Description
import_ensembl Import Ensembl sequence and annotations
fon_import Import annotations to FON (from GFF3 for now)
fon_transform Transform FON file
merge_annot Merge FON/GFF3/FASTA files
ensembl2ucsc Convert names from Ensembl to UCSC (in FASTA, GFF3 and tab)
fasta_format Format and/or Sort FASTA file (split sequence)
fon_mask_fasta Mask sequence (FASTA) using FON
fasta_seq_length Create tab file with sequence(s) length from FASTA file

Import Ensembl

The import_ensembl script creates and maintains an Ensembl-based annotation repository including:

  • Download Ensembl annotations and sequences
  • Parse GFF to create FON annotations
  • Map chromosome and contig names to create genome tracks in UCSC
  • Create indices for read-mapping software.

Annotations are imported using fon_import, then fon_transform is used:

  • To create FON files restricted to a biotype, for example protein coding transcripts,

  • To create FON files selecting the longest isoform of each gene,

  • To create "metagene" FON files obtained by merging all isoforms of a gene together. Example of how a metagene is obtained from 3 isoforms:


    These "metagenes" can be used for counting HTS reads per gene, where reads mapping to any isoforms will map to the metagene.

The script is compatible with Ensembl and Ensembl Genomes (see option --division/-n).


The import_ensembl script aims to maintain a local Ensembl-based repository. Using it requires to set multiple options. But most of these options will be the same each time import_ensembl is used. In most cases, the data will always be stored in the same directories and only options specifying the release number or the species will change and be specified on the command-line. To this end, all import_ensembl options can be set for convenience in a JSON config file, in addition to the command-line. This config file can be placed:

  • Either one of the two following directories (step 2 below):
    • The directory defined by the environment variable $HTS_CONFIG_PATH. $HTS_CONFIG_PATH can be defined by the user.
    • The directory defined by the environment variable $XDG_CONFIG_HOME/hts. $XDG_CONFIG_HOME is defined by your desktop environment.
  • Or, you can use the ---path_config option to set the directory where to find a fontools.json config file. This option is not used in this tutorial.

To configure import_ensembl script:

  1. Create the root/main directory. All downloaded, FON, sequences, etc files will be stored in this directory:
    mkdir /data/sai
    mkdir /data/sai/download
    In the example above, we are using the /data/sai directory (sai stands for Sequence Annotations & Indices). This is intended for system-wide installation. Alternatively, you can change it to the directory of your choice, for example, if you want to store the data in a sai directory in your home:
    mkdir ~/sai
    mkdir ~/sai/download
  2. To configure a config directory, add in your ~/.bashrc:
    export HTS_CONFIG_PATH="/etc/hts"
    Alternatively, in case you created a sai directory in your home:
    mkdir ~/sai/config
    Then, to set the HTS_CONFIG_PATH environment variable, add in your ~/.bashrc:
    export HTS_CONFIG_PATH="$HOME/sai/config"
  3. Create a fontools.json config file in /etc/hts:
        "fontools_path_main": "/data/sai",
        "fontools_path_download": "/data/sai/download"
    Alternatively, in case you created a sai directory in your home (please replace smith by your username), create a fontools.json config file in ~/sai/config:
        "fontools_path_main": "/home/smith/sai",
        "fontools_path_download": "/home/smith/sai/download"
  4. Optional. To automatically keep a log of import_ensembl actions, you can create the following directory. This will automatically create a different log file per Ensembl release. To specify the location for the log from the script, use the --path_log/-l option:
    mkdir /data/sai/log
    Alternatively, in case you created a sai directory in your home:
    mkdir ~/sai/log
  5. Optional. If mapping to UCSC chromosome and contig names are desired, download mapping from the ChromosomeMappings repository:
    mkdir /data/sai/annots
    cd /data/sai/annots
    Alternatively, in case you created a sai directory in your home:
    mkdir ~/sai/annots
    cd ~/sai/annots
    tar xvfz master.tar.gz
    rm -f master.tar.gz
    mv ChromosomeMappings-master ChromosomeMappings
    Add the fontools_path_mapping to fontools.json config file:
        "fontools_path_main": "/data/sai",
        "fontools_path_download": "/data/sai/download",
        "fontools_path_mapping": "/data/sai/annots/ChromosomeMappings"
    Alternatively, in case you created a sai directory in your home (please replace smith by your username):
        "fontools_path_main": "/home/smith/sai",
        "fontools_path_download": "/home/smith/sai/download",
        "fontools_path_mapping": "/home/smith/sai/annots/ChromosomeMappings"
    Although using name mapping isn't required, it's recommended. Using name mapping, files containing chromosome lengths with UCSC names will be created. These files are essential to create UCSC genome browser tracks.


If you haven't set the environment variable HTS_CONFIG_PATH (see above), then:

  • Use the ---path_config option to set the directory to find a fontools.json config file or,
  • Use the --fontools_path_main and --fontools_path_download on the command line.

To list available species, use (for Ensembl 104):

import_ensembl -r 104 -s list

To get Ensembl 104 data for 4 species using 10 cores:

import_ensembl -r 104 -s danio_rerio,saccharomyces_cerevisiae,homo_sapiens,mus_musculus -p 10

To select what data are generated, use the --steps/-t option. Currently, the following steps are available:

  • genome step download FASTA genome sequences, map chromosome/contig names to UCSC names if requested, sort FASTA files, and create chromosome length file.
  • gene step download GFF annotations, import them to FON files, map chromosome/contig names to UCSC names if requested, and create FON files with:
    • metagenes (isoform union, see above),
    • longest transcripts.
  • bowtie2 and star to create indices for Bowtie2 and STAR respectively.
  • all of the above steps. This is the default.

To import terms, domains and sources from Gene Ontology (GO), add the --go option.


To import from Ensembl (-n) release 104 (-r), get FASTA and GFF (-t genome,gene) and convert to FON:

import_ensembl -n ensembl -r 104 -s caenorhabditis_elegans -t genome,gene

This command will print detailed log that are recorded in log/ensembl104.log:

2021-05-06 13:53:05,501 - import_ensembl - INFO - Starting (caenorhabditis_elegans,104)
2021-05-06 13:53:06,303 - import_ensembl - INFO - Found assembly WBcel235 (toplevel)
2021-05-06 13:53:06,304 - import_ensembl - INFO - Downloading fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
2021-05-06 13:53:49,983 - import_ensembl - INFO - Downloading gff3/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.104.gff3.gz
2021-05-06 13:54:02,400 - import_ensembl - INFO - Downloading fasta/caenorhabditis_elegans/cdna/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz
2021-05-06 13:54:22,319 - import_ensembl - INFO - Downloading fasta/caenorhabditis_elegans/ncrna/Caenorhabditis_elegans.WBcel235.ncrna.fa.gz
2021-05-06 13:54:25,103 - import_ensembl - INFO - Sorting to /data/sai/seqs/caeele_genome_all_ensembl_wbcel235.fa
2021-05-06 13:54:25,104 - import_ensembl - INFO - Start ['fasta_format', '--sort', '--input', '/data/sai/download/', '--output', '/data/sai/seqs/caeele_genome_all_ensembl_wbcel235.fa']
2021-05-06 13:54:28,613 - import_ensembl - INFO - Start ['cp', '/data/sai/download/', '/data/sai/annots/caeele_cdna_all_ensembl104.gff3.gz']
2021-05-06 13:54:28,617 - import_ensembl - INFO - Start ['gzip', '-d', '/data/sai/annots/caeele_cdna_all_ensembl104.gff3.gz']
2021-05-06 13:54:28,858 - import_ensembl - INFO - Creating chromosome length file /data/sai/annots/
2021-05-06 13:54:28,859 - import_ensembl - INFO - Start ['fasta_seq_length', '--input', '/data/sai/seqs/caeele_genome_all_ensembl_wbcel235.fa', '--output', '/data/sai/annots/']
2021-05-06 13:54:29,322 - import_ensembl - INFO - Creating chromosome length file for UCSC /data/sai/annots/
2021-05-06 13:54:29,322 - import_ensembl - INFO - Start ['ensembl2ucsc', '--input', '/data/sai/annots/', '--output', '/data/sai/annots/', '--path_mapping', '/data/sai/annots/ChromosomeMappings/WBcel235_ensembl2UCSC.txt']
2021-05-06 13:54:29,344 - import_ensembl - INFO - Importing annotation
2021-05-06 13:54:29,344 - import_ensembl - INFO - Start ['fon_import', '--annotation', '/data/sai/download/', '--data_source', 'ensembl', '--fasta', '/data/sai/download/', '--fasta', '/data/sai/download/', '--cdna', '--exclude_no_seq', '--biotype', 'all,protein_coding', '--output', '/data/sai/annots/caeele_cdna_${biotype}_ensembl104.fon${version}.json', '--output_format', 'fon']
2021-05-06 13:54:39,531 - import_ensembl - INFO - Transform FON (union,protein_coding)
2021-05-06 13:54:39,531 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_protein_coding_ensembl104.fon1.json', '--method', 'union', '--output', '/data/sai/annots/caeele_cdna_union2gene_protein_coding_ensembl104.fon${version}.json']
2021-05-06 13:54:42,214 - import_ensembl - INFO - Transform FON (longest,protein_coding)
2021-05-06 13:54:42,214 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_protein_coding_ensembl104.fon1.json', '--method', 'longest', '--output', '/data/sai/annots/caeele_cdna_longest_transcript_protein_coding_ensembl104.fon${version}.json']
2021-05-06 13:54:44,178 - import_ensembl - INFO - Transform FON (union,all)
2021-05-06 13:54:44,178 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_all_ensembl104.fon1.json', '--method', 'union', '--output', '/data/sai/annots/caeele_cdna_union2gene_all_ensembl104.fon${version}.json']
2021-05-06 13:54:48,050 - import_ensembl - INFO - Transform FON (longest,all)
2021-05-06 13:54:48,050 - import_ensembl - INFO - Start ['fon_transform', '--fon', '/data/sai/annots/caeele_cdna_all_ensembl104.fon1.json', '--method', 'longest', '--output', '/data/sai/annots/caeele_cdna_longest_transcript_all_ensembl104.fon${version}.json']

The following files will be created:

├── annots
│   ├── caeele_cdna_all_ensembl104.fon1.json          <-- All transcripts
│   ├── caeele_cdna_all_ensembl104.gff3               <-- All transcripts (GFF3)
│   ├── caeele_cdna_longest_transcript_all_ensembl104.fon1.json              <-- Longest transcript per all gene
│   ├── caeele_cdna_longest_transcript_protein_coding_ensembl104.fon1.json   <-- Longest transcript per protein-coding gene
│   ├── caeele_cdna_protein_coding_ensembl104.fon1.json                      <-- All transcripts of protein-coding gene
│   ├── caeele_cdna_union2gene_all_ensembl104.fon1.json                      <-- Metagenes of all genes
│   ├── caeele_cdna_union2gene_protein_coding_ensembl104.fon1.json           <-- Metagenes of protein-coding genes
│   ├──                  <-- Chromosome lengths (TAB)
│   ├──       <-- Chromosome lengths with UCSC names (TAB)
│   └── ChromosomeMappings                            <-- Ensembl to/from UCSC name mapping
│       .
│       .
│       └── Zv9_UCSC2ensembl.txt
├── config
│   └── fontools.json
├── download
│   └──
│       └── pub
│           └── release-104
│               ├── fasta
│               │   └── caenorhabditis_elegans
│               │       ├── cdna
│               │       │   └── Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz
│               │       ├── dna
│               │       │   └── Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
│               │       └── ncrna
│               │           └── Caenorhabditis_elegans.WBcel235.ncrna.fa.gz
│               └── gff3
│                   └── caenorhabditis_elegans
│                       └── Caenorhabditis_elegans.WBcel235.104.gff3.gz
├── log
│   └── ensembl104.log
└── seqs
    ├── caeele_genome_all_ensembl_wbcel235.fa             <-- Sequence (FASTA)
    └── caeele_genome_all_ensembl_wbcel235.fa.fai         <-- FASTA index

Import annotations to FON

Annotations can be imported into FON from any GFF source. For example, to import gene annotations for Xenopus tropicalis from Xenbase:

  1. Download GFF3:
    cd /data/sai/downloads
    wget -m
  2. Download FASTA:
    cd /data/sai/downloads
    wget -m
    • Providing FASTA sequence is optional. If not FASTA file is provided, the resulting FON file won't include any sequence.
    • Instead of genomic sequence, the transcript sequences can be provided. Use the --cdna option to specify FASTA file containts cDNA instead of genomic sequence.
  3. To import the GFF3 annotation to FON, run (change paths of files to your setup):
    fon_import --annotation "/data/sai/downloads/" \
               --output '/data/sai/annots/xentro_cdna_${biotype}_xenbase100.fon${version}.json' \
               --fasta "/data/sai/downloads/" \
               --output_format fon \
               --biotype all
    • The --output is a path written as a simple string or a template string (string.Template).

Transform FON

Transform FON files: select longest isoform or merge isoforms. For example to select the longest isoform of each gene:

fon_transform --fon "/data/sai/annots/caeele_cdna_protein_coding_ensembl104.fon1.json" \
              --method "longest" \
              --output '/data/sai/annots/caeele_cdna_longest_transcript_protein_coding_ensembl104.fon${version}.json'

FON and other formats

An easy way to merge the sequence and annotations of multiple species together is to input each sequence and annotation in a comma-separated list of files, using the merge_annot script. In this example annotations and sequences from zebrafish and yeast are merged:

cd /data/sai
merge_annot --input_fasta "seqs/zebrafish.fa,seqs/yeast.fa" \
            --output_fasta "seqs/zebrafish_plus_yeast.fa" \
            --input_gff "annots/zebrafish.gff3,annots/yeast.gff3" \
            --output_gff "annots/zebrafish_plus_yeast.gff3" \
            --input_fon "annots/zebrafish.fon1.json,annots/yeast.fon1.json" \
            --output_fon "annots/zebrafish_plus_yeast.fon1.json"

The script ensembl2ucsc can be used to translate chromosome/contig names from Ensembl to UCSC names (for C. Elegans) using ChromosomeMappings:

cd /data/sai/annots
ensembl2ucsc --input "" \
             --output "" \
             --path_mapping "ChromosomeMappings/WBcel235_ensembl2UCSC.txt"

FASTA tools

  • fasta_format: Format FASTA file

    • Sort entries of FASTA file (--sort)
    • Split sequences in lines of desired length (--seq_length)
  • fon_mask_fasta: Mask part(s) of sequence (FASTA) with Ns

    fon_mask_fasta --input_fon "selected_loci.fon1.json" \
                   --input_fasta "genome.fa" \
                   --output_fasta "genome_mask.fa" \
                   --extension "50" \
                   --exterior_extension "100"
    • A list of interval coordinates from FON (--input_fon) are masked with Ns in the sequence (--input_fasta). By default, the list of interval coordinates from the exons key of each feature in the FON file is used. To use a different key, use the --key option; for example use --key "cds_exons" to mask the coding sequences.
    • Each interval can be extended by --extension and each feature can be extended by --exterior_extension. --exterior_extension value is by default equal to --extension value. For example, using --extension 2 on [[10,15], [20, 30]] will mask [[8,17], [18, 32]], while --exterior_extension 2 will mask [[8,15], [20, 32]].
    • Using --inverse, only interval coordinates from FON are kept intact, the rest of the sequence is replaced by Ns.
    • Strand of FON features is ignored.
  • fasta_seq_length: Create tabulated file with sequence(s) name and length from FASTA file.


FONtools are distributed under the Mozilla Public License Version 2.0 (see /LICENSE).

Copyright © 2015-2023 Charles E. Vejnar


JSON-based FON (Feature Object Notation) format and tools to simplify genomic annotations usage