Skip to content

User guide

robotoD edited this page Feb 15, 2023 · 29 revisions

GenoVi: Genome Visualizer Software

Description

GenoVi generates circular genome representations for complete or draft bacterial and archaeal genomes. GenoVi pipeline combines several python scripts to automatically generate all needed files for Circos, including customisable options for colour palettes, fonts, font format, background colour, and scaling options for complete genomes comprising more than one replicon. Optionally, GenoVi built-in workflow integrates DeepNOG to annotate COG categories using alignment-free methods with user-defined thresholds. In the case of draft genomes, GenoVi displays the replicons as delivered by the initial GenBank file.

Table of contents

  1. Description
  2. Index
  3. Requirements
  4. Installation
  5. Usage
  6. Tutorials
    1. Draft genome basic tutorial
    2. Complete genome tutorial
    3. Other options
  7. Arguments
  8. Scripts
  9. Output
  10. Publication
  11. Acknowledgements
  12. Citation and License

Requirements

  • Circos 0.69-8
  • Python 3.7 or later
  • DeepNog 1.2.3
  • NumPy 1.20.2
  • Pandas 1.2.4
  • Biopython 1.79
  • CairoSVG 2.5.2
  • Perl 5
  • List::MoreUtils (Perl library)

Installation

GenoVi dependencies can be installed in a python environment with a python version equal or higher than v.3.7.

conda create -n genovi python=3.7 circos

Activate the environment

conda activate genovi

GenoVi can then be installed using pip

pip install genovi

Usage

genovi [-h] [options ..] -i input_file -s status

Main arguments

  • -i, --input_file. GenBank input file path.
  • -o, --output_file. Output file name. Default: genovi.
  • -s, --status. “complete” or “draft”. Complete genomes are drawn as separate circles for each contig/replicon.

Information:

  • -h, --help. Shows this help message and exit.
  • --version. Shows the currently installed version of genovi.

COGs:

  • -cu, --cogs_unclassified. Do not classify each coding sequence into Clusters of Orthologous Groups of proteins (COGs).
  • --cogs, COGS To specify which COG categories include in the circular representation. For example 'CEFGHIPQ' or 'MET-' for all Metabolism-realted COGs
  • -b, --deepnog_confidence_threshold. DeepNOG confidence threshold range [0,1] Default: 0. If provided, predictions below the threshold are discarded.

Format:

  • -a, --alignment. When a --status complete is specified, this flag defines the alignment of each individual contig. Options: center, top, bottom, A (First on top), < (first to the left), U (Two on top, the rest below). By default, this is defined by contig sizes.
  • --scale. When using --status complete, whether to use a different scale format to ensure visibility. Options: variable, linear, sqrt. Default: sqrt.
  • -k, --keep_temporary_files. Keep temporary files.
  • -r, -reuse_predictions. If available, reuse DeepNog prediction result from the previous run. Useful only if --keep_temporary_files flag is enabled.
  • -w, --window. Window size (base pair) to assign a GC analysis. Default: 5000.
  • -v, --verbose. Verbose or in-console log messages activated.

Text:

  • -c, --captions_not_included. Do not include captions in the figure.
  • -cp, --captions_position. Captions position. Options: top, bottom, bottom-right.
  • -t, --title. Figure title.
  • --title_position. Title position. Options: center, top, bottom.
  • --italic_words. How many title words should be written in italic. Default: 0.
  • --size. Displays the genome size of each independent circular representation.
  • -te, --tracks_explain. To include an additional text on each track, explaining their meaning.

Colours:

  • -cs, --colour_scheme. Prebuilt color scheme to use for CDS, RNAs, and GC analysis. Options: strong,autumn,dawn,blossom,paradise,neutral, blue, purple, soil, grayscale, velvet, pastel, ocean, wood, beach, desert, ice, island, forest, toxic, fire, spring.
  • -bc, --background. Background colour, in R, G, B format. By default, it has no background.
  • -fc, --font_colour. Font color. Default: '0, 0, 0'.
  • -pc, --CDS_positive_colour. Colour for positive CDSs, in R, G, B format. Default is defined by colour scheme.
  • -nc, --CDS_negative_colour. Colour for negative CDSs, in R, G, B format. Default is defined by colour scheme.
  • -tc, --tRNA_colour. Colour for tRNAs, in R, G, B format. Default is defined by colour scheme.
  • -rc, --rRNA_colour. Colour for rRNAs, in R, G, B format. Default is defined by colour scheme.
  • -cc, --GC_content_colour. Colour for GC content, in R, G, B format. Default is defined by colour scheme.
  • -sc, --GC_skew_colour. Colour scheme for positive and negative GC skew. A pair of RGB colors. Default is defined by colour scheme.
  • -sl, --GC_skew_line_colour. Colour for GC skew line. Default is defined by colour scheme.

Tutorials

Draft genome basic tutorial

genovi -i input_test/Corynebacterium_alimapuense_VA37.gbk -s draft -cs paradise --cogs_unclassified -bc white

This command will render an essential genome representation in png and svg formats, using the paradise color scheme and white background. All contigs from Corynebacterium alimapuense VA37’s genome are drawn in a single circle (default behavior). From outside to inside, the contigs length (each contig alternatively depicted in black and white), positive and negative strand coding sequences (CDSs), respectively, GC content, and finally, GC skew are displayed.

Complete genome tutorial

genovi -i input_test/Acinetobacter_radioresistens_DD78.gbff -cs strong -s complete --size

This command renders an image separating each scaffold as an independent chromosome or plasmid showing its size in the middle. Additional image files are generated for each chromosome or plasmid, as 1.png and 1.svg, 2.png and 2.svg, and so on.

Multiple Genome tutorial

There is an additional option to render multiple genomes at once using a directory as an input. All genomes will be drawn either as draft or complete as stated by the user with the status argument -s. To differentiate each ideogram, the --title 'filename' can be used, for each filename to be displayed as the title of each circular representation. As an output, a folder with a circular representation, general statistics, and COGs information will be delivered, for each file. Additionally, a scaled joined figure of all the circular representations will be delivered, including output tables summarizing the general statistics, COG identification and COG frequency of every genome analyzed into one file.

genovi -i input_test/Brevibacterium_Genomes -cs blossom -s draft --title 'filename'

Other options tutorial

genovi -i input_test/Acinetobacter_radioresistens_DD78.gbff -cs paradise --scale linear --alignment '<' -s complete

By default, circles are scaled using a square root scale, so small plasmids are still visible. That means the area of each circle is proportional to the replicon's length. If a linear scale is needed, you may specify it explicitly with --scale linear. Circles' order can be changed, by putting them on a line or using more complex ordering like this one, where the chromosome is on the left side and plasmids are lined up on the right.

Arguments

Input file

-i, --input_file. This mandatory argument specifies the path of the annotated genome file to be drawn. Accepted files are GenBank file format (.gbk and .gbff) and they might be gzipped (.gz or .z). Also, if a directory is specified, all of the supported files inside of it will be drawn, summary tables will be generated and, in case --status draft is specified, an additional image will be created including all of the assemblies (useful for comparative analysis).

Status

-s, --status. Specify whether your genome is complete or draft. If draft is selected (default), then each contig is drawn in the same circular genome representation. If complete is selected, then GenoVi draws a different circle for each contig, generating several figures, one for each contig and a concatenated one. Below, the Paraburkholderia xenovorans’ genome is shown as a complete and draft genome. The '-c' flag will not include the caption.

genovi -i input_test/P_xenovorans_LB400.gbff -cs autumn -s draft -c

Paraburkholderia xenovorans LB400 as a draft genome drawing

genovi -i input_test/P_xenovorans_LB400.gbff -cs autumn -a A -s complete -c

Paraburkholderia xenovorans LB400 as a complete genome drawing. '-a A' will set the largest scaffold on top, and the rest below.

Help

-h, --help. Displays the help message.

Version

--version. Displays the current version of GenoVi.

Output file

-o, --output_file. Output file name. GenoVi generates the image in both vectorial (svg) and pixel (png) formats. This argument specifies the name of the image to create, and the directory name to include additional figures, if --status complete is defined. File extension should not be included as part of this argument.

Cogs unclassified

-cu, --cogs_unclassified. By default, DeepNOG predicts Clusters of Orthologous Groups of proteins (COGs) of each coding sequence (CDS). Use this flag to specify you do not want CDSs to be classified into COGs. This will allow you to save time and run the program even if you don’t have DeepNOG installed on your machine.

Deepnog threshold

-b, --deepnog_confidence_threshold. DeepNOG confidence threshold range [0, 1]. Predictions below the threshold are discarded. This is equivalent to DeepNOG's infer -c/--confidence_threshold argument.

Cogs

--cogs. By default, the figure shows COG classification of every CDS. This might difficult to see the important information. Using this argument you may specify a specific set of COG categories to draw. The argument received is a string where each character represents a specific COG category, according to this table:

Character COG
D Cell cycle control, division, chromosome partitioning
M Cell wall/membrane/envelope biogenesis
N Cell motility
O Post-translational modification, protein turnover, chaperones
T Signal transduction mechanism
U Intracellular trafficking, secretion, and vesicular transport
V Defense mechanism
W Extracellular structures
Y Nuclear structure
Z Cytoskeleton
A RNA processing and modification
B Chromatin structure and dynamics
J Translation, ribosomal structure, and biogenesis
K Transcription
L Replication, recombination, and repair
X Mobilome: prophages, transposons
C Energy production and conversion
E Amino acid transport and metabolism
F Nucleotide transport and metabolism
G Carbohydrate transport and metabolism
H Coenzyme transport and metabolism
I Lipid transport and metabolism
P Inorganic ion transport and metabolism
Q Secondary metabolites biosynthesis, transport, and metabolism
R General function prediction only
S Function unknown

There are also a few shortcuts available: cel- for DMNOTUVWYZ (cellular processes and signaling), inf- for ABJKLX (information storage and processing), met- for CEFGHIPQ (metabolism) and finally poo- for poorly characterized sequences.

For instance, to draw the genome of Rhodococcus sp. H-CA8f, displaying only the metabolism-related COGs. This strain has a complete genome assembly, with one chromosome and one plasmid, therefore the -s complete flag should be used. If no color scheme is specified, genovi will use the 'strong' color palette, which is colorblind-safe.

genovi -i input_test/Rhodococcus_H-CA8f.gbff -s complete --cogs met-

Rhodococcus sp. H-CA8f as a complete genome drawing displaying only metabolism COG categories.

There is an option to only draw specific COGs categories using the --cogs flag. For example, displaying only the Q and X categories.

genovi -i input_test/Rhodococcus_H-CA8f.gbff -s complete --cogs QX

Rhodococcus sp. H-CA8f as a complete genome drawing displaying only X and Q COG categories.

Additionally, there is an option to only display the top X number of COGs classification categories using --cogs flag. For example, displaying only the top 5 COGs categories.

genovi -i input_test/Rhodococcus_H-CA8f.gbff -s complete --cogs 5

Rhodococcus sp. H-CA8f as a complete genome drawing displaying only top 5 COG categories.

Alignment

-a, --alignment. When drawing a complete genome, the circular representation of each contig can be aligned in three ways. A: First contig above and the rest below aligned horizontally. <: The first contig left and the rest depicted on the right, aligned vertically. And U: First and second contig top and the rest below, aligned horizontally.

Scale

--scale. When drawing a complete genome, the relative size of each circular representation can be determined in three ways. A linear scale depicts circular representations proportional to the size of each contig. If variable is chosen, each circular representation is depicted in a variable scale, shown in a rectangle indicating the scale (X times). The default case is sqrt, a square root scale.

Keep temporary files

-k, --keep_temporary_files. Multiple files will be generated within the user’s project folder and by default will be deleted upon completion. Specifying this argument stops the deletion of the files. Generated files are:

  • circos.conf: Main CIRCOS configuration file.
  • conf/colors_fonts_patterns.conf: Imports several files from the Circos distribution in order to define colors, fonts, and fill patterns.
  • conf/highlight.conf: Defines ideogram highlights.
  • conf/housekeeping.conf: Defines system and debug parameters.
  • conf/image.conf: Imports generic Circos image configuration and background.
  • conf/ticks.conf: Defines tick mark formatting.
  • temp/_bands.kar: Contains band annotation positions of contigs and their color.
  • temp/_CDS_neg.txt: Defines band annotation positions of negative-sense-strand coding sequences.
  • temp/_CDS_pos.txt: Defines band annotation positions of positive-sense-strand coding sequences.
  • temp/gbk_converted.fna: nucleotide fasta file converted from the original gbk.
  • temp/GC_GC_content.wig: GC content percentage on each base-pair window.
  • temp/GC_GC_skew.wig: Measures strand asymmetry in the distribution of guanines and cytosines on each base-pair window.
  • temp/_rRNA_neg.txt: Defines band annotation positions of negative-sense-strand ribosomal RNA sequences.
  • temp/_rRNA_pos.txt: Defines band annotation positions of positive-sense-strand ribosomal RNA sequences.
  • temp/_tRNA_neg.txt: Defines band annotation positions of negative-sense-strand transfer RNA sequences.
  • temp/_tRNA_pos.txt: Defines band annotation positions of positive-sense-strand transfer RNA sequences.
  • temp/_prediction_deepnog.csv: Generated only if COG prediction is enabled (default behavior). Includes COG prediction and confidence for each coding sequence.
  • temp/_CDS_pos_X.txt: Generated if COG prediction is enabled, one file for each COG category, “X” the corresponding letter. Defines band annotation positions of positive-sense-strand coding sequences of “X” COG category.
  • temp/_CDS_neg_X.txt: Generated if COG prediction is enabled, one file for each COG category, “X” the corresponding letter. Defines band annotation positions of negative-sense-strand coding sequences of “X” COG category.

In the case of a complete genome, tem directory files will be generated for each contig and identified with the prefix contig-X-, with X being 1, 2, 3, etc.

Window

-w, --window. Windows size For GC content and skew plotting. This indicates how many base pairs will be considered for the calculation.

Verbose

-v, --verbose. Displays additional information while executing GenoVi.

Captions not included

-c, --captions_not_included. By default, generated images include a caption with COGs and other colors. Use this flag to stop the program from including this caption.

Captions position

-cp, --captions_position. Caption position. Options: left, right or auto.

Title

-t, --title. Figures title, for example, which genome is being represented.

Title position

--title_position. Title position in the figure. Options: top, bottom, or center of the image.

Italic words

--italic_words. If required, a number of words of the title could be written in italic. As the title is intended for organism specification, the default is 2. For example, if the title is “Paraburkholderia xenovorans LB400”, then “Paraburkholderia xenovorans” would be in italics, but “LB400” would not.

Size

--size. To display the genome size (in base pairs) of each circular representation.

As an example, let's use the genome of Streptomyces sp. H-KF8 to insert a genome title and size. This genome is in a permanent-draft state. We are going to insert the name of the strain as the title with the -t flag, on top of the figure, using --title_position, and specify that only one word should be in italic with --italic-words. Additionally, the size of the genome will be displayed using --size. WARNING! The PNG version of the image may look odd because italic text transformation is not yet properly implemented. Please prefer using the svg version instead.

genovi -i input_test/Streptomyces_H-KF8.gbff -s draft -t 'Streptomyces sp. H-KF8' --title_position top --italic_words 1 --size

Streptomyces sp. H-KF8 as a draft genome drawing displaying title and size.

Tracks

-te. Adds a space break in the circular representation, including captions for each track within the ideogram.

Using the genome of the strain Alcaligenes aquatilis QD168 as an example. This genome is a complete assembly consisting of a unique chromosome, therefore whether the flag -s draft or -s complete is irrelevant. We will add the -te flag to add space in the ideogram, including captions of each feature. Additionally, we will use the "blossom" color palette.

genovi -i input_test/Alcaligenes_aquatilis_QD168.gbff -s complete -te -cs blossom

Alcaligenes aquatilis QD168 complete genome representation by GenoVi using the blossom colour palette

Archaea dataset example

GenoVi also works with archaeal genomes. Using the genome of the strain Sulfolobus acidocaldarius DG1, we will represent its genome with the autumn color scheme, and use the -te flag to add space in the ideogram explaining each track.

genovi -i input_test/Sulfolobus_acidocaldarius_DG1.gbff -s complete -bc white-cs autumn -te --size

Sulfolobus acidocaldarius DG1 complete genome representation by GenoVi using the autumn colour palette

Color scheme

-cs, --color_scheme. Prebuilt color scheme to use. Available color schemes include: strong, autumn, dawn, blossom, paradise, neutral, blue, purple, soil, grayscale, velvet, pastel, ocean, wood, beach, desert, ice, island, forest, toxic, fire, spring. The Colour of specific parts of the image can be modified individually, as --background, --CDS_positive_color, --CDS_negative_color, --tRNA_color, --rRNA_color, --GC_content_color, --GC_skew_color, and --GC_skew_line_color, using a R, G, B format. By default, genovi uses the 'strong color palette. 'strong, autumn, dawn, blossom, and paradise color palettes, are all colorblind-safe (based on Okabe & Ito works and ColorBrewer).

Scripts

GenoVi.py

Main script. Uses custom arguments and calls the rest of the modules to generate the genome representations. Inputs and outputs are explained in the Arguments section.

create_raw.py

Generates the .kar, CDS, rRNA, tRNA files for CIRCOS, and calls DeepNOG for predicting COGs.

Input:

  • input file: GenBank file.
  • output folder (-o/--output_folder): Path to the folder that will contain all raw files.
  • CDS (-cds/--cds): CDS band files for CIRCOS will be created.
  • tRNA (-trna/--trna): tRNA band files for CIRCOS will be created.
  • rRNA (-rrna/--rrna): rRNA band files for CIRCOS will be created.
  • COG categories (-gc/--get_categories): CDS COG categories will be predicted.
  • Divided categories (-d/--divided): COG categories will be split in one file per category.
  • Complete genome (-c/--complete_genome): Script will consider the input file to be a complete genome.

createConf.py

Writes the following CIRCOS configuration files; circos.conf, conf/highlight.conf, conf/colors_fonts_patterns.conf, conf/housekeeping.conf, conf/image.conf, and conf/ticks.conf.

Input:

  • Min GC content (--content_min/--min_GC_content): Minimum GC content. Default 0.
  • Mac GC content (--content_max/--max_GC_content): Maximum GC content. Default 100.
  • Min GC skew (--skew_min/min_GC_skew): Minimum GC skew. Default -1.
  • Max GC skew (--skew_max/--max_GC_skew): Maximum GC skew. Default 1.
  • GC content color (-cc/--GC_content_color): GC content color. Default: '23, 0, 115'.
  • GC skew color (-sc/--GC_skew_color)
  • CDS positive color (-pc/--CDS_positive_color): Positive CDSs color.
  • CDS negative color (-pc/--CDS_negative_color): Negative CDSs color.

GC_analysis.py

Calculates GC percentage and GC skew of the genomic sequence, and writes them down to files.

Input:

  • Input file (-i/--input_file): FASTA input file path.
  • Window size (-w/--window_size): Number of base pairs where the GC percentage is calculated for.
  • Shift increment (-s/--shift): Shift increment. By default, it is -1.
  • Output file (-o/--output_file): Output file path. The default matches input file path.
  • Ignore trailing (-ot/--omit_tail): Trailing sequence will be omitted. Default retains leftover sequence.

genbank2faa.py

Transforms GenBank flat files into protein fasta format files. The output has the same name as the original file.

genbank2fna.py

Transforms GenBank flat files into nucleotide fasta format files. The output has the same name as the original file.

mergeImages.py

Generates a .svg file with all scaled genome visualizations.

Input: List of dictionaries that includes filenames and each image's desired size. e.g. \[\{"fileName": "img1.svg", "size": 30000\}, \{"fileName": "img2.svg", "size": 10000\}\].

addText.py

Adds title and contig size to the visualization, and allows to modify the legend color.

colors.py

Parses color schemes.

Output

Resulting images are saved in a folder called [name] as [name].svg and [name].png (the name being specified with output_file argument or, by default, circos. In the case of a complete genome, individual contig image files are stored in a [name] subdirectory as [name]-contig_[i].png with i in [1, the number of circles].

Besides images, if -k or --keep_temporary_files was called, files described in user guide arguments will also be stored.

Four additional files are stored in [name] folder: a histogram displaying COG categories named [name]_COG_histogram.png; a file with the COG classification of each replicon named [name]_COG_Classification.csv; a csv file named [name]_Gral_Stats.csv displaying general information of each replicon, including size, GC content, number of CDS, tRNA and rRNA; and a heatmap displaying the distribution of COGs within each replicon [name]_COG_Classification.csv_percentage

Histogram of COG categories

Heatmap displaying COG categories of each replicon of Paraburkholderia xenovorans LB400.

Publication

Cumsille, A., Durán, R.E., Rodríguez-Delherbe, A., Saona-Urmeneta, V., Cámara, B., Seeger, M., Araya, M., Jara, N., Buil-Aranda, C. (2023). GenoVi, an open-source automated circular genome visualizer for bacteria and archaea. (Accepted)

Acknowledgments

This work was supported by Proyecto USM Multidisciplinarios 2020, Fondecyt 1200756 grants. A.C. acknowledges ANID 21191625 PhD fellowship and Programa de Incentivos a la Iniciación Científica, UTFSM.

Citation and License

GenoVi is under a BY-NC-SA Creative Commons License, Please cite. Cumsille et al., 202x (Under revision) You may remix, tweak, and build upon this work even for commercial purposes, as long as you credit this work and license your new creations under identical terms.