GenoMiss is a computational tool for detecting potential protein-coding gene misannotations in genomes. GenoMiss identifies cases where adjacent genes may actually represent fragments of a single misannotated gene by fusing neighboring genes and comparing their alignment scores against a reference protein database.
Our tool was built to detect gene misannotations in species where intron sizes can be greater than the cutoff value used by most aligners (500-600 KB), but it can also be used to detect misannotations in general. This tool aims to provide a list of potentially misannotated genes without requiring large computational resources. These genes can then be evaluted in wet-lab or by re-running computationally expensive annotation pipelines such as EGAPX.
- Graph-based genome representation: Handles overlapping genes and complex genomic architectures using strand-specific acyclic directed graphs
- Comprehensive scoring system: Evaluates fused genes candidates based on query coverage, overlap area, sequence identity, bit score improvement, organism hit count, and e-value significance
- Implemented with parallel processing: Coded to use multiple CPU threads to rapidly generate results in anytime between 5 minutes to an hour, depending on parameters
- Flexible filtering: Customizable thresholds for percent identity, chromosome/contig filtering, and alignment sensitivity
- Detailed statistics: Includes intron length calculations, organism count hits, confidence categorization, etc.
-
Python 3.12 with the following packages:
pandas- Data manipulation and analysistqdm- Progress bar visualizationregex- Enhanced regular expression supportopenpyxl- Excel file generation
-
DIAMOND - Fast protein sequence aligner
- Must be installed and available in your system PATH
- Download from: https://github.com/bbuchfink/diamond
- Proteome file (
.faa) - FASTA file containing all protein sequences for the organism - Genome annotation (
.gff) - GFF3 format annotation file (RefSeq recommended) - Reference database (
.dmnd) - Pre-built DIAMOND database of reference proteins. It is recommendeded to use a proteome containing organisms from the same class as the species of interest (ergo, Insecta for Drosophila and Mammalia for Mouse). Be sure to build the database with taxonomic information, or the program will not be able to filter self-hits properly.
- Clone the repository:
git clone https://github.com/txdylan27/GenoMiss.git
cd GenoMiss- Install Python dependencies:
pip install pandas tqdm regex openpyxl- Install DIAMOND (if not already installed):
# On Ubuntu/Debian
sudo apt-get install diamond-aligner
# Or download from GitHub releases
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
sudo mv diamond /usr/local/bin/python GenoMiss.py \
-p <proteome.faa> \
-a <annotation.gff> \
-db <reference.dmnd> \
-o <output_directory>python GenoMiss.py \
-p apis_mellifera_protein.faa \
-a apis_mellifera_annotation.gff \
-db insecta_refseq_protein_db.dmnd \
-o honeybee_results \
-t 16 \
-ds very-sensitive \
-lo| Option | Description |
|---|---|
-p, --proteome |
Path to the .faa proteome file containing all protein sequences |
-a, --organism_annotation |
Path to the .gff genome annotation file (RefSeq format recommended) |
-db, --database |
Path to the .dmnd DIAMOND reference protein database |
-o, --output |
Output directory path for results and intermediate files |
| Option | Default | Description |
|---|---|---|
-t, --num_threads |
Half of available CPU cores | Number of threads for DIAMOND alignment (positive integer) |
-i, --identity_cutoff |
50 | Percent identity cutoff for filtering fused gene alignments (0.0-100.0) |
-xf, --xfilter |
5 | Minimum number of genes required per chromosome/contig for analysis (filters out small contigs) |
-ds, --diamond_sensitivity |
None | DIAMOND sensitivity mode: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, or ultra-sensitive |
-lo, --longest_only |
False | Selects whether all isoforms will be fused to neighboring proteins. If set to True, only the longest isoform is considered, which considerably shortens the running time of the algorithm. |
-n, --organism_name |
None | Used to manually provide the name of the organism if it's not detectable in the .GFF file |
-d, --taxon-id |
None | Taxon ID of the organism. Used to filter self-hits during DIAMOND alignment |
The tool generates multiple output files in the specified output directory:
| File | Format | Description |
|---|---|---|
misannotation_results.xlsx |
Excel | Multi-sheet workbook with formatted results, including Top Hits, All Fused Hits, Control Hits, and Scoring Methodology |
fused_hits.csv |
CSV | All fused gene hits sorted by composite score with comprehensive metadata |
high_confidence_hits.csv |
CSV | Subset of fused hits with composite score ≥ 70 |
control_hits.csv |
CSV | Alignment results for individual unfused gene parts |
fused_hits.tsv |
TSV | Tab-separated version of all fused hits |
control_hits.tsv |
TSV | Tab-separated version of control hits |
| File | Description |
|---|---|
args.log |
Command-line arguments used for the analysis |
genome_wide_positive_hits.faa |
FASTA file containing gene parts that were components of positive fused hits |
<chrom>/<strand>/ |
Per-chromosome/strand intermediate files and DIAMOND results |
The tool uses a composite scoring system (0-100 scale) that combines five components:
| Component | Weight | Description |
|---|---|---|
| Query Coverage | 50% | Percentage of the fused protein sequence covered by the alignment |
| Bit Score Improvement | 15% | Relative improvement of fused protein bit score compared to individual gene part bit scores |
| Organism Count | 10% | Number of different organisms with hits to the fused gene |
| Percent Identity | 20% | Percentage of identical matches between the fused protein and the aligned subject |
| E-value | 5% | Statistical significance of the alignment (lower e-values increase confidence) |
- High Confidence (≥70): Strong evidence of misannotation; multiple species show "fused" versions of this gene
- Medium Confidence (40-69): Moderate evidence; may only be present in a small number of species or lacking alignment quality
- Low Confidence (<40): Weak evidence; may represent false positives or edge cases
-
Genome Map Construction
- Parse GFF annotation to identify protein-coding genes
- Build strand-specific directed graphs representing gene adjacency
- Handle overlapping genes using branching paths
-
Fusion Generation
- Traverse genome graph for each chromosome and strand
- Create all possible isoform combinations between adjacent genes
- Store metadata (gene IDs, product IDs, sequence lengths)
-
DIAMOND Alignment
- Align fused proteins against reference database
- Align individual gene parts (control group)
- Filter hits based on overlap threshold (10 amino acids on each side of fusion junction)
-
Score Calculation
- Calculate composite scores for all fused hits
- Compute intron lengths from GFF coordinates
- Count unique organisms hits per fused gene
-
Output Generation
- Generate summary statistics
- Main overview file is "misannotation_results.xlsx"
GenoMiss/
├── GenoMiss.py # Main program and CLI interface
├── GenomeMap.py # Genome graph data structures
├── scoring.py # Composite scoring system
├── output_formatter.py # Multi-format report generation
└── README.md # This file
- Chromosome filtering (
-xf): Increase threshold to skip small contigs and reduce runtime - Thread count (
-t): More threads accelerate DIAMOND but increase memory usage - Sensitivity mode (
-ds): Higher sensitivity increases alignment accuracy but significantly increases runtimefast: ~10x faster than defaultvery-sensitive: ~10x slower than default
If you use this tool in your research, please cite:
[Citation information to be added]
This is a unpublished project in active development, we will not accept modifications or contributions until after publication.
- Dylan Ulloa
- David Bellini
- DIAMOND alignment tool by Benjamin Buchfink
For questions, issues, or feature requests, please open an issue on GitHub or contact the authors.