GenoMiss

GenoMiss is a computational tool for detecting potential protein-coding gene misannotations in genomes. GenoMiss identifies cases where adjacent genes may actually represent fragments of a single misannotated gene by fusing neighboring genes and comparing their alignment scores against a reference protein database.

Overview

Our tool was built to detect gene misannotations in species where intron sizes can be greater than the cutoff value used by most aligners (500-600 KB), but it can also be used to detect misannotations in general. This tool aims to provide a list of potentially misannotated genes without requiring large computational resources. These genes can then be evaluted in wet-lab or by re-running computationally expensive annotation pipelines such as EGAPX.

Features

Graph-based genome representation: Handles overlapping genes and complex genomic architectures using strand-specific acyclic directed graphs
Comprehensive scoring system: Evaluates fused genes candidates based on query coverage, overlap area, sequence identity, bit score improvement, organism hit count, and e-value significance
Implemented with parallel processing: Coded to use multiple CPU threads to rapidly generate results in anytime between 5 minutes to an hour, depending on parameters
Flexible filtering: Customizable thresholds for percent identity, chromosome/contig filtering, and alignment sensitivity
Detailed statistics: Includes intron length calculations, organism count hits, confidence categorization, etc.

Requirements

Software Dependencies

Python 3.12 with the following packages:
- pandas - Data manipulation and analysis
- tqdm - Progress bar visualization
- regex - Enhanced regular expression support
- openpyxl - Excel file generation
DIAMOND - Fast protein sequence aligner
- Must be installed and available in your system PATH
- Download from: https://github.com/bbuchfink/diamond

Input Files

Proteome file (.faa) - FASTA file containing all protein sequences for the organism
Genome annotation (.gff) - GFF3 format annotation file (RefSeq recommended)
Reference database (.dmnd) - Pre-built DIAMOND database of reference proteins. It is recommendeded to use a proteome containing organisms from the same class as the species of interest (ergo, Insecta for Drosophila and Mammalia for Mouse). Be sure to build the database with taxonomic information, or the program will not be able to filter self-hits properly.

Installation

Clone the repository:

git clone https://github.com/txdylan27/GenoMiss.git
cd GenoMiss

Install Python dependencies:

pip install pandas tqdm regex openpyxl

Install DIAMOND (if not already installed):

# On Ubuntu/Debian
sudo apt-get install diamond-aligner

# Or download from GitHub releases
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
sudo mv diamond /usr/local/bin/

Usage

Basic Usage

python GenoMiss.py \
  -p <proteome.faa> \
  -a <annotation.gff> \
  -db <reference.dmnd> \
  -o <output_directory>

Example

python GenoMiss.py \
  -p apis_mellifera_protein.faa \
  -a apis_mellifera_annotation.gff \
  -db insecta_refseq_protein_db.dmnd \
  -o honeybee_results \
  -t 16 \
  -ds very-sensitive \
  -lo

Command-Line Options

Required Arguments

Option	Description
`-p`, `--proteome`	Path to the `.faa` proteome file containing all protein sequences
`-a`, `--organism_annotation`	Path to the `.gff` genome annotation file (RefSeq format recommended)
`-db`, `--database`	Path to the `.dmnd` DIAMOND reference protein database
`-o`, `--output`	Output directory path for results and intermediate files

Optional Arguments

Option	Default	Description
`-t`, `--num_threads`	Half of available CPU cores	Number of threads for DIAMOND alignment (positive integer)
`-i`, `--identity_cutoff`	50	Percent identity cutoff for filtering fused gene alignments (0.0-100.0)
`-xf`, `--xfilter`	5	Minimum number of genes required per chromosome/contig for analysis (filters out small contigs)
`-ds`, `--diamond_sensitivity`	None	DIAMOND sensitivity mode: `fast`, `mid-sensitive`, `sensitive`, `more-sensitive`, `very-sensitive`, or `ultra-sensitive`
`-lo`, `--longest_only`	False	Selects whether all isoforms will be fused to neighboring proteins. If set to True, only the longest isoform is considered, which considerably shortens the running time of the algorithm.
`-n`, `--organism_name`	None	Used to manually provide the name of the organism if it's not detectable in the .GFF file
`-d`, `--taxon-id`	None	Taxon ID of the organism. Used to filter self-hits during DIAMOND alignment

Output Files

The tool generates multiple output files in the specified output directory:

Primary Output Files

File	Format	Description
`misannotation_results.xlsx`	Excel	Multi-sheet workbook with formatted results, including Top Hits, All Fused Hits, Control Hits, and Scoring Methodology
`fused_hits.csv`	CSV	All fused gene hits sorted by composite score with comprehensive metadata
`high_confidence_hits.csv`	CSV	Subset of fused hits with composite score ≥ 70
`control_hits.csv`	CSV	Alignment results for individual unfused gene parts
`fused_hits.tsv`	TSV	Tab-separated version of all fused hits
`control_hits.tsv`	TSV	Tab-separated version of control hits

Intermediate Files

File	Description
`args.log`	Command-line arguments used for the analysis
`genome_wide_positive_hits.faa`	FASTA file containing gene parts that were components of positive fused hits
`<chrom>/<strand>/`	Per-chromosome/strand intermediate files and DIAMOND results

Scoring Methodology

The tool uses a composite scoring system (0-100 scale) that combines five components:

Component	Weight	Description
Query Coverage	50%	Percentage of the fused protein sequence covered by the alignment
Bit Score Improvement	15%	Relative improvement of fused protein bit score compared to individual gene part bit scores
Organism Count	10%	Number of different organisms with hits to the fused gene
Percent Identity	20%	Percentage of identical matches between the fused protein and the aligned subject
E-value	5%	Statistical significance of the alignment (lower e-values increase confidence)

Score Interpretation

High Confidence (≥70): Strong evidence of misannotation; multiple species show "fused" versions of this gene
Medium Confidence (40-69): Moderate evidence; may only be present in a small number of species or lacking alignment quality
Low Confidence (<40): Weak evidence; may represent false positives or edge cases

How It Works

Genome Map Construction
- Parse GFF annotation to identify protein-coding genes
- Build strand-specific directed graphs representing gene adjacency
- Handle overlapping genes using branching paths
Fusion Generation
- Traverse genome graph for each chromosome and strand
- Create all possible isoform combinations between adjacent genes
- Store metadata (gene IDs, product IDs, sequence lengths)
DIAMOND Alignment
- Align fused proteins against reference database
- Align individual gene parts (control group)
- Filter hits based on overlap threshold (10 amino acids on each side of fusion junction)
Score Calculation
- Calculate composite scores for all fused hits
- Compute intron lengths from GFF coordinates
- Count unique organisms hits per fused gene
Output Generation
- Generate summary statistics
- Main overview file is "misannotation_results.xlsx"

Project Structure

GenoMiss/
├── GenoMiss.py                 # Main program and CLI interface
├── GenomeMap.py                # Genome graph data structures
├── scoring.py                  # Composite scoring system
├── output_formatter.py         # Multi-format report generation
└── README.md                   # This file

Performance Considerations

Chromosome filtering (-xf): Increase threshold to skip small contigs and reduce runtime
Thread count (-t): More threads accelerate DIAMOND but increase memory usage
Sensitivity mode (-ds): Higher sensitivity increases alignment accuracy but significantly increases runtime
- fast: ~10x faster than default
- very-sensitive: ~10x slower than default

Citation

If you use this tool in your research, please cite:

[Citation information to be added]

Contributing

This is a unpublished project in active development, we will not accept modifications or contributions until after publication.

Authors

Dylan Ulloa
David Bellini

Acknowledgments

DIAMOND alignment tool by Benjamin Buchfink

Contact

For questions, issues, or feature requests, please open an issue on GitHub or contact the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenoMiss

Overview

Features

Requirements

Software Dependencies

Input Files

Installation

Usage

Basic Usage

Example

Command-Line Options

Required Arguments

Optional Arguments

Output Files

Primary Output Files

Intermediate Files

Scoring Methodology

Score Interpretation

How It Works

Project Structure

Performance Considerations

Citation

Contributing

Authors

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.gitignore		.gitignore
GenoMiss.py		GenoMiss.py
GenomeMap.py		GenomeMap.py
LICENSE		LICENSE
README.md		README.md
output_formatter.py		output_formatter.py
scoring.py		scoring.py

Folders and files

Latest commit

History

Repository files navigation

GenoMiss

Overview

Features

Requirements

Software Dependencies

Input Files

Installation

Usage

Basic Usage

Example

Command-Line Options

Required Arguments

Optional Arguments

Output Files

Primary Output Files

Intermediate Files

Scoring Methodology

Score Interpretation

How It Works

Project Structure

Performance Considerations

Citation

Contributing

Authors

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages