EDEN: Multiscale Expected Density of Nucleotide Encoding for Enhanced DNA Sequence Classification with Hybrid Deep Learning
Background: DNA sequences are fundamental carriers of genetic information. Accurate classification is essential for understanding gene regulation and disease mechanisms. Existing encoding methods often struggle to capture both local and long-range dependencies simultaneously.
Results: We introduce EDEN (Expected Density of Nucleotide Encoding), a unified multiscale encoding framework based on kernel density estimation. EDEN captures position-specific and context-dependent nucleotide patterns and integrates them into a hybrid deep learning architecture. Across sixteen benchmark datasets, EDEN achieves state-of-the-art performance with significantly fewer parameters than competing models.
Conclusions: EDEN provides an efficient, biologically informed representation for genomic sequence classification, demonstrating high practicality for large-scale applications.
The repository is modularized for academic reproducibility and clear separation of concerns:
models.py: Implementation of theHybrid_CNNarchitecture (Dual-branch CNN).utils.py: Core engine for Multiscale EDN Encoding, data loading, and evaluation metrics.predict.py: Command-line interface (CLI) for performing inference on datasets.datasets/: Genomic benchmark data in CSV format. All datasets used in this study are publicly available as part of the Genome Understanding Evaluation (GUE) benchmark. The benchmark datasets can be accessed at: https://huggingface.co/datasets/leannmlindsey/GUEmodels/: Pretrained.pthweights for various genomic tasks.
Install the necessary Python packages using conda/pip:
pip install torch pandas numpy scikit-learn
The predict.py script allows you to run predictions for specific datasets directly from the terminal.
Basic Syntax:
python predict.py --dataset <dataset_folder_name> --limit <integer_value>
Examples:
- Core Promoter Detection (TATA):
python predict.py --dataset human_prom_core_tata --limit 70
- Transcription Factor Binding (TF0):
python predict.py --dataset human_tf0 --limit 100
To run the CLI, use the following parameters:
| Dataset Category | --dataset (Parameter) |
--limit (Parameter) |
|---|---|---|
| Core Promoter | human_prom_core_all |
70 |
| Core Promoter | human_prom_core_notata |
70 |
| Core Promoter | human_prom_core_tata |
70 |
| Promoter (300bp) | human_prom_300_all |
300 |
| Promoter (300bp) | human_prom_300_notata |
300 |
| Promoter (300bp) | human_prom_300_tata |
300 |
| TF Binding (Human) | human_tf0 |
100 |
| TF Binding (Human) | human_tf1 |
100 |
| TF Binding (Human) | human_tf2 |
100 |
| TF Binding (Human) | human_tf3 |
100 |
| TF Binding (Human) | human_tf4 |
100 |
| TF Binding (Mouse) | mouse_tf0 |
100 |
| TF Binding (Mouse) | mouse_tf1 |
100 |
| TF Binding (Mouse) | mouse_tf2 |
100 |
| TF Binding (Mouse) | mouse_tf3 |
100 |
| TF Binding (Mouse) | mouse_tf4 |
100 |
- Saman Zabihi
- Email: szabihi@hotmail.com
- GitHub: https://github.com/zabihis/EDEN
If you use EDEN in your research, please cite our work:
Zabihi, S., Hashemi, S. & Mansoori, E. EDEN: multiscale expected density of nucleotide encoding for enhanced DNA sequence classification with hybrid deep learning. BMC Bioinformatics 27, 40 (2026). https://doi.org/10.1186/s12859-026-06367-6
or BibTeX entry:
@article{zabihi2026eden,
title={EDEN: multiscale expected density of nucleotide encoding for enhanced DNA sequence classification with hybrid deep learning},
author={Zabihi, S. and Hashemi, S. and Mansoori, E.},
journal={BMC Bioinformatics},
volume={27},
number={40},
year={2026},
publisher={BioMed Central},
doi={10.1186/s12859-026-06367-6}
}