DNALLM is a comprehensive, open-source toolkit designed for fine-tuning and inference with DNA Language Models. It provides a unified interface for working with various DNA sequence models, supporting tasks ranging from basic sequence classification to advanced in-silico mutagenesis analysis. With built-in Model Context Protocol (MCP) support, DNALLM enables seamless communication with traditional large language models, allowing for enhanced integration and interoperability in AI-powered DNA analysis workflows.
- π Model Management: Load and switch between 150+ pre-trained DNA language models from Hugging Face and ModelScope
- π― Multi-Task Support: Binary/multi-class classification, regression, NER, MLM, and generation tasks
- π Benchmarking: Multi-model performance comparison and evaluation metrics
- π§ Fine-tuning: Comprehensive training pipeline with configurable parameters
- π± Interactive Interfaces: Jupyter notebooks and Marimo-based interactive demos
- π MCP Support: Model Context Protocol for server/client deployment with real-time streaming
- 𧬠Advanced Analysis: In-silico mutagenesis, saturation mutation analysis, and mutation effect visualization
- π§ͺ Comprehensive Testing: 200+ test cases covering all major functionality
DNALLM supports a wide range of DNA language models including:
- DNABERT Series: Plant DNABERT, DNABERT, DNABERT-2, DNABERT-S
- Caduceus Series: Caduceus-Ph, Caduceus-PS, PlantCaduceus
- Specialized Models: AgroNT, GENA-LM, GPN, GROVER, MutBERT, ProkBERT
- EVO Series: EVO-1, EVO-2
- Plant Models: Plant DNAGemma, Plant DNAGPT, Plant DNAMamba
- Other Models: GENERator, GenomeOcean, HyenaDNA, Jamba-DNA, Mistral-DNA
- Hugging Face Hub: Primary model repository
- ModelScope: Alternative model source with additional models
- Custom Models: Support for locally trained or custom architectures
- Python 3.10 or higher (Python 3.12 recommended)
- Git
- CUDA-compatible GPU (optional, for GPU acceleration)
- Environment Manager: Choose one of the following:
- Python venv (built-in)
- Conda/Miniconda (recommended for scientific computing)
DNALLM uses uv for dependency management and packaging.
What is uv is a fast Python package manager that is 10-100x faster than traditional tools like pip.
# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM
# Create virtual environment
python -m venv .venv
# Activate virtual environment
source .venv/bin/activate # Linux/MacOS
# or
.venv\Scripts\activate # Windows
# Upgrade pip (recommended)
pip install --upgrade pip
# Install uv in virtual environment
pip install uv
# Install DNALLM with base dependencies
uv pip install -e '.[base]'
# For MCP server support (optional)
uv pip install -e '.[mcp]'
# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"
# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM
# Create conda environment
conda create -n dnallm python=3.12 -y
# Activate conda environment
conda activate dnallm
# Install uv in conda environment
conda install uv -c conda-forge
# Install DNALLM with base dependencies
uv pip install -e '.[base]'
# For MCP server support (optional)
uv pip install -e '.[mcp]'
# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"
For GPU acceleration, install the appropriate CUDA version:
# For venv users: activate virtual environment
source .venv/bin/activate # Linux/MacOS
# or
.venv\Scripts\activate # Windows
# For conda users: activate conda environment
# conda activate dnallm
# CUDA 12.4 (recommended for recent GPUs)
uv pip install -e '.[cuda124]'
# Other supported versions: cpu, cuda121, cuda126, cuda128
uv pip install -e '.[cuda121]'
Native Mamba architecture runs significantly faster than transformer-compatible Mamba architecture, but native Mamba depends on Nvidia GPUs.
If you need native Mamba architecture support, after installing DNALLM dependencies, use the following command:
# For venv users: activate virtual environment
source .venv/bin/activate # Linux/MacOS
# For conda users: activate conda environment
# conda activate dnallm
# Install Mamba support
uv pip install -e '.[mamba]' --no-cache-dir --no-build-isolation
# If encounter network issue, using the special install script for mamba (optional)
sh scripts/install_mamba.sh # select github proxy
Please ensure your machine can connect to GitHub, otherwise Mamba dependencies may fail to download.
Note that Plant DNAMamba, Caduceus, PlantCaduceus, PlantCAD2, Jamba-DNA, JanusDNA models are all based on Mamba architecture. Therefore, the training and inference of these models can be accelerated by installing the native mamba support.
Several models require extra dependencies to train or inference.
These models are listed below:
Models | Model Type | Source | Dependencies |
---|---|---|---|
EVO-1 | CausalLM | Hugging Face | GitHub |
EVO2 | CausalLM | Hugging Face | GitHub |
GPN | MaskedLM | Hugging Face | GitHub |
megaDNA | CausalLM | Hugging Face | GitHub |
LucaOne | CausalLM | Hugging Face | GitHub |
Omni-DNA | CausalLM | Hugging Face | GitHub |
The installation method for the dependencies of these models can be found here.
from dnallm import load_config, load_model_and_tokenizer, DNAInference
# Load configuration
configs = load_config("./example/notebooks/inference/inference_config.yaml")
# Load model and tokenizer
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter_strength_protoplast"
model, tokenizer = load_model_and_tokenizer(
model_name,
task_config=configs['task'],
source="huggingface"
)
# Initialize inference engine
inference_engine = DNAInference(config=configs, model=model, tokenizer=tokenizer)
# Make inference
sequence = "AATATATTTAATCGGTGTATAATTTCTGTGAAGATCCTCGATACTTCATATAAGAGATTTTGAGAGAGAGAGAGAACCAATTTTCGAATGGGTGAGTTGGCAAAGTATTCACTTTTCAGAACATAATTGGGAAACTAGTCACTTTACTATTCAAAATTTGCAAAGTAGTC"
inference_result = inference_engine.infer(sequence)
print(f"Inference result: {inference_result}")
from dnallm import Mutagenesis
# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)
# Generate saturation mutations
mutagenesis.mutate_sequence(sequence, replace_mut=True)
# Evaluate mutation effects
predictions = mutagenesis.evaluate(strategy="mean")
# Visualize results
plot = mutagenesis.plot(predictions, save_path="mutation_effects.pdf")
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# Prepare dataset
dataset = DNADataset(
data_path="path/to/your/data.csv",
task_type="binary_classification",
text_column="sequence",
label_column="label"
)
# Initialize trainer
trainer = DNATrainer(
config=configs,
model=model,
tokenizer=tokenizer,
train_dataset=dataset
)
# Start training
trainer.train()
# Start MCP server for real-time DNA sequence prediction
from dnallm.mcp import DNALLMMCPServer
# Initialize MCP server
server = DNALLMMCPServer("config/mcp_server_config.yaml")
await server.initialize()
# Start server with SSE transport for real-time streaming
server.start_server(host="0.0.0.0", port=8000, transport="sse")
- Real-time Streaming: Server-Sent Events (SSE) for live prediction updates
- Multiple Transport Protocols: STDIO, SSE, and Streamable HTTP
- Comprehensive Tools: 10+ MCP tools for DNA sequence analysis
- Model Management: Dynamic model loading and switching
- Batch Processing: Efficient handling of multiple sequences
- Health Monitoring: Built-in server diagnostics and status checks
dna_sequence_predict
- Single sequence predictiondna_batch_predict
- Batch sequence processingdna_multi_model_predict
- Multi-model comparisondna_stream_predict
- Real-time streaming predictionlist_loaded_models
- Model managementhealth_check
- Server monitoring
# Launch Jupyter Lab
uv run jupyter lab
# Fine-tuning demo
uv run marimo run example/marimo/finetune/finetune_demo.py
# Inference demo
uv run marimo run example/marimo/inference/inference_demo.py
# Benchmark demo
uv run marimo run example/marimo/benchmark/benchmark_demo.py
# Launch Gradio configuration generator app
uv run python ui/run_config_app.py
# Or run the model config generator directly
uv run python ui/model_config_generator_app.py
# Launch Jupyter Lab
uv run jupyter lab
# Available notebooks:
# - example/notebooks/finetune_binary/ - Binary classification fine-tuning
# - example/notebooks/finetune_multi_labels/ - Multi-label classification
# - example/notebooks/finetune_NER_task/ - Named Entity Recognition
# - example/notebooks/inference_and_benchmark/ - Model evaluation
# - example/notebooks/in_silico_mutagenesis/ - Mutation analysis
# - example/notebooks/inference_for_tRNA/ - tRNA-specific analysis
# - example/notebooks/inference_evo_models/ - EVO model inference
# - example/notebooks/lora_finetune_inference/ - LoRA fine-tuning
# - example/notebooks/embedding_attention.ipynb - Embedding and attention analysis
DNALLM/
βββ dnallm/ # Core library package
β βββ __init__.py # Package initialization and main exports
β βββ version.py # Version information
β βββ cli/ # Command-line interface tools
β β βββ __init__.py
β β βββ cli.py # Main CLI entry point
β β βββ train.py # Training command implementation
β β βββ inference.py # Inference command implementation
β β βββ model_config_generator.py # Interactive config generator
β βββ configuration/ # Configuration management system
β β βββ __init__.py
β β βββ configs.py # Configuration classes and loaders
β β βββ evo # Folder contains configs for loading evo models
β βββ datahandling/ # Dataset processing and management
β β βββ __init__.py
β β βββ README.md
β β βββ data.py # Core dataset classes
β β βββ dataset_auto.py # Automatic dataset builders
β βββ finetune/ # Model fine-tuning pipeline
β β βββ __init__.py
β β βββ trainer.py # Training logic and utilities
β βββ inference/ # Inference and analysis tools
β β βββ __init__.py
β β βββ benchmark.py # Multi-model performance comparison
β β βββ inference.py # Core inference engine
β β βββ mutagenesis.py # In-silico mutation analysis
β β βββ plot.py # Result visualization tools
β βββ models/ # Model loading and management
β β βββ __init__.py
β β βββ model.py # Model utilities and helpers
β β βββ model_info.yaml # Model registry and metadata
β β βββ modeling_auto.py # Automatic model loading
β βββ tasks/ # Task definitions and evaluation
β β βββ __init__.py
β β βββ task.py # Task type definitions
β β βββ metrics.py # Evaluation metrics
β β βββ metrics/ # Individual metric implementations
β β βββ accuracy/ # Accuracy metrics
β β βββ f1/ # F1 score metrics
β β βββ precision/ # Precision metrics
β β βββ recall/ # Recall metrics
β β βββ roc_auc/ # ROC-AUC metrics
β β βββ mse/ # Mean squared error
β β βββ mae/ # Mean absolute error
β β βββ r_squared/ # R-squared metrics
β β βββ ... (30+ metrics)
β βββ utils/ # Utility functions and helpers
β β βββ __init__.py
β β βββ logger.py # Logging utilities
β β βββ sequence.py # DNA sequence processing
β βββ mcp/ # Model Context Protocol server
β βββ __init__.py
β βββ README.md # MCP documentation (Chinese)
β βββ DEVELOPMENT.md # Development guide
β βββ server.py # MCP server implementation
β βββ start_server.py # Server startup script
β βββ config_manager.py # Configuration management
β βββ config_validators.py # Input validation
β βββ model_manager.py # Model lifecycle management
β βββ example_sse_usage.py # SSE usage examples
β βββ run_tests.py # Test runner
β βββ requirements.txt # MCP-specific dependencies
β βββ test_mcp_curl.md # MCP testing documentation
β βββ configs/ # MCP configuration files
β β βββ mcp_server_config.yaml
β β βββ promoter_inference_config.yaml
β β βββ conservation_inference_config.yaml
β β βββ ... (task-specific configs)
β βββ tests/ # MCP test suite
β βββ __init__.py
β βββ test_config_manager.py
β βββ test_config_validators.py
β βββ test_mcp_functionality.py
β βββ test_server_integration.py
β βββ test_sse_client.py
β βββ configs/ # Test configurations
βββ cli/ # Legacy CLI scripts (deprecated)
β βββ cli.py
β βββ inference.py
β βββ train.py
β βββ model_config_generator.py
β βββ examples/ # CLI configuration examples
βββ example/ # Examples and interactive demos
β βββ README.md # Example documentation
β βββ marimo/ # Interactive Marimo applications
β β βββ benchmark/ # Benchmarking demos
β β βββ finetune/ # Fine-tuning demos
β β βββ inference/ # Inference demos
β βββ mcp_example/ # MCP usage examples
β β βββ mcp_client_ollama_pydantic_ai.ipynb
β βββ notebooks/ # Jupyter notebook tutorials
β βββ benchmark/ # Model comparison notebooks
β βββ finetune_binary/ # Binary classification training
β βββ finetune_multi_labels/ # Multi-label classification
β βββ finetune_NER_task/ # Named entity recognition
β βββ inference/ # Inference demonstrations
β βββ inference_for_tRNA/ # tRNA-specific analysis
β βββ in_silico_mutagenesis/ # Mutation effect analysis
β βββ embedding_attention.ipynb # Embedding visualization
βββ docs/ # Comprehensive documentation
β βββ index.md # Documentation home page
β βββ api/ # API reference documentation
β β βββ datahandling/ # Dataset handling APIs
β β βββ finetune/ # Training APIs
β β βββ inference/ # Inference APIs
β β βββ mcp/ # MCP APIs
β β βββ utils/ # Utility APIs
β βββ cli/ # Command-line interface docs
β βββ concepts/ # Core concepts and architecture
β βββ getting_started/ # Installation and setup guides
β βββ tutorials/ # Step-by-step tutorials
β βββ resources/ # Additional resources
β βββ pic/ # Documentation images
βββ tests/ # Comprehensive test suite
β βββ TESTING.md # Testing documentation
β βββ pytest.ini # Pytest configuration
β βββ benchmark/ # Benchmarking tests
β βββ datahandling/ # Dataset handling tests
β βββ finetune/ # Training pipeline tests
β βββ inference/ # Inference engine tests
β βββ utils/ # Utility function tests
β βββ test_data/ # Test datasets
β βββ binary_classification/
β βββ multiclass_classification/
β βββ multilabel_classification/
β βββ regression/
β βββ token_classification/
β βββ embedding/
βββ ui/ # Web-based user interfaces
β βββ README.md # UI documentation
β βββ model_config_generator_app.py # Gradio configuration app
β βββ run_config_app.py # App launcher
β βββ requirements.txt # UI-specific dependencies
βββ scripts/ # Development and deployment scripts
β βββ check_code.py # Code quality checker
β βββ check_code.sh # Shell script for code checks
β βββ check_code.bat # Windows batch script
β βββ ci_checks.sh # Continuous integration checks
β βββ install_mamba.sh # Mamba installation script
β βββ publish.sh # Package publishing script
β βββ setup_uv.sh # UV package manager setup
βββ benchmark_results/ # Benchmark output storage
βββ .github/ # GitHub workflows and templates
βββ .flake8 # Code style configuration
βββ .gitignore # Git ignore patterns
βββ .pre-commit-config.yaml # Pre-commit hooks
βββ CONTRIBUTING.md # Contribution guidelines
βββ LICENSE # MIT license
βββ README.md # This file
βββ mkdocs.yml # Documentation configuration
βββ pyproject.toml # Project metadata and dependencies
βββ setup.py # Package setup script
βββ run_cli.py # Legacy CLI runner
DNALLM provides convenient CLI tools:
# Main CLI with subcommands
dnallm --help
# Training
dnallm train --config path/to/config.yaml
# or
dnallm-train --config path/to/config.yaml
# Inference
dnallm inference --config path/to/config.yaml --input path/to/sequences.txt
# or
dnallm-inference --config path/to/config.yaml --input path/to/sequences.txt
# Model configuration generator
dnallm-model-config-generator
# MCP server
dnallm-mcp-server --config path/to/config.yaml
DNALLM supports the following task types:
- EMBEDDING: Extract embeddings, attention maps, and token probabilities for downstream analysis
- MASK: Masked language modeling task for pre-training
- GENERATION: Text generation task for causal language models
- BINARY: Binary classification task with two possible labels
- MULTICLASS: Multi-class classification task that specifies which class the input belongs to (more than two)
- MULTILABEL: Multi-label classification task with multiple binary labels per sample
- REGRESSION: Regression task which returns a continuous score
- NER: Token classification task which is usually for Named Entity Recognition
DNALLM includes a comprehensive test suite with 200+ test cases:
# Run all tests
uv run pytest
# Run specific test categories
uv run pytest tests/inference/ -v
uv run pytest tests/mcp/ -v
uv run pytest tests/tasks/ -v
# Run with coverage
uv run pytest --cov=dnallm --cov-report=html
- Getting Started - Installation and basic usage
- Tutorials - Step-by-step guides for specific tasks
- API Reference - Detailed function documentation
- Concepts - Core concepts and architecture
- FAQ - Common questions and solutions
We welcome contributions! Please see our Contributing Guide for details on:
- Code style and standards
- Testing requirements
- Pull request process
- Development setup
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face - Model hosting and transformers library
- ModelScope - Alternative model repository
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Examples: Check the
example/
directory for working code
DNALLM - Empowering DNA sequence analysis with state-of-the-art language models.