Skip to content

zhangtaolab/DNALLM

Repository files navigation

DNALLM - DNA Large Language Model Toolkit

DNALLM Logo

Python 3.10+ License: MIT PyPI version

DNALLM is a comprehensive, open-source toolkit designed for fine-tuning and inference with DNA Language Models. It provides a unified interface for working with various DNA sequence models, supporting tasks ranging from basic sequence classification to advanced in-silico mutagenesis analysis. With built-in Model Context Protocol (MCP) support, DNALLM enables seamless communication with traditional large language models, allowing for enhanced integration and interoperability in AI-powered DNA analysis workflows.

πŸš€ Key Features

  • πŸ”„ Model Management: Load and switch between 150+ pre-trained DNA language models from Hugging Face and ModelScope
  • 🎯 Multi-Task Support: Binary/multi-class classification, regression, NER, MLM, and generation tasks
  • πŸ“Š Benchmarking: Multi-model performance comparison and evaluation metrics
  • πŸ”§ Fine-tuning: Comprehensive training pipeline with configurable parameters
  • πŸ“± Interactive Interfaces: Jupyter notebooks and Marimo-based interactive demos
  • 🌐 MCP Support: Model Context Protocol for server/client deployment with real-time streaming
  • 🧬 Advanced Analysis: In-silico mutagenesis, saturation mutation analysis, and mutation effect visualization
  • πŸ§ͺ Comprehensive Testing: 200+ test cases covering all major functionality

🧬 Supported Models

DNALLM supports a wide range of DNA language models including:

Masked Language Models (MLM)

  • DNABERT Series: Plant DNABERT, DNABERT, DNABERT-2, DNABERT-S
  • Caduceus Series: Caduceus-Ph, Caduceus-PS, PlantCaduceus
  • Specialized Models: AgroNT, GENA-LM, GPN, GROVER, MutBERT, ProkBERT

Causal Language Models (CLM)

  • EVO Series: EVO-1, EVO-2
  • Plant Models: Plant DNAGemma, Plant DNAGPT, Plant DNAMamba
  • Other Models: GENERator, GenomeOcean, HyenaDNA, Jamba-DNA, Mistral-DNA

Model Sources

  • Hugging Face Hub: Primary model repository
  • ModelScope: Alternative model source with additional models
  • Custom Models: Support for locally trained or custom architectures

πŸ› οΈ Installation

Prerequisites

  • Python 3.10 or higher (Python 3.12 recommended)
  • Git
  • CUDA-compatible GPU (optional, for GPU acceleration)
  • Environment Manager: Choose one of the following:
    • Python venv (built-in)
    • Conda/Miniconda (recommended for scientific computing)

Quick Installation with uv (Recommended)

DNALLM uses uv for dependency management and packaging.

What is uv is a fast Python package manager that is 10-100x faster than traditional tools like pip.

Method 1: Using venv + uv

# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM

# Create virtual environment
python -m venv .venv

# Activate virtual environment
source .venv/bin/activate  # Linux/MacOS
# or
.venv\Scripts\activate     # Windows

# Upgrade pip (recommended)
pip install --upgrade pip

# Install uv in virtual environment
pip install uv

# Install DNALLM with base dependencies
uv pip install -e '.[base]'

# For MCP server support (optional)
uv pip install -e '.[mcp]'

# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"

Method 2: Using conda + uv

# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM

# Create conda environment
conda create -n dnallm python=3.12 -y

# Activate conda environment
conda activate dnallm

# Install uv in conda environment
conda install uv -c conda-forge

# Install DNALLM with base dependencies
uv pip install -e '.[base]'

# For MCP server support (optional)
uv pip install -e '.[mcp]'

# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"

GPU Support

For GPU acceleration, install the appropriate CUDA version:

# For venv users: activate virtual environment
source .venv/bin/activate  # Linux/MacOS
# or
.venv\Scripts\activate     # Windows

# For conda users: activate conda environment
# conda activate dnallm

# CUDA 12.4 (recommended for recent GPUs)
uv pip install -e '.[cuda124]'

# Other supported versions: cpu, cuda121, cuda126, cuda128
uv pip install -e '.[cuda121]'

Native Mamba Support

Native Mamba architecture runs significantly faster than transformer-compatible Mamba architecture, but native Mamba depends on Nvidia GPUs.

If you need native Mamba architecture support, after installing DNALLM dependencies, use the following command:

# For venv users: activate virtual environment
source .venv/bin/activate  # Linux/MacOS

# For conda users: activate conda environment
# conda activate dnallm

# Install Mamba support
uv pip install -e '.[mamba]' --no-cache-dir --no-build-isolation

# If encounter network issue, using the special install script for mamba (optional)
sh scripts/install_mamba.sh  # select github proxy

Please ensure your machine can connect to GitHub, otherwise Mamba dependencies may fail to download.

Note that Plant DNAMamba, Caduceus, PlantCaduceus, PlantCAD2, Jamba-DNA, JanusDNA models are all based on Mamba architecture. Therefore, the training and inference of these models can be accelerated by installing the native mamba support.

Install Dependencies for Special Models

Several models require extra dependencies to train or inference.

These models are listed below:

Models Model Type Source Dependencies
EVO-1 CausalLM Hugging Face GitHub
EVO2 CausalLM Hugging Face GitHub
GPN MaskedLM Hugging Face GitHub
megaDNA CausalLM Hugging Face GitHub
LucaOne CausalLM Hugging Face GitHub
Omni-DNA CausalLM Hugging Face GitHub

The installation method for the dependencies of these models can be found here.

πŸš€ Quick Start

1. Basic Model Loading and Inference

from dnallm import load_config, load_model_and_tokenizer, DNAInference

# Load configuration
configs = load_config("./example/notebooks/inference/inference_config.yaml")

# Load model and tokenizer
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter_strength_protoplast"
model, tokenizer = load_model_and_tokenizer(
    model_name, 
    task_config=configs['task'], 
    source="huggingface"
)

# Initialize inference engine
inference_engine = DNAInference(config=configs, model=model, tokenizer=tokenizer)

# Make inference
sequence = "AATATATTTAATCGGTGTATAATTTCTGTGAAGATCCTCGATACTTCATATAAGAGATTTTGAGAGAGAGAGAGAACCAATTTTCGAATGGGTGAGTTGGCAAAGTATTCACTTTTCAGAACATAATTGGGAAACTAGTCACTTTACTATTCAAAATTTGCAAAGTAGTC"
inference_result = inference_engine.infer(sequence)
print(f"Inference result: {inference_result}")

2. In-silico Mutagenesis Analysis

from dnallm import Mutagenesis

# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)

# Generate saturation mutations
mutagenesis.mutate_sequence(sequence, replace_mut=True)

# Evaluate mutation effects
predictions = mutagenesis.evaluate(strategy="mean")

# Visualize results
plot = mutagenesis.plot(predictions, save_path="mutation_effects.pdf")

3. Model Fine-tuning

from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer

# Prepare dataset
dataset = DNADataset(
    data_path="path/to/your/data.csv",
    task_type="binary_classification",
    text_column="sequence",
    label_column="label"
)

# Initialize trainer
trainer = DNATrainer(
    config=configs,
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset
)

# Start training
trainer.train()

4. MCP Server Deployment

# Start MCP server for real-time DNA sequence prediction
from dnallm.mcp import DNALLMMCPServer

# Initialize MCP server
server = DNALLMMCPServer("config/mcp_server_config.yaml")
await server.initialize()

# Start server with SSE transport for real-time streaming
server.start_server(host="0.0.0.0", port=8000, transport="sse")

MCP Server Features

  • Real-time Streaming: Server-Sent Events (SSE) for live prediction updates
  • Multiple Transport Protocols: STDIO, SSE, and Streamable HTTP
  • Comprehensive Tools: 10+ MCP tools for DNA sequence analysis
  • Model Management: Dynamic model loading and switching
  • Batch Processing: Efficient handling of multiple sequences
  • Health Monitoring: Built-in server diagnostics and status checks

Available MCP Tools

  • dna_sequence_predict - Single sequence prediction
  • dna_batch_predict - Batch sequence processing
  • dna_multi_model_predict - Multi-model comparison
  • dna_stream_predict - Real-time streaming prediction
  • list_loaded_models - Model management
  • health_check - Server monitoring

πŸ“š Examples and Tutorials

Interactive Demos (Marimo)

# Launch Jupyter Lab
uv run jupyter lab

# Fine-tuning demo
uv run marimo run example/marimo/finetune/finetune_demo.py

# Inference demo
uv run marimo run example/marimo/inference/inference_demo.py

# Benchmark demo
uv run marimo run example/marimo/benchmark/benchmark_demo.py

Web-based UI (Gradio)

# Launch Gradio configuration generator app
uv run python ui/run_config_app.py

# Or run the model config generator directly
uv run python ui/model_config_generator_app.py

Jupyter Notebooks

# Launch Jupyter Lab
uv run jupyter lab

# Available notebooks:
# - example/notebooks/finetune_binary/ - Binary classification fine-tuning
# - example/notebooks/finetune_multi_labels/ - Multi-label classification
# - example/notebooks/finetune_NER_task/ - Named Entity Recognition
# - example/notebooks/inference_and_benchmark/ - Model evaluation
# - example/notebooks/in_silico_mutagenesis/ - Mutation analysis
# - example/notebooks/inference_for_tRNA/ - tRNA-specific analysis
# - example/notebooks/inference_evo_models/ - EVO model inference
# - example/notebooks/lora_finetune_inference/ - LoRA fine-tuning
# - example/notebooks/embedding_attention.ipynb - Embedding and attention analysis

πŸ—οΈ Project Structure

DNALLM/
β”œβ”€β”€ dnallm/                     # Core library package
β”‚   β”œβ”€β”€ __init__.py             # Package initialization and main exports
β”‚   β”œβ”€β”€ version.py              # Version information
β”‚   β”œβ”€β”€ cli/                    # Command-line interface tools
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ cli.py              # Main CLI entry point
β”‚   β”‚   β”œβ”€β”€ train.py            # Training command implementation
β”‚   β”‚   β”œβ”€β”€ inference.py        # Inference command implementation
β”‚   β”‚   └── model_config_generator.py # Interactive config generator
β”‚   β”œβ”€β”€ configuration/          # Configuration management system
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ configs.py          # Configuration classes and loaders
β”‚   β”‚   └── evo                 # Folder contains configs for loading evo models
β”‚   β”œβ”€β”€ datahandling/           # Dataset processing and management
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ README.md
β”‚   β”‚   β”œβ”€β”€ data.py             # Core dataset classes
β”‚   β”‚   └── dataset_auto.py     # Automatic dataset builders
β”‚   β”œβ”€β”€ finetune/               # Model fine-tuning pipeline
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── trainer.py          # Training logic and utilities
β”‚   β”œβ”€β”€ inference/              # Inference and analysis tools
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ benchmark.py        # Multi-model performance comparison
β”‚   β”‚   β”œβ”€β”€ inference.py        # Core inference engine
β”‚   β”‚   β”œβ”€β”€ mutagenesis.py      # In-silico mutation analysis
β”‚   β”‚   └── plot.py             # Result visualization tools
β”‚   β”œβ”€β”€ models/                 # Model loading and management
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ model.py            # Model utilities and helpers
β”‚   β”‚   β”œβ”€β”€ model_info.yaml     # Model registry and metadata
β”‚   β”‚   └── modeling_auto.py    # Automatic model loading
β”‚   β”œβ”€β”€ tasks/                  # Task definitions and evaluation
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ task.py             # Task type definitions
β”‚   β”‚   β”œβ”€β”€ metrics.py          # Evaluation metrics
β”‚   β”‚   └── metrics/            # Individual metric implementations
β”‚   β”‚       β”œβ”€β”€ accuracy/       # Accuracy metrics
β”‚   β”‚       β”œβ”€β”€ f1/             # F1 score metrics
β”‚   β”‚       β”œβ”€β”€ precision/      # Precision metrics
β”‚   β”‚       β”œβ”€β”€ recall/         # Recall metrics
β”‚   β”‚       β”œβ”€β”€ roc_auc/        # ROC-AUC metrics
β”‚   β”‚       β”œβ”€β”€ mse/            # Mean squared error
β”‚   β”‚       β”œβ”€β”€ mae/            # Mean absolute error
β”‚   β”‚       β”œβ”€β”€ r_squared/      # R-squared metrics
β”‚   β”‚       └── ... (30+ metrics)
β”‚   β”œβ”€β”€ utils/                  # Utility functions and helpers
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ logger.py           # Logging utilities
β”‚   β”‚   └── sequence.py         # DNA sequence processing
β”‚   └── mcp/                    # Model Context Protocol server
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ README.md           # MCP documentation (Chinese)
β”‚       β”œβ”€β”€ DEVELOPMENT.md      # Development guide
β”‚       β”œβ”€β”€ server.py           # MCP server implementation
β”‚       β”œβ”€β”€ start_server.py     # Server startup script
β”‚       β”œβ”€β”€ config_manager.py   # Configuration management
β”‚       β”œβ”€β”€ config_validators.py # Input validation
β”‚       β”œβ”€β”€ model_manager.py    # Model lifecycle management
β”‚       β”œβ”€β”€ example_sse_usage.py # SSE usage examples
β”‚       β”œβ”€β”€ run_tests.py        # Test runner
β”‚       β”œβ”€β”€ requirements.txt    # MCP-specific dependencies
β”‚       β”œβ”€β”€ test_mcp_curl.md    # MCP testing documentation
β”‚       β”œβ”€β”€ configs/            # MCP configuration files
β”‚       β”‚   β”œβ”€β”€ mcp_server_config.yaml
β”‚       β”‚   β”œβ”€β”€ promoter_inference_config.yaml
β”‚       β”‚   β”œβ”€β”€ conservation_inference_config.yaml
β”‚       β”‚   └── ... (task-specific configs)
β”‚       └── tests/              # MCP test suite
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ test_config_manager.py
β”‚           β”œβ”€β”€ test_config_validators.py
β”‚           β”œβ”€β”€ test_mcp_functionality.py
β”‚           β”œβ”€β”€ test_server_integration.py
β”‚           β”œβ”€β”€ test_sse_client.py
β”‚           └── configs/        # Test configurations
β”œβ”€β”€ cli/                        # Legacy CLI scripts (deprecated)
β”‚   β”œβ”€β”€ cli.py
β”‚   β”œβ”€β”€ inference.py
β”‚   β”œβ”€β”€ train.py
β”‚   β”œβ”€β”€ model_config_generator.py
β”‚   └── examples/               # CLI configuration examples
β”œβ”€β”€ example/                    # Examples and interactive demos
β”‚   β”œβ”€β”€ README.md               # Example documentation
β”‚   β”œβ”€β”€ marimo/                 # Interactive Marimo applications
β”‚   β”‚   β”œβ”€β”€ benchmark/          # Benchmarking demos
β”‚   β”‚   β”œβ”€β”€ finetune/           # Fine-tuning demos
β”‚   β”‚   └── inference/          # Inference demos
β”‚   β”œβ”€β”€ mcp_example/            # MCP usage examples
β”‚   β”‚   └── mcp_client_ollama_pydantic_ai.ipynb
β”‚   └── notebooks/              # Jupyter notebook tutorials
β”‚       β”œβ”€β”€ benchmark/          # Model comparison notebooks
β”‚       β”œβ”€β”€ finetune_binary/    # Binary classification training
β”‚       β”œβ”€β”€ finetune_multi_labels/ # Multi-label classification
β”‚       β”œβ”€β”€ finetune_NER_task/  # Named entity recognition
β”‚       β”œβ”€β”€ inference/          # Inference demonstrations
β”‚       β”œβ”€β”€ inference_for_tRNA/ # tRNA-specific analysis
β”‚       β”œβ”€β”€ in_silico_mutagenesis/ # Mutation effect analysis
β”‚       └── embedding_attention.ipynb # Embedding visualization
β”œβ”€β”€ docs/                       # Comprehensive documentation
β”‚   β”œβ”€β”€ index.md                # Documentation home page
β”‚   β”œβ”€β”€ api/                    # API reference documentation
β”‚   β”‚   β”œβ”€β”€ datahandling/       # Dataset handling APIs
β”‚   β”‚   β”œβ”€β”€ finetune/           # Training APIs
β”‚   β”‚   β”œβ”€β”€ inference/          # Inference APIs
β”‚   β”‚   β”œβ”€β”€ mcp/                # MCP APIs
β”‚   β”‚   └── utils/              # Utility APIs
β”‚   β”œβ”€β”€ cli/                    # Command-line interface docs
β”‚   β”œβ”€β”€ concepts/               # Core concepts and architecture
β”‚   β”œβ”€β”€ getting_started/        # Installation and setup guides
β”‚   β”œβ”€β”€ tutorials/              # Step-by-step tutorials
β”‚   β”œβ”€β”€ resources/              # Additional resources
β”‚   └── pic/                    # Documentation images
β”œβ”€β”€ tests/                      # Comprehensive test suite
β”‚   β”œβ”€β”€ TESTING.md              # Testing documentation
β”‚   β”œβ”€β”€ pytest.ini              # Pytest configuration
β”‚   β”œβ”€β”€ benchmark/              # Benchmarking tests
β”‚   β”œβ”€β”€ datahandling/           # Dataset handling tests
β”‚   β”œβ”€β”€ finetune/               # Training pipeline tests
β”‚   β”œβ”€β”€ inference/              # Inference engine tests
β”‚   β”œβ”€β”€ utils/                  # Utility function tests
β”‚   └── test_data/              # Test datasets
β”‚       β”œβ”€β”€ binary_classification/
β”‚       β”œβ”€β”€ multiclass_classification/
β”‚       β”œβ”€β”€ multilabel_classification/
β”‚       β”œβ”€β”€ regression/
β”‚       β”œβ”€β”€ token_classification/
β”‚       └── embedding/
β”œβ”€β”€ ui/                         # Web-based user interfaces
β”‚   β”œβ”€β”€ README.md               # UI documentation
β”‚   β”œβ”€β”€ model_config_generator_app.py # Gradio configuration app
β”‚   β”œβ”€β”€ run_config_app.py       # App launcher
β”‚   └── requirements.txt        # UI-specific dependencies
β”œβ”€β”€ scripts/                    # Development and deployment scripts
β”‚   β”œβ”€β”€ check_code.py           # Code quality checker
β”‚   β”œβ”€β”€ check_code.sh           # Shell script for code checks
β”‚   β”œβ”€β”€ check_code.bat          # Windows batch script
β”‚   β”œβ”€β”€ ci_checks.sh            # Continuous integration checks
β”‚   β”œβ”€β”€ install_mamba.sh        # Mamba installation script
β”‚   β”œβ”€β”€ publish.sh              # Package publishing script
β”‚   └── setup_uv.sh             # UV package manager setup
β”œβ”€β”€ benchmark_results/          # Benchmark output storage
β”œβ”€β”€ .github/                    # GitHub workflows and templates
β”œβ”€β”€ .flake8                     # Code style configuration
β”œβ”€β”€ .gitignore                  # Git ignore patterns
β”œβ”€β”€ .pre-commit-config.yaml     # Pre-commit hooks
β”œβ”€β”€ CONTRIBUTING.md             # Contribution guidelines
β”œβ”€β”€ LICENSE                     # MIT license
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ mkdocs.yml                  # Documentation configuration
β”œβ”€β”€ pyproject.toml              # Project metadata and dependencies
β”œβ”€β”€ setup.py                    # Package setup script
└── run_cli.py                  # Legacy CLI runner

πŸ”§ Command Line Interface

DNALLM provides convenient CLI tools:

# Main CLI with subcommands
dnallm --help

# Training
dnallm train --config path/to/config.yaml
# or
dnallm-train --config path/to/config.yaml

# Inference
dnallm inference --config path/to/config.yaml --input path/to/sequences.txt
# or
dnallm-inference --config path/to/config.yaml --input path/to/sequences.txt

# Model configuration generator
dnallm-model-config-generator

# MCP server
dnallm-mcp-server --config path/to/config.yaml

🎯 Supported Task Types

DNALLM supports the following task types:

  • EMBEDDING: Extract embeddings, attention maps, and token probabilities for downstream analysis
  • MASK: Masked language modeling task for pre-training
  • GENERATION: Text generation task for causal language models
  • BINARY: Binary classification task with two possible labels
  • MULTICLASS: Multi-class classification task that specifies which class the input belongs to (more than two)
  • MULTILABEL: Multi-label classification task with multiple binary labels per sample
  • REGRESSION: Regression task which returns a continuous score
  • NER: Token classification task which is usually for Named Entity Recognition

πŸ§ͺ Testing

DNALLM includes a comprehensive test suite with 200+ test cases:

# Run all tests
uv run pytest

# Run specific test categories
uv run pytest tests/inference/ -v
uv run pytest tests/mcp/ -v
uv run pytest tests/tasks/ -v

# Run with coverage
uv run pytest --cov=dnallm --cov-report=html

πŸ“– Documentation

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Code style and standards
  • Testing requirements
  • Pull request process
  • Development setup

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Hugging Face - Model hosting and transformers library
  • ModelScope - Alternative model repository

πŸ“ž Support


DNALLM - Empowering DNA sequence analysis with state-of-the-art language models.

About

DNA Large Language Model Toolkit

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •