High-performance genomic sequence classification using minimizer-based k-mer sketching in RY (purine/pyrimidine) space.
Rype is a Rust library and CLI tool for fast sequence classification. It uses a reduced 2-bit alphabet that collapses purines (A/G) and pyrimidines (T/C), enabling:
- Mutation tolerance: Purine-purine and pyrimidine-pyrimidine mutations don't break k-mer matches
- Compact representation: 64bp k-mers fit in a single
u64 - High performance: Minimizer sketching with O(n) extraction using monotonic deques
cargo install --path .# Release build
cargo build --release
# Development build
cargo build# Create index from reference sequences
rype index create -o index.ryxdi -r ref1.fasta -r ref2.fasta -k 64 -w 50
# Create index with one bucket per sequence
rype index create -o genes.ryxdi -r genes.fasta --separate-buckets
# Build from a TOML configuration file
rype index from-config -c config.toml
# Merge two indices
rype index merge --index-primary idx1.ryxdi --index-secondary idx2.ryxdi -o merged.ryxdi# Single-end reads
rype classify run -i index.ryxdi -1 reads.fastq -t 0.1
# Paired-end reads
rype classify run -i index.ryxdi -1 reads_R1.fastq -2 reads_R2.fastq -t 0.1
# Best-hit-only classification
rype classify run -i index.ryxdi -1 reads.fastq --best-hit
# Trim reads to first N bases before classification
rype classify run -i index.ryxdi -1 reads.fastq -t 0.1 --trim-to 100
# Host depletion with a negative index
rype classify run -i target.ryxdi -N host.ryxdi -1 reads.fastq -t 0.1
# Sample-level aggregated classification
rype classify aggregate -i index.ryxdi -1 reads.fastq -t 0.05
# Log-ratio scoring with two single-bucket indices
rype classify log-ratio -n numerator.ryxdi -d denominator.ryxdi -1 reads.fastq# Show index statistics
rype index stats -i index.ryxdi| Subcommand | Description |
|---|---|
create |
Build index from FASTA/FASTQ files |
from-config |
Build index from TOML configuration file |
merge |
Merge two indices (with optional subtraction) |
stats |
Show index statistics |
bucket-source-detail |
Show source details for a specific bucket |
summarize |
Show detailed minimizer statistics (legacy non-Parquet indices only) |
| Subcommand | Description |
|---|---|
run |
Per-read classification |
aggregate |
Sample-level aggregated classification (alias: agg) |
log-ratio |
Log-ratio scoring using two single-bucket indices |
Build complex indices using a TOML configuration file:
[index]
output = "output.ryxdi"
window = 50
salt = 0x5555555555555555 # XOR salt for hashing
# k = 64 # K-mer size (default: 64)
[buckets.species_a]
files = ["ref_a1.fasta", "ref_a2.fasta"]
[buckets.species_b]
files = ["ref_b.fasta"]use needletail::parse_fastx_file;
use rype::{
classify_batch_sharded_merge_join, extract_into, BucketData, MinimizerWorkspace,
ParquetWriteOptions, ShardedInvertedIndex,
};
fn main() -> anyhow::Result<()> {
let mut workspace = MinimizerWorkspace::new();
let k = 32;
let w = 10;
let salt = 0u64;
// Read phiX174 genome and extract minimizers
let mut phix_mins = Vec::new();
let mut reader = parse_fastx_file("examples/phiX174.fasta")?;
while let Some(record) = reader.next() {
let record = record?;
extract_into(&record.seq(), k, w, salt, &mut workspace);
phix_mins.extend(workspace.buffer.drain(..));
}
phix_mins.sort();
phix_mins.dedup();
// Read pUC19 plasmid and extract minimizers
let mut puc19_mins = Vec::new();
let mut reader = parse_fastx_file("examples/pUC19.fasta")?;
while let Some(record) = reader.next() {
let record = record?;
extract_into(&record.seq(), k, w, salt, &mut workspace);
puc19_mins.extend(workspace.buffer.drain(..));
}
puc19_mins.sort();
puc19_mins.dedup();
// Build bucket data
let buckets = vec![
BucketData {
bucket_id: 1,
bucket_name: "phiX174".to_string(),
sources: vec!["phiX174.fasta".to_string()],
minimizers: phix_mins,
},
BucketData {
bucket_id: 2,
bucket_name: "pUC19".to_string(),
sources: vec!["pUC19.fasta".to_string()],
minimizers: puc19_mins,
},
];
// Create index in temporary directory
let temp_dir = tempfile::tempdir()?;
let index_path = temp_dir.path().join("example.ryxdi");
let options = ParquetWriteOptions::default();
rype::create_parquet_inverted_index(&index_path, buckets, k, w, salt, None, Some(&options), None)?;
// Load and classify
let index = ShardedInvertedIndex::open(&index_path)?;
// Create query from first 100bp of phiX174 - should match bucket 1
let query_seq = b"GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTT";
let queries = vec![(1_i64, query_seq.as_slice(), None)];
let results = classify_batch_sharded_merge_join(&index, None, &queries, 0.1, None)?;
for hit in &results {
let name = index.manifest().bucket_names.get(&hit.bucket_id).unwrap();
println!("Query {} -> {} (score: {:.2})", hit.query_id, name, hit.score);
}
Ok(())
}Rype provides a C API for FFI integration:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "rype.h"
int main(int argc, char** argv) {
if (argc < 2) {
fprintf(stderr, "Usage: %s <index.ryxdi>\n", argv[0]);
return 1;
}
// Load index
RypeIndex* index = rype_index_load(argv[1]);
if (!index) {
fprintf(stderr, "Failed to load index: %s\n", rype_get_last_error());
return 1;
}
// Create a query (70bp from phiX174)
const char* seq = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT";
RypeQuery queries[1];
queries[0].seq = (const uint8_t*)seq;
queries[0].seq_len = strlen(seq);
queries[0].pair_seq = NULL;
queries[0].pair_len = 0;
// Classify sequences
RypeResultArray* results = rype_classify(index, queries, 1, 0.1);
if (!results) {
fprintf(stderr, "Classification failed: %s\n", rype_get_last_error());
rype_index_free(index);
return 1;
}
// Process results
printf("Found %zu hits:\n", results->len);
for (size_t i = 0; i < results->len; i++) {
const char* name = rype_bucket_name(index, results->data[i].bucket_id);
printf(" Query %ld -> %s (bucket %u, score: %.2f)\n",
results->data[i].query_id,
name ? name : "unknown",
results->data[i].bucket_id,
results->data[i].score);
}
// Cleanup
rype_results_free(results);
rype_index_free(index);
return 0;
}Build and run:
gcc example.c -L target/release -lrype -o example
LD_LIBRARY_PATH=target/release ./example index.ryxdi- Index loading/freeing: NOT thread-safe
- Classification (
rype_classify): Thread-safe (multiple threads can share the same index) - Results: NOT thread-safe (each thread needs its own
RypeResultArray)
The library uses a reduced 2-bit alphabet:
| Base | Category | Encoding |
|---|---|---|
| A, G | Purine | 1 |
| T, C | Pyrimidine | 0 |
| N, etc. | Invalid | Resets k-mer |
- Slide a window of size
wover k-mers - Select the minimum hash value within each window
- Deduplicate consecutive identical minimizers
The implementation uses a monotonic deque for O(n) time complexity.
- K = 16, 32, or 64 (stored as
u64)
Rype uses a Parquet-based inverted index format stored as a directory:
index.ryxdi/
├── manifest.toml # Index metadata (k, w, salt, bucket info)
├── buckets.parquet # Bucket metadata (id, name, sources)
└── inverted/
├── shard.0.parquet # Inverted index shard (minimizer -> bucket_id pairs)
└── ... # Additional shards for large indices
This format provides:
- Efficient columnar storage with compression
- Memory-efficient streaming classification
- Support for indices larger than available RAM
- Use
--releasebuilds for production - Batch processing amortizes thread pool overhead
- Parquet format enables memory-efficient classification for large indices
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_name# Enable pre-commit hooks (runs fmt, clippy, tests, doc checks)
git config core.hooksPath .githooks
# Manually run checks
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-featuresThe test suite includes automated verification that README examples compile and run correctly (readme_example_test.rs and cli_integration_tests.rs::test_readme_bash_examples).
Licensed under the BSD 3-Clause License.
Contributions are welcome! Please feel free to submit a Pull Request.