Rype

High-performance genomic sequence classification using minimizer-based k-mer sketching in RY (purine/pyrimidine) space.

Overview

Rype is a Rust library and CLI tool for fast sequence classification. It uses a reduced 2-bit alphabet that collapses purines (A/G) and pyrimidines (T/C), enabling:

Mutation tolerance: Purine-purine and pyrimidine-pyrimidine mutations don't break k-mer matches
Compact representation: 64bp k-mers fit in a single u64
High performance: Minimizer sketching with O(n) extraction using monotonic deques

Installation

From source

cargo install --path .

Building

# Release build
cargo build --release

# Development build
cargo build

Quick Start

Create an index

# Create index from reference sequences
rype index create -o index.ryxdi -r ref1.fasta -r ref2.fasta -k 64 -w 50

# Create index with one bucket per sequence
rype index create -o genes.ryxdi -r genes.fasta --separate-buckets

# Build from a TOML configuration file
rype index from-config -c config.toml

# Merge two indices
rype index merge --index-primary idx1.ryxdi --index-secondary idx2.ryxdi -o merged.ryxdi

Classify sequences

# Single-end reads
rype classify run -i index.ryxdi -1 reads.fastq -t 0.1

# Paired-end reads
rype classify run -i index.ryxdi -1 reads_R1.fastq -2 reads_R2.fastq -t 0.1

# Best-hit-only classification
rype classify run -i index.ryxdi -1 reads.fastq --best-hit

# Trim reads to first N bases before classification
rype classify run -i index.ryxdi -1 reads.fastq -t 0.1 --trim-to 100

# Host depletion with a negative index
rype classify run -i target.ryxdi -N host.ryxdi -1 reads.fastq -t 0.1

# Sample-level aggregated classification
rype classify aggregate -i index.ryxdi -1 reads.fastq -t 0.05

# Log-ratio scoring with two single-bucket indices
rype classify log-ratio -n numerator.ryxdi -d denominator.ryxdi -1 reads.fastq

Index management

# Show index statistics
rype index stats -i index.ryxdi

CLI Reference

`rype index`

Subcommand	Description
`create`	Build index from FASTA/FASTQ files
`from-config`	Build index from TOML configuration file
`merge`	Merge two indices (with optional subtraction)
`stats`	Show index statistics
`bucket-source-detail`	Show source details for a specific bucket
`summarize`	Show detailed minimizer statistics (legacy non-Parquet indices only)

`rype classify`

Subcommand	Description
`run`	Per-read classification
`aggregate`	Sample-level aggregated classification (alias: `agg`)
`log-ratio`	Log-ratio scoring using two single-bucket indices

Configuration File

Build complex indices using a TOML configuration file:

[index]
output = "output.ryxdi"
window = 50
salt = 0x5555555555555555    # XOR salt for hashing
# k = 64                    # K-mer size (default: 64)

[buckets.species_a]
files = ["ref_a1.fasta", "ref_a2.fasta"]

[buckets.species_b]
files = ["ref_b.fasta"]

Library Usage

use needletail::parse_fastx_file;
use rype::{
    classify_batch_sharded_merge_join, extract_into, BucketData, MinimizerWorkspace,
    ParquetWriteOptions, ShardedInvertedIndex,
};

fn main() -> anyhow::Result<()> {
    let mut workspace = MinimizerWorkspace::new();
    let k = 32;
    let w = 10;
    let salt = 0u64;

    // Read phiX174 genome and extract minimizers
    let mut phix_mins = Vec::new();
    let mut reader = parse_fastx_file("examples/phiX174.fasta")?;
    while let Some(record) = reader.next() {
        let record = record?;
        extract_into(&record.seq(), k, w, salt, &mut workspace);
        phix_mins.extend(workspace.buffer.drain(..));
    }
    phix_mins.sort();
    phix_mins.dedup();

    // Read pUC19 plasmid and extract minimizers
    let mut puc19_mins = Vec::new();
    let mut reader = parse_fastx_file("examples/pUC19.fasta")?;
    while let Some(record) = reader.next() {
        let record = record?;
        extract_into(&record.seq(), k, w, salt, &mut workspace);
        puc19_mins.extend(workspace.buffer.drain(..));
    }
    puc19_mins.sort();
    puc19_mins.dedup();

    // Build bucket data
    let buckets = vec![
        BucketData {
            bucket_id: 1,
            bucket_name: "phiX174".to_string(),
            sources: vec!["phiX174.fasta".to_string()],
            minimizers: phix_mins,
        },
        BucketData {
            bucket_id: 2,
            bucket_name: "pUC19".to_string(),
            sources: vec!["pUC19.fasta".to_string()],
            minimizers: puc19_mins,
        },
    ];

    // Create index in temporary directory
    let temp_dir = tempfile::tempdir()?;
    let index_path = temp_dir.path().join("example.ryxdi");
    let options = ParquetWriteOptions::default();
    rype::create_parquet_inverted_index(&index_path, buckets, k, w, salt, None, Some(&options), None)?;

    // Load and classify
    let index = ShardedInvertedIndex::open(&index_path)?;

    // Create query from first 100bp of phiX174 - should match bucket 1
    let query_seq = b"GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTT";
    let queries = vec![(1_i64, query_seq.as_slice(), None)];

    let results = classify_batch_sharded_merge_join(&index, None, &queries, 0.1, None)?;
    for hit in &results {
        let name = index.manifest().bucket_names.get(&hit.bucket_id).unwrap();
        println!("Query {} -> {} (score: {:.2})", hit.query_id, name, hit.score);
    }

    Ok(())
}

C API

Rype provides a C API for FFI integration:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "rype.h"

int main(int argc, char** argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <index.ryxdi>\n", argv[0]);
        return 1;
    }

    // Load index
    RypeIndex* index = rype_index_load(argv[1]);
    if (!index) {
        fprintf(stderr, "Failed to load index: %s\n", rype_get_last_error());
        return 1;
    }

    // Create a query (70bp from phiX174)
    const char* seq = "GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT";
    RypeQuery queries[1];
    queries[0].seq = (const uint8_t*)seq;
    queries[0].seq_len = strlen(seq);
    queries[0].pair_seq = NULL;
    queries[0].pair_len = 0;

    // Classify sequences
    RypeResultArray* results = rype_classify(index, queries, 1, 0.1);
    if (!results) {
        fprintf(stderr, "Classification failed: %s\n", rype_get_last_error());
        rype_index_free(index);
        return 1;
    }

    // Process results
    printf("Found %zu hits:\n", results->len);
    for (size_t i = 0; i < results->len; i++) {
        const char* name = rype_bucket_name(index, results->data[i].bucket_id);
        printf("  Query %ld -> %s (bucket %u, score: %.2f)\n",
               results->data[i].query_id,
               name ? name : "unknown",
               results->data[i].bucket_id,
               results->data[i].score);
    }

    // Cleanup
    rype_results_free(results);
    rype_index_free(index);
    return 0;
}

Build and run:

gcc example.c -L target/release -lrype -o example
LD_LIBRARY_PATH=target/release ./example index.ryxdi

Thread Safety

Index loading/freeing: NOT thread-safe
Classification (rype_classify): Thread-safe (multiple threads can share the same index)
Results: NOT thread-safe (each thread needs its own RypeResultArray)

Algorithm Details

RY Encoding

The library uses a reduced 2-bit alphabet:

Base	Category	Encoding
A, G	Purine	1
T, C	Pyrimidine	0
N, etc.	Invalid	Resets k-mer

Minimizer Extraction

Slide a window of size w over k-mers
Select the minimum hash value within each window
Deduplicate consecutive identical minimizers

The implementation uses a monotonic deque for O(n) time complexity.

Supported K-mer Sizes

K = 16, 32, or 64 (stored as u64)

Index Format

Rype uses a Parquet-based inverted index format stored as a directory:

index.ryxdi/
├── manifest.toml         # Index metadata (k, w, salt, bucket info)
├── buckets.parquet       # Bucket metadata (id, name, sources)
└── inverted/
    ├── shard.0.parquet   # Inverted index shard (minimizer -> bucket_id pairs)
    └── ...               # Additional shards for large indices

This format provides:

Efficient columnar storage with compression
Memory-efficient streaming classification
Support for indices larger than available RAM

Performance Tips

Use --release builds for production
Batch processing amortizes thread pool overhead
Parquet format enables memory-efficient classification for large indices

Testing

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test
cargo test test_name

Development Setup

# Enable pre-commit hooks (runs fmt, clippy, tests, doc checks)
git config core.hooksPath .githooks

# Manually run checks
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --all-features

The test suite includes automated verification that README examples compile and run correctly (readme_example_test.rs and cli_integration_tests.rs::test_readme_bash_examples).

License

Licensed under the BSD 3-Clause License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.githooks		.githooks
.github		.github
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build_example.bash		build_example.bash
example.c		example.c
generate_reads.sh		generate_reads.sh
generate_short_reads.sh		generate_short_reads.sh
ref.fa		ref.fa
rustfmt.toml		rustfmt.toml
rype.h		rype.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rype

Overview

Installation

From source

Building

Quick Start

Create an index

Classify sequences

Index management

CLI Reference

`rype index`

`rype classify`

Configuration File

Library Usage

C API

Thread Safety

Algorithm Details

RY Encoding

Minimizer Extraction

Supported K-mer Sizes

Index Format

Performance Tips

Testing

Development Setup

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rype

Overview

Installation

From source

Building

Quick Start

Create an index

Classify sequences

Index management

CLI Reference

rype index

rype classify

Configuration File

Library Usage

C API

Thread Safety

Algorithm Details

RY Encoding

Minimizer Extraction

Supported K-mer Sizes

Index Format

Performance Tips

Testing

Development Setup

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`rype index`

`rype classify`

Packages