Touchstone-rs

The live leaderboard at wenig.github.io/touchstone-rs ranks every contributed algorithm across all 20 benchmark datasets, updated automatically on every merge. Submit a pull request and your detector appears on the board.

Benchmarking streaming anomaly detectors is tedious: loading dozens of CSV files, wiring up metrics, normalizing scores, and collecting results into a comparable table all take time away from the detector itself. Touchstone-rs handles that scaffolding so you can focus on the algorithm.

Point it at a directory of CSVs, register one or more detectors, call run(), and get back a Polars DataFrame with one row per (dataset, detector) pair covering 14 standard metrics.

Touchstone-rs is designed in the spirit of TimeEval [2], a Python benchmarking toolkit for time series anomaly detection algorithms. If you are looking for ready-made datasets, the TimeEval evaluation paper [1] provides a large collection already formatted for direct use with Touchstone-rs at the TimeEval Datasets page.

Quickstart

Add to Cargo.toml:

[dependencies]
touchstone-rs = "0.1"

Implementing the `Detector` Trait

Every algorithm plugs in through a single trait:

pub trait Detector: Send {
    fn name() -> &'static str where Self: Sized;
    fn new(n_dimensions: usize) -> Self where Self: Sized;
    fn update(&mut self, point: &[f32]) -> f32;
}

name() returns the display name used in the results DataFrame and comparison tables.
point is a slice of f32 features for the current time step. The length matches the number of feature columns in the dataset.
Return an anomaly score as f32. Higher values mean more anomalous.
Return f32::NAN during warmup or whenever a score is not yet meaningful. NaN points are excluded from metric computation.
Scores are minmax-normalized to [0, 1] before any metric is computed, so the absolute scale of your scores does not matter.

Running an Evaluation

use std::path::Path;
use touchstone_rs::{Detector, Touchstone};

struct MyDetector { n_dimensions: usize }

impl Detector for MyDetector {
    fn name() -> &'static str { "MyDetector-v1" }

    fn new(n_dimensions: usize) -> Self {
        MyDetector { n_dimensions }
    }

    fn update(&mut self, point: &[f32]) -> f32 {
        // compute and return anomaly score
        0.5
    }
}

fn main() {
    let mut experiment = Touchstone::new(Path::new("data"));

    // `new(n_dimensions)` receives the dataset's feature count at runtime,
    // use it to size internal buffers to match.
    experiment.add_detector::<MyDetector>();

    let df = experiment.run().unwrap();
    println!("{df}");
}

Output DataFrame

run() returns a DataFrame with this schema:

column	type	description
`dataset`	String	dataset filename (without extension)
`detector`	String	name passed to `add_detector`
`roc_auc`	f64	ROC-AUC
`pr_auc`	f64	Precision-Recall AUC
`average_precision`	f64	Average Precision
`precision`	f64	Precision at 90th-percentile threshold
`recall`	f64	Recall at 90th-percentile threshold
`f1`	f64	F1 at 90th-percentile threshold
`range_precision`	f64	Range-based Precision (Tatbul et al., NeurIPS 2018)
`range_recall`	f64	Range-based Recall
`range_f_score`	f64	Range-based F-score
`range_auc`	f64	Range-based AUC
`range_pr_vus`	f64	PR-VUS (Paparrizos et al., PVLDB 2022)
`range_roc_vus`	f64	ROC-VUS
`time_sec`	f64	wall-clock seconds for this detector on this dataset

If a dataset fails to load or a detector produces only NaN scores, the metric columns for that row contain NaN.

Custom Metrics

If the default metric set does not suit your needs, swap it out entirely by adding metrics before calling run():

use std::path::Path;
use touchstone_rs::{Detector, Touchstone};
use touchstone_rs::metrics::{RocAuc, F1Score, SigmaThreshold};

# struct MyDetector { n_dims: usize }
# impl Detector for MyDetector { fn name() -> &'static str { "MyDetector" } fn new(n_dims: usize) -> Self { MyDetector { n_dims } } fn update(&mut self, _: &[f32]) -> f32 { 0.5 } }

let mut experiment = Touchstone::new(Path::new("data"));
experiment.add_detector::<MyDetector>();
experiment.add_metric(RocAuc);
experiment.add_metric(F1Score::new(SigmaThreshold(3.0)));

You can also implement Metric directly for fully custom scoring:

use touchstone_rs::metrics::Metric;

struct MyMetric;

impl Metric for MyMetric {
    fn name(&self) -> &str { "my_metric" }
    fn score(&self, labels: &[u8], scores: &[f32]) -> f64 {
        // labels: 0 = normal, 1 = anomaly
        // scores: minmax-normalized to [0, 1], NaN already removed
        todo!()
    }
}

Dataset Format

Datasets are CSV files with no assumed column names:

timestamp, feature_1, ..., feature_N, label
2016-04-20 10:35:12, 1.2, 3.4, 0
2016-04-20 10:35:13, 5.6, 7.8, 1

Column 1: timestamp | parsed but ignored
Columns 2 ... N: features | cast to f32, passed as point to update()
Last column: binary anomaly label | 0 (normal) or 1 (anomaly)

Touchstone-rs passes every row to update() in order, simulating a streaming environment. Each detector gets a fresh instance per dataset.

Python Bindings (touchstone-py)

touchstone-py exposes the same benchmarking engine as a Python package via PyO3, returning results as a Polars DataFrame.

Installation

pip install touchstone-py

Or build from source (requires maturin and Rust):

cd touchstone-py
pip install maturin
maturin develop --release

Implementing a Detector

Subclass Detector and implement three members:

from touchstone_py import Detector

class MyDetector(Detector):
    @classmethod
    def name(cls) -> str:
        return "MyDetector-v1"

    def __init__(self, n_dimensions: int) -> None:
        # called once per dataset; size internal state to n_dimensions
        self.n_dimensions = n_dimensions

    def update(self, point: list[float]) -> float:
        # called once per data point in arrival order
        # return a higher score for more anomalous points
        # return float('nan') during warm-up — NaN scores are excluded from metrics
        return 0.5

name() — display name shown in the results DataFrame.
__init__(n_dimensions) — receives the dataset's feature count; called once per dataset.
update(point) — receives the current feature vector; return an anomaly score (float). Scores are minmax-normalized to [0, 1] before metrics are computed, so absolute scale does not matter.

Running an Evaluation

from pathlib import Path
from touchstone_py import run_touchstone, Detector

class MyDetector(Detector):
    @classmethod
    def name(cls) -> str:
        return "MyDetector-v1"

    def __init__(self, n_dimensions: int) -> None:
        self.window: list[float] = []

    def update(self, point: list[float]) -> float:
        self.window.append(point[0])
        if len(self.window) < 20:
            return float("nan")
        mean = sum(self.window) / len(self.window)
        return abs(point[0] - mean)

df = run_touchstone(Path("data"), [MyDetector])
print(df)

run_touchstone accepts a directory of CSV datasets and a list of detector classes (not instances). It returns a Polars DataFrame with the same schema as the Rust API — one row per (dataset, detector) pair with all 14 metrics plus time_sec.

Benchmark Dataset Selection

Evaluating against all 976 TimeEval datasets is expensive and often redundant: many datasets exercise the same failure modes. To address this, we use a data-driven selection procedure to pick a diverse, representative subset.

We extract a feature vector for each dataset describing its performance profile across all 60 algorithms, covering five metrics: ROC-AUC, PR-AUC, Range PR-AUC, Average Precision, and execution time. We then cluster the univariate (902 datasets) and multivariate (74 datasets) subsets independently using k-medoids with k=10 for each, and select the medoid from each cluster as the representative. After filtering for public availability, this yields 19 datasets (10 univariate, 9 multivariate from 6 distinct collections).

The selection provides two key guarantees. First, diversity: each chosen dataset represents a distinct performance pattern, avoiding redundant evaluation on algorithmically similar benchmarks. Second, balance: both univariate and multivariate time series are covered proportionally despite the class imbalance in the full collection.

We additionally include the CoMuT synthetic dataset, which is purpose-designed to surface correlation anomalies, that is, anomalies invisible in individual channels but visible only through multivariate relationships. This complements TimeEval's focus on point and subsequence anomalies. The resulting 20-dataset benchmark reduces computational cost by 98% while covering both standard detection patterns and correlation-based detection, enabling fast iteration during development while maintaining statistical robustness for final validation.

See DATASETS.md for the complete list of benchmark datasets and their sources.

Running the Built-in Example

cargo run --example normal_distribution_detector

This runs a rolling z-score detector (window = 20) against all datasets in data/ and prints the results.

Contributing an Algorithm

Touchstone-rs accepts new streaming anomaly detectors via pull request. Rust detectors live as their own crate under algorithms/rust/ and are picked up automatically by the workspace; Python detectors live under algorithms/python/. See ADD_ALGORITHM.md for the step-by-step workflow (fork, implement, open PR) and how CI validates submissions.

References

If you use Touchstone-rs or the TimeEval dataset collection in your work, please cite:

[1] Dataset collection and evaluation methodology

@article{SchmidlEtAl2022Anomaly,
  title = {Anomaly Detection in Time Series: A Comprehensive Evaluation},
  author = {Schmidl, Sebastian and Wenig, Phillip and Papenbrock, Thorsten},
  date = {2022},
  journaltitle = {Proceedings of the VLDB Endowment (PVLDB)},
  volume = {15},
  number = {9},
  pages = {1779--1797},
  doi = {10.14778/3538598.3538602}
}

[2] TimeEval benchmarking toolkit

@article{WenigEtAl2022TimeEval,
  title = {TimeEval: A Benchmarking Toolkit for Time Series Anomaly Detection Algorithms},
  author = {Wenig, Phillip and Schmidl, Sebastian and Papenbrock, Thorsten},
  date = {2022},
  journaltitle = {Proceedings of the VLDB Endowment (PVLDB)},
  volume = {15},
  number = {12},
  pages = {3678--3681},
  doi = {10.14778/3554821.3554873}
}

[3] Touchstone-rs

@software{TouchstoneRs,
  title = {Touchstone-rs: A Rust Library for Benchmarking Streaming Anomaly Detectors},
  author = {Wenig, Phillip},
  date = {2026},
  url = {https://github.com/wenig/touchstone-rs}
}

License

The MIT license applies solely to the source code in the touchstone-rs and touchstone-py directories. All other files in this repository — including algorithms/, benchmark results, datasets, scripts, and documentation — are excluded from this license.

The TiRex algorithm in algorithms/python/tirex/ is based on materials from NXAI GmbH and is governed by the NXAI Community License. Built with technology from NXAI.

The TinyTimeMixer algorithm in algorithms/python/tinytimemixer/ contains code vendored from ibm-granite/granite-tsfm, copyright contributors to the TSFM project (IBM Corporation), and is governed by the Apache License 2.0.

The Toto algorithm in algorithms/python/toto_4m/ uses the toto-2 package by Datadog, Inc., which is licensed under the Apache License 2.0.

The benchmark datasets are sourced from the TimeEval evaluation paper dataset collection. The MIT License applies to preprocessed TimeEval datasets where not otherwise stated by the original source. See DATASETS.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github/workflows		.github/workflows
algorithms		algorithms
benchmark		benchmark
leaderboard		leaderboard
scripts		scripts
tests/fixtures/smoketest		tests/fixtures/smoketest
touchstone-py		touchstone-py
touchstone-rs		touchstone-rs
.gitignore		.gitignore
ADD_ALGORITHM.md		ADD_ALGORITHM.md
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DATASETS.md		DATASETS.md
LICENSE		LICENSE
README.md		README.md
WORKFLOW.md		WORKFLOW.md
banner.png		banner.png
logo.svg		logo.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Touchstone-rs

Quickstart

Implementing the `Detector` Trait

Running an Evaluation

Output DataFrame

Custom Metrics

Dataset Format

Python Bindings (touchstone-py)

Installation

Implementing a Detector

Running an Evaluation

Benchmark Dataset Selection

Running the Built-in Example

Contributing an Algorithm

References

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Touchstone-rs

Quickstart

Implementing the Detector Trait

Running an Evaluation

Output DataFrame

Custom Metrics

Dataset Format

Python Bindings (touchstone-py)

Installation

Implementing a Detector

Running an Evaluation

Benchmark Dataset Selection

Running the Built-in Example

Contributing an Algorithm

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Implementing the `Detector` Trait

Packages