The live leaderboard at wenig.github.io/touchstone-rs ranks every contributed algorithm across all 20 benchmark datasets, updated automatically on every merge. Submit a pull request and your detector appears on the board.
Benchmarking streaming anomaly detectors is tedious: loading dozens of CSV files, wiring up metrics, normalizing scores, and collecting results into a comparable table all take time away from the detector itself. Touchstone-rs handles that scaffolding so you can focus on the algorithm.
Point it at a directory of CSVs, register one or more detectors, call run(), and get back a Polars DataFrame with one row per (dataset, detector) pair covering 14 standard metrics.
Touchstone-rs is designed in the spirit of TimeEval [2], a Python benchmarking toolkit for time series anomaly detection algorithms. If you are looking for ready-made datasets, the TimeEval evaluation paper [1] provides a large collection already formatted for direct use with Touchstone-rs at the TimeEval Datasets page.
Add to Cargo.toml:
[dependencies]
touchstone-rs = "0.1"Every algorithm plugs in through a single trait:
pub trait Detector: Send {
fn name() -> &'static str where Self: Sized;
fn new(n_dimensions: usize) -> Self where Self: Sized;
fn update(&mut self, point: &[f32]) -> f32;
}name()returns the display name used in the results DataFrame and comparison tables.pointis a slice off32features for the current time step. The length matches the number of feature columns in the dataset.- Return an anomaly score as
f32. Higher values mean more anomalous. - Return
f32::NANduring warmup or whenever a score is not yet meaningful. NaN points are excluded from metric computation. - Scores are minmax-normalized to
[0, 1]before any metric is computed, so the absolute scale of your scores does not matter.
use std::path::Path;
use touchstone_rs::{Detector, Touchstone};
struct MyDetector { n_dimensions: usize }
impl Detector for MyDetector {
fn name() -> &'static str { "MyDetector-v1" }
fn new(n_dimensions: usize) -> Self {
MyDetector { n_dimensions }
}
fn update(&mut self, point: &[f32]) -> f32 {
// compute and return anomaly score
0.5
}
}
fn main() {
let mut experiment = Touchstone::new(Path::new("data"));
// `new(n_dimensions)` receives the dataset's feature count at runtime,
// use it to size internal buffers to match.
experiment.add_detector::<MyDetector>();
let df = experiment.run().unwrap();
println!("{df}");
}run() returns a DataFrame with this schema:
| column | type | description |
|---|---|---|
dataset |
String | dataset filename (without extension) |
detector |
String | name passed to add_detector |
roc_auc |
f64 | ROC-AUC |
pr_auc |
f64 | Precision-Recall AUC |
average_precision |
f64 | Average Precision |
precision |
f64 | Precision at 90th-percentile threshold |
recall |
f64 | Recall at 90th-percentile threshold |
f1 |
f64 | F1 at 90th-percentile threshold |
range_precision |
f64 | Range-based Precision (Tatbul et al., NeurIPS 2018) |
range_recall |
f64 | Range-based Recall |
range_f_score |
f64 | Range-based F-score |
range_auc |
f64 | Range-based AUC |
range_pr_vus |
f64 | PR-VUS (Paparrizos et al., PVLDB 2022) |
range_roc_vus |
f64 | ROC-VUS |
time_sec |
f64 | wall-clock seconds for this detector on this dataset |
If a dataset fails to load or a detector produces only NaN scores, the metric columns for that row contain NaN.
If the default metric set does not suit your needs, swap it out entirely by adding metrics before calling run():
use std::path::Path;
use touchstone_rs::{Detector, Touchstone};
use touchstone_rs::metrics::{RocAuc, F1Score, SigmaThreshold};
# struct MyDetector { n_dims: usize }
# impl Detector for MyDetector { fn name() -> &'static str { "MyDetector" } fn new(n_dims: usize) -> Self { MyDetector { n_dims } } fn update(&mut self, _: &[f32]) -> f32 { 0.5 } }
let mut experiment = Touchstone::new(Path::new("data"));
experiment.add_detector::<MyDetector>();
experiment.add_metric(RocAuc);
experiment.add_metric(F1Score::new(SigmaThreshold(3.0)));You can also implement Metric directly for fully custom scoring:
use touchstone_rs::metrics::Metric;
struct MyMetric;
impl Metric for MyMetric {
fn name(&self) -> &str { "my_metric" }
fn score(&self, labels: &[u8], scores: &[f32]) -> f64 {
// labels: 0 = normal, 1 = anomaly
// scores: minmax-normalized to [0, 1], NaN already removed
todo!()
}
}Datasets are CSV files with no assumed column names:
timestamp, feature_1, ..., feature_N, label
2016-04-20 10:35:12, 1.2, 3.4, 0
2016-04-20 10:35:13, 5.6, 7.8, 1
- Column 1: timestamp | parsed but ignored
- Columns 2 ... N: features | cast to
f32, passed aspointtoupdate() - Last column: binary anomaly label |
0(normal) or1(anomaly)
Touchstone-rs passes every row to update() in order, simulating a streaming environment. Each detector gets a fresh instance per dataset.
touchstone-py exposes the same benchmarking engine as a Python package via PyO3, returning results as a Polars DataFrame.
pip install touchstone-pyOr build from source (requires maturin and Rust):
cd touchstone-py
pip install maturin
maturin develop --releaseSubclass Detector and implement three members:
from touchstone_py import Detector
class MyDetector(Detector):
@classmethod
def name(cls) -> str:
return "MyDetector-v1"
def __init__(self, n_dimensions: int) -> None:
# called once per dataset; size internal state to n_dimensions
self.n_dimensions = n_dimensions
def update(self, point: list[float]) -> float:
# called once per data point in arrival order
# return a higher score for more anomalous points
# return float('nan') during warm-up — NaN scores are excluded from metrics
return 0.5name()— display name shown in the results DataFrame.__init__(n_dimensions)— receives the dataset's feature count; called once per dataset.update(point)— receives the current feature vector; return an anomaly score (float). Scores are minmax-normalized to[0, 1]before metrics are computed, so absolute scale does not matter.
from pathlib import Path
from touchstone_py import run_touchstone, Detector
class MyDetector(Detector):
@classmethod
def name(cls) -> str:
return "MyDetector-v1"
def __init__(self, n_dimensions: int) -> None:
self.window: list[float] = []
def update(self, point: list[float]) -> float:
self.window.append(point[0])
if len(self.window) < 20:
return float("nan")
mean = sum(self.window) / len(self.window)
return abs(point[0] - mean)
df = run_touchstone(Path("data"), [MyDetector])
print(df)run_touchstone accepts a directory of CSV datasets and a list of detector classes (not instances). It returns a Polars DataFrame with the same schema as the Rust API — one row per (dataset, detector) pair with all 14 metrics plus time_sec.
Evaluating against all 976 TimeEval datasets is expensive and often redundant: many datasets exercise the same failure modes. To address this, we use a data-driven selection procedure to pick a diverse, representative subset.
We extract a feature vector for each dataset describing its performance profile across all 60 algorithms, covering five metrics: ROC-AUC, PR-AUC, Range PR-AUC, Average Precision, and execution time. We then cluster the univariate (902 datasets) and multivariate (74 datasets) subsets independently using k-medoids with k=10 for each, and select the medoid from each cluster as the representative. After filtering for public availability, this yields 19 datasets (10 univariate, 9 multivariate from 6 distinct collections).
The selection provides two key guarantees. First, diversity: each chosen dataset represents a distinct performance pattern, avoiding redundant evaluation on algorithmically similar benchmarks. Second, balance: both univariate and multivariate time series are covered proportionally despite the class imbalance in the full collection.
We additionally include the CoMuT synthetic dataset, which is purpose-designed to surface correlation anomalies, that is, anomalies invisible in individual channels but visible only through multivariate relationships. This complements TimeEval's focus on point and subsequence anomalies. The resulting 20-dataset benchmark reduces computational cost by 98% while covering both standard detection patterns and correlation-based detection, enabling fast iteration during development while maintaining statistical robustness for final validation.
See DATASETS.md for the complete list of benchmark datasets and their sources.
cargo run --example normal_distribution_detectorThis runs a rolling z-score detector (window = 20) against all datasets in data/ and prints the results.
Touchstone-rs accepts new streaming anomaly detectors via pull request. Rust detectors live as their own crate under algorithms/rust/ and are picked up automatically by the workspace; Python detectors live under algorithms/python/. See ADD_ALGORITHM.md for the step-by-step workflow (fork, implement, open PR) and how CI validates submissions.
If you use Touchstone-rs or the TimeEval dataset collection in your work, please cite:
[1] Dataset collection and evaluation methodology
@article{SchmidlEtAl2022Anomaly,
title = {Anomaly Detection in Time Series: A Comprehensive Evaluation},
author = {Schmidl, Sebastian and Wenig, Phillip and Papenbrock, Thorsten},
date = {2022},
journaltitle = {Proceedings of the VLDB Endowment (PVLDB)},
volume = {15},
number = {9},
pages = {1779--1797},
doi = {10.14778/3538598.3538602}
}[2] TimeEval benchmarking toolkit
@article{WenigEtAl2022TimeEval,
title = {TimeEval: A Benchmarking Toolkit for Time Series Anomaly Detection Algorithms},
author = {Wenig, Phillip and Schmidl, Sebastian and Papenbrock, Thorsten},
date = {2022},
journaltitle = {Proceedings of the VLDB Endowment (PVLDB)},
volume = {15},
number = {12},
pages = {3678--3681},
doi = {10.14778/3554821.3554873}
}[3] Touchstone-rs
@software{TouchstoneRs,
title = {Touchstone-rs: A Rust Library for Benchmarking Streaming Anomaly Detectors},
author = {Wenig, Phillip},
date = {2026},
url = {https://github.com/wenig/touchstone-rs}
}The MIT license applies solely to the source code in the touchstone-rs and touchstone-py directories. All other files in this repository — including algorithms/, benchmark results, datasets, scripts, and documentation — are excluded from this license.
The TiRex algorithm in algorithms/python/tirex/ is based on materials from NXAI GmbH and is governed by the NXAI Community License. Built with technology from NXAI.
The TinyTimeMixer algorithm in algorithms/python/tinytimemixer/ contains code vendored from ibm-granite/granite-tsfm, copyright contributors to the TSFM project (IBM Corporation), and is governed by the Apache License 2.0.
The Toto algorithm in algorithms/python/toto_4m/ uses the toto-2 package by Datadog, Inc., which is licensed under the Apache License 2.0.
The Matrix Profile algorithm in algorithms/python/matrix-profile/ uses the STUMPY package, copyright 2019 TD Ameritrade, which is licensed under the 3-Clause BSD License.
The benchmark datasets are sourced from the TimeEval evaluation paper dataset collection. The MIT License applies to preprocessed TimeEval datasets where not otherwise stated by the original source. See DATASETS.md for details.
