English | Chinese
OpenSTBench is an evaluation toolkit centered on translation and speech translation. It provides a unified way to score text translation quality, speech output quality, preservation-related properties, and streaming latency.
This project is best suited for these directions:
- MT or S2TT text-side evaluation with
BLEU,chrF++,COMET, andBLEURT - S2ST evaluation by combining text quality, speech quality, speaker similarity, and latency
- Streaming or simultaneous speech translation latency evaluation with a custom agent
- Preservation analysis for speech translation outputs, including speaker similarity, emotion, and paralinguistic similarity
- Temporal consistency analysis for speech translation or dubbing outputs, including duration compliance and duration error
| Module | Main Use | Typical Metrics |
|---|---|---|
TranslationEvaluator |
Text-side translation quality | sacreBLEU, chrF++, COMET, BLEURT |
SpeechQualityEvaluator |
Naturalness and text-speech consistency | UTMOS, WER_Consistency, CER_Consistency |
SpeakerSimilarityEvaluator |
Speaker preservation | wavlm_similarity, resemblyzer_similarity |
EmotionEvaluator |
Emotion preservation or classification accuracy | Emotion2Vec_Cosine_Similarity, Audio_Emotion_Accuracy |
ParalinguisticEvaluator |
Non-verbal and paralinguistic preservation | Paralinguistic_Fidelity_Cosine, Acoustic_Event_Preservation_Rate, Acoustic_Event_Preservation_Macro_F1, Acoustic_Event_Preservation_Macro_Recall, Event_Aligned_Preservation_Rate, Conditional_Relative_Onset_Error |
TemporalConsistencyEvaluator |
Source-target temporal structure consistency | Duration_Consistency_SLC_0.2, Duration_Consistency_SLC_0.4 |
LatencyEvaluator |
Streaming / simultaneous translation latency | StartOffset, ATD, CustomATD, RTF, Model_Generate_RTF |
Basic install:
pip install OpenSTBenchOptional extras:
pip install "OpenSTBench[comet]"
pip install "OpenSTBench[whisper]"
pip install "OpenSTBench[speech_quality]"
pip install "OpenSTBench[emotion]"
pip install "OpenSTBench[paralinguistics]"
pip install "OpenSTBench[all]"If you need BLEURT:
pip install git+https://github.com/lucadiliello/bleurt-pytorch.gitPyPI package name:
OpenSTBenchPython import name:
openstbenchExample:
from openstbench import (
TranslationEvaluator,
SpeechQualityEvaluator,
TemporalConsistencyEvaluator,
)Quick-start scripts live under examples/.
Python examples:
examples/python/translation_eval.pyexamples/python/speech_quality_eval.pyexamples/python/speaker_similarity_eval.pyexamples/python/emotion_eval.pyexamples/python/paralinguistic_eval.pyexamples/python/paralinguistic_identity_baseline.pyexamples/python/temporal_consistency_eval.pyexamples/python/latency_eval.py
Shell examples:
examples/bash/install_extras.shexamples/bash/run_latency_cli.sh
Minimal temporal consistency example:
from openstbench import TemporalConsistencyEvaluator
evaluator = TemporalConsistencyEvaluator(
thresholds=(0.2, 0.4),
)
results, diagnostics = evaluator.evaluate_all(
source_audio="./source_wavs",
target_audio="./generated_wavs",
sample_ids=["sample_1", "sample_2"],
return_diagnostics=True,
)Latency output distinguishes two RTF variants:
Real_Time_Factor_(RTF): system-level RTF. This includes agent policy overhead, pre/post-processing, and other runtime costs around model inference.Model_Generate_RTF: model-level RTF. This is reported only when the agent explicitly records model inference time viarecord_model_inference_time(...)or returns it inSegment.config["model_inference_time"].
Common text inputs support:
- Python
List[str] .txtfiles with one sample per line.jsonfiles
Common audio inputs support:
- folder path
- Python
List[str] .txtfiles.jsonfiles
- For
zh/ja/ko, the toolkit uses CJK-aware handling for text-side evaluation. SpeechQualityEvaluatorreturnsCER_Consistencyforzh/ja/ko, andWER_Consistencyfor most other languages.ParalinguisticEvaluatoralways supportsParalinguistic_Fidelity_Cosine, a continuous CLAP-based audio similarity score between source and target speech.TemporalConsistencyEvaluatorsupportsList[str], audio folders,.txtpath lists, and.jsonpath lists for bothsource_audioandtarget_audio.TemporalConsistencyEvaluatorreports thresholded duration compliance metrics (Duration_Consistency_SLC_*).- The discrete preservation branch is an utterance-level single-label task. With source-side gold labels, it reports
Acoustic_Event_Preservation_Rate,Acoustic_Event_Preservation_Macro_F1, andAcoustic_Event_Preservation_Macro_Recall. - If
source_onsets_msare available, the evaluator can also report alignment-aware metrics:Event_Aligned_Preservation_RateandConditional_Relative_Onset_Error. - Alignment is computed on relative onset position, not absolute wall-clock time. This makes it suitable for cross-lingual S2ST where source and target utterance durations naturally differ.
- If target-side onset timestamps are not provided, the default localizer estimates them with CLAP sliding-window scoring conditioned on the target event label.
- These alignment metrics should be interpreted as weak, coarse-grained alignment signals rather than timestamp-accurate event localization benchmarks.
- If source-side gold labels are not available, the evaluator can still run in prediction-only mode and reports
Predicted_Event_Consistency_Rate,Predicted_Event_Consistency_Macro_F1, andPredicted_Event_Consistency_Macro_Recall. - The default discrete predictor is a closed-set CLAP classifier over
candidate_labels. Users may replace it with any custom predictor object that implementspredict(audio_paths, candidate_labels). - The default event localizer is also replaceable. Custom localizers only need to implement
localize(audio_paths, labels, candidate_labels). - Dataset-specific label mapping is intentionally outside the core package. Pass
candidate_labelsandlabel_normalizerat call time so the same evaluator works across datasets without changing core code. - For offline environments,
clap_model_pathaccepts either a Hugging Face repo id or a local model directory or snapshot. - Model-loading parameters such as
clap_model_path,wavlm_model_path,whisper_model,e2v_model_path,comet_model, andbleurt_pathnow use a consistent local-first rule: if the supplied local path exists, it is used; otherwise the evaluator falls back to the default remote model id. - In S2S latency evaluation, alignment prefers the model's native transcript when available. If the model is audio-only, the evaluator can optionally use ASR fallback to prepare alignment text.
- For S2S forced alignment, pass language-appropriate MFA models through
alignment_acoustic_modelandalignment_dictionary_model. The defaults are English. - Some modules rely on optional dependencies or local model paths in offline environments.
MIT License