- [2026/5] We release both the paper and code for DySem.
We propose DySem, a novel training-free framework that identifies more semantically relevant internal components through multilingual consensus and moves beyond static representation spaces by dynamically selecting sample-specific semantic dimensions. Specifically, DySem constructs a text-dependent joint semantic set and computes similarity over this selected dimensional subset. It achieves superior performance across standard STS benchmarks and diverse base and instruction-tuned LLMs, consistently outperforming strong training-free baselines while using substantially fewer dimensions.
DySem achieves strong and consistent performance across standard STS benchmarks on both base and instruction-tuned LLMs. Across ten evaluated model settings, its two variants obtain the best and second-best average results. On base models, DySem reaches 78.98 on LLaMA2-7B and 80.54 on Qwen3-8B, outperforming AlignedWVA by up to 5.75 points and PromptEOL by up to 11.63 points. On instruction-tuned models, DySem achieves the best average score in every setting, including 81.20 on LLaMA3.1-8B-it and 80.16 on Qwen3-8B-it. Overall, these results show that dynamic semantic dimension selection effectively filters non-semantic noise and improves STS computation while using substantially fewer dimensions.
- Base Models. DySem achieves strong and consistent performance on base LLMs, outperforming strong training-free baselines while using fewer dimensions.
- Instruction-tuned Models. DySem also performs best across instruction-tuned LLMs, showing that dynamic semantic dimension selection effectively filters non-semantic variation introduced by instruction tuning.
Clone the repository and install DySem:
# Clone the repository
git clone https://github.com/szu-tera/DySem.git
cd DySem
# Install the package and dependencies
pip install -e .Configure the evaluated causal LLMs in configs/models.yaml:
models:
- path: /path/to/local/causal-lm
tag: organization/model-nameYou can also evaluate a single model directly with environment variables:
MODEL_PATH=/path/to/local/causal-lm MODEL_TAG=organization/model-name bash run_dydim_eval.shRun the default DySem evaluation grid:
bash run_dydim_eval.shThe default grid evaluates both prompt settings and semantic vector variants:
PROMPT_SETTINGS="english language-specific"
SEMANTIC_VECTORS="source mean"
LANGUAGE_COUNTS="12"
DIMENSION_SIZES="256 512 768 1024 1280 2048"To reproduce a specific model setting, use the prepared scripts:
bash scripts/Qwen3-8b/EP.sh
bash scripts/Qwen3-8b/LP.shTranslations are cached locally. We recommend running translation on CPU to avoid competing with the evaluated LLM for GPU memory:
TRANSLATION_DEVICE=cpu TRANSLATION_BATCH_SIZE=8 bash run_dydim_eval.shEvaluation artifacts are written to project-local directories:
translation_cache/ # generated translations
rank_cache/ # reusable language-ranking files
results/ # final evaluation CSV files
For a quick configuration check without loading a model:
DYDIM_DRY_RUN=1 bash run_dydim_eval.shFor a minimal STSBenchmark smoke test:
TASKS="STSBenchmark" LANGUAGE_COUNTS="1" DIMENSION_SIZES="256" PROMPT_SETTINGS="english" SEMANTIC_VECTORS="mean" bash run_dydim_eval.shThis project builds upon the following open-source projects:
We sincerely thank the authors and contributors for their valuable work.
- Kaijie Zheng: liushengle35@gmail.com
- Weiqin Wang: here1swqw@gmail.com
If you find this work useful for your research, please consider citing our paper:
@article{zheng2026dysem,
title={DySem: Uncovering Dynamic Semantic Components via Multilingual Consensus for Calculating Semantic Textual Similarity},
author={Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang},
journal={arXiv preprint arXiv:2605.29751},
year={2026}
}

