ESRT supports many-to-many speech-to-text translation across 45 languages (45 × 44 directions). It uses an edge-cloud split inference architecture to protect voice privacy and reduce bandwidth by transmitting only compressed acoustic features instead of raw audio.
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txtgit clone https://huggingface.co/datasets/yxdu/fleurs_eng_test ./fleurs_eng_testTwo-stage inference: edge side and cloud side.
#Offline for Quick Testing
python test_inference.py
#Online deployment guide coming soon.Training code will be open-sourced in a future release. Validated on:
- GPU: NVIDIA A100 80GB × 8
- NPU: Huawei Ascend 910C 64GB × 8
| Family | Languages |
|---|---|
| Afro-Asiatic | Arabic, Hebrew |
| Austroasiatic | Khmer, Vietnamese |
| Austronesian | Indonesian, Malay, Tagalog |
| Dravidian | Tamil |
| Indo-European | Bengali, Bulgarian, Catalan, Czech, Danish, Dutch, English, French, German, Greek, Hindi, Croatian, Italian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Urdu |
| Japonic | Japanese |
| Koreanic | Korean |
| Kra–Dai | Lao, Thai |
| Sino-Tibetan | Chinese, Burmese, Cantonese |
| Turkic | Azerbaijani, Kazakh, Turkish, Uzbek |
| Uralic | Finnish, Hungarian |
@misc{du2026bandwidthefficientprivacypreservingedgecloudmanytomany,
title={Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation},
author={Yexing Du and Kaiyuan Liu and Youcheng Pan and Bo Yang and Ming Liu and Bing Qin and Yang Xiang},
year={2026},
eprint={2605.28642},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.28642},
}