RoMoAligner is a novel alignment model designed for non-autoregressive Text-to-Speech (TTS) synthesis. It combines a rough aligner and a fine-grained monotonic boundary aligner (MoBoAligner) to achieve fast and accurate alignment between text and speech.
- Two-stage alignment: RoMoAligner first uses a rough aligner to estimate the coarse boundaries of each text token, then applies MoBoAligner to refine the alignment within the selected boundaries.
- Monotonic alignment: MoBoAligner ensures the monotonicity and continuity of the alignment, which is crucial for TTS.
- Robust and efficient: By selecting the most relevant mel frames for each text token, RoMoAligner reduces the computational complexity and improves the robustness of the alignment.
- Easy integration: RoMoAligner can be easily integrated into any non-autoregressive TTS system to provide accurate duration information.
-
Clone the repository:
git clone https://github.com/yourusername/RoMoAligner.git cd RoMoAligner
-
Install the required dependencies:
pip install -r requirements.txt
-
Compile the Cython extension:
python setup.py build_ext --inplace
from romo_aligner import RoMoAligner
aligner = RoMoAligner(
text_channels, mel_channels, attention_dim, attention_head, dropout, noise_scale
)
soft_alignment, hard_alignment, expanded_text_embeddings, dur_by_rough, dur_by_mobo = aligner(
text_embeddings,
mel_embeddings,
text_mask,
mel_mask,
direction=["forward", "backward"],
)
RoMoAligner consists of two main components:
- RoughAligner: A cross-modal attention-based module that estimates the coarse boundaries of each text token in the mel spectrogram.
- MoBoAligner (unofficial): A fine-grained monotonic boundary aligner that refines the alignment within the selected boundaries.
The rough aligner first provides an initial estimation of the text token durations, which are then used to select the most relevant mel frames for each token. MoBoAligner then performs a more precise alignment within these selected frames, ensuring the monotonicity and continuity of the alignment.
We welcome contributions to RoMoAligner! If you have any bug reports, feature requests, or suggestions, please open an issue on the GitHub repository. If you'd like to contribute code, please fork the repository and submit a pull request.
RoMoAligner is released under the MIT License.
We would like to thank the open-source community for their valuable contributions and feedback. Special thanks to the developers of ESPnet and PyTorch for their excellent libraries.