Marusia demo interface - an http api client which allows to automatically convert russian short texts into speech using vk cloud.
To install system-wise dependencies:
sudo apt-get update && sudo apt-get install build-essential libtag1-dev ffmpegTo create a conda environment and install dependencies use the following command:
conda env create -f environment.yml
python -c 'import nltk; nltk.download("punkt")'Then activate created environment:
conda activate marudeThere is an auxiliary script fetch.sh for generating baneks dataset, which should be used invoked this:
./fetch.sh 05.11.2025This script accepts version name which may be an arbitrary string.
Then to transform the generated dataset to speech:
python -m rr handle-aneks -s assets/baneks/05.11.2025/default.tsv -d assets/baneks-speech -e salute -rkWhere assets/baneks-speech should contain previously generated audio files to avoid making duplicates.
Then to copy new files to a separate folder use a command like:
ls -alth | grep Nov | grep mp3 | awk '{print $9}' | xargs -I {} mv {} ../baneks-speech-new/And to generate an archive run the following command:
tar -cJvf ../../baneks-speech/speech/040001-040720.tar.xz -C /home/zeio/raconteur/assets/baneks-speech-new/ .After the environment is set up, the app can be used from the command line:
python -m marude tts 'Привет, мир' -m pavel -p message.mp3The provided text (which must be 1024 characters long or shorter) will be converted into speech and saved as an audiofile message.mp3. By default the file is saved at assets/message.mp3.
The module can be used programmatically as well. First, install the system-wise dependencies:
sudo apt-get update && sudo apt-get install ffmpeg libtag1-devThen, install the module through pip:
pip install marudeThen run your script, which may look like this (see example):
from tasty import pipe
from marude import CloudVoiceClient, Voice
if __name__ == '__main__':
client = CloudVoiceClient(Voice.MARIA)
with open('message-1.mp3', 'wb') as file:
_ = 'Съешь еще этих мягких французских булок' | pipe | client.tts | pipe | file.write
with open('message-2.mp3', 'wb') as file:
_ = 'да выпей чаю' | pipe | client.tts | pipe | file.writeAutomatic speech recognition (ASR) pipeline is implemented as a pair of services:
- producer - splits input
mp3file into chunks with given max-length on silence and converts audio towavformat; - consumer - as soon as producer completes converting the next chunk it loads the file and sends it to a remote service for recognition. Then it appends the recognized text to the output file.
Example of command for running the pipeline:
python -m marude asr assets/real-colonel.mp3Example of log looks like this (the whole pipeline was completed in 28 seconds):
Finished converting assets/real-colonel.mp3 to .wav which is saved as assets/real-colonel-converted.wav. Audio duration is 1502.856 seconds
Started segmenting assets/real-colonel-converted.wav
Finished segmenting assets/real-colonel-converted.wav. There are 413 segments
Final number of segments is 15
Started saving segments on disk
Finished saving 01/15 segments. The latest segment was saved as assets/real-colonel/00.wav
Finished saving 02/15 segments. The latest segment was saved as assets/real-colonel/01.wav
Finished saving 03/15 segments. The latest segment was saved as assets/real-colonel/02.wav
Finished saving 04/15 segments. The latest segment was saved as assets/real-colonel/03.wav
Finished saving 05/15 segments. The latest segment was saved as assets/real-colonel/04.wav
Finished saving 06/15 segments. The latest segment was saved as assets/real-colonel/05.wav
Finished saving 07/15 segments. The latest segment was saved as assets/real-colonel/06.wav
Finished saving 08/15 segments. The latest segment was saved as assets/real-colonel/07.wav
Finished saving 09/15 segments. The latest segment was saved as assets/real-colonel/08.wav
Finished saving 10/15 segments. The latest segment was saved as assets/real-colonel/09.wav
Finished saving 11/15 segments. The latest segment was saved as assets/real-colonel/10.wav
Finished saving 12/15 segments. The latest segment was saved as assets/real-colonel/11.wav
Recognized text from file assets/real-colonel/00.wav
Finished saving 13/15 segments. The latest segment was saved as assets/real-colonel/12.wav
Finished saving 14/15 segments. The latest segment was saved as assets/real-colonel/13.wav
Finished saving 15/15 segments. The latest segment was saved as assets/real-colonel/14.wav
Recognized text from file assets/real-colonel/01.wav
Recognized text from file assets/real-colonel/02.wav
Recognized text from file assets/real-colonel/03.wav
Recognized text from file assets/real-colonel/04.wav
Recognized text from file assets/real-colonel/05.wav
Recognized text from file assets/real-colonel/06.wav
Recognized text from file assets/real-colonel/07.wav
Recognized text from file assets/real-colonel/08.wav
Recognized text from file assets/real-colonel/09.wav
Recognized text from file assets/real-colonel/10.wav
Recognized text from file assets/real-colonel/11.wav
Recognized text from file assets/real-colonel/12.wav
Recognized text from file assets/real-colonel/13.wav
Recognized text from file assets/real-colonel/14.wavSee details of this example here. There is also another example with 3-hour recording, which was handled in ~ 12 minutes.