GitHub - wasertech/TrainingSpeech: Open and freely reusable dataset of voices for speech-to-text models training

TrainingSpeech is an initiative to provide open and freely reusable dataset of voices

for speech-to-text models training
on non-english languages
using already available data (such as audio-books).

Right now, data are extracted exclusively from audio-books and in French language. Let me know if you are intersted to contribute by creating an issue.

Tooling

TrainingSpeech comes with a CLI that automate and simplify:

transcript extraction
forced-alignment (using aeneas)
validation and correction

Common workflow

1. Generate and validate alignment on existing source

pick a source that have NOT been validated yet: see python manage.py stats and ./sources.json for more info
download assets (ie epub and mp3 files): python manage.py download -s <SOURCE_NAME>
check alignment: python manage.py check-alignment <SOURCE_NAME> (may require multiple iterations)
send a pull request with generated transcript and alignment

2. Add New source (team members only)

retrieve epub and corresponding mp3 file and store them into ./data/epubs and ./data/mp3 (respectively)
create new source into ./sources.json (NB: all fields are mandatory)
generate initial transcript using python manage.py build-transcript <SOURCE_NAME>
upload epub and mp3 files on S3 python manage.py upload -s <SOURCE_NAME>

Dev setup

$ sudo apt-get install -y ffmpeg espeak libespeak-dev python3-numpy python-numpy libncurses-dev libncursesw5-dev sox libsqlite3-dev
$ git clone git@gitlab.com:wasertech/TrainingSpeech.git
$ pip3 install --user pipenv
$ cd TrainingSpeech
$ pipenv install --python=3.6.6
$ pipenv sync
$ pipenv shell
$ pytest

Last releases & download

Releases are ready-to-use zip archives containing :

short 16kHz 16bit wav audio speeches (0-15s)
a single data.csv file with following columns:
- path: path to the audio file inside the archive
- duration: audio duration in second
- text: transcript

Name	# speeches	# speakers	Total Duration	Language
2019-04-11_fr_FR (w/ 💖 from @lissyx)	124089	4	182:43:35	fr_FR

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github/workflows		.github/workflows
data		data
tests		tests
training_speech		training_speech
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.python-version		.python-version
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
manage.py		manage.py
pytest.ini		pytest.ini
sources.json		sources.json

License

wasertech/TrainingSpeech

Folders and files

Latest commit

History

Repository files navigation

Tooling

Common workflow

1. Generate and validate alignment on existing source

2. Add New source (team members only)

Dev setup

Last releases & download

About

Resources

License

Stars

Watchers

Forks

Languages