diff --git a/README.md b/README.md index bc8838d..eaf9d80 100644 --- a/README.md +++ b/README.md @@ -6,15 +6,16 @@ [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![Mailing list : test](https://img.shields.io/badge/Contact-Authors-blue.svg)](mailto:open_stt@googlegroups.com) - # **Russian Open Speech To Text (STT/ASR) Dataset** Arguably the largest public Russian STT dataset up to date: - ~16m utterances (1-2m with less perfect annotation, see [#7](https://github.com/snakers4/open_stt/issues/7)); - ~20 000 hours; -- 2,3 TB (in `.wav` format in `int16`); +- 2,3 TB (in `.wav` format in `int16`), 356G in `.opus`; - (**new!**) A new domain - public speech; - (**new!**) A huge Radio dataset update with **10 000+ hours**; +- (**new!**) Utils for working with OPUS; +- (**Coming soon!**) New OPUS torrent, **unlimited direct links**; Prove [us](mailto:open_stt@googlegroups.com) wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! @@ -22,35 +23,33 @@ Let's make STT in Russian (and more) as open and available as CV models. **Planned releases:** -- Refine and publish speaker labels, probably add speakers for old datasets; -- Improve / re-upload some of the existing datasets, refine the STT labels; -- Probably add new languages; -- Add pre-trained models; +- Working on a new project with 3 more languages, stay tuned! # **Table of contents** - - - [Dataset composition](https://github.com/snakers4/open_stt/#dataset-composition) - - [Downloads](https://github.com/snakers4/open_stt/#downloads) - - [Via torrent](https://github.com/snakers4/open_stt/#via-torrent) - - [Links](https://github.com/snakers4/open_stt/#links) - - [Download-instructions](https://github.com/snakers4/open_stt/#download-instructions) - - [End-to-end download scripts](https://github.com/snakers4/open_stt/#end-to-end-download-scripts) - - [Annotation methodology](https://github.com/snakers4/open_stt/#annotation-methodology) - - [Audio normalization](https://github.com/snakers4/open_stt/#audio-normalization) - - [Disk db methodology](https://github.com/snakers4/open_stt/#on-disk-db-methodology) - - [Helper functions](https://github.com/snakers4/open_stt/#helper-functions) - - [Contacts](https://github.com/snakers4/open_stt/#contacts) - - [Acknowledgements](https://github.com/snakers4/open_stt/#acknowledgements) - - [FAQ](https://github.com/snakers4/open_stt/#faq) - - [License](https://github.com/snakers4/open_stt/#license) - - [Donations](https://github.com/snakers4/open_stt/#donations) +- [Dataset composition](https://github.com/snakers4/open_stt/#dataset-composition) +- [Downloads](https://github.com/snakers4/open_stt/#downloads) + - [Via torrent](https://github.com/snakers4/open_stt/#via-torrent) + - [Links](https://github.com/snakers4/open_stt/#links) + - [Download-instructions](https://github.com/snakers4/open_stt/#download-instructions) + - [End-to-end download scripts](https://github.com/snakers4/open_stt/#end-to-end-download-scripts) +- [Annotation methodology](https://github.com/snakers4/open_stt/#annotation-methodology) +- [Audio normalization](https://github.com/snakers4/open_stt/#audio-normalization) +- [Disk db methodology](https://github.com/snakers4/open_stt/#on-disk-db-methodology) +- [Helper functions](https://github.com/snakers4/open_stt/#helper-functions) +- [How to open opus](https://github.com/snakers4/open_stt/#how-to-open-opus) +- [Contacts](https://github.com/snakers4/open_stt/#contacts) +- [Acknowledgements](https://github.com/snakers4/open_stt/#acknowledgements) +- [FAQ](https://github.com/snakers4/open_stt/#faq) +- [License](https://github.com/snakers4/open_stt/#license) +- [Donations](https://github.com/snakers4/open_stt/#donations) +- [Further reading](https://github.com/snakers4/open_stt/#further-reading) # **Dataset composition** | Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise | |---------------------------|------------|-------|-----|------------|------------------|-------------|---------------| -| radio_v4 | 7,603,192 | 10,430 | 1,195 | 4.94s / 68 | Radio | Alignment (*)| 95% / crisp | +| radio_v4 | 7,603,192 | 10,430 | 1,195 | 4.94s / 68 | Radio | Alignment (*)| 95% / crisp | | public_speech | 1,700,060 | 2,709 | 301 | 5,73s / 79 | Public speech | Alignment (*)| 95% / crisp | | audiobook_2 | 1,149,404 | 1,511 | 162 | 4.7s / 56 | Books | Alignment (*)| 95% / crisp | | radio_2 | 651,645 | 1,439 | 154 | 7.95s / 110 | Radio | Alignment (*)| TBC, should be high | @@ -77,6 +76,15 @@ This alignment was performed using Yuri's alignment tool. # **Updates** +## **_Update 2020-05-04_** + +**Migration to OPUS** + +- Conversion of the whole dataset to OPUS +- New OPUS torrent +- Added OPUS helpers and build instructions +- Coming soon - **new unlimited direct downloads** + ## **_Update 2020-02-07_** **Temporarily Deprecated Direct MP3 Links:** @@ -87,10 +95,10 @@ This alignment was performed using Yuri's alignment tool. **New train datasets added:** - - 10,430 hours radio_v4; - - 2,709 hours public_speech; - - 154 hours radio_v4_add; - - 5% sample of all new datasets with annotation. +- 10,430 hours radio_v4; +- 2,709 hours public_speech; +- 154 hours radio_v4_add; +- 5% sample of all new datasets with annotation.
Click to expand @@ -144,16 +152,16 @@ This alignment was performed using Yuri's alignment tool. ## **Via torrent** -Save us a couple of bucks, download via torrent: -- ~~An **MP3** [version](http://academictorrents.com/details/4a2656878dc819354ba59cd29b1c01182ca0e162) of the dataset (v3)~~ not supported anymore; -- A **WAV** [version](https://academictorrents.com/details/a7929f1d8108a2a6ba2785f67d722423f088e6ba) of the dataset (v5); +- ~~An **MP3** [version](http://academictorrents.com/details/4a2656878dc819354ba59cd29b1c01182ca0e162) of the dataset (v3)~~ DEPRECATED; +- ~~A **WAV** [version](https://academictorrents.com/details/a7929f1d8108a2a6ba2785f67d722423f088e6ba) of the dataset (v5)~~ DEPRECATED; +- A **OPUS** [version](https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f) of the dataset (v1.01); You can download separate files via torrent. -~~Try several torrent clients if some do not work.~~ + Looks like that due to large chunk size, most conversional torrent clients just fail silently. -No problem (re-calculating the torrent takes much time, and some people have downloaded it already): +No problem (re-calculating the torrent takes much time, and some people have downloaded it already), use `aria2c`: -``` +```bash apt update apt install aria2 # list the torrent files @@ -165,11 +173,16 @@ aria2c --select-file=4 ru_open_stt_wav_v10.torrent # https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-metalink-options # https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-specific-options ``` -If you are using Windows, you may use Linux subsystem to run these commands. + +If you are using Windows, you may use **Linux subsystem** to run these commands. ## **Links** -All **WAV** files can be downloaded ONLY via [torrent](https://academictorrents.com/details/a7929f1d8108a2a6ba2785f67d722423f088e6ba) +**Coming soon** - new direct OPUS links! + +All WAV or MP3 files / links / torrents to be superseded by OPUS. + +Total size of OPUS files is about 356G, so OPUS is ~10% smaller than MP3. | Dataset | GB, wav | GB, mp3 | Mp3 | Source | Manifest | |---------------------------------------|------|----------------|-----| -------| ----------| @@ -198,8 +211,10 @@ All **WAV** files can be downloaded ONLY via [torrent](https://academictorrents. ### End to end -`download.sh` -or +`download.sh` + +or + `download.py` with this config [file](https://github.com/snakers4/open_stt/blob/master/md5sum.lst). Please check the config first. ### Manually @@ -207,16 +222,19 @@ or 1. Download each dataset separately: Via `wget` + ``` wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file ``` For multi-threaded downloads use aria2 with `-x` flag, i.e. + ``` aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file ``` If necessary, merge chunks like this: + ``` cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz ``` @@ -276,7 +294,7 @@ manifest_df = read_manifest('path/to/manifest.csv')
See example

-```python +```python3 from utils.open_stt_utils import (plain_merge_manifests, check_files, save_manifest) @@ -295,6 +313,45 @@ save_manifest(train_manifest,

+# **How to open opus** + +The best efficient way to read opus files in python (the we know of) that does incur any significant overhead (i.e. launching subprocesses, using a daisy chain of libraries with sox, FFMPEG etc) is to use pysoundfile (a python CFFI wrapper around libsoundfile). + +When this solution was being researched the community had been waiting for a major libsoundfile release for years. Opus support has been implemented some time ago upstream, but it has not been properly released. Therefore we opted for a custom build + monkey patching. + +At the time when you read / use this - probably there will be decent / proper builds of libsndfile. + +## **Building libsoundfile** + +```bash +apt-get update +apt-get install cmake autoconf autogen automake build-essential libasound2-dev \ +libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y + +cd /usr/local/lib +git clone https://github.com/erikd/libsndfile.git +cd libsndfile +git reset --hard 49b7d61 +mkdir -p build && cd build + +cmake .. -DBUILD_SHARED_LIBS=ON +make && make install +cmake --build . +``` + +## **Patched pysound file wrapper** + +```python3 +import utils.soundfile_opus as sf + +path = 'path/to/file.opus` +audio, sr = sf.read(path, dtype='int16') +``` + +## **Known issues** + +When you attempt writing large files (90-120s), there is an upstream bug in libsndfile that prevents writing such files with `opus` / `vorbis`. Most likely will be fixed by major libsndfile releases. + # **Contacts** Please contact us [here](mailto:open_stt@googlegroups.com) or just create a GitHub issue! @@ -310,6 +367,7 @@ Please contact us [here](mailto:open_stt@googlegroups.com) or just create a GitH # **Acknowledgements** This repo would not be possible without these people: + - Many thanks for helping to encode the initial bulk of the data into mp3 to [akreal](https://nuget.pkg.github.com/akreal); - 18 hours of ground truth annotation datasets for validation are a courtesy of [activebc](https://activebc.ru/); @@ -317,9 +375,9 @@ Kudos! # **FAQ** -## **0. ~~Why not MP3?~~ MP3 encoding / decoding** +## **0. ~~Why not MP3?~~ MP3 encoding / decoding** - DEPRECATED -#### **Encoding** +### **Encoding** Mostly we used `pydub` (via ffmpeg) or `sox` (much much faster way) to convert to MP3. We omitted blank files (YouTube mostly). @@ -367,8 +425,7 @@ if res != 0:

- -#### **Decoding** +### **Decoding** It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files: @@ -432,7 +489,7 @@ wav_path = save_wav_diskdb(wav,

-#### **Why not OGG/ Opus** +#### **Why not OGG/ Opus** - DEPRECATED Even though OGG / Opus is considered to be better for speech with higher compression, we opted for a more conventional well known format. @@ -440,7 +497,7 @@ Also LPC net codec boasts ultra-low bitrate speech compression as well. But we d ## **1. Issues with reading files** -#### **Maybe try this approach:** +### **Maybe try this approach:**
See example

@@ -461,15 +518,16 @@ if abs_max>0: ## **2. Why share such dataset?** -We are not altruists, life just is **not a zero sum game**. +We are not altruists, life just is **not a zero sum game**. Consider the progress in computer vision, that was made possible by: + - Public datasets; - Public pre-trained models; - Open source frameworks; - Open research; -TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. +STT does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community. ## **3. Known issues with the dataset to be fixed** @@ -477,12 +535,36 @@ Ultimately it leads to worse-off situation for the general community. - Speaker labels coming soon; - Validation sets for new domains: Radio/Public Speech will be added in next releases. +## **4. Why migrate to OPUS?** + +After extensive testing, both during training and validation, we confirmed that converting 16kHz int16 data to OPUS does not at the very least degrade quality. + +Also designed for speech, OPUS even at default compression rates takes less space than MP3 and does not introduce artefacts. + +Some people even reported quality improvements when training using OPUS. + # **License** ![сс-nc-by-license](https://static.wixstatic.com/media/342407_05e016f9f44240429203c35dfc8df63b~mv2.png/v1/fill/w_563,h_200,al_c,lg_1,q_80/342407_05e016f9f44240429203c35dfc8df63b~mv2.webp) -Сc-by-nc and commercial usage available after agreement with dataset authors. +CC-BY-NC and commercial usage available after agreement with dataset authors. # **Donations** [Donate](https://buymeacoff.ee/8oneCIN) (each coffee pays for several full downloads) or via [open_collective](https://opencollective.com/open_stt) or just use our DO referral [link](https://sohabr.net/habr/post/357748/) to help. + +# **Further reading** + +## **English** + +- https://thegradient.pub/towards-an-imagenet-moment-for-speech-to-text/ +- https://thegradient.pub/a-speech-to-text-practitioners-criticisms-of-industry-and-academia/ + +## **Chinese** + +- https://www.infoq.cn/article/4u58WcFCs0RdpoXev1E2 + +## **Russian** + +- https://habr.com/ru/post/494006/ +- https://habr.com/ru/post/474462/ \ No newline at end of file diff --git a/md5sum.lst b/md5sum.lst index 28d5fc6..395163f 100644 --- a/md5sum.lst +++ b/md5sum.lst @@ -1,43 +1,38 @@ -b0ce7564ba90b121aeb13aada73a6e30 asr_public_phone_calls_1.csv -6867d14dfdec1f9e9b8ca2f1de9ceda6 asr_public_phone_calls_2.csv -0bdd77e15172e654d9a1999a86e92c7f asr_public_stories_1.csv -f388013039d94dc36970547944db51c7 asr_public_stories_2.csv -3b67e27c1429593cccbf7c516c4b582d private_buriy_audiobooks_2.csv -04027c20eb3aff05f6067957ecff856b public_lecture_1.csv -89da3f1b6afcd4d4936662ceabf3033e public_series_1.csv -a81dfb018c88d0ecd5194ab3d8ff6c95 public_youtube700.csv -c858f020729c34ba0ab525bbb8950d0c ru_RU.csv -0275525914825dec663fd53390fdc9a0 russian_single.csv -52f406f4e30fcc8c634f992befd91beb tts_russian_addresses_rhvoice_4voices.csv -a6f888c53d7cbded85ab51627ef57c96 asr_public_phone_calls_1_mp3.tar.gz -f707e34f488c62af2e3142085ff595ad asr_public_phone_calls_2_mp3.tar.gz -baa491ed0b526b2a989b8c4a8897429d asr_public_stories_1_mp3.tar.gz -f24e21c69c03062d667caf0f055244f2 asr_public_stories_2_mp3.tar.gz -42b9c8c2e31100d6c5b972c9ac000167 private_buriy_audiobooks_2_mp3.tar.gz -7a5704721012fafa115e7316e5f6e058 public_lecture_1_mp3.tar.gz -16cf820330f9f8b388395d777b2331ac public_series_1_mp3.tar.gz -dd048e7110c0c852c353759dad8fec0f public_youtube700_mp3.tar.gz -579e9d98bd159a27d3573641edee69b0 ru_ru_mp3.tar.gz -177b041594684623ec7d038613e1330d russian_single_mp3.tar.gz -d7ce4c4116dcc655be2b466f82c98b6e tts_russian_addresses_rhvoice_4voices_mp3.tar.gz -25ea6d9e249a242ecc217acc28c8077b voxforge_ru_mp3.tar.gz -97cd6b56ba1eb5088bc5643dce054028 asr_calls_2_val_mp3.tar.gz -0cc0f50db85ec4271696b4eb03a2203c buriy_audiobooks_2_val_mp3.tar.gz -f5d2e3d13b47e1566ba0b021f00788cf public_youtube1120_hq_mp3.tar.gz -12eb78a9ab7c3d39bbe2842b8d6550ca public_youtube1120_mp3.tar.gz -f6b6034e1e91d9a0a5069fc9ad2ed545 public_youtube700_val_mp3.tar.gz -0cdbd085ffa6dab4bfdce7c3ed31fcfe asr_calls_2_val.csv -4e0b73e0d00374482a0f2286acf314a0 buriy_audiobooks_2_val.csv -6b9ce6828a55d2741d51bc3503345db5 public_youtube1120.csv -33040a25cad99e70a81e9e54ff8c758e public_youtube1120_hq.csv -525bd20802e529dcabf9e44345a50d0b public_youtube700_val.csv -d1b37f4cb32c4f461c56062523d0c645 radio_2.csv -69a465e218fc1f597f7b5da836952d9d radio_2_mp3.tar.gz -7c2273a5b8c3cc10df3754dbe9c783e1 radio_v4_mp3.tar.gz -d41f3f21d3cb9328de3cd6a530a70832 public_speech_mp3.tar.gz -ae00489678836b92e3a65d2ee8b51960 radio_v4_add_mp3.tar.gz -84397631475426f505babbb73b4197d9 radio_pspeech_sample_mp3.tar.gz -2551236643466a8df6a99ebaa64491b2 radio_v4_manifest.csv -0556f324bf43bdd8cb5e2eff97ea125b public_speech_manifest.csv -5758c95bd141d62e32a77c9b8aca9711 radio_v4_add_manifest.csv -48eb023de631fbc690dfbaa426a0df80 radio_pspeech_sample_manifest.csv +5146eef0aa64af41e17c5ec18a9cab5c asr_calls_2_val.tar.gz +8a6da753d8e76fc0188505ba84761efa asr_public_phone_calls_1.tar.gz +61736beda9893df218cf755ea2308077 asr_public_phone_calls_2.tar.gz +2926853f6eefe67dff4db2b9fdb93d44 asr_public_stories_1.tar.gz +207deb061308d4906ecdb558b94dcac0 asr_public_stories_2.tar.gz +dc6e33299e09d804eb6cedad49c7866a buriy_audiobooks_2_val.tar.gz +34419c7d29cc21d8d1a280c78dd6aa5c private_buriy_audiobooks_2.tar.gz +184a4d0a2016f8f5359bd4365a488fe5 public_lecture_1.tar.gz +10f41cb506403dd3d9100ced98c0cc0d public_series_1.tar.gz +2339d242aca15890eefaa73e86a4b527 public_speech_manifest.tar.gz +32e727b732c2caead2b36664302905b7 public_youtube1120_hq.tar.gz +27cc829396f72fc0a7b8a15693609714 public_youtube1120.tar.gz +2ba18cc170f3254d32f917d2ef619355 public_youtube700.tar.gz +0bd6b99cfe702d70b65c2e5210eaf135 public_youtube700_val.tar.gz +6975f79b4593121cc61eaca9c4e976bb radio_2.tar.gz +0e7e36e0e25563e92334a6aa5829a19f radio_pspeech_sample_manifest.tar.gz +79dbbbed705e69ae811a4e598e47e4e2 radio_v4_add_manifest.tar.gz +b808ef771aab22b4cbf333e2f478b3aa radio_v4_manifest.tar.gz +2bdd0e26d972f60a0e54dafeef642264 tts_russian_addresses_rhvoice_4voices.tar.gz +6c2b582a63d5cec7b644232f2ad4f8da asr_calls_2_val.csv +32699abae0c0a3cc37bbda261cd9c46a asr_public_phone_calls_1.csv +e68a816759846db2432b9a09b96b6c7b asr_public_phone_calls_2.csv +182c7c87422793a85f8d142051e0ef3d asr_public_stories_1.csv +dfee0731071be2015830b0e9a1c2c51a asr_public_stories_2.csv +2599f9a8c226418e201d82651288014b buriy_audiobooks_2_val.csv +243a13f8b6a8b98f3742abc85ae77bdb private_buriy_audiobooks_2.csv +cf65339e4ed95d01d06c79a6f40f7114 public_lecture_1.csv +f97bfae4536eedd16bb54847e7d0985c public_series_1.csv +61f61202e89af13bb781e8c1280e24f1 public_speech_manifest.csv +9b27d9b9766ff869ec8fb36e1f584c17 public_youtube1120.csv +98adaa2a36e25a8bd77b6c72621de7ed public_youtube1120_hq.csv +45272b66ccb704bbdd200bedf3438c93 public_youtube700.csv +42dc17c92dd4f582d1462d78ae729ad7 public_youtube700_val.csv +4f368dfa929b39dd99ac36d05715647e radio_2.csv +fb6587c1a62a70b809c92053f694d978 radio_pspeech_sample_manifest.csv +844c326e8a16fc0aad158e4143d592be radio_v4_add_manifest.csv +cbb993513fada1f0ca8d3388372f016f radio_v4_manifest.csv +628c2974eeb2edfba4a560445d9dc628 tts_russian_addresses_rhvoice_4voices.csv diff --git a/utils/soundfile_opus.py b/utils/soundfile_opus.py new file mode 100644 index 0000000..e5ca034 --- /dev/null +++ b/utils/soundfile_opus.py @@ -0,0 +1,39 @@ +import soundfile as sf +import os + + +# Fx for soundfile read/write functions +def fx_seek(self, frames, whence=os.SEEK_SET): + self._check_if_closed() + position = sf._snd.sf_seek(self._file, frames, whence) + return position + + +def fx_get_format_from_filename(file, mode): + format = '' + file = getattr(file, 'name', file) + try: + format = os.path.splitext(file)[-1][1:] + format = format.decode('utf-8', 'replace') + except Exception: + pass + if format == 'opus': + return 'OGG' + if format.upper() not in sf._formats and 'r' not in mode: + raise TypeError("No format specified and unable to get format from " + "file extension: {0!r}".format(file)) + return format + + +#sf._snd = sf._ffi.dlopen('/usr/local/lib/libsndfile/build/libsndfile.so.1.0.29') +sf._subtypes['OPUS'] = 0x0064 +sf.SoundFile.seek = fx_seek +sf._get_format_from_filename = fx_get_format_from_filename + + +def read(file, **kwargs): + return sf.read(file, **kwargs) + + +def write(file, data, samplerate, **kwargs): + return sf.write(file, data, samplerate, **kwargs)