# LibriSpeech corpus

This corpus was created from the raw data found at [OpenSLR](http://www.openslr.org/12/). Speech samples and partial transcripts were extracted by parsing the metadata included in the downloaded raw data. The full transcript was extracted by searching for the first and last partial transcript in the original book text.

The corpus was stored in a generic, proprietary format which mapped the individual speech segments to _corpus entries_ (which contained information about language, book chapter and speaker) and allowed easy extraction of statistical information. Because this format does not conform to the format understood by DeepSpeech, the corpus needed to be transformed to the expected format.

In [None]:
from IPython.display import HTML, Audio, display
from util.corpus_util import *
ls_corpus = get_corpus('/media/daniel/IP9/corpora/librispeech')
display(HTML(ls_corpus.summary(html=True)))

In [None]:
from IPython.display import HTML, Audio, display
ls_entry = ls_corpus[0]
display(HTML(ls_entry.summary(html=True)))
display(Audio(ls_entry.audio_path))
display(HTML(ls_entry.transcript))

In [None]:
from IPython.display import HTML, Audio, display

# show some speech segments of corpus entry
print(f'{len(ls_entry)} segments')
for i, s in enumerate(ls_entry[:5], 1):
    display(Audio(data=s.audio, rate=s.rate))
    display(HTML(s.transcript))

# CommonVoice

This corpus was readily downloaded from [the CommonVoice project page](https://voice.mozilla.org/de/data) and already contains speech samples in the expected format.

In [None]:
from IPython.display import HTML, Audio, display
from util.corpus_util import *
cv_corpus = get_corpus('/media/daniel/IP9/corpora/cv', 'en')
display(HTML(cv_corpus.summary(html=True)))

In [None]:
from IPython.display import HTML, Audio, display
cv_entry = cv_corpus[0]
display(HTML(cv_entry.summary(html=True)))
display(Audio(cv_entry.audio_path))
display(HTML(cv_entry.transcript))

In [None]:
from IPython.display import HTML, Audio, display
cv_entry = cv_corpus[0]
# show some speech segments of corpus entry
print(f'{len(cv_entry)} segments')
for i, s in enumerate(cv_entry[:5], 1):
    display(Audio(data=s.audio, rate=s.rate))
    display(HTML(s.transcript))

# ReadyLingua corpus

This corpus was received from ReadyLingua and was stored in the same generic, proprietary format like the _LibriSpeech_ corpus.

In [None]:
from IPython.display import HTML, Audio, display
from util.corpus_util import *

rl_corpus = get_corpus('/media/daniel/IP9/corpora/readylingua')
display(HTML(rl_corpus.summary(html=True)))

In [None]:
from IPython.display import HTML, Audio, display

rl_entry = rl_corpus[0]
# rl_entry = rl_corpus['news160929']

display(HTML(rl_entry.summary(html=True)))
display(Audio(rl_entry.audio_path))
display(HTML(rl_entry.transcript))

In [None]:
from IPython.display import HTML, Audio, display

# show some speech segments of corpus entry
print(f'{len(rl_entry)} speech segments')
for i, s in enumerate(rl_entry[:5], 1):
    display(Audio(data=s.audio, rate=s.rate))
    display(HTML(s.transcript))

# ReadyLingua corpus (synthesized data)

In [None]:
from IPython.display import HTML, Audio, display
from corpus.corpus import DeepSpeechCorpus

train_csv = '/media/daniel/IP9/corpora/readylingua-de/readylingua-de-train.csv'
dev_csv = '/media/daniel/IP9/corpora/readylingua-de/readylingua-de-dev.csv'
test_csv = '/media/daniel/IP9/corpora/readylingua-de/readylingua-de-test.csv'

rl_de_synth_corpus = DeepSpeechCorpus('de', train_csv, dev_csv, test_csv)
display(HTML(rl_de_synth_corpus.summary(html=True)))