# Prepare Training Data
Prepare training data for the `NN_Train_Align.ipynb` notebook. Currently we train on the CommonVoice data.
First **go to the [CommonVoice Downloads](https://commonvoice.mozilla.org/en/datasets) page,** select Czech and Common Voice Corpus with the highest number (likely the second item listed). Down on the page, fill in starred fields (email, You are prepared..., You agree...) and right-click Download Dataset Bundle. Choose where to **save the .tar.gz archive.**

Then edit location of the downloaded archive and other paths in the config cell below.



In [1]:
# config cell - edit paths as needed

# Where is the archive you just downloaded:
cv_archive = "/data/commonvoice/dl4/cv-corpus-12.0-2022-12-07-cs.tar.gz"

# Where to uncompress the archive:
uncompress_dir = "/data/commonvoice/"

# Where to put similar hierarchy with wavs resampled to 16kHz and safe NFC textfiles:
clean_dir = "/data4T/commonvoice/"

In [2]:
# Verify dependencies - install/modify if something fails below:
import pandas as pd # install it via mamba/conda if needed
import unicodedata
!which parallel # not that critical, can use bash instead
!which mpg123 # needs something to convert mp3 to wav, might use sox or ffmpeg
# If your torchaudio.load() opens mp3s, you can also train directly from mp3s.

/usr/bin/parallel
/usr/bin/mpg123


In [None]:
# uncompress the archive
!cd {uncompress_dir} && tar xzf {cv_archive}

In [4]:
# get strings like 'cv-corpus-12.0-2022-12-07' and 'cs':
corpus_name = cv_archive.split('/')[-1][:-len('-cs.tar.gz')]
lang = cv_archive[-len('cs.tar.gz'):-len('.tar.gz')]
corpus_name, lang

('cv-corpus-12.0-2022-12-07', 'cs')

In [5]:
# Where we expect train.tsv:
raw_train_tsv_file = f"{uncompress_dir}/{corpus_name}/{lang}/train.tsv"
clean_train_tsv_file = f"{clean_dir}/{corpus_name}/{lang}/train.tsv"
!ls -l {raw_train_tsv_file}
clean_train_tsv_file

-rw-r--r-- 1 hanzl hanzl 3502164 Dec  8 19:07 /data/commonvoice//cv-corpus-12.0-2022-12-07/cs/train.tsv


'/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/train.tsv'

In [6]:
# clips paths, make dir for cleaned ones
clean_clips = f"{clean_dir}/{corpus_name}/{lang}/clips"
raw_clips = f"{uncompress_dir}/{corpus_name}/{lang}/clips"
!mkdir -p {clean_clips}
!ls {raw_clips}|wc

  59925   59925 1737825


In [7]:
# Clean text file (LF, NFC, no BOMs)
import unicodedata

def clean_textline(line):
    if line and line[0] == '\uFEFF':
        line = line[1:]
    line = line.rstrip("\r\n")
    line = unicodedata.normalize('NFC', line)
    return line

def clean_textfile(infile, outfile):
    with open(infile, 'r') as f_in, open(outfile, 'w') as f_out:
        for line in f_in:
            if line and line[0] == '\uFEFF':
                line = line[1:]
            line = line.rstrip("\r\n")
            line = unicodedata.normalize('NFC', line)
            f_out.write("%s\n" % line)

In [8]:
# This is a bit paranoid, CV files seem to be already NFC clean, but rather be safe than sorry:
clean_textfile(raw_train_tsv_file, clean_train_tsv_file)
!diff {raw_train_tsv_file} {clean_train_tsv_file}

In [9]:
# Use full width of browser window:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [10]:
# read tsv:
df = pd.read_csv(clean_train_tsv_file, sep="\t", keep_default_na=False)
df.client_id = [id[:6] for id in df.client_id.values] # shorten very long hash
df

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,locale,segment
0,2b8bbe,common_voice_cs_25695144.mp3,S judem začínala v rodném Kjóto.,2,0,,,,cs,
1,2b8bbe,common_voice_cs_25695145.mp3,Průtok se vyznačuje prudkými výkyvy a prudce roste v létě v období dešťů.,2,0,,,,cs,
2,2b8bbe,common_voice_cs_25695148.mp3,Dělí ji pouze přidané pásy jednotlivých pater.,2,0,,,,cs,
3,2b8bbe,common_voice_cs_25695233.mp3,"Nesmíme jim ztěžovat použití dřeva, které je výborným přírodním materiálem.",2,0,,,,cs,
4,2b8bbe,common_voice_cs_25695235.mp3,Počet přeživších pacientů závisí na kmenu viru a na fyzické kondici pacienta.,2,1,,,,cs,
...,...,...,...,...,...,...,...,...,...,...
14810,419567,common_voice_cs_23959820.mp3,Celkově tyto změny v signalizaci negativně ovlivňují proliferaci a přežití buněk.,2,0,fourties,male,,cs,
14811,419567,common_voice_cs_23959822.mp3,Zvyk je tendence vykonávat za určitých okolností určitou činnost.,2,0,fourties,male,,cs,
14812,419567,common_voice_cs_23959824.mp3,Jeho žena Marie byla mladší sestra spisovatele Zdeňka Bára.,2,0,fourties,male,,cs,
14813,419567,common_voice_cs_23959825.mp3,Za stejnou roli získal i Oscara.,2,0,fourties,male,,cs,


In [11]:
df['wav'] = [clean_clips+"/"+p.replace(".mp3",".wav") for p in df.path.values]
df['mp3'] = [raw_clips+"/"+p for p in df.path.values]

In [12]:
cols = ["wav", "mp3", "sentence"]
zf = df[cols]
#zf

In [13]:
with open('tmp_batch', 'w') as f:
    for wav, mp3 in zip(zf.wav.values, zf.mp3.values):
        f.write(f"mpg123 -q -r 16000 -w {wav} {mp3}\n")
!wc tmp_batch
!head -n 3 tmp_batch

  14815  103705 2814850 tmp_batch
mpg123 -q -r 16000 -w /data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695144.wav /data/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695144.mp3
mpg123 -q -r 16000 -w /data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695145.wav /data/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695145.mp3
mpg123 -q -r 16000 -w /data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695148.wav /data/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695148.mp3


In [14]:
# Convert mp3s to wavs in parallel.
# If you do not have GNU parallel installed, replace it with bash or sh.
!cat tmp_batch|parallel

In [15]:
!ls -l {clean_clips}|head -n 3

total 2232352
-rw-r--r-- 1 hanzl hanzl 129836 Feb 14 11:34 common_voice_cs_20487672.wav
-rw-r--r-- 1 hanzl hanzl 185132 Feb 14 11:34 common_voice_cs_20487695.wav
ls: write error: Broken pipe


In [17]:
cols = ["wav", "sentence"]
ini_tsv = zf[cols]

In [18]:
# Write initial tsv for NN AM training:
ini_tsv.to_csv("initial_train.tsv", sep="\t", index=False)
!head initial_train.tsv

wav	sentence
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695144.wav	S judem začínala v rodném Kjóto.
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695145.wav	Průtok se vyznačuje prudkými výkyvy a prudce roste v létě v období dešťů.
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695148.wav	Dělí ji pouze přidané pásy jednotlivých pater.
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695233.wav	Nesmíme jim ztěžovat použití dřeva, které je výborným přírodním materiálem.
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695235.wav	Počet přeživších pacientů závisí na kmenu viru a na fyzické kondici pacienta.
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695236.wav	To je bezpochyby pravda.
/data4T/commonvoice//cv-corpus-12.0-2022-12-07/cs/clips/common_voice_cs_25695238.wav	Poté se i v samotném Mexiku situace poněkud uklidnil